python

Upload large files fast with Dropzone.js

I have previously covered how to upload large files with dropzone.js, but it didn’t allow for parallel chunk uploads. In this article we will go over that new addition, as well as several other improvements.

Download the final executable or view github to see all the code now if you don’t want to read the article.

This is the final result for now, and can be customized if you so chose. Obviously I’m no graphic design, but I’d pick function over style anyway.

Design

Before we dive into code, lets think about the design. We will need to somehow handle multiple parts of a single file being uploaded at the same time in a random order. How do we keep track of that?

Thankfully, Dropzone will provide the server with a few different pieces of information which each chunk, they include:

  • dzuuid – unique ID per upload file
  • dzchunkindex – the chunk number of the current upload
  • dztotalfilesize – Total size of the upload
  • dzchunksize – Max size per chunk
  • dztotalchunkcount – The number of chunks in this file
  • dzchunkbyteoffset – The place in the file this chunk starts

In my mind there are two clear ways to approach the problem. First option is to create a sparse file of the full size to start with, using dztotalchunkcount and then with every incoming chunk, set the position of the file using dzchunkbyteoffset and write the data starting there.

The advantage of this method is that it only requires a single file on disk. The disadvantage is you have to worry about multiple threads accessing the same file at the same time.

The second choice is to write each chunk to a separate file, then when they are all uploaded concatenate them all to a single file and remove the individual chunks. The disadvantage are that that you require twice the space for a short time, and have to deal with cleanup of temporary files.

I personally preferred the second option, as it seemed a bit safer.

Upload Function

As a quick warning, I am now using Bottle instead of Flask for this upload, so a bit of the form syntax has changed since the last post.

from pathlib import Path
from threading import Lock
from collections import defaultdict
import shutil
import uuid

from bottle import route, run, request, error, response, HTTPError, static_file
from werkzeug.utils import secure_filename

lock = Lock()
chucks = defaultdict(list)

chunk_path = Path(__file__).parent / "chunks"
storage_path = Path(__file__).parent / "storage"
chunk_path.mkdir(exist_ok=True, parents=True)
storage_path.mkdir(exist_ok=True, parents=True)

@route("/upload", method="POST")
def upload():
    file = request.files.get("file")
    if not file:
        raise HTTPError(status=400, body="No file provided")

    dz_uuid = request.forms.get("dzuuid")
    if not dz_uuid:
        # Assume this file has not been chunked
        with open(storage_path / f"{uuid.uuid4()}_{secure_filename(file.filename)}", "wb") as f:
            file.save(f)
        return "File Saved"

    # Chunked download
    try:
        current_chunk = int(request.forms["dzchunkindex"])
        total_chunks = int(request.forms["dztotalchunkcount"])
    except KeyError as err:
        raise HTTPError(status=400, body=f"Not all required fields supplied, missing {err}")
    except ValueError:
        raise HTTPError(status=400, body=f"Values provided were not in expected format")
    
    # Create a new directory for this file in the chunks dir, using the UUID as the folder name
    save_dir = chunk_path / dz_uuid
    if not save_dir.exists():
        save_dir.mkdir(exist_ok=True, parents=True)

    # Save the individual chunk
    with open(save_dir / str(request.forms["dzchunkindex"]), "wb") as f:
        file.save(f)

    # See if we have all the chunks downloaded
    with lock:
        chucks[dz_uuid].append(current_chunk)
        completed = len(chucks[dz_uuid]) == total_chunks

    # Concat all the files into the final file when all are downloaded
    if completed:
        with open(storage_path / f"{dz_uuid}_{secure_filename(file.filename)}", "wb") as f:
            for file_number in range(total_chunks):
                f.write((save_dir / str(file_number)).read_bytes())
        print(f"{file.filename} has been uploaded")
        shutil.rmtree(save_dir)

    return "Chunk upload successful"

if __name__ == "__main__":
    run(server="paste")

Hopefully the code is decently self documented. We do a few checks at the start as we pull in the required parameters. Then we prepare the directory for where the temporary chunks will be stored, and write the incoming chunk there. We gather information on all the chunks and when then have been completed in a global dictionary, and when they are all uploaded they are assembled into the final file.

File Downloading

Now that we can put files on the server, what about getting them back? I personally don’t want people to host random files on my server, but others may. To accomplish that, we shouldn’t just list all the files to everyone that visits the site, but only to whoever uploaded it. Thankfully we can just store the uuid in a cookie on the frontend, and then have a very basic download function.

@route("/download/<dz_uuid>")
def download(dz_uuid):
    for file in storage_path.iterdir():
        if file.is_file() and file.name.startswith(dz_uuid):
            return static_file(file.name, root=file.parent.absolute(), download=True)
    return HTTPError(status=404)

This does complicate our frontend a bit, as we want to save both UUID and filename as text fields in a cookie. There are a lot of great libraries out there to make life easier with JavaScript and cookies, but I wanted to keep it simple and pure JS other than Dropzone, making the code a bit more complicated than last time.

Dropzone frontend

Instead of being a standalone file, I have also put this directly into the python file to make using it as a f-string a lot easier, but makes it a little harder to read.

<!doctype html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <link rel="stylesheet" href="{dropzone_cdn.rstrip('/')}/{dropzone_version}/min/dropzone.min.css"/>
    <link rel="stylesheet" href="{dropzone_cdn.rstrip('/')}/{dropzone_version}/min/basic.min.css"/>
    <script type="application/javascript"
        src="{dropzone_cdn.rstrip('/')}/{dropzone_version}/min/dropzone.min.js">
    </script>
    <title>pyfiledrop</title>
</head>
<body>

    <div id="content" style="width: 800px; margin: 0 auto;">
        <h2>Upload new files</h2>
        <form method="POST" action='/upload' class="dropzone dz-clickable" id="dropper" enctype="multipart/form-data">
        </form>

        <h2>
            Uploaded
            <input type="button" value="Clear" onclick="clearCookies()" />
        </h2>
        <div id="uploaded">

        </div>

        <script type="application/javascript">
            function clearCookies() {{
                document.cookie = "files=; Max-Age=0";
                document.getElementById("uploaded").innerHTML = "";
            }}

            function getFilesFromCookie() {{
                try {{ return document.cookie.split("=", 2)[1].split("||");}} catch (error) {{ return []; }}
            }}

            function saveCookie(new_file) {{
                    let all_files = getFilesFromCookie();
                    all_files.push(new_file);
                    document.cookie = `files=${{all_files.join("||")}}`;
            }}

            function generateLink(combo){{
                const uuid = combo.split('|^^|')[0];
                const name = combo.split('|^^|')[1];
                if ({'true' if allow_downloads else 'false'}) {{
                    return `<a href="/download/${{uuid}}" download="${{name}}">${{name}}</a>`;
                }}
                return name;
            }}


            function init() {{

                Dropzone.options.dropper = {{
                    paramName: 'file',
                    chunking: true,
                    forceChunking: {dropzone_force_chunking},
                    url: '/upload',
                    retryChunks: true,
                    parallelChunkUploads: {dropzone_parallel_chunks},
                    timeout: {dropzone_timeout}, // microseconds
                    maxFilesize: {dropzone_max_file_size}, // megabytes
                    chunkSize: {dropzone_chunk_size}, // bytes
                    init: function () {{
                        this.on("complete", function (file) {{
                            let combo = `${{file.upload.uuid}}|^^|${{file.upload.filename}}`;
                            saveCookie(combo);
                            document.getElementById("uploaded").innerHTML += generateLink(combo)  + "<br />";
                        }});
                    }}
                }}

                if (typeof document.cookie !== 'undefined' ) {{
                    let content = "";
                     getFilesFromCookie().forEach(function (combo) {{
                        content += generateLink(combo) + "<br />";
                    }});

                    document.getElementById("uploaded").innerHTML = content;
                }}
            }}

            init();

        </script>
    </div>
</body>
</html>

Notice we are using a slew of python variables that we are going to allow to be configurable upon launch.

Command line options

import argparse
...

allow_downloads = False
dropzone_cdn = "https://cdnjs.cloudflare.com/ajax/libs/dropzone"
dropzone_version = "5.7.6"
dropzone_timeout = "120000"
dropzone_max_file_size = "100000"
dropzone_chunk_size = "1000000"
dropzone_parallel_chunks = "true"
dropzone_force_chunking = "true"

...


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("-p", "--port", type=int, default=16273, required=False)
    parser.add_argument("--host", type=str, default="0.0.0.0", required=False)
    parser.add_argument("-s", "--storage", type=str, default=str(storage_path), required=False)
    parser.add_argument("-c", "--chunks", type=str, default=str(chunk_path), required=False)
    parser.add_argument(
        "--max-size",
        type=str,
        default=dropzone_max_file_size,
        help="Max file size (Mb)",
    )
    parser.add_argument(
        "--timeout",
        type=str,
        default=dropzone_timeout,
        help="Timeout (ms) for each chuck upload",
    )
    parser.add_argument("--chunk-size", type=str, default=dropzone_chunk_size, help="Chunk size (bytes)")
    parser.add_argument("--disable-parallel-chunks", required=False, default=False, action="store_true")
    parser.add_argument("--disable-force-chunking", required=False, default=False, action="store_true")
    parser.add_argument("-a", "--allow-downloads", required=False, default=False, action="store_true")
    parser.add_argument("--dz-cdn", type=str, default=None, required=False)
    parser.add_argument("--dz-version", type=str, default=None, required=False)
    return parser.parse_args()


if __name__ == "__main__":

    args = parse_args()
    storage_path = Path(args.storage)
    chunk_path = Path(args.chunks)
    dropzone_chunk_size = args.chunk_size
    dropzone_timeout = args.timeout
    dropzone_max_file_size = args.max_size
    try:
        if int(dropzone_timeout) < 1 or int(dropzone_chunk_size) < 1 or int(dropzone_max_file_size) < 1:
            raise Exception("Invalid dropzone option, make sure max-size, timeout, and chunk-size are all positive")
    except ValueError:
        raise Exception("Invalid dropzone option, make sure max-size, timeout, and chunk-size are all integers")

    if args.dz_cdn:
        dropzone_cdn = args.dz_cdn
    if args.dz_version:
        dropzone_version = args.dz_version
    if args.disable_parallel_chunks:
        dropzone_parallel_chunks = "false"
    if args.disable_force_chunking:
        dropzone_force_chunking = "false"
    if args.allow_downloads:
        allow_downloads = True

    if not storage_path.exists():
        storage_path.mkdir(exist_ok=True)
    if not chunk_path.exists():
        chunk_path.mkdir(exist_ok=True)

    print(f"""Timeout: {int(dropzone_timeout) // 1000} seconds per chunk
Chunk Size: {int(dropzone_chunk_size) // 1024} Kb
Max File Size: {int(dropzone_max_file_size)} Mb
Force Chunking: {dropzone_force_chunking}
Parallel Chunks: {dropzone_parallel_chunks}
Storage Path: {storage_path.absolute()}
Chunk Path: {chunk_path.absolute()}
""")
    run(server="paste", port=args.port, host=args.host)

As this will become an executable, to be configurable we want to pass parameters upon launch.

Favicon

Now this is getting into the realm of silly. But to be an all in one script, we need to provide a binary file (the favicon) in the script itself. Thankfully ico files can be compressed rather easily, so we are going to compress it in the script itself, and decompress it when requested.

@route("/favicon.ico")
def favicon():
    return zlib.decompress(
        b"x\x9c\xedVYN\xc40\x0c5J%[\xe2\xa3|q\x06\x8e1G\xe1(=ZoV"
        b"\xb2\xa7\x89\x97R\x8d\x84\x04\xe4\xa5\xcb(\xc9\xb3\x1do"
        b"\x1d\x80\x17?\x1e\x0f\xf0O\x82\xcfw\x00\x7f\xc1\x87\xbf"
        b"\xfd\x14l\x90\xe6#\xde@\xc1\x966n[z\x85\x11\xa6\xfcc"
        b"\xdfw?s\xc4\x0b\x8e#\xbd\xc2\x08S\xe1111\xf1k\xb1NL"
        b"\xfcU<\x99\xe4T\xf8\xf43|\xaa\x18\xf8\xc3\xbaHFw\xaaj\x94"
        b"\xf4c[F\xc6\xee\xbb\xc2\xc0\x17\xf6\xf4\x12\x160\xf9"
        b"\xa3\xfeQB5\xab@\xf4\x1f\xa55r\xf9\xa4KGG\xee\x16\xdd\xff"
        b"\x8e\x9d\x8by\xc4\xe4\x17\tU\xbdDg\xf1\xeb\xf0Zh\x8e"
        b"\xd3s\x9c\xab\xc3P\n<e\xcb$\x05 b\xd8\x84Q1\x8a\xd6Kt\xe6"
        b"\x85(\x13\xe5\xf3]j\xcf\x06\x88\xe6K\x02\x84\x18\x90"
        b"\xc5\xa7Kz\xd4\x11\xeeEZK\x012\xe9\xab\xa5\xbf\xb3@i\x00"
        b"\xce\xe47\x0b\xb4\xfe\xb1d\xffk\xebh\xd3\xa3\xfd\xa4:`5J"
        b"\xa3\xf1\xf5\xf4\xcf\x02tz\x8c_\xd2\xa1\xee\xe1\xad"
        b"\xaa\xb7n-\xe5\xafoSQ\x14'\x01\xb7\x9b<\x15~\x0e\xf4b"
        b"\x8a\x90k\x8c\xdaO\xfb\x18<H\x9d\xdfj\xab\xd0\xb43\xe1"
        b'\xe3nt\x16\xdf\r\xe6\xa1d\xad\xd0\xc9z\x03"\xc7c\x94v'
        b"\xb6I\xe1\x8f\xf5,\xaa2\x93}\x90\xe0\x94\x1d\xd2\xfcY~f"
        b"\xab\r\xc1\xc8\xc4\xe4\x1f\xed\x03\x1e`\xd6\x02\xda\xc7k"
        b"\x16\x1a\xf4\xcb2Q\x05\xa0\xe6\xb4\x1e\xa4\x84\xc6"
        b"\xcc..`8'\x9a\xc9-\n\xa8\x05]?\xa3\xdfn\x11-\xcc\x0b"
        b"\xb4\x7f67:\x0c\xcf\xd5\xbb\xfd\x89\x9ebG\xf8:\x8bG"
        b"\xc0\xfb\x9dm\xe2\xdf\x80g\xea\xc4\xc45\xbe\x00\x03\xe9\xd6\xbb"
    )

Putting it all together

Here is the culmination of everything we talked about put into a script.

This may not always be the newest version, if you want to use it yourself please download the final executable or view github to see the latest code.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from pathlib import Path
from threading import Lock
from collections import defaultdict
import shutil
import argparse
import uuid
import zlib

from bottle import route, run, request, error, response, HTTPError, static_file
from werkzeug.utils import secure_filename

storage_path: Path = Path(__file__).parent / "storage"
chunk_path: Path = Path(__file__).parent / "chunk"

allow_downloads = False
dropzone_cdn = "https://cdnjs.cloudflare.com/ajax/libs/dropzone"
dropzone_version = "5.7.6"
dropzone_timeout = "120000"
dropzone_max_file_size = "100000"
dropzone_chunk_size = "1000000"
dropzone_parallel_chunks = "true"
dropzone_force_chunking = "true"

lock = Lock()
chucks = defaultdict(list)


@error(500)
def handle_500(error_message):
    response.status = 500
    response.body = f"Error: {error_message}"
    return response


@route("/")
def index():
    index_file = Path(__file__) / "index.html"
    if index_file.exists():
        return index_file.read_text()
    return f"""
<!doctype html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <link rel="stylesheet" href="{dropzone_cdn.rstrip('/')}/{dropzone_version}/min/dropzone.min.css"/>
    <link rel="stylesheet" href="{dropzone_cdn.rstrip('/')}/{dropzone_version}/min/basic.min.css"/>
    <script type="application/javascript"
        src="{dropzone_cdn.rstrip('/')}/{dropzone_version}/min/dropzone.min.js">
    </script>
    <title>pyfiledrop</title>
</head>
<body>

    <div id="content" style="width: 800px; margin: 0 auto;">
        <h2>Upload new files</h2>
        <form method="POST" action='/upload' class="dropzone dz-clickable" id="dropper" enctype="multipart/form-data">
        </form>

        <h2>
            Uploaded
            <input type="button" value="Clear" onclick="clearCookies()" />
        </h2>
        <div id="uploaded">

        </div>

        <script type="application/javascript">
            function clearCookies() {{
                document.cookie = "files=; Max-Age=0";
                document.getElementById("uploaded").innerHTML = "";
            }}

            function getFilesFromCookie() {{
                try {{ return document.cookie.split("=", 2)[1].split("||");}} catch (error) {{ return []; }}
            }}

            function saveCookie(new_file) {{
                    let all_files = getFilesFromCookie();
                    all_files.push(new_file);
                    document.cookie = `files=${{all_files.join("||")}}`;
            }}

            function generateLink(combo){{
                const uuid = combo.split('|^^|')[0];
                const name = combo.split('|^^|')[1];
                if ({'true' if allow_downloads else 'false'}) {{
                    return `<a href="/download/${{uuid}}" download="${{name}}">${{name}}</a>`;
                }}
                return name;
            }}


            function init() {{

                Dropzone.options.dropper = {{
                    paramName: 'file',
                    chunking: true,
                    forceChunking: {dropzone_force_chunking},
                    url: '/upload',
                    retryChunks: true,
                    parallelChunkUploads: {dropzone_parallel_chunks},
                    timeout: {dropzone_timeout}, // microseconds
                    maxFilesize: {dropzone_max_file_size}, // megabytes
                    chunkSize: {dropzone_chunk_size}, // bytes
                    init: function () {{
                        this.on("complete", function (file) {{
                            let combo = `${{file.upload.uuid}}|^^|${{file.upload.filename}}`;
                            saveCookie(combo);
                            document.getElementById("uploaded").innerHTML += generateLink(combo)  + "<br />";
                        }});
                    }}
                }}

                if (typeof document.cookie !== 'undefined' ) {{
                    let content = "";
                     getFilesFromCookie().forEach(function (combo) {{
                        content += generateLink(combo) + "<br />";
                    }});

                    document.getElementById("uploaded").innerHTML = content;
                }}
            }}

            init();

        </script>
    </div>
</body>
</html>
    """


@route("/favicon.ico")
def favicon():
    return zlib.decompress(
        b"x\x9c\xedVYN\xc40\x0c5J%[\xe2\xa3|q\x06\x8e1G\xe1(=ZoV"
        b"\xb2\xa7\x89\x97R\x8d\x84\x04\xe4\xa5\xcb(\xc9\xb3\x1do"
        b"\x1d\x80\x17?\x1e\x0f\xf0O\x82\xcfw\x00\x7f\xc1\x87\xbf"
        b"\xfd\x14l\x90\xe6#\xde@\xc1\x966n[z\x85\x11\xa6\xfcc"
        b"\xdfw?s\xc4\x0b\x8e#\xbd\xc2\x08S\xe1111\xf1k\xb1NL"
        b"\xfcU<\x99\xe4T\xf8\xf43|\xaa\x18\xf8\xc3\xbaHFw\xaaj\x94"
        b"\xf4c[F\xc6\xee\xbb\xc2\xc0\x17\xf6\xf4\x12\x160\xf9"
        b"\xa3\xfeQB5\xab@\xf4\x1f\xa55r\xf9\xa4KGG\xee\x16\xdd\xff"
        b"\x8e\x9d\x8by\xc4\xe4\x17\tU\xbdDg\xf1\xeb\xf0Zh\x8e"
        b"\xd3s\x9c\xab\xc3P\n<e\xcb$\x05 b\xd8\x84Q1\x8a\xd6Kt\xe6"
        b"\x85(\x13\xe5\xf3]j\xcf\x06\x88\xe6K\x02\x84\x18\x90"
        b"\xc5\xa7Kz\xd4\x11\xeeEZK\x012\xe9\xab\xa5\xbf\xb3@i\x00"
        b"\xce\xe47\x0b\xb4\xfe\xb1d\xffk\xebh\xd3\xa3\xfd\xa4:`5J"
        b"\xa3\xf1\xf5\xf4\xcf\x02tz\x8c_\xd2\xa1\xee\xe1\xad"
        b"\xaa\xb7n-\xe5\xafoSQ\x14'\x01\xb7\x9b<\x15~\x0e\xf4b"
        b"\x8a\x90k\x8c\xdaO\xfb\x18<H\x9d\xdfj\xab\xd0\xb43\xe1"
        b'\xe3nt\x16\xdf\r\xe6\xa1d\xad\xd0\xc9z\x03"\xc7c\x94v'
        b"\xb6I\xe1\x8f\xf5,\xaa2\x93}\x90\xe0\x94\x1d\xd2\xfcY~f"
        b"\xab\r\xc1\xc8\xc4\xe4\x1f\xed\x03\x1e`\xd6\x02\xda\xc7k"
        b"\x16\x1a\xf4\xcb2Q\x05\xa0\xe6\xb4\x1e\xa4\x84\xc6"
        b"\xcc..`8'\x9a\xc9-\n\xa8\x05]?\xa3\xdfn\x11-\xcc\x0b"
        b"\xb4\x7f67:\x0c\xcf\xd5\xbb\xfd\x89\x9ebG\xf8:\x8bG"
        b"\xc0\xfb\x9dm\xe2\xdf\x80g\xea\xc4\xc45\xbe\x00\x03\xe9\xd6\xbb"
    )


@route("/upload", method="POST")
def upload():
    file = request.files.get("file")
    if not file:
        raise HTTPError(status=400, body="No file provided")

    dz_uuid = request.forms.get("dzuuid")
    if not dz_uuid:
        # Assume this file has not been chunked
        with open(storage_path / f"{uuid.uuid4()}_{secure_filename(file.filename)}", "wb") as f:
            file.save(f)
        return "File Saved"

    # Chunked download
    try:
        current_chunk = int(request.forms["dzchunkindex"])
        total_chunks = int(request.forms["dztotalchunkcount"])
    except KeyError as err:
        raise HTTPError(status=400, body=f"Not all required fields supplied, missing {err}")
    except ValueError:
        raise HTTPError(status=400, body=f"Values provided were not in expected format")

    save_dir = chunk_path / dz_uuid

    if not save_dir.exists():
        save_dir.mkdir(exist_ok=True, parents=True)

    # Save the individual chunk
    with open(save_dir / str(request.forms["dzchunkindex"]), "wb") as f:
        file.save(f)

    # See if we have all the chunks downloaded
    with lock:
        chucks[dz_uuid].append(current_chunk)
        completed = len(chucks[dz_uuid]) == total_chunks

    # Concat all the files into the final file when all are downloaded
    if completed:
        with open(storage_path / f"{dz_uuid}_{secure_filename(file.filename)}", "wb") as f:
            for file_number in range(total_chunks):
                f.write((save_dir / str(file_number)).read_bytes())
        print(f"{file.filename} has been uploaded")
        shutil.rmtree(save_dir)

    return "Chunk upload successful"


@route("/download/<dz_uuid>")
def download(dz_uuid):
    if not allow_downloads:
        raise HTTPError(status=403)
    for file in storage_path.iterdir():
        if file.is_file() and file.name.startswith(dz_uuid):
            return static_file(file.name, root=file.parent.absolute(), download=True)
    return HTTPError(status=404)


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("-p", "--port", type=int, default=16273, required=False)
    parser.add_argument("--host", type=str, default="0.0.0.0", required=False)
    parser.add_argument("-s", "--storage", type=str, default=str(storage_path), required=False)
    parser.add_argument("-c", "--chunks", type=str, default=str(chunk_path), required=False)
    parser.add_argument(
        "--max-size",
        type=str,
        default=dropzone_max_file_size,
        help="Max file size (Mb)",
    )
    parser.add_argument(
        "--timeout",
        type=str,
        default=dropzone_timeout,
        help="Timeout (ms) for each chuck upload",
    )
    parser.add_argument("--chunk-size", type=str, default=dropzone_chunk_size, help="Chunk size (bytes)")
    parser.add_argument("--disable-parallel-chunks", required=False, default=False, action="store_true")
    parser.add_argument("--disable-force-chunking", required=False, default=False, action="store_true")
    parser.add_argument("-a", "--allow-downloads", required=False, default=False, action="store_true")
    parser.add_argument("--dz-cdn", type=str, default=None, required=False)
    parser.add_argument("--dz-version", type=str, default=None, required=False)
    return parser.parse_args()


if __name__ == "__main__":

    args = parse_args()
    storage_path = Path(args.storage)
    chunk_path = Path(args.chunks)
    dropzone_chunk_size = args.chunk_size
    dropzone_timeout = args.timeout
    dropzone_max_file_size = args.max_size
    try:
        if int(dropzone_timeout) < 1 or int(dropzone_chunk_size) < 1 or int(dropzone_max_file_size) < 1:
            raise Exception("Invalid dropzone option, make sure max-size, timeout, and chunk-size are all positive")
    except ValueError:
        raise Exception("Invalid dropzone option, make sure max-size, timeout, and chunk-size are all integers")

    if args.dz_cdn:
        dropzone_cdn = args.dz_cdn
    if args.dz_version:
        dropzone_version = args.dz_version
    if args.disable_parallel_chunks:
        dropzone_parallel_chunks = "false"
    if args.disable_force_chunking:
        dropzone_force_chunking = "false"
    if args.allow_downloads:
        allow_downloads = True

    if not storage_path.exists():
        storage_path.mkdir(exist_ok=True)
    if not chunk_path.exists():
        chunk_path.mkdir(exist_ok=True)

    print(
        f"""Timeout: {int(dropzone_timeout) // 1000} seconds per chunk
Chunk Size: {int(dropzone_chunk_size) // 1024} Kb
Max File Size: {int(dropzone_max_file_size)} Mb
Force Chunking: {dropzone_force_chunking}
Parallel Chunks: {dropzone_parallel_chunks}
Storage Path: {storage_path.absolute()}
Chunk Path: {chunk_path.absolute()}
"""
    )
    run(server="paste", port=args.port, host=args.host)

Make it yours, and give back if you can!

What will you add to this script? Set a max time for how long you can see the uploaded files? A way to ensure the file exists on the server before trying to download it? Checksum comparison to avoid using space for duplicate files?

However you make it better, please consider to add a pull request for your features so anyone can benefit from it!

Keep Windows from going to sleep, no power settings needed!

Say you have a long running process that you let go overnight, only to come back the next morning and realize “Oh no, only an hour into it, my computer went to sleep!”. Thanks to some Windows internals, it’s possible to make it realize there is a background task running and should not go to sleep.

Artwork by Clara Griffith

I personally use this exact method in my FastFlix program to make sure the computer doesn’t interrupt the coding process. It should work with any version of Python3, and it boils down too:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import ctypes
import time

CONTINUOUS = 0x80000000
SYSTEM_REQUIRED = 0x00000001
DISPLAY_REQUIRED = 0x00000002 

ctypes.windll.kernel32.SetThreadExecutionState(CONTINUOUS | SYSTEM_REQUIRED)
try:
    # This is where you would do stuff
    while True:
        time.sleep(600) 
finally:
    ctypes.windll.kernel32.SetThreadExecutionState(CONTINUOUS)

As this example shows, you let this run in the background to always keep your computer from sleeping. Just be careful you don’t abuse your IT policies by allowing your computer to stay on for a lot longer than they want.

It’s pretty straightforward code, the main function being SetThreadExecutionState.

ES_CONTINUOUS 0x80000000Informs the system that the state being set should remain in effect until the next call that uses ES_CONTINUOUS and one of the other state flags is cleared.
ES_DISPLAY_REQUIRED 0x00000002Forces the display to be on by resetting the display idle timer.
ES_SYSTEM_REQUIRED 0x00000001Forces the system to be in the working state by resetting the system idle timer.
Table from https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-setthreadexecutionstate

Notice that it doesn’t take multiple arguments, like what you expect in Python. Instead it accepts a single hex value that represents the culmination of all the options you want to set. Hence why you see us passing them in via a “bitwise or” expression. For example, if you also wanted to force the screen to stay on, just need to add:

ctypes.windll.kernel32.SetThreadExecutionState(CONTINUOUS | SYSTEM_REQUIRED | DISPLAY_REQUIRED)

After this is set, it is extremely important to remember to always reset it back to ES_CONTINUOUS after your work is done.

Use atexit for safety

In our example above, we simply add this in a finally clause. For a more robust program you may want to ensure it’s always called via atexit.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import ctypes
import atexit

CONTINUOUS = 0x80000000
SYSTEM_REQUIRED = 0x00000001
DISPLAY_REQUIRED = 0x00000002 

ctypes.windll.kernel32.SetThreadExecutionState(CONTINUOUS | SYSTEM_REQUIRED)

@atexit.register
def cleanup():
    ctypes.windll.kernel32.SetThreadExecutionState(CONTINUOUS)

# Do Stuff    

Create an image in Task Manager from CPU Usage

A lot of people have pass around a internet video of someone using a crazy high core count computer to display a video via Task Manager’s CPU usage. The one problem with it: it’s probably fake. Task Manager displays the last 60 seconds of activity, not a instant view like shown in the video. But what if we kept the usage the same for sixty seconds, could we at least make an image?

So the first problem, is how do we generate enough load for a CPU usage to noticeably increase? Luckily there is an old goto into the benchmark world that is really simple to implement. Square Roots. Throw a few million square roots at a CPU and watch it light up.

from itertools import count
import math

def run_benchmark():
    # Warning, no way to stop other than Ctrl+C or killing the process
    for num in count():
        math.sqrt(num)

run_benchmark()

The real problem is making sure we have a way to stop it. The easiest way is with a timeout, but just the slight work of gathering system time may throw off how much work we want to do in the future. So lets only do that, say, every 100,000 operations.

from itertools import count
import math
import time 

def run_benchmark(timeout=60):
    start_time = time.perf_counter()
    for num in count():
        if num % 100_000 == 0.0:
            if time.perf_counter() > start_time + timeout:
                return
        math.sqrt(num)

run_benchmark()

Excellent, now we can pump up a core to max workload. However, we currently have no control over which CPU it will run on! We have to set the “affinity” of the process. There is no easy way to do that with Python directly, and in this case we have to go straight to the Win32 API. In this case we will use the pywin32 library to access the win32process.

pip install pywin32

We will then use the SetProcessAffinityMask function to lock our program to a specific CPU, which will also require grabbing some details from GetCurrentProcess Win32 API function. One thing not really documented anywhere I could find, is the fact you set the CPU core affinity not by it’s actual number, but by the mask itself. Which we can create by taking 2 ** cpu_core.

from itertools import count
import math
import time 

import win32process  # pywin32

def run_benchmark(cpu_core=0, timeout=60):
    start_time = time.perf_counter()
    process_id = win32process.GetCurrentProcess()
    win32process.SetProcessAffinityMask(process_id, 2 ** cpu_core)

    for num in count():
        if num % 100_000 == 0.0:
            if time.perf_counter() > start_time + timeout:
                return
        math.sqrt(num)

run_benchmark()

So now every time we run the program, it should only max out the first core. If we want to do this across multiple cores, we will have to create multiple processes at the same time. My favorite way to do that is with multiprocessing maps. So instead of lighting up a single processor, let’s pump them all to the roof.

from itertools import count
import math
import time
from multiprocessing.pool import Pool
from multiprocessing import cpu_count

import win32process  # pywin32


def run_benchmark(cpu_core=0, timeout=60):
    start_time = time.perf_counter()
    process_id = win32process.GetCurrentProcess()
    win32process.SetProcessAffinityMask(process_id, 2 ** cpu_core)

    for num in count():
        if num % 100_000 == 0.0:
            if time.perf_counter() > start_time + timeout:
                return
        math.sqrt(num)


if __name__ == '__main__':
    arguments = [(core, 60) for core in range(cpu_count())]
    # [(0, 60), (1, 60), (2, 60), (3, 60)] 
    # each tuple will be as arguments to run_benchmark as its arguments

    with Pool(processes=cpu_count()) as p:
        p.starmap(run_benchmark, arguments)
    run_benchmark()

We have to throw the multiprocessing after the `if __name__ == ‘__main__’: block due to how CPython starts up on Windows. (It’s also just a good idea for any scripts.)

At this point you could also change up which cores it is running on to see how they correspond on the Task Manager. For example, you could change the range increments to only launch on each other core. range(0, cpu_count(), 2)

On my 8 core machine (so 16 logical cores) I can make a quick X shape by selecting certain cores

arguments = [(core, 100) for core in [0, 3, 5, 6, 9, 10, 12, 15]]

Now remember there are two types of CPU cores according to the operating system, physical and logical. Physical is the number of actual cores on the CPU, but if the CPU has SMT (Simultaneous multi-threading) it will double that. So that means every odd number core is actually a fake one (remember cores start at 0). Which means it has to use some of previous core’s resources. Hence why cores 2, 4, 8 and 14 are showing higher usage.

But what if we want even cooler graphics and don’t want to be limited to just 100% cpu usage? Well then we need to tell the computer to not work for very small amounts of time. Aka sleep. So lets try adding a sleep every 100K ops for oh say, 60 milliseconds.

def run_benchmark(cpu_core=0, timeout=60):
    start_time = time.perf_counter()
    process_id = win32process.GetCurrentProcess()
    win32process.SetProcessAffinityMask(process_id, 2 ** cpu_core)

    for num in count():
        if num % 100_000 == 0.0:
            if time.perf_counter() > start_time + timeout:
                return
            time.sleep(0.06)
        math.sqrt(num)

This time I am also going to run it just on physical cores.

arguments = [(core, 100, 80) for core in range(0, cpu_count(), 2)]

How about that, now it’s using just about 50% usage on each core on my computer. If you’re trying this on your own, this is where the fun begins. I suggest trying out different time offsets to see if you can get a list of times for 10~90% usage. For example, mine is close to:

usage_to_sleep_time = {
    100: 0,
    90: 0.014,
    80: 0.02,
    70: 0.03,
    60: 0.04,
    50: 0.06,
    40: 0.08,
    30: 0.1,
    20: 0.2,
    10: 0.5
}

Then lets throw that into the run_benchmark function to be able to set a precise amount of usage per core.

def run_benchmark(cpu_core=0, timeout=60, usage=100):
    start_time = time.perf_counter()
    process_id = win32process.GetCurrentProcess()
    win32process.SetProcessAffinityMask(process_id, 2 ** cpu_core)

    for num in count():
        if num % 100_000 == 0.0:
            if time.perf_counter() > start_time + timeout:
                return
            if usage != 100:
                time.sleep(usage_to_sleep_time[usage])
        math.sqrt(num)

Then if you have enough cores, you can see them all in action at the same time.

arguments = [(core, 100, usage) for core, usage in enumerate(usage_to_sleep_time)]

(I was impatient and didn’t do this one while the computer was idle, hence the messy lower ones.)

From here I am sure some of you could go crazy making individual cores ramp up and done, and create something truly spectacular, but my whistle was wetted. It is totally possible to have control over exactly how much CPU core usage is being shown in Task Manager with Python.

Here is the full final code, hope you enjoyed!

from itertools import count
import math
import time
from multiprocessing.pool import Pool
from multiprocessing import cpu_count

import win32process  # pywin32

usage_to_sleep_time = {
    100: 0,
    90: 0.014,
    80: 0.02,
    70: 0.03,
    60: 0.04,
    50: 0.06,
    40: 0.08,
    30: 0.1,
    20: 0.2,
    10: 0.5
}

def run_benchmark(cpu_core=0, timeout=60, usage=100):
    start_time = time.perf_counter()
    process_id = win32process.GetCurrentProcess()
    win32process.SetProcessAffinityMask(process_id, 2 ** cpu_core)

    for num in count():
        if num % 100_000 == 0.0:
            if time.perf_counter() > start_time + timeout:
                return num
            if usage != 100:
                time.sleep(usage_to_sleep_time[usage])
        math.sqrt(num)



if __name__ == '__main__':
    arguments = [(core, 100, usage) for core, usage in enumerate(usage_to_sleep_time)]

    with Pool(processes=len(arguments)) as p:
        print(p.starmap(run_benchmark, arguments))

Let it Djan-go!

https://www.claragriffith.com/
Artwork by Clara Griffith

Do you know someone who hails from the strange land of Django development? Where there are code based settings files, is customary to use a ORM for all database calls, and classes reign supreme. It’s a dangerous place that sits perilously close to the Javalands. Most who originally traveled there were told there would be riches, but they lost themselves along the way and simply need our help to come home again.

Classless classes

The first bad habit that must be broken is the horrendous overuse of classes. While classes are considered the building blocks of most object oriented languages, Python has the power of first-class functions and a clean namespacing system which negates many the usual reasons for classes.

When using Django’s framework, where a lot of its powerful operations are performed by subclassing, classes actually makes sense. When creating a small class for a one use instance in a library or script it’s most likely simply over complicating and slowing down your code.

Let’s use an example I have seen recently, where someone wanted to run a command, but needed to know if a custom shell wrapper existed before running the command.

import subprocess
import os

class DoIt:

    def __init__(self, command):
        self.command = command
        self.shell = "/bin/secure_shell" if os.path.exists("/bin/secure_shell") else "/bin/bash"

    def execute(self):
        return subprocess.run([self.shell, "-c", self.command], stdout=subprocess.PIPE)

runner = DoIt('echo "order 66" ')
runner.execute() 

Most of this class is setup and unnecessary. Especially if the state of the system won’t change while the script is running (i.e. if “/bin/secure_shell” is added or removed during execution). Let’s clean this up a little.

import subprocess
import os

shell = "/bin/secure_shell" if os.path.exists("/bin/secure_shell") else "/bin/bash"


def do_it(command):
    return subprocess.run([shell, "-c", command], stdout=subprocess.PIPE)

do_it('echo "order 66" ')

Shorter, cleaner, faster, easier to read. It doesn’t get much better than that. Now if the system state may change, you would need to throw the shell check inside the command itself, but is also possible.

To learn more view the great PyCon video entitled “stop writing classes.” The video goes over more details of why classes can stink up your code.

Help me ORM-Kenobi!

Now onto probably the hardest change that will have the most significant impact. If you or someone you know has mesothelioma learned how to interact databases by using an ORM (Django’s Models) and don’t know how to hand write SQL queries, it’s time to make an effort to learn.

An ORM is always going to be slower than its raw SQL equivalent, and in a many cases it will be much much slower. ORMs are brilliant, but still can’t always optimize queries. Even if they do, they then load the retrieved data into Python objects which takes longer and uses more memory. Their abstraction also comes at the price of not always being clear on what is actually happening under the hood.

I am not suggesting never using an ORM. They have a lot of benefits for simple datasets such as ease of development, cross database compatibility, custom field transforms and so on. However if you don’t know SQL to begin with, your ORM will most likely be extremely inefficient, under-powered and the SQL tables will be non-normalized.

For example, one of the worst sins in SQL is to use SELECT * in production. Which is the default behavior for ORM queries. In fact, the Django main query page doesn’t even go over how to limit columns returned!

Now, I am not going to tell when to use an ORM or not, or link to a dozen other articles that fight over this point further. All I want is for everyone that thinks they should use an ORM to at least sit down and learn a moderate amount of SQL before making that decision.

Text Config is King

When was the last time you had to convince an IT admin to update a source code file? Yeah, good luck with that.

By separate settings from code you gain a lot of flexibility and safety. It’s the difference between “Mr. CEO, you need the default from email changed? So for our Django docker deploy we’ll need to make the change in code, commit it to the feature branch, PR it to develop, pray we don’t have other changes and require a hotfix for this instead, then PR again for release to master, wait for the CI/CD to finish and then the auto-deployment should be good by end of the day.” vs. “One sec, let me update the config file and restart it. Okay, good to go.”

“Code based setting files are more powerful!” – Old Sith Proverb

That’s the path to the dark side. If you think you need programmatic help in a configuration file, your logic is in the wrong place. A config file shouldn’t validate itself, not have if statements, nor have to find the absolute path to the relative directory structured the user entered.

The best configuration handling I have seen are text files that are read, sanity checked, loaded into memory and have paths extrapolated as needed, run any necessary safety checks, then any additional logic run. This allows almost anyone to update a setting safely and quickly.

The scary Django way

Lets take a chunk from the “one django settings.py file to rule them all”

import os 
import json 

DEBUG = os.environ.get("APP_DEBUG") == "1"

ALLOWED_HOSTS = os.environ \
  .get("APP_ALLOWED_HOSTS", "localhost") \
  .split(",")

DEFAULT_DATABASES = {...}
DATABASES = (
  json.loads(os.environ["APP_DATABASES"])
  if "APP_DATABASES" in os.environ
  else DEFAULT_DATABASES
)

The basic design is to allow for overrides on runtime based on environment variables, which is very common and good practice. However, the way this example has been implemented honestly scares me. If the json loads fails, you are going to have an exception raised, in your freaking settings file. Which may even happen before logging is setup, project dependent.

Second, say you have “APP_ALLOWED_HOSTS” (or worse “APP_DATABASES”) set to an empty string accidentally. Then the “ALLOWED_HOSTS” would be set to [""] !

Finally spending all that work to grab environment variables makes it very unclear where to actually update the settings. For example, to update the ALLOWED_HOSTS you would have to change the default in the os.environ.get which wouldn’t make much sense to someone who didn’t know Python.

The safe way

Why not a much more easily readable file, in this case a YAML file paired with python-box (lots of other options as well, this was simply fast to write and easy to digest.)

# config.yaml
---
debug: false
allowed_hosts:
  - 'localhost'
databases:
  - 'sqlite3'

Wonderful, now we have a super easy to digest config file that would be easy to update. Let’s implement the parser for it.

import os
from box import Box

config = Box.from_yaml(filename="config.yaml")

config.debug |= os.getenv("DEBUG") == "1"

# Doing it this way will make sure it exists and is not empty
if os.getenv("ALLOWED_HOSTS"):
    # This is overwritten like in settings.py above instead of extended 
    config.allowed_hosts = os.getenv("ALLOWED_HOSTS").split(",")

if os.getenv("APP_DATABASES"):
    try:
        databases = Box.from_json(filename=os.environ["APP_DATABASES"])
    except Exception as err:
        # Can add proper logging or print message as well / instead of exception
        raise Exception(f"Could not load database set from {os.environ['APP_DATABASES']}: {err}")
    else:
        config.databases.extend(databases)

Tada! Now this is a lot easier to understand the logic that is happening, much easier to update the config file, as well as a crazy amount safer.

Now the only thing to argue about with your coworkers is what type of config file to use. cfg is old-school, json doesn’t support comments, yaml is slow, toml is fake ini, hocon is overcomplicated, and you should probably just stop coding if you’re even considering xml. But no matter which you go with, they are all better (except xml) than using a language based settings file.

Wrap Up

So you’ve heard me prattle on and on of the dangers Djangoers might bring with them to your next project, but also keep in mind the good things they will bring. Django helps people learn how to document well, keep a rigid project structure and has a very warm and supportive community that we should all strive to match.

Thanks for reading, and I hope you enjoyed and found this informative or at least amusing!

On demand gaming server for pennies a month!

The only downside with most dedicated gaming servers is the cost. If you need a 24/7 server or someone to manage it for you, you can’t avoid it. However, if you are self sufficient for setting up the gaming software, and only play certain games intermittently, your costs can be significantly reduced.

Let’s use Factorio as an example. To pay for a dedicated server it will cost around $5~10 per month, used or not. Now, if you are like me and play with friends and family only a few nights a month, that’s an unnecessary expense. Sure you can self host, but then everyone is dependent on a single person to have it up and running. Instead, I switched to using a Digital Ocean Droplet, and have paid less than a quarter this past month for an on demand gaming server.

Even if we played every night after work and weekends, it would only be $1 or $2 a month. You may be thinking “oh, you just turn off the droplet when you’re not using it!”, but alas, it is not that simple. A turned off droplet or server is still holding onto resources, so it still incurs cost, which is standard across all the hosting giants I looked into.

The trick on how to save cash? When you’re not playing, turn the droplet off, snapshot it, then destroy the droplet. When you want to play again, restore the snapshot to a new droplet. That way you are only paying server costs while it is running, then paying the super cheap snapshot storage the rest of the time.

This is painful to do by hand, but really easy with a helper script. (If you want to take it further, you could make it a website with access controls so only who you want could control it whenever they wanted.)

Digital Ocean Gaming Service (DOGS)

The code repo is available on github and the instructions are boringly standard.

git clone https://github.com/cdgriffith/dogs.git
# Alternatively, just download and extract the zip file
# https://github.com/cdgriffith/dogs/archive/master.zip

# Create a venv if you are python savvy
pip install requirements.txt
cp config.yaml.example config.yaml
# Update the config file to match your digital ocean settings
python -m dogs

This script does have some prerequisites. You need to have a Digital Ocean account token, and will have to manually create and startup the server once before you can manage it via this script. (Pull Requests always considered if you want to ease that pain for others.)

Droplet setup

When you do create the droplet you want, make sure to pick the smallest specs you think you will need. You can always upgrade to larger disk size, but cannot go back down to smaller. Also during this process, record the hostname which we will use as our server name in the config file.

You will also need your SSH key id, which are the digits at the very end of the public key. For example if the key ends with rsa-key-20190721 the ssh id is 20190721. You can always find this info in the Accounts > Security section as well by selecting a key and hitting “Edit”.

Also if you hook a firewall up to the droplet you will also need that ID, which can you retrieve via the API.

DOGS Server Config

When you got all that info, add it to the config.yaml file.

token: <your 64 character hex string>
 servers:
   ubuntu-1804-factorio:
     region: nyc1
     size: s-1vcpu-2gb
     firewall_id: <unique hex string separated by dashes>
     snapshot_max: 2
     ssh_key: <8 digits from end of ssh public cert>

You can get a full list of regions and sizes from their API as well.

Again, to run you just need to be in the dogs directory and run python -m dogs.

Binary (EXE) files

If you want to package it into an easy to use exe (can also modify for mac or linux binaries), just use the included build scripts.

# Windows specific requirements, otherwise just install 'pyinstaller'
pip install -r requirements-build.txt
python dogs\build.py

And now you should have a super handy dogs.exe in the dist directory. Don’t forget to keep your config.yaml in the same directory as it!

Gaming Setup Scripts

I have a directory to put files for setting up and updating game servers. Right now I only have factorio, but feel free to add your own, a PR would be very appreciated! All you need is a way to near automatically create / install the service and a way to have it auto start (in my example using systemd for standard Ubuntu servers)

Check the scripts out on github.