Uploading large files by chunking – featuring Python Flask and Dropzone.js

It can be a real pain to upload huge files. Many services limit their upload sizes to a few megabytes, and you don’t want a single connection open forever either. The super simple way to get around that is simply send the file in lots of small parts, aka chunking.

UPDATE: Check out the new article, which includes adding parallel chucking for speed improvements.

Chunking Food - Artwork by Clara Griffith
Chunking Food – Artwork by Clara Griffith

Finished code example can be viewed at github.

So there are going to be two parts to making this work, the front-end (website) and backend (server). Lets start on what the user will see.

Webpage with Dropzone.js

Beautiful, ain’t it? The best part is, the code powering it is just as succinct.

<!doctype html>
<html lang="en">
<head>

    <meta charset="UTF-8">

    <link rel="stylesheet" 
     href="https://cdnjs.cloudflare.com/ajax/libs/dropzone/5.4.0/min/dropzone.min.css"/>

    <link rel="stylesheet" 
     href="https://cdnjs.cloudflare.com/ajax/libs/dropzone/5.4.0/min/basic.min.css"/>

    <script type="application/javascript" 
     src="https://cdnjs.cloudflare.com/ajax/libs/dropzone/5.4.0/min/dropzone.min.js">
    </script>

    <title>File Dropper</title>
</head>
<body>

<form method="POST" action='/upload' class="dropzone dz-clickable" 
      id="dropper" enctype="multipart/form-data">
</form>


</body>
</html>

This is using the dropzone.js library, which has no additional dependencies and decent CSS included. All you have to do is add the class “dropzone” to a form and it automatically turns it into one of their special drag and drop fields (you can also click and select).

However, by default, dropzone does not chunk files. Luckily, it is really easy to enable. We are going to add some custom JavaScript and insert it between the form and the end of the body

</form>

<script type="application/javascript">
    Dropzone.options.dropper = {
        paramName: 'file',
        chunking: true,
        forceChunking: true,
        url: '/upload',
        maxFilesize: 1025, 
        chunkSize: 1000000 
    }
</script>

</body>

When enabling chunking, it will break up any files larger than the chunkSize and send them to the server over multiple requests. It accomplishes this by adding form data that has information about the chunk (uuid, current chunk, total chunks, chunk size, total size). By default, anything under that size will not have that information send as part of the form data and the server would have to have an additional logic path. Thankfully, there is the forceChunking option which will always send that information, even if it’s a smaller file. Everything else is pretty self-explanatory, but if you want more details about the possible options, just check out their list of configuration options.

Python Flask Server

Onto the backend. I am going to be using Flask, which is currently the most popular Python web framework (by github stars), other good options include Bottle and CherryPy. If you hate yourself or your colleagues, you could also use Django or Pyramid. There are a ton of good example Flask projects, and boiler plates to start from, I am going to use one that I have created for my own use that fits my needs, but don’t feel obligated to use it.

This type of upload will work across any real website back-end. You will simply need two routes, one that displays the frontend, and the other that accepts the file as an upload. At first, lets just view what dropzone is sending us. In this example my project’s name is called ‘pydrop’, and if you’re using my FlaskBootstrap code, this is the views/templated.py file.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import logging
import os

from flask import render_template, Blueprint, request, make_response
from werkzeug.utils import secure_filename

from pydrop.config import config

blueprint = Blueprint('templated', __name__, template_folder='templates')

log = logging.getLogger('pydrop')


@blueprint.route('/')
@blueprint.route('/index')
def index():
    # Route to serve the upload form
    return render_template('index.html',
                           page_name='Main',
                           project_name="pydrop")


@blueprint.route('/upload', methods=['POST'])
def upload():
    # Route to deal with the uploaded chunks
    log.info(request.form)
    log.info(request.files)
    return make_response(('ok', 200))

Run the flask server and upload a small file (under the size of the chunk limit). It should log a single instance of a POST to /upload:

[INFO] werkzeug: 127.0.0.1 "POST /upload HTTP/1.1" 200 -

[INFO] pydrop: ImmutableMultiDict([
     ('dzuuid', '807f99b7-7f58-4d9b-ac05-2a20f5e53782'), 
     ('dzchunkindex', '0'), 
     ('dztotalfilesize', '1742'), 
     ('dzchunksize', '1000000'), 
     ('dztotalchunkcount', '1'), 
     ('dzchunkbyteoffset', '0')])

[INFO] pydrop: ImmutableMultiDict([
     ('file', &lt;FileStorage: 'README.md' ('application/octet-stream')&gt;)])

Lets break down what information we are getting:

dzuuid – Unique identifier of the file being uploaded

dzchunkindex – Which block number we are currently on

dztotalfilesize – The entire file’s size

dzchunksize – The max chunk size set on the frontend (note this may be larger than the actual chuck’s size)

dztotalchunkcount – The number of chunks to expect

dzchunkbyteoffset – The file offset we need to keep appending to the file being  uploaded

Next, let’s upload something just a bit larger that will require it to be chunked into multiple parts:

[INFO] werkzeug: 127.0.0.1 "POST /upload HTTP/1.1" 200 -

[INFO] pydrop: ImmutableMultiDict([
    ('dzuuid', 'b4b2409a-99f0-4300-8602-8becbef24c91'), 
    ('dzchunkindex', '0'), 
    ('dztotalfilesize', '1191708'), 
    ('dzchunksize', '1000000'), 
    ('dztotalchunkcount', '2'), 
    ('dzchunkbyteoffset', '0')])

[INFO] pydrop: ImmutableMultiDict([
    ('file', &lt;FileStorage: '04vfpknzx8z01.png' ('application/octet-stream')&gt;)])



[INFO] werkzeug: 127.0.0.1 "POST /upload HTTP/1.1" 200 -

[INFO] pydrop: ImmutableMultiDict([
    ('dzuuid', 'b4b2409a-99f0-4300-8602-8becbef24c91'), 
    ('dzchunkindex', '1'),
    ('dztotalfilesize', '1191708'),  
    ('dzchunksize', '1000000'), 
    ('dztotalchunkcount', '2'), 
    ('dzchunkbyteoffset', '1000000')])

[INFO] pydrop: ImmutableMultiDict([
    ('file', &lt;FileStorage: '04vfpknzx8z01.png' ('application/octet-stream')&gt;)])

Notice how /upload has been called twice. And that the dzchunkindex and dzchunkbyteoffset have been updated accordingly.  That means our upload function has to be smart enough to handle both new requests and existing multipart uploads.  That means for new requests we should open existing files and only write data after the data already in them, whereas we will create a file and start at the beginning for new uploads. Luckily, both can be accomplished by opening with the same code. First open file in append mode,  then ‘seek’ to the end of the current data (in this case we are relying on the seek offset to be provided by dropzone.)

@blueprint.route('/upload', methods=['POST'])
def upload():
    # Remember the paramName was set to 'file', we can use that here to grab it
    file = request.files['file']

    # secure_filename makes sure the filename isn't unsafe to save
    save_path = os.path.join(config.data_dir, secure_filename(file.filename))

    # We need to append to the file, and write as bytes
    with open(save_path, 'ab') as f:
        # Goto the offset, aka after the chunks we already wrote 
        f.seek(int(request.form['dzchunkbyteoffset']))
        f.write(file.stream.read())
       
    # Giving it a 200 means it knows everything is ok
    return make_response(('Uploaded Chunk', 200))

At this point you should have a working upload script, tada!

But lets beef this up a little bit. The following code improvements make it so we don’t overwrite existing files that have already been uploaded, checks the file size matches what we expect when we’re done, and gives a little more output along the way.

@blueprint.route('/upload', methods=['POST'])
def upload():
    file = request.files['file']

    save_path = os.path.join(config.data_dir, secure_filename(file.filename))
    current_chunk = int(request.form['dzchunkindex'])

    # If the file already exists it's ok if we are appending to it,
    # but not if it's new file that would overwrite the existing one
    if os.path.exists(save_path) and current_chunk == 0:
        # 400 and 500s will tell dropzone that an error occurred and show an error
        return make_response(('File already exists', 400))

    try:
        with open(save_path, 'ab') as f:
            f.seek(int(request.form['dzchunkbyteoffset']))
            f.write(file.stream.read())
    except OSError:
        # log.exception will include the traceback so we can see what's wrong 
        log.exception('Could not write to file')
        return make_response(("Not sure why,"
                              " but we couldn't write the file to disk", 500))

    total_chunks = int(request.form['dztotalchunkcount'])

    if current_chunk + 1 == total_chunks:
        # This was the last chunk, the file should be complete and the size we expect
        if os.path.getsize(save_path) != int(request.form['dztotalfilesize']):
            log.error(f"File {file.filename} was completed, "
                      f"but has a size mismatch."
                      f"Was {os.path.getsize(save_path)} but we"
                      f" expected {request.form['dztotalfilesize']} ")
            return make_response(('Size mismatch', 500))
        else:
            log.info(f'File {file.filename} has been uploaded successfully')
    else:
        log.debug(f'Chunk {current_chunk + 1} of {total_chunks} '
                  f'for file {file.filename} complete')

    return make_response(("Chunk upload successful", 200))

Now lets give this a try:

[DEBUG] pydrop: Chunk 1 of 6 for file DSC_0051-1.jpg complete
[DEBUG] pydrop: Chunk 2 of 6 for file DSC_0051-1.jpg complete
[DEBUG] pydrop: Chunk 3 of 6 for file DSC_0051-1.jpg complete
[DEBUG] pydrop: Chunk 4 of 6 for file DSC_0051-1.jpg complete
[DEBUG] pydrop: Chunk 5 of 6 for file DSC_0051-1.jpg complete
[INFO] pydrop: File DSC_0051-1.jpg has been uploaded successfully

Sweet! But wait, what if we remove the directories where the files are stored? Or try to upload the same file again?

(Dropzone’s text out of the box is a little hard to read, but it says “File already exists” on the left and “Not sure why, but we couldn’t write file the disk” on the right. Exactly what we’d expect.)

2018-05-28 14:29:19,311 [ERROR] pydrop: Could not write to file
Traceback (most recent call last):
    ....
FileNotFoundError: [Errno 2] No such file or directory:

We get error message on the webpage and in the logs, perfect.

I hope you found this information useful and if you have any suggestions on how to improve it, please let me know!

Thinking further down the road

In the long-term I would have a database or some permanent storage option to keep track of file uploads. That way you could see if one fails or stops halfway and be able to remove incomplete ones. I would also base saving files first into a temp directory based off their UUID then, when complete, moving them to a place based off their file hash. Would also be nice to have a page to see everything uploaded and manage directories or other options, or even password protected uploads.