It can be a real pain to upload huge files. Many services limit their upload sizes to a few megabytes, and you don’t want a single connection open forever either. The super simple way to get around that is simply send the file in lots of small parts, aka chunking.
UPDATE: Check out the new article, which includes adding parallel chucking for speed improvements.
Finished code example can be viewed at github.
So there are going to be two parts to making this work, the front-end (website) and backend (server). Lets start on what the user will see.
Webpage with Dropzone.js
Beautiful, ain’t it? The best part is, the code powering it is just as succinct.
<!doctype html> <html lang="en"> <head> <meta charset="UTF-8"> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/dropzone/5.4.0/min/dropzone.min.css"/> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/dropzone/5.4.0/min/basic.min.css"/> <script type="application/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/dropzone/5.4.0/min/dropzone.min.js"> </script> <title>File Dropper</title> </head> <body> <form method="POST" action='/upload' class="dropzone dz-clickable" id="dropper" enctype="multipart/form-data"> </form> </body> </html>
This is using the dropzone.js library, which has no additional dependencies and decent CSS included. All you have to do is add the class “dropzone” to a form and it automatically turns it into one of their special drag and drop fields (you can also click and select).
However, by default, dropzone does not chunk files. Luckily, it is really easy to enable. We are going to add some custom JavaScript and insert it between the form
and the end of the body
</form> <script type="application/javascript"> Dropzone.options.dropper = { paramName: 'file', chunking: true, forceChunking: true, url: '/upload', maxFilesize: 1025, chunkSize: 1000000 } </script> </body>
When enabling chunking, it will break up any files larger than the chunkSize
and send them to the server over multiple requests. It accomplishes this by adding form data that has information about the chunk (uuid, current chunk, total chunks, chunk size, total size). By default, anything under that size will not have that information send as part of the form data and the server would have to have an additional logic path. Thankfully, there is the forceChunking
option which will always send that information, even if it’s a smaller file. Everything else is pretty self-explanatory, but if you want more details about the possible options, just check out their list of configuration options.
Python Flask Server
Onto the backend. I am going to be using Flask, which is currently the most popular Python web framework (by github stars), other good options include Bottle and CherryPy. If you hate yourself or your colleagues, you could also use Django or Pyramid. There are a ton of good example Flask projects, and boiler plates to start from, I am going to use one that I have created for my own use that fits my needs, but don’t feel obligated to use it.
This type of upload will work across any real website back-end. You will simply need two routes, one that displays the frontend, and the other that accepts the file as an upload. At first, lets just view what dropzone is sending us. In this example my project’s name is called ‘pydrop’, and if you’re using my FlaskBootstrap code, this is the views/templated.py
file.
#!/usr/bin/env python # -*- coding: UTF-8 -*- import logging import os from flask import render_template, Blueprint, request, make_response from werkzeug.utils import secure_filename from pydrop.config import config blueprint = Blueprint('templated', __name__, template_folder='templates') log = logging.getLogger('pydrop') @blueprint.route('/') @blueprint.route('/index') def index(): # Route to serve the upload form return render_template('index.html', page_name='Main', project_name="pydrop") @blueprint.route('/upload', methods=['POST']) def upload(): # Route to deal with the uploaded chunks log.info(request.form) log.info(request.files) return make_response(('ok', 200))
Run the flask server and upload a small file (under the size of the chunk limit). It should log a single instance of a POST
to /upload
:
[INFO] werkzeug: 127.0.0.1 "POST /upload HTTP/1.1" 200 - [INFO] pydrop: ImmutableMultiDict([ ('dzuuid', '807f99b7-7f58-4d9b-ac05-2a20f5e53782'), ('dzchunkindex', '0'), ('dztotalfilesize', '1742'), ('dzchunksize', '1000000'), ('dztotalchunkcount', '1'), ('dzchunkbyteoffset', '0')]) [INFO] pydrop: ImmutableMultiDict([ ('file', <FileStorage: 'README.md' ('application/octet-stream')>)])
Lets break down what information we are getting:
dzuuid – Unique identifier of the file being uploaded
dzchunkindex – Which block number we are currently on
dztotalfilesize – The entire file’s size
dzchunksize – The max chunk size set on the frontend (note this may be larger than the actual chuck’s size)
dztotalchunkcount – The number of chunks to expect
dzchunkbyteoffset – The file offset we need to keep appending to the file being uploaded
Next, let’s upload something just a bit larger that will require it to be chunked into multiple parts:
[INFO] werkzeug: 127.0.0.1 "POST /upload HTTP/1.1" 200 - [INFO] pydrop: ImmutableMultiDict([ ('dzuuid', 'b4b2409a-99f0-4300-8602-8becbef24c91'), ('dzchunkindex', '0'), ('dztotalfilesize', '1191708'), ('dzchunksize', '1000000'), ('dztotalchunkcount', '2'), ('dzchunkbyteoffset', '0')]) [INFO] pydrop: ImmutableMultiDict([ ('file', <FileStorage: '04vfpknzx8z01.png' ('application/octet-stream')>)]) [INFO] werkzeug: 127.0.0.1 "POST /upload HTTP/1.1" 200 - [INFO] pydrop: ImmutableMultiDict([ ('dzuuid', 'b4b2409a-99f0-4300-8602-8becbef24c91'), ('dzchunkindex', '1'), ('dztotalfilesize', '1191708'), ('dzchunksize', '1000000'), ('dztotalchunkcount', '2'), ('dzchunkbyteoffset', '1000000')]) [INFO] pydrop: ImmutableMultiDict([ ('file', <FileStorage: '04vfpknzx8z01.png' ('application/octet-stream')>)])
Notice how /upload
has been called twice. And that the dzchunkindex
and dzchunkbyteoffset
have been updated accordingly. That means our upload function has to be smart enough to handle both new requests and existing multipart uploads. That means for new requests we should open existing files and only write data after the data already in them, whereas we will create a file and start at the beginning for new uploads. Luckily, both can be accomplished by opening with the same code. First open file in append mode, then ‘seek’ to the end of the current data (in this case we are relying on the seek offset to be provided by dropzone.)
@blueprint.route('/upload', methods=['POST']) def upload(): # Remember the paramName was set to 'file', we can use that here to grab it file = request.files['file'] # secure_filename makes sure the filename isn't unsafe to save save_path = os.path.join(config.data_dir, secure_filename(file.filename)) # We need to append to the file, and write as bytes with open(save_path, 'ab') as f: # Goto the offset, aka after the chunks we already wrote f.seek(int(request.form['dzchunkbyteoffset'])) f.write(file.stream.read()) # Giving it a 200 means it knows everything is ok return make_response(('Uploaded Chunk', 200))
At this point you should have a working upload script, tada!
But lets beef this up a little bit. The following code improvements make it so we don’t overwrite existing files that have already been uploaded, checks the file size matches what we expect when we’re done, and gives a little more output along the way.
@blueprint.route('/upload', methods=['POST']) def upload(): file = request.files['file'] save_path = os.path.join(config.data_dir, secure_filename(file.filename)) current_chunk = int(request.form['dzchunkindex']) # If the file already exists it's ok if we are appending to it, # but not if it's new file that would overwrite the existing one if os.path.exists(save_path) and current_chunk == 0: # 400 and 500s will tell dropzone that an error occurred and show an error return make_response(('File already exists', 400)) try: with open(save_path, 'ab') as f: f.seek(int(request.form['dzchunkbyteoffset'])) f.write(file.stream.read()) except OSError: # log.exception will include the traceback so we can see what's wrong log.exception('Could not write to file') return make_response(("Not sure why," " but we couldn't write the file to disk", 500)) total_chunks = int(request.form['dztotalchunkcount']) if current_chunk + 1 == total_chunks: # This was the last chunk, the file should be complete and the size we expect if os.path.getsize(save_path) != int(request.form['dztotalfilesize']): log.error(f"File {file.filename} was completed, " f"but has a size mismatch." f"Was {os.path.getsize(save_path)} but we" f" expected {request.form['dztotalfilesize']} ") return make_response(('Size mismatch', 500)) else: log.info(f'File {file.filename} has been uploaded successfully') else: log.debug(f'Chunk {current_chunk + 1} of {total_chunks} ' f'for file {file.filename} complete') return make_response(("Chunk upload successful", 200))
Now lets give this a try:
[DEBUG] pydrop: Chunk 1 of 6 for file DSC_0051-1.jpg complete [DEBUG] pydrop: Chunk 2 of 6 for file DSC_0051-1.jpg complete [DEBUG] pydrop: Chunk 3 of 6 for file DSC_0051-1.jpg complete [DEBUG] pydrop: Chunk 4 of 6 for file DSC_0051-1.jpg complete [DEBUG] pydrop: Chunk 5 of 6 for file DSC_0051-1.jpg complete [INFO] pydrop: File DSC_0051-1.jpg has been uploaded successfully
Sweet! But wait, what if we remove the directories where the files are stored? Or try to upload the same file again?
(Dropzone’s text out of the box is a little hard to read, but it says “File already exists” on the left and “Not sure why, but we couldn’t write file the disk” on the right. Exactly what we’d expect.)
2018-05-28 14:29:19,311 [ERROR] pydrop: Could not write to file Traceback (most recent call last): .... FileNotFoundError: [Errno 2] No such file or directory:
We get error message on the webpage and in the logs, perfect.
I hope you found this information useful and if you have any suggestions on how to improve it, please let me know!
Thinking further down the road
In the long-term I would have a database or some permanent storage option to keep track of file uploads. That way you could see if one fails or stops halfway and be able to remove incomplete ones. I would also base saving files first into a temp directory based off their UUID then, when complete, moving them to a place based off their file hash. Would also be nice to have a page to see everything uploaded and manage directories or other options, or even password protected uploads.
how to active the parrallel update chunks of one big file?
On the JS side, you need to enable the option
parallelChunkUploads
https://www.dropzonejs.com/#config-parallelChunkUploadsOn the Python server side, you need to change it so on the first request it writes out dummy data to the full file size, so you can seek to any part of it before writing I believe (unsure / untested myself.)
Could you show how it the same concept could be done in go language. I really need to write almost exactly the same program in my university but struggling for 2 days..
Can you please share piece of code on python server side for parallel upload.
Only two years late, but I finally have an article on just that! https://codecalamity.com/upload-large-files-fast-with-dropzone-js/
What if I want to do it in background using celery how to create task for chuncks any idea?
There is no real way to do it “in the background” from the client side, as they are constantly connected to upload. From the server side of things, as long as you are using a WSGI that is threaded/async/multiprocessed like CherryPy or Tornado you will still be able to handle multiple requests. Aka don’t use the default flask one (that you really shouldn’t be using in prod). If you want to use it with Bottle instead of flask there are solutions like greenlets.
Thanks for great post Chris! I want to use this ‘sequential’ chunking instead of parallel to reduce the risk of crash on bad internet lines. I’ve implemented your code and it seems to work well 🙂 However, the progress bar always show 100%. You know how it can be forced into showing a percentage of successfully uploaded chuncs / total chunks?