Is the world ready for Python 3?

The trek from Python 2 to Python 3 has been drawn-out, arduous and fraught with perils. How close are our dear Knights developers to all reaching the long sought glory of Python 3?

Quest for the Python 3 – Artwork by Clara Griffith (Link may contain NSFW art)

PIP Downloads

Let’s first jump into what is being used the most currently. This data examines fifteen different libraries downloaded via PIP for a particular Python version. We are only including 2.7 and 3.4+, the Python Versions that are currently supported.

The libraries analyzed are ones that have over 10K stars on github and have been downloaded via PIP. The contenders are: celery, django, flask, ipython, keras, mitmproxy, numpy, pandas, python-box, requests, scrapy, selenium, tensorflow, and tornado. (To be fair, numpy and python-box didn’t have 10K stars, but I used them in the script to make these graphics, so gave them some spotlight too.)

As of January 2019, Python 3 downloads are eclipsing Python 2 by over 20% with Python 3.6 bringing over 39% of it, almost directly matching Python 2.7’s total.

That is good, but not great news. Thankfully Python 2 won’t just stop working at the end of this year, but those are rookie Python 3 numbers, we got to pump them up!

Of course, we have to remember this is a small subset of all downloads. Subsequently, pip downloads themselves don’t tell the whole tale, but this does give us an idea of how things of are going.

This is accomplished by using the PyPI BigQuery data and some SQL (adapted from Artem Golubin’s post about this from last year), then throwing it into matplotlib.

SELECT
  SUBSTR(details.python, 0, 3) as python_version,
  COUNT(*) as download_count
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -30, "DAY"),
    CURRENT_TIMESTAMP()
  )
WHERE
 details.installer.name='pip' and
 file.project = 'requests' -- change project name here
GROUP BY
  python_version
ORDER BY
  download_count DESC
LIMIT 100

Library Brawl: Who’s the Python 3 champs?

In this head to head, we are going to compare two similar libraries, and see how they are doing on the switch to Python 3.

Web Frameworks

The first two up are very popular web frameworks to develop in, Flask and Django.

It’s a dead heat! Both libraries are doing well at attracting developers with a fresh mindset.

Machine Learning

The most popular github package by far was tensorflow with over a hundred thousand stars. Here it’s paired against it’s younger brother keras, which actually depends on it (or other AI tools) to operate.

Machine learning needs to teach it’s developers how to update! It’s a sad day for AI.

Hacker vs Web Scraper

Okay, not really directly comparable tools with a man-in-the-middle proxy and a web scraper, but it’s still an interesting match up.

With this duo I was surprised they didn’t have a higher correlation. I was honestly expecting the mitm tool to have less Python 3 love, as a lot of “hacker” tools depend on the broken way Python 2 handles strings vs unicode, thus are hard to update.

Good job hackers, always keep your tool belt fresh! Scrapers….scrape it together.

Data Science

The last head to head is for the data scientists out there, and you got science in your name and numbers in your veins, you should be at the bleeding edge of tech!

Ouch, yinz need to get with the times.

Python Version Developers Use More Often

This is some hard to gather data as an individual, so I’m going to have to cheat and just base this information off JetBrain’s yearly state of the ecosystem reports from 2017 and 2018.

In 2017, 53% of devs reported using Python 3 as their main language, which went up 22% in 2018 to 75%. Based on those two points of data, we can come to a crystal clear, no doubt conclusion to how many developers will be using Python 3 as their main language in 2019.

That’s right, based on the past two year trend, 97% of developers should be using Python 3 in 2019.

Okay, well, maybe not. But I personal expect that number to be over 90% by the time Python 2 is EOL, which is excellent news.

Operating System Default Language

OSes have a fun time of being in the cross hairs of everyone from desktop to server users, trying to figure out the right combo of what’s best for their users and for their own technology stack going forward. Every major Linux distribution agrees Python 3 is the way to the future and they will need to change over. The hard part is deciding when it will impact the users least and best for their own release cycle. This has caused lots of headaches over the years. So where do we stand now?

OSPython Version
Windows 10None
OSX 10.82.7
Debian 92.7
RedHat 8*3.6
Fedora 293.7
Ubuntu 19.04*3.7

(* denotes upcoming releases this year)

Windows has the easy stance of just saying “do it yourself” and Mac is, as usual, not bothering to innovate and just hum along until it breaks. Thankfully most Linux distros, which power the internet, are either already updated or updating this year. I haven’t seen for sure that Debian 10 will be released with Python 3 or that it w ill be out before year’s end, but I would be surprised if either were not true. Then there’s Arch linux. Arch has had Python 3 as the standard for almost as long as it existed, good boy!

Are we ready?

In all honesty, we are. We are far more prepared for this than the financial sector was ready for Y2K, and we all survived that. Moreover, there are always going to be code bases that can’t update to the latest version easily, but that’s true across the entire software development world. That and the fact the Python Software Foundation has given an extended eleven years which has allowed for even the slowest of companies to have ample time to migrate to Python 3.

Python 3 everywhere? Bring it on!


Stop using plus signs to concatenate strings!

In Python, using plus signs to concatenate strings together is one of the first things you learn, i.e. print("hello" + "world!"), and it should be one of the first things you stop using. Using plus signs to “add” strings together is inherently more error prone, messier and unprofessional. Instead you should be using .format() or f-strings.

Hunter – Artwork by Clara Griffith

Before diving into what’s really wrong with + plus sign concatenation, we are going to take a quick step back and look at the possible different ways to merge strings together in Python, so we can get a better understanding of where to use what.

Concatenating strings

When to useWhen to avoid
+NeverAlways
%Legacy code, logging modulePython 3+
formatEverywhere
f-stringPython 3.6+When you need to escape characters inside the {}s
joinOn an iterable (list, tuple, etc) of strings

Here is a quick demo of each of those methods in action using the same tuple of strings. For an already existing iterate of strings, join makes the most sense if you want them to have the same character(s) between all of them. However, in most other cases join won’t be applicable so we are going to ignore it for the rest of this post.

variables = ("these", "are", "strings")

print(" ".join(variables))
print("%s %s %s" % variables)
print("{} {} {}".format(*variables))
print(f"{variables[0]} {variables[1]} {variables[2]}")
print(variables[0] + " " + variables[1] + " " + variables[2])

# They all print "these are strings"

In many cases you will have other words or strings not in the same structure you will be concatenating together, so even though something like f-strings here looks more cumbersome than the others, it wins out in simplicity in other scenarios. I honestly use f-strings more than anything else, but .format does have advantages we will look at later. Anyways, back to why using plus signs with strings is bad.

Errors lurking in the shadows

Consider the following code, which has four different perfectly working examples of string concatenation.

wait_time = "0.1"
time_amount = "seconds"

print("We are going to wait {} {}".format(wait_time, time_amount))

print(f"We are going to wait {wait_time} {time_amount}")

print("We are going to wait %s %s" % (wait_time, time_amount))

print("We are going to wait " + wait_time + " " + time_amount)

# We are going to wait 0.1 seconds
# We are going to wait 0.1 seconds
# We are going to wait 0.1 seconds
# We are going to wait 0.1 seconds

Everything works as expected, but wait, if we are going to put a time.sleep in there, it takes the wait time as a float. Let’s update that and add the sleep.

Concatenation TypeErrors

import time

wait_time = 0.1 # Changed from string to float
time_amount = "seconds"

print("We are going to wait {} {}".format(wait_time, time_amount))

print(f"We are going to wait {wait_time} {time_amount}")

print("We are going to wait %s %s" % (wait_time, time_amount))

print("We are going to wait " + wait_time + " " + time_amount)

time.sleep(wait_time)

print("All done!")


# We are going to wait 0.1 seconds
# We are going to wait 0.1 seconds
# We are going to wait 0.1 seconds
# Traceback (most recent call last):
#    print("We are going to wait " + wait_time + " " + time_amount)
# TypeError: can only concatenate str (not "float") to str

That’s right, the only method of string concatenation to break our code was using + plus signs. Now here it was very obvious it was going to happen. But what about going back to your code a few weeks or months later? Or even worse, if you are using someone else’s code as a library and they do this. It can become quite an avoidable headache.

Formatting issues

Another common issue that you will run into frequently using plus signs is unclear formatting. It’s very easy to forget to add white space around variables when you aren’t using a single string with replace characters like every other method. What can look very similar will yield two different results:

print(f"{wait_time} {time_amount}")
print(wait_time + time_amount)

# 0.1 seconds
# 0.1seconds

Did you even notice we had that issue in the very first paragraph’s code? print("hello" + "world!")

Messy

This is the most subjective of my reasons to avoid it, but I personally think it becomes very unreadable compared to any other methods, as shown with the following example.

mixed_type_vars = {
    "a": "My",
    "b": 2056,
    "c": "bodyguards",
    "d": {"have": "feelings"}
}


def plus_string(variables):
    return variables["a"] + " " + str(variables["b"]) + \
           " " + variables["c"] + " " + str(variables["d"])


def format_string(variables):
    return "{a} {b} {c} {d}".format(**variables)


def percent_string(variables):
    return "%s %d %s %s" % (variables["a"], variables["b"], 
                            variables["c"], variables["d"])

print(plus_string(mixed_type_vars))
print(format_string(mixed_type_vars))
print(percent_string(mixed_type_vars))

String format is very powerful because it is a function, and can take positional or keyword args and replace them as such in the string. In the example above .format(**variables) is equivalent to

.format(a="My", b=2056, c="bodyguards", d={"have": "feelings"})

That way in the string you can reference them by their keywords (in this case single characters a through d).

"Thing string is {opinion} formatted".format(opinion="very nicely")

Which means with format you have a lot of options to make the string a lot more readable, or you can reuse positional or named variables easily.

print("{0} is not {1} but it is {0} just like "
      "{fruit} is not a {vegetable} but is a {fruit}"
      "".format(1, 2, fruit="apple", vegetable="potato"))

Slower string conversion

Using the functions from the Messy section we can see that it is also slower when concatenation a mix of types.

import timeit
plus = timeit.timeit('plus_string(mixed_type_vars)',
                     number=1000000,
                     setup='from __main__ import mixed_type_vars, plus_string')

form = timeit.timeit('format_string(mixed_type_vars)',
                     number=1000000,
                     setup='from __main__ import mixed_type_vars, format_string')

percent = timeit.timeit('percent_string(mixed_type_vars)',
                     number=1000000,
                     setup='from __main__ import mixed_type_vars, percent_string')

print("Concatenating a mix of types into a string one million times:")
print(f"{plus:.04f} seconds - plus signs")
print(f"{form:.04f} seconds - string format")
print(f"{percent:.04f} seconds - percent signs")

# Concatenating a mix of types into a string one million times:
# 1.9958 seconds - plus signs
# 1.3123 seconds - string format
# 1.0439 seconds - percent signs

On my machine, percent signs were slightly faster than string format, but both smoked using plus signs and explicit conversion.

Unprofessional

This isn’t only something to call out teammates on during code review, but can even negatively impact you if you’re applying for Python jobs. Using “+” everywhere for strings is a red flag that you are still a novice. I don’t know anyone personally that has been turned away because of something so trivial, but it does show that you unfamiliar with Python’s awesome feature rich strings and haven’t had a lot of experience in group coding.

If you ever saw Batman or James Bond coding in Python, they wouldn’t be using +s in their string concatenation, and nor should you!

Summary

"If" + "👏" + "you" + "👏" +"use" + "👏" + "plus signs" + "👏" + "to" + 
"👏" + "concatenate"  + "👏" + "your"  + "👏" + "strings"  + "👏" + "you" 
 + "👏" + "are"  + "👏" + "more"  + "👏" + "annoying"  + "👏" + "than"  + 
"👏" + "this"  + "👏" + "meme!"

Truffle: going from ganache to testnet (ropsten)

Truffle is an amazing suite of tools created by Consensys to develop smart contracts for the Ethereum blockchain network. However, it can be a bit jarring to make the leap from local development to the real test network, ropsten.

Required Setup

For this walk through, I have installed:

I will be using the default example truffle project, MetaCoin, that you can walk through how to unbox here or follow along using your own project.

First things first, if you do NOT have a package.json file yet, make sure to run npm init. This will turn the directory into a node package that we can easily manage with the npm package manager. So we don’t have to download dependices into the global package scope.

Now we can download all the things we are going to need:

npm install bip39 dotenv --save
  • bip39 – used to generate wallet mnemonic
  • dotenv – simple way to read environment variable files

We got everything development wise we need now.

Storing Secrets outside the code

We will have to create a private key or mnemonic, and that means we need somewhere relatively secure to store it. For testnet stuff, this can be as simple as making sure it’s not being put into version control alongside the code. To that end, we are going to use Environment Variables, and will to store them in a file called .env (that’s it, just an extension basically. Make sure to add it to your .gitignore if you’re using git). To learn more, check out the github page for dotenv. But for our purposes, all you need to know is that this file will have a format of:

ENV_VARIABLE_NAME=someting
ANOTHER_ENV=something else

Accessing testnet

The easiest way to reach out to testnet is by using a provider. I personally like using infura.io (free, just requires registration).  After you register and have your API key emailed to you, make sure you select the URL for the test network and add to the .env file using a variable named ROPSTEN_URL.

ROPSTEN_URL=https://ropsten.infura.io/<your-api-key>

It’s also possible to use your own geth node set to testnet, but that is not required.

Next we are going to create our own wallet, if you already have one set up, like with MetaMask, you can skip this next part.

Creating your testnet wallet

So now you have an place to put your secrets, lets create some. This is where bip39 comes in, it will create random mnemonics which can be used as the basis for private key of a wallet. It will be a series of 12 random words.

We could put this generation in a file, but it’s easy enough to just do straight from the command line:

node -e "console.log(require('bip39').generateMnemonic())"

This will output 12 words, DO NOT SHARE THESE ANYWHERE. The ones I am using below are example ones, and also shout NOT be used. Put them in .env file as the variable MNEMONIC. So now your .env file should now contain:

MNEMONIC=candy maple cake sugar pudding cream honey rich smooth crumble sweet treat
ROPSTEN_URL=https://ropsten.infura.io/<your-api-key>

We have our seed, so it’s time to hook it into our code. In your truffle.js or truffle-config.js file, you will need to now import the environment variables and a wallet provider at the top of the file.

require('dotenv').config()
const HDWalletProvider = require('truffle-hdwallet-provider')

After that is added, we will move down to the the exports section, we are going to add a new network, named ropsten. Then are going to use the HDWalletProvider and supply it with the mnemonic and Ifura url provided via environment variables.

module.exports = {
  networks: {
    ropsten: {
      provider: () => new HDWalletProvider(
        process.env.MNEMONIC,
        process.env.ROPSTEN_URL),
      network_id: 3
    },
  },
}

Test and make sure everything’s working by opening a truffle console, specifying our new network.

truffle console --network ropsten

We can then get our public account address via the console.

truffle(ropsten)> web3.eth.getAccounts((err, accounts) => console.log(accounts))
[ '0x627306090abab3a6e1400e9345bc60c78a8bef57' ]

If you are seeing this same wallet address, you did it wrong. Go back and make your own mnemonic, don’t copy the candy one from above.

Funding the wallet

In your development environment, the wallet already has ETH in it to pay for gas and deploying the contract. On the mainnet, you will have to buy some real ETH. On testnet, you can get some for free by using a Faucet, such as https://faucet.ropsten.be/ or if you’re using MetaMask just use https://faucet.metamask.io/.

Make sure to use the address you gathered from the console for the faucet,  and soon you should have test funds to play around with and actually deploy your contract.

Deploying the Contract

Now where the rubber meets the road, getting your contract out into the real (test) world.

truffle deploy --network ropsten

If everything is successful, you’ll get messages like these:

Using network 'ropsten'.

Running migration: 1_initial_migration.js
  Deploying Migrations...
  ... 0xefe70115c578c92bfa97154f70f9c3fbaa2b8400b1da1ee7cdxxxxxxxxxxxxxx
  Migrations: 0x6eeedefb64bd6ee6618ac54623xxxxxxxxxxxxxx
Saving successful migration to network...
  ... 0xd4294e35c166e2dca771ba9bf5eb3801bc1793e30db6a53d4dxxxxxxxxxxxxxx
Saving artifacts...
Running migration: 2_deploy_contracts.js
  Deploying Capture...
  ... 0x446d5e92d6976bb05c85bb95b243d6f7405af6bb12b3b6fe08xxxxxxxxxxxxxx
  Capture: 0x1d2f60c6ef979ca86f53af1942xxxxxxxxxxxxxx
Saving successful migration to network...
  ... 0x0b6f918ccc8e3b82cdf43038a2c32fe1fef66d0fa9aeb2260bxxxxxxxxxxxxxx
Saving artifacts...

Tada! You now have your custom contracts deployed to testnet!

Or, you got an out of gas error, as it is not uncommon to have to adjust the gas price to get it onto the network, as truffle does not automatically figure that out for you. A follow up post will show how to calculate and adjust gas price as needed.

 

 

 

Discover AWS State Machines using Python Lambdas for an ETL process

Step Functions, State Machines, and Lambdas oh my! AWS has really been expanding what you can do without needing to actually stand up any servers. I’m going to walk through a very basic example of how to get going with your own Python code to create an ETL (Extract Transform Load) process using Amazon’s services. And don’t worry, all this goodness is included in the free tier!

The goal of this exercise will be to have an aggregation of news headlines downloaded and transformed into CSV format and uploaded to another service. We are going to achieve this by breaking up each step of the process into its own AWS Lambda.

What are Lambdas?

AWS Lambdas are a “serverless”, stateless way to run snippets of code with no extra initialization or shutdown time.

When to use Lambdas

They are great if you have small highly reusable pieces of code that serve a single purpose. (If you have a few that go together really well, that’s where state machines come in.)  For example if you have some code that does image recognition and you need to use it across multiple projects. Or even just want it to run faster or be more accessible, as Lambdas have several ways they can be initiated, including via an API you can define.

They will NOT fit your purpose if you need something that does a multitude of tasks, will run for a long time, use a lot of memory or update frequently.

Creating a Lambda

Creating your own is a lot easier than a lot of other tutorials seem to show. If you haven’t already, sign up for an AWS account. Then open your AWS console and search for Lambda.

You’ll be presented with a welcome screen most likely, after clicking through “Get Started” or whatever they updated it to this month, you’ll have a screen where you can create new functions as well as check on existing ones.

See the big orange button that even Trump would be proud of? Click it.

As this is probably your first Lambda, you will have to create a new role. Super simple, don’t have to leave the page even. Just give it a new name, and give it a policy template. I used the Simple Microserve permissions as it seemed to fit the bill for me most.

Then you will be greeted with a page with a large amount of info and stuff going on. The part that we are going to be most concerned about is the Function Code area (and will also need Environment Variables to store API keys in).

It may seem like we need to set up triggers or resources for this information to go to, but as we plan to use these inside a state machine, that will handle all that bother for us.

ETL – Extract Transform Load

Now that we know how to make a Lambda, lets look at some code we could use with it. For the state machine we will create later, I want to have an entire process where I pull in information from an outside source (extract), modify to fit my needs (transform) and then put it into my own system (load.)

Extract

As stated above, this scenario involves pulling down data from a news source, in this case we are using News API that allows you to create a free API key to grab top news headlines and link to their stories.

That code is dead simple:

import json
from urllib import request


def retrieve_news(key, source):
    url = f"https://newsapi.org/v2/top-headlines?sources={source}&apiKey={key}"
    with request.urlopen(url) as req:
        return json.loads(req.read())

print(retrieve_news(my_key, 'associated-press')

If I wasn’t using this in a Lambda, I would be using the wonderful Requests module instead, but Python 3’s urllib is at least a lot better than 2s.

So now, we need a way for the Lambda function to call this code and pass along the results in a manor we can use later. On the page to fill in the code, you’ll see a place that says under Function Code that lists the Handler this is the entry point to your code. lambda_function.lambda_handler is the default, which means it will use the function lambda_handler inside the file lambda_function.py as the entry.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import os
import json
from urllib import request


def retrieve_news(key, source):
    url = f"https://newsapi.org/v2/top-headlines?sources={source}&apiKey={key}"
    with request.urlopen(url) as req:
        return json.load(req)


# What AWS Lambda calls
def lambda_handler(event, context):
    key = os.getenv('NEWSAPI_KEY')
    if not key:
        raise Exception('Not all environment variables set')

    if not event['source']:
        raise Exception('Source not set')

    return {'data': retrieve_news(key, event['source']),
            'source': event['source']}

There are two arguments passed into the function, the first is event which is all the information sent to the lambda function (if using a standard JSON object this will be a dictionary, as seen above). The second is context which is a class that will tell you about the current lambda function if necessary, you can learn more about it here, but it will not be used in this example.

Testing the lambda

You may also notice that we are pulling the API key not from the event, but from an environment variable, so make sure to set that as well on the page. Last and not least, I would suggest increasing the timeout for the lambda to 10 seconds, from the default 3.

Before we go on and add the other functions, lets make sure this one works properly.  At the top of the page, where there is a drop down beside test and Actions on the right, click Configure test events we are going to add a new one with the details that will be passed into the event dictionary.

{
  "source": "associated-press"
}

On the pop-up, copy in the above JSON and save it as a new test event.

Hit the test button at the top, and see the results. You should get a big green window that shows you how it ran. If you have a red error window, you will have to figure out what went wrong first.

Transform

This will be our second lambda, so we get to go through the process again of creating a new one (you can use the exiting role from the last one) and copying this code into it. No Environment variables needed this time!

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import csv
from io import StringIO


# What AWS Lambda calls
def lambda_handler(event, context):

    sio = StringIO()
    writer = csv.writer(sio)
    writer.writerow(["Source", "Title", "Author", "URL"])
    for article in event['data']['articles']:
        writer.writerow([
            article['source']['name'],
            article['title'],
            article['author'],
            article['url']
        ])

    csv_content = sio.getvalue()
    print(csv_content)

    return {'data': csv_content,
            'source': event['source']}


The tricky part here is now you need good test data for it. Luckily you can copy the output of the last Lambda (provided snippet below) to do just that.

{
  "data": {
    "status": "ok",
    "totalResults": 5,
    "articles": [
      {
        "source": {
          "id": "associated-press",
          "name": "Associated Press"
        },
        "author": "FRANCES D'EMILIO",
        "title": "Pope accepts resignation of McCarrick after sex abuse claims",
        "description": "VATICAN CITY (AP) — In a move described as unprecedented, Pope Francis has effectively stripped U.S. prelate Theodore McCarrick of his cardinal's title and rank following allegations of sexual abuse, including one involving an 11-year-old boy. The Vatican ann…",
        "url": "https://apnews.com/46e8e15911034e7f971c7542b60a6444",
        "urlToImage": "https://storage.googleapis.com/afs-prod/media/media:b5c82ad2f2b74b50ab9faccf51898309/2628.jpeg",
        "publishedAt": "2018-07-28T16:21:57Z"
      },
      {
        "source": {
          "id": "associated-press",
          "name": "Associated Press"
        },
        "author": "KEVIN FREKING",
        "title": "On trade policy, Trump is turning GOP orthodoxy on its head",
        "description": "WASHINGTON (AP) — President Donald Trump's trade policies are turning long-established Republican orthodoxy on its head, marked by tariff fights and now $12 billion in farm aid that represents the type of government intervention GOP voters railed against a de…",
        "url": "https://apnews.com/57cd042b57054e5790b9b444c561ac3b",
        "urlToImage": "https://storage.googleapis.com/afs-prod/media/media:90f04d837f514d0b984e25bd5153be8a/3000.jpeg",
        "publishedAt": "2018-07-28T16:20:11Z"
      },
      {
        "source": {
          "id": "associated-press",
          "name": "Associated Press"
        },
        "author": "SETH BORENSTEIN and FRANK JORDANS",
        "title": "Science Says: Record heat, fires worsened by climate change",
        "description": "Heat waves are setting all-time temperature records across the globe, again. Europe suffered its deadliest wildfire in more than a century, and one of nearly 90 large fires in the U.S. West burned dozens of homes and forced the evacuation of at least 37,000 p…",
        "url": "https://apnews.com/a4255779e2b6461b9cc8dbf24ea4b96c",
        "urlToImage": "https://storage.googleapis.com/afs-prod/media/media:f9b76dc0354e47caafcfad96c36443ca/3000.jpeg",
        "publishedAt": "2018-07-28T15:03:01Z"
      },
      {
        "source": {
          "id": "associated-press",
          "name": "Associated Press"
        },
        "author": "MICHAEL KUNZELMAN and LARRY NEUMEISTER",
        "title": "No mystery to Supreme Court nominee Kavanaugh's gun views",
        "description": "SILVER SPRING, Md. (AP) — Supreme Court nominee Brett Kavanaugh says he recognizes that gun, drug and gang violence \"has plagued all of us.\" Still, he believes the Constitution limits how far government can go to restrict gun use to prevent crime. As a federa…",
        "url": "https://apnews.com/c8fc0785b429497abf9621efcdb345e8",
        "urlToImage": "https://storage.googleapis.com/afs-prod/media/media:4c3619ea948b4c91b8f2fcdd50162d26/3000.jpeg",
        "publishedAt": "2018-07-28T14:11:06Z"
      },
      {
        "source": {
          "id": "associated-press",
          "name": "Associated Press"
        },
        "author": "HOPE YEN, JOSH BOAK and CHRISTOPHER RUGABER",
        "title": "AP FACT CHECK: Trump's hyped claims on economy, NKorea, vets",
        "description": "WASHINGTON (AP) — President Donald Trump received positive economic news this past week and twisted it out of proportion. That impulse ran through days of rhetoric as he hailed the success of a veterans program that hasn't started and saw progress with North …",
        "url": "https://apnews.com/5b405824a9d843a09a641754d84aa1ab",
        "urlToImage": "https://storage.googleapis.com/afs-prod/media/media:636c2c3068b94181ba3c5bcb8d2a3ae9/3000.jpeg",
        "publishedAt": "2018-07-28T12:30:33Z"
      }
    ]
  },
  "source": "associated-press"
}

Configure and run the test like before using the above data.

In this case I also printed the output so you could see that any standard output is captured by the logs.

Load

Now to actually submit this data to a server, you could set up your own, or use file.io which is a free filedropper website, as the code uses below. No API needed!

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

from urllib import request, parse
import json


# What AWS Lambda calls
def lambda_handler(event, context):
    url = 'https://file.io'

    encoded_args = parse.urlencode({'text': event['data']}).encode('utf-8')

    with request.urlopen(url, encoded_args) as req:
        info = json.load(req)

    return {'data': info, 'source': event['source']}

Again, as this is reaching out to an external API, I would increase the default 3 second timeout limit of the Lambda from 3 to 10 seconds.

Woo! We now have three lambda’s that can take each other’s outputs in a row and do a full ETL process. Now lets put them together.

State Machines

AWS Step functions allow for creating a set of various actions to run with each other, and then presented in a pretty auto-generated graph. Back at the console, find the Step functions.

Then create a new state machine.

This is probably the hardest part, is the actual state machine definition. The state language can be confusing, thankfully for our needs we don’t need to do anything complicated.

You can use this code, and will just have to update the actual Resource links under Extract, Transform and Load. (You can even click on them and should be presented with a drop down of your previously created resources so you don’t have to copy the ARNs manually.)

{
  "StartAt": "Set Source",
  "States": {
    "Set Source": {
      "Type": "Pass",
      "Result": {"source": "associated-press"},
      "ResultPath": "$",
      "Next": "Extract"
    },
    "Extract": {
      "Type": "Task",
      "Resource": "<ARN>:function:google-news-extract",
      "ResultPath": "$",
      "Next": "Transform"
    },
    "Transform": {
      "Type": "Task",
      "Resource": "<ARN>:function:google-news-transform",
      "ResultPath": "$",
      "Next": "Load"
    },
     "Load": {
      "Type": "Task",
      "Resource": "<ARN>:function:google-news-load",
      "ResultPath": "$",
      "End": true
    }
  }
}

Notice the first step is not a task, but rather a pass through state that sets the source. We could do this during initialization, but wanted to highlight the ability to add information where needed.

After creation, we will need to start a new execution. It doesn’t need any input, but doesn’t hurt to include a comment if you want.

Then run it!

 

During the middle of an execution, it will show what has been run successfully and what is currently in progress, or erred. At any time, you can click on a specific block to see what it’s input and outputs were.

This function then can be run whenever to run the full ETL process!

Scheduling

For a process like this, you want to run it on a schedule. That means creating a new CloudWatch rule. Search for CloudWatch in the console, then click on Rules on the left hand side.

Then, click the big blue button.

It’s pretty simple to create a fixed rate schedule, and then just make sure to select the right state machine on the right side!

 

Uploading large files by chunking – featuring Python Flask and Dropzone.js

It can be a real pain to upload huge files. Many services limit their upload sizes to a few megabytes, and you don’t want a single connection open forever either. The super simple way to get around that is simply send the file in lots of small parts, aka chunking.

Chunking Food - Artwork by Clara Griffith

Chunking Food – Artwork by Clara Griffith

Finished code example can be downloaded here.

So there are going to be two parts to making this work, the front-end (website) and backend (server). Lets start on what the user will see.

Webpage with Dropzone.js

Beautiful, ain’t it? The best part is, the code powering it is just as succinct.

<!doctype html>
<html lang="en">
<head>

    <meta charset="UTF-8">

    <link rel="stylesheet" 
     href="https://cdnjs.cloudflare.com/ajax/libs/dropzone/5.4.0/min/dropzone.min.css"/>

    <link rel="stylesheet" 
     href="https://cdnjs.cloudflare.com/ajax/libs/dropzone/5.4.0/min/basic.min.css"/>

    <script type="application/javascript" 
     src="https://cdnjs.cloudflare.com/ajax/libs/dropzone/5.4.0/min/dropzone.min.js">
    </script>

    <title>File Dropper</title>
</head>
<body>

<form method="POST" action='/upload' class="dropzone dz-clickable" 
      id="dropper" enctype="multipart/form-data">
</form>


</body>
</html>

This is using the dropzone.js library, which has no additional dependencies and decent CSS included. All you have to do is add the class “dropzone” to a form and it automatically turns it into one of their special drag and drop fields (you can also click and select).

However, by default, dropzone does not chunk files. Luckily, it is really easy to enable. We are going to add some custom JavaScript and insert it between the form and the end of the body

</form>

<script type="application/javascript">
    Dropzone.options.dropper = {
        paramName: 'file',
        chunking: true,
        forceChunking: true,
        url: '/upload',
        maxFilesize: 1025, // megabytes
        chunkSize: 1000000 // bytes
    }
</script>

</body>

When enabling chunking, it will break up any files larger than the chunkSize and send them to the server over multiple requests. It accomplishes this by adding form data that has information about the chunk (uuid, current chunk, total chunks, chunk size, total size). By default, anything under that size will not have that information send as part of the form data and the server would have to have an additional logic path. Thankfully, there is the forceChunking option which will always send that information, even if it’s a smaller file. Everything else is pretty self-explanatory, but if you want more details about the possible options, just check out their list of configuration options.

Python Flask Server

Onto the backend. I am going to be using Flask, which is currently the most popular Python web framework (by github stars), other good options include Bottle and CherryPy. If you hate yourself or your colleagues, you could also use Django or Pyramid. There are a ton of good example Flask projects, and boiler plates to start from, I am going to use one that I have created for my own use that fits my needs, but don’t feel obligated to use it.

This type of upload will work across any real website back-end. You will simply need two routes, one that displays the frontend, and the other that accepts the file as an upload. At first, lets just view what dropzone is sending us. In this example my project’s name is called ‘pydrop’, and if you’re using my FlaskBootstrap code, this is the views/templated.py file.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import logging
import os

from flask import render_template, Blueprint, request, make_response
from werkzeug.utils import secure_filename

from pydrop.config import config

blueprint = Blueprint('templated', __name__, template_folder='templates')

log = logging.getLogger('pydrop')


@blueprint.route('/')
@blueprint.route('/index')
def index():
    # Route to serve the upload form
    return render_template('index.html',
                           page_name='Main',
                           project_name="pydrop")


@blueprint.route('/upload', methods=['POST'])
def upload():
    # Route to deal with the uploaded chunks
    log.info(request.form)
    log.info(request.files)
    return make_response(('ok', 200))

Run the flask server and upload a small file (under the size of the chunk limit). It should log a single instance of a POST to /upload:

[INFO] werkzeug: 127.0.0.1 "POST /upload HTTP/1.1" 200 -

[INFO] pydrop: ImmutableMultiDict([
     ('dzuuid', '807f99b7-7f58-4d9b-ac05-2a20f5e53782'), 
     ('dzchunkindex', '0'), 
     ('dztotalfilesize', '1742'), 
     ('dzchunksize', '1000000'), 
     ('dztotalchunkcount', '1'), 
     ('dzchunkbyteoffset', '0')])

[INFO] pydrop: ImmutableMultiDict([
     ('file', <FileStorage: 'README.md' ('application/octet-stream')>)])

Lets break down what information we are getting:

dzuuid – Unique identifier of the file being uploaded

dzchunkindex – Which block number we are currently on

dztotalfilesize – The entire file’s size

dzchunksize – The max chunk size set on the frontend (note this may be larger than the actual chuck’s size)

dztotalchunkcount – The number of chunks to expect

dzchunkbyteoffset – The file offset we need to keep appending to the file being  uploaded

Next, let’s upload something just a bit larger that will require it to be chunked into multiple parts:

[INFO] werkzeug: 127.0.0.1 "POST /upload HTTP/1.1" 200 -

[INFO] pydrop: ImmutableMultiDict([
    ('dzuuid', 'b4b2409a-99f0-4300-8602-8becbef24c91'), 
    ('dzchunkindex', '0'), 
    ('dztotalfilesize', '1191708'), 
    ('dzchunksize', '1000000'), 
    ('dztotalchunkcount', '2'), 
    ('dzchunkbyteoffset', '0')])

[INFO] pydrop: ImmutableMultiDict([
    ('file', <FileStorage: '04vfpknzx8z01.png' ('application/octet-stream')>)])



[INFO] werkzeug: 127.0.0.1 "POST /upload HTTP/1.1" 200 -

[INFO] pydrop: ImmutableMultiDict([
    ('dzuuid', 'b4b2409a-99f0-4300-8602-8becbef24c91'), 
    ('dzchunkindex', '1'),
    ('dztotalfilesize', '1191708'),  
    ('dzchunksize', '1000000'), 
    ('dztotalchunkcount', '2'), 
    ('dzchunkbyteoffset', '1000000')])

[INFO] pydrop: ImmutableMultiDict([
    ('file', <FileStorage: '04vfpknzx8z01.png' ('application/octet-stream')>)])

Notice how /upload has been called twice. And that the dzchunkindex and dzchunkbyteoffset have been updated accordingly.  That means our upload function has to be smart enough to handle both new requests and existing multipart uploads.  That means for new requests we should open existing files and only write data after the data already in them, whereas we will create a file and start at the beginning for new uploads. Luckily, both can be accomplished by opening with the same code. First open file in append mode,  then ‘seek’ to the end of the current data (in this case we are relying on the seek offset to be provided by dropzone.)

@blueprint.route('/upload', methods=['POST'])
def upload():
    # Remember the paramName was set to 'file', we can use that here to grab it
    file = request.files['file']

    # secure_filename makes sure the filename isn't unsafe to save
    save_path = os.path.join(config.data_dir, secure_filename(file.filename))

    # We need to append to the file, and write as bytes
    with open(save_path, 'ab') as f:
        # Goto the offset, aka after the chunks we already wrote 
        f.seek(int(request.form['dzchunkbyteoffset']))
        f.write(file.stream.read())
       
    # Giving it a 200 means it knows everything is ok
    return make_response(('Uploaded Chunk', 200))

At this point you should have a working upload script, tada!

But lets beef this up a little bit. The following code improvements make it so we don’t overwrite existing files that have already been uploaded, checks the file size matches what we expect when we’re done, and gives a little more output along the way.

@blueprint.route('/upload', methods=['POST'])
def upload():
    file = request.files['file']

    save_path = os.path.join(config.data_dir, secure_filename(file.filename))
    current_chunk = int(request.form['dzchunkindex'])

    # If the file already exists it's ok if we are appending to it,
    # but not if it's new file that would overwrite the existing one
    if os.path.exists(save_path) and current_chunk == 0:
        # 400 and 500s will tell dropzone that an error occurred and show an error
        return make_response(('File already exists', 400))

    try:
        with open(save_path, 'ab') as f:
            f.seek(int(request.form['dzchunkbyteoffset']))
            f.write(file.stream.read())
    except OSError:
        # log.exception will include the traceback so we can see what's wrong 
        log.exception('Could not write to file')
        return make_response(("Not sure why,"
                              " but we couldn't write the file to disk", 500))

    total_chunks = int(request.form['dztotalchunkcount'])

    if current_chunk + 1 == total_chunks:
        # This was the last chunk, the file should be complete and the size we expect
        if os.path.getsize(save_path) != int(request.form['dztotalfilesize']):
            log.error(f"File {file.filename} was completed, "
                      f"but has a size mismatch."
                      f"Was {os.path.getsize(save_path)} but we"
                      f" expected {request.form['dztotalfilesize']} ")
            return make_response(('Size mismatch', 500))
        else:
            log.info(f'File {file.filename} has been uploaded successfully')
    else:
        log.debug(f'Chunk {current_chunk + 1} of {total_chunks} '
                  f'for file {file.filename} complete')

    return make_response(("Chunk upload successful", 200))

Now lets give this a try:

[DEBUG] pydrop: Chunk 1 of 6 for file DSC_0051-1.jpg complete
[DEBUG] pydrop: Chunk 2 of 6 for file DSC_0051-1.jpg complete
[DEBUG] pydrop: Chunk 3 of 6 for file DSC_0051-1.jpg complete
[DEBUG] pydrop: Chunk 4 of 6 for file DSC_0051-1.jpg complete
[DEBUG] pydrop: Chunk 5 of 6 for file DSC_0051-1.jpg complete
[INFO] pydrop: File DSC_0051-1.jpg has been uploaded successfully

Sweet! But wait, what if we remove the directories where the files are stored? Or try to upload the same file again?

(Dropzone’s text out of the box is a little hard to read, but it says “File already exists” on the left and “Not sure why, but we couldn’t write file the disk” on the right. Exactly what we’d expect.)

2018-05-28 14:29:19,311 [ERROR] pydrop: Could not write to file
Traceback (most recent call last):
    ....
FileNotFoundError: [Errno 2] No such file or directory:

We get error message on the webpage and in the logs, perfect.

I hope you found this information useful and if you have any suggestions on how to improve it, please let me know!

Thinking further down the road

In the long-term I would have a database or some permanent storage option to keep track of file uploads. That way you could see if one fails or stops halfway and be able to remove incomplete ones. I would also base saving files first into a temp directory based off their UUID then, when complete, moving them to a place based off their file hash. Would also be nice to have a page to see everything uploaded and manage directories or other options, or even password protected uploads.