Chris Griffith

Personal Cloud Media Server – Encrypted, Streamable, Affordable…Possible?

Is it possible to build a personal media server that is hosted in the cloud, while making privacy, security, and accessibility paramount?I wanted to find out, and this post will dive into the options available to achieve such a possibly as well. (Spoiler: I did end up making my own software to do just this!)  

First, of course, is the why even try this when other options already exist? For example, Plex and Subsonic are some great options if you want to host a media server from your own home. The catch is then you have to have good upload speeds, storage space, an always running server or NAS, and concerned about how private your data really is. Because at the end of the day these are companies, not just software, and they are beholden to the requests of government agencies. They also have user all data in a single, potentially hackable, silo.

Cost Breakdown

So first, fast upload speeds. If you got it, you’re golden, but if your ISP doesn’t offer high enough speeds, you’re screwed. And even if you do have fast upload speeds you now need a server with hard drives that is always being fed electricity. Time to figure out what it costs to remove that need entirely.

On the flip side, a media server is pretty simple, logistics wise. You need a server to host the web page or API, and a storage provider. These could even be the same thing, however I have yet to find a price conscious option that includes both. Instead, for my personal needs, I priced out the difference between buying a 2-bay NAS (as it offers low electric and data redundancy) and using a local server, and constantly paying for an online one storage roughly 2TBs of data (BackBlaze for storage, DigitalOcean for webserver).

 Local Low  High     Cloud Low  High 
2-Bay NAS $150 $300   2TB
Storage 
$140 $600
2TB HDDs $70 $130   Web
server 
$25 $80
Electric $5 $30        
Internet  $0 $50        
             
Immediate
Cost
 $220  $430        
 Yearly Cost $5 $80     $165 $680

And the numbers speak for themselves. It is much, much cheaper,and more viable to buy a NAS and use the established software. The low end of the yearly cloud costs would match the cost of the high end NAS after only five years. Only an idiot with a paranoiac need for total control and security would even think about making their own software and paying so much to host it online.

I naturally started working on the architecture for how to build my cloud media player after the price analysis.

Design

Having content that is both easily streamable and encrypted is a doozy. It wouldn’t really have been possibly for the his and hers at home a few years ago.But thanks to MPEG DASH and HLS, we now have video formats with those features built in!

HLS is far more common, but it is a proprietary format developed by Apple and doesn’t have nearly the same feature set as DASH. (Note: Apple should rename their company to Sour Apple, because they refuse to support the internationally standardized DASH format because they hate competition.) So for my own purposes, I chose MPEG DASH.

The real downsize to either of these formats though, is now you have to re-encode all your videos before uploading them, ugh. But it really can’t be helped, and then at least it standardizes your library. After figuring out a bit more of how DASH works, I created a super basic structure I wanted to follow:

The webserver needs to allow for finding and playing the encrypted movies. DASH supports multiple DRM methods, but the best option for a home user is ‘cleartext DRM’ aka a password. Well, you don’t want to store raw passwords in the database, so that means anything in there also has to be encrypted. Oh, and if you really want to storage provider to have no clue what’son there if they scan your stuff, that means subtitles and cover files need to be encrypted too. Oy.

But I really wanted to see if this was possible and learn this new tech, so I plowed on. I also was heavily working with JavaScript at work,so I wrote it to use Node instead of my beloved Python. Two weeks, forty dependencies and a job change later, I had ZABAVA!

Solution?

Zabava translates to “fun” or “entertainment” from Bosnian / Czech / Croatian(and maybe more?) and I thought it was a cool sounding word. Every time I say it aloud, I imitate Jim Carrey from Ace Ventura saying “shikaka”. (No idea if that’s correct, but no one’s stopped me yet.)

It has user authentication with JWT tokens. Thought, admittedly just single user right now with admin rights.

It of course supports editing video information and changing cover file.

And has a script that allows for automatic converting videos to DASH, adding them to the DB and uploading them to the storage provider. I only designed the backend for BackBlaze B2 currently, as that is what I use, but it has a fairly agnostic provider setup to allow easy creation of others.

Sexy right? I am quite proud I was able to get nearly everything to work as envisioned.

Of course, not all fairy tales have the ending we imagine. Some videos still don’t like being converted or played via DASH format and it takes forever and a day to convert and upload terabytes of media files. The code is also not up to my personal quality standard, as a lot was written while figuring out how the tech worked without consideration to overall architecture.

In the end, after uploading all my media and not using it for a few months, I stopped work on the project and have bought a home NAS anyways (as I needed some solution for tons other files as well.)

I may go back and refactor it some weekend I am in a crazy mood, but I don’t think it will fit my person standard for code quality. However, if you are interested in the code and working on it yourself, you can find it on my github

Summary

I learned a lot about building a streaming site and different security methodologies for it, so I even though I wouldn’t qualify the code as a success, it’s surely a personal win.

If I were to do this again I would of course do things a little differently:

  • Try using HLS instead of DASH for wider support
  • Write the backend in Python instead of Node
  • SQL instead of Mongo (in my defense I was using Node at the time)

So, lets go over the criteria again:

  • Encrypted – Yes!
    • Everything was secure
  • Streamable – Yes!
    • Some conversions might have issues until tweaked right, but majority of the content worked as expected.
  • Affordable – No
    • The NAS that I ended up buying cost more than streaming the media, but it was used for a lot more than just the few VHS backups I have. So not the cheapest option, but not a bank breaker.
    • Oh wait, factor in the time spent building the new app… yeah, it ain’t cheap.

As expected, you can’t have all the perks with no downsides, but if Security and Accessibility are your goals more than cost, something like this might interest you.

Is the world ready for Python 3?

The trek from Python 2 to Python 3 has been drawn-out, arduous and fraught with perils. How close are our dear Knights developers to all reaching the long sought glory of Python 3?

Quest for the Python 3 – Artwork by Clara Griffith (Link may contain NSFW art)

PIP Downloads

Let’s first jump into what is being used the most currently. This data examines fifteen different libraries downloaded via PIP for a particular Python version. We are only including 2.7 and 3.4+, the Python Versions that are currently supported.

The libraries analyzed are ones that have over 10K stars on github and have been downloaded via PIP. The contenders are: celery, django, flask, ipython, keras, mitmproxy, numpy, pandas, python-box, requests, scrapy, selenium, tensorflow, and tornado. (To be fair, numpy and python-box didn’t have 10K stars, but I used them in the script to make these graphics, so gave them some spotlight too.)

As of January 2019, Python 3 downloads are eclipsing Python 2 by over 20% with Python 3.6 bringing over 39% of it, almost directly matching Python 2.7’s total.

That is good, but not great news. Thankfully Python 2 won’t just stop working at the end of this year, but those are rookie Python 3 numbers, we got to pump them up!

Of course, we have to remember this is a small subset of all downloads. Subsequently, pip downloads themselves don’t tell the whole tale, but this does give us an idea of how things of are going.

This is accomplished by using the PyPI BigQuery data and some SQL (adapted from Artem Golubin’s post about this from last year), then throwing it into matplotlib.

SELECT
  SUBSTR(details.python, 0, 3) as python_version,
  COUNT(*) as download_count
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -30, "DAY"),
    CURRENT_TIMESTAMP()
  )
WHERE
 details.installer.name='pip' and
 file.project = 'requests' -- change project name here
GROUP BY
  python_version
ORDER BY
  download_count DESC
LIMIT 100

Library Brawl: Who’s the Python 3 champs?

In this head to head, we are going to compare two similar libraries, and see how they are doing on the switch to Python 3.

Web Frameworks

The first two up are very popular web frameworks to develop in, Flask and Django.

It’s a dead heat! Both libraries are doing well at attracting developers with a fresh mindset.

Machine Learning

The most popular github package by far was tensorflow with over a hundred thousand stars. Here it’s paired against it’s younger brother keras, which actually depends on it (or other AI tools) to operate.

Machine learning needs to teach it’s developers how to update! It’s a sad day for AI.

Hacker vs Web Scraper

Okay, not really directly comparable tools with a man-in-the-middle proxy and a web scraper, but it’s still an interesting match up.

With this duo I was surprised they didn’t have a higher correlation. I was honestly expecting the mitm tool to have less Python 3 love, as a lot of “hacker” tools depend on the broken way Python 2 handles strings vs unicode, thus are hard to update.

Good job hackers, always keep your tool belt fresh! Scrapers….scrape it together.

Data Science

The last head to head is for the data scientists out there, and you got science in your name and numbers in your veins, you should be at the bleeding edge of tech!

Ouch, yinz need to get with the times.

Python Version Developers Use More Often

This is some hard to gather data as an individual, so I’m going to have to cheat and just base this information off JetBrain’s yearly state of the ecosystem reports from 2017 and 2018.

In 2017, 53% of devs reported using Python 3 as their main language, which went up 22% in 2018 to 75%. Based on those two points of data, we can come to a crystal clear, no doubt conclusion to how many developers will be using Python 3 as their main language in 2019.

That’s right, based on the past two year trend, 97% of developers should be using Python 3 in 2019.

Okay, well, maybe not. But I personal expect that number to be over 90% by the time Python 2 is EOL, which is excellent news.

Operating System Default Language

OSes have a fun time of being in the cross hairs of everyone from desktop to server users, trying to figure out the right combo of what’s best for their users and for their own technology stack going forward. Every major Linux distribution agrees Python 3 is the way to the future and they will need to change over. The hard part is deciding when it will impact the users least and best for their own release cycle. This has caused lots of headaches over the years. So where do we stand now?

OSPython Version
Windows 10None
OSX 10.82.7
Debian 92.7
RedHat 8*3.6
Fedora 293.7
Ubuntu 19.04*3.7

(* denotes upcoming releases this year)

Windows has the easy stance of just saying “do it yourself” and Mac is, as usual, not bothering to innovate and just hum along until it breaks. Thankfully most Linux distros, which power the internet, are either already updated or updating this year. I haven’t seen for sure that Debian 10 will be released with Python 3 or that it w ill be out before year’s end, but I would be surprised if either were not true. Then there’s Arch linux. Arch has had Python 3 as the standard for almost as long as it existed, good boy!

Are we ready?

In all honesty, we are. We are far more prepared for this than the financial sector was ready for Y2K, and we all survived that. Moreover, there are always going to be code bases that can’t update to the latest version easily, but that’s true across the entire software development world. That and the fact the Python Software Foundation has given an extended eleven years which has allowed for even the slowest of companies to have ample time to migrate to Python 3.

Python 3 everywhere? Bring it on!


Stop using plus signs to concatenate strings!

In Python, using plus signs to concatenate strings together is one of the first things you learn, i.e. print("hello" + "world!"), and it should be one of the first things you stop using. Using plus signs to “add” strings together is inherently more error prone, messier and unprofessional. Instead you should be using .format() or f-strings.

Hunter – Artwork by Clara Griffith

Before diving into what’s really wrong with + plus sign concatenation, we are going to take a quick step back and look at the possible different ways to merge strings together in Python, so we can get a better understanding of where to use what.

Concatenating strings

When to useWhen to avoid
+NeverAlways
%Legacy code, logging modulePython 3+
formatEverywhere
f-stringPython 3.6+When you need to escape characters inside the {}s
joinOn an iterable (list, tuple, etc) of strings

Here is a quick demo of each of those methods in action using the same tuple of strings. For an already existing iterate of strings, join makes the most sense if you want them to have the same character(s) between all of them. However, in most other cases join won’t be applicable so we are going to ignore it for the rest of this post.

variables = ("these", "are", "strings")

print(" ".join(variables))
print("%s %s %s" % variables)
print("{} {} {}".format(*variables))
print(f"{variables[0]} {variables[1]} {variables[2]}")
print(variables[0] + " " + variables[1] + " " + variables[2])

# They all print "these are strings"

In many cases you will have other words or strings not in the same structure you will be concatenating together, so even though something like f-strings here looks more cumbersome than the others, it wins out in simplicity in other scenarios. I honestly use f-strings more than anything else, but .format does have advantages we will look at later. Anyways, back to why using plus signs with strings is bad.

Errors lurking in the shadows

Consider the following code, which has four different perfectly working examples of string concatenation.

wait_time = "0.1"
time_amount = "seconds"

print("We are going to wait {} {}".format(wait_time, time_amount))

print(f"We are going to wait {wait_time} {time_amount}")

print("We are going to wait %s %s" % (wait_time, time_amount))

print("We are going to wait " + wait_time + " " + time_amount)

# We are going to wait 0.1 seconds
# We are going to wait 0.1 seconds
# We are going to wait 0.1 seconds
# We are going to wait 0.1 seconds

Everything works as expected, but wait, if we are going to put a time.sleep in there, it takes the wait time as a float. Let’s update that and add the sleep.

Concatenation TypeErrors

import time

wait_time = 0.1 # Changed from string to float
time_amount = "seconds"

print("We are going to wait {} {}".format(wait_time, time_amount))

print(f"We are going to wait {wait_time} {time_amount}")

print("We are going to wait %s %s" % (wait_time, time_amount))

print("We are going to wait " + wait_time + " " + time_amount)

time.sleep(wait_time)

print("All done!")


# We are going to wait 0.1 seconds
# We are going to wait 0.1 seconds
# We are going to wait 0.1 seconds
# Traceback (most recent call last):
#    print("We are going to wait " + wait_time + " " + time_amount)
# TypeError: can only concatenate str (not "float") to str

That’s right, the only method of string concatenation to break our code was using + plus signs. Now here it was very obvious it was going to happen. But what about going back to your code a few weeks or months later? Or even worse, if you are using someone else’s code as a library and they do this. It can become quite an avoidable headache.

Formatting issues

Another common issue that you will run into frequently using plus signs is unclear formatting. It’s very easy to forget to add white space around variables when you aren’t using a single string with replace characters like every other method. What can look very similar will yield two different results:

print(f"{wait_time} {time_amount}")
print(wait_time + time_amount)

# 0.1 seconds
# 0.1seconds

Did you even notice we had that issue in the very first paragraph’s code? print("hello" + "world!")

Messy

This is the most subjective of my reasons to avoid it, but I personally think it becomes very unreadable compared to any other methods, as shown with the following example.

mixed_type_vars = {
    "a": "My",
    "b": 2056,
    "c": "bodyguards",
    "d": {"have": "feelings"}
}


def plus_string(variables):
    return variables["a"] + " " + str(variables["b"]) + \
           " " + variables["c"] + " " + str(variables["d"])


def format_string(variables):
    return "{a} {b} {c} {d}".format(**variables)


def percent_string(variables):
    return "%s %d %s %s" % (variables["a"], variables["b"], 
                            variables["c"], variables["d"])

print(plus_string(mixed_type_vars))
print(format_string(mixed_type_vars))
print(percent_string(mixed_type_vars))

String format is very powerful because it is a function, and can take positional or keyword args and replace them as such in the string. In the example above .format(**variables) is equivalent to

.format(a="My", b=2056, c="bodyguards", d={"have": "feelings"})

That way in the string you can reference them by their keywords (in this case single characters a through d).

"Thing string is {opinion} formatted".format(opinion="very nicely")

Which means with format you have a lot of options to make the string a lot more readable, or you can reuse positional or named variables easily.

print("{0} is not {1} but it is {0} just like "
      "{fruit} is not a {vegetable} but is a {fruit}"
      "".format(1, 2, fruit="apple", vegetable="potato"))

Slower string conversion

Using the functions from the Messy section we can see that it is also slower when concatenation a mix of types.

import timeit
plus = timeit.timeit('plus_string(mixed_type_vars)',
                     number=1000000,
                     setup='from __main__ import mixed_type_vars, plus_string')

form = timeit.timeit('format_string(mixed_type_vars)',
                     number=1000000,
                     setup='from __main__ import mixed_type_vars, format_string')

percent = timeit.timeit('percent_string(mixed_type_vars)',
                     number=1000000,
                     setup='from __main__ import mixed_type_vars, percent_string')

print("Concatenating a mix of types into a string one million times:")
print(f"{plus:.04f} seconds - plus signs")
print(f"{form:.04f} seconds - string format")
print(f"{percent:.04f} seconds - percent signs")

# Concatenating a mix of types into a string one million times:
# 1.9958 seconds - plus signs
# 1.3123 seconds - string format
# 1.0439 seconds - percent signs

On my machine, percent signs were slightly faster than string format, but both smoked using plus signs and explicit conversion.

Unprofessional

This isn’t only something to call out teammates on during code review, but can even negatively impact you if you’re applying for Python jobs. Using “+” everywhere for strings is a red flag that you are still a novice. I don’t know anyone personally that has been turned away because of something so trivial, but it does show that you unfamiliar with Python’s awesome feature rich strings and haven’t had a lot of experience in group coding.

If you ever saw Batman or James Bond coding in Python, they wouldn’t be using +s in their string concatenation, and nor should you!

Summary

"If" + "👏" + "you" + "👏" +"use" + "👏" + "plus signs" + "👏" + "to" + "👏" + "concatenate" + "👏" + "your" + "👏" + "strings" + "👏" + "you" + "👏" + "are" + "👏" + "more" + "👏" + "annoying" + "👏" + "than" + "👏" + "this" + "👏" + "meme!"

Truffle: going from ganache to testnet (ropsten)

Truffle is an amazing suite of tools created by Consensys to develop smart contracts for the Ethereum blockchain network. However, it can be a bit jarring to make the leap from local development to the real test network, ropsten.

Required Setup

For this walk through, I have installed:

I will be using the default example truffle project, MetaCoin, that you can walk through how to unbox here or follow along using your own project.

First things first, if you do NOT have a package.json file yet, make sure to run npm init. This will turn the directory into a node package that we can easily manage with the npm package manager. So we don’t have to download dependices into the global package scope.

Now we can download all the things we are going to need:

npm install bip39 dotenv --save
  • bip39 – used to generate wallet mnemonic
  • dotenv – simple way to read environment variable files

We got everything development wise we need now.

Storing Secrets outside the code

We will have to create a private key or mnemonic, and that means we need somewhere relatively secure to store it. For testnet stuff, this can be as simple as making sure it’s not being put into version control alongside the code. To that end, we are going to use Environment Variables, and will to store them in a file called .env (that’s it, just an extension basically. Make sure to add it to your .gitignore if you’re using git). To learn more, check out the github page for dotenv. But for our purposes, all you need to know is that this file will have a format of:

ENV_VARIABLE_NAME=someting
ANOTHER_ENV=something else

Accessing testnet

The easiest way to reach out to testnet is by using a provider. I personally like using infura.io (free, just requires registration).  After you register and have your API key emailed to you, make sure you select the URL for the test network and add to the .env file using a variable named ROPSTEN_URL.

ROPSTEN_URL=https://ropsten.infura.io/<your-api-key>

It’s also possible to use your own geth node set to testnet, but that is not required.

Next we are going to create our own wallet, if you already have one set up, like with MetaMask, you can skip this next part.

Creating your testnet wallet

So now you have an place to put your secrets, lets create some. This is where bip39 comes in, it will create random mnemonics which can be used as the basis for private key of a wallet. It will be a series of 12 random words.

We could put this generation in a file, but it’s easy enough to just do straight from the command line:

node -e "console.log(require('bip39').generateMnemonic())"

This will output 12 words, DO NOT SHARE THESE ANYWHERE. The ones I am using below are example ones, and also shout NOT be used. Put them in .env file as the variable MNEMONIC. So now your .env file should now contain:

MNEMONIC=candy maple cake sugar pudding cream honey rich smooth crumble sweet treat
ROPSTEN_URL=https://ropsten.infura.io/<your-api-key>

We have our seed, so it’s time to hook it into our code. In your truffle.js or truffle-config.js file, you will need to now import the environment variables and a wallet provider at the top of the file.

require('dotenv').config()
const HDWalletProvider = require('truffle-hdwallet-provider')

After that is added, we will move down to the the exports section, we are going to add a new network, named ropsten. Then are going to use the HDWalletProvider and supply it with the mnemonic and Ifura url provided via environment variables.

module.exports = {
  networks: {
    ropsten: {
      provider: () => new HDWalletProvider(
        process.env.MNEMONIC,
        process.env.ROPSTEN_URL),
      network_id: 3
    },
  },
}

Test and make sure everything’s working by opening a truffle console, specifying our new network.

truffle console --network ropsten

We can then get our public account address via the console.

truffle(ropsten)> web3.eth.getAccounts((err, accounts) => console.log(accounts))
[ '0x627306090abab3a6e1400e9345bc60c78a8bef57' ]

If you are seeing this same wallet address, you did it wrong. Go back and make your own mnemonic, don’t copy the candy one from above.

Funding the wallet

In your development environment, the wallet already has ETH in it to pay for gas and deploying the contract. On the mainnet, you will have to buy some real ETH. On testnet, you can get some for free by using a Faucet, such as https://faucet.ropsten.be/ or if you’re using MetaMask just use https://faucet.metamask.io/.

Make sure to use the address you gathered from the console for the faucet,  and soon you should have test funds to play around with and actually deploy your contract.

Deploying the Contract

Now where the rubber meets the road, getting your contract out into the real (test) world.

truffle deploy --network ropsten

If everything is successful, you’ll get messages like these:

Using network 'ropsten'.

Running migration: 1_initial_migration.js
  Deploying Migrations...
  ... 0xefe70115c578c92bfa97154f70f9c3fbaa2b8400b1da1ee7cdxxxxxxxxxxxxxx
  Migrations: 0x6eeedefb64bd6ee6618ac54623xxxxxxxxxxxxxx
Saving successful migration to network...
  ... 0xd4294e35c166e2dca771ba9bf5eb3801bc1793e30db6a53d4dxxxxxxxxxxxxxx
Saving artifacts...
Running migration: 2_deploy_contracts.js
  Deploying Capture...
  ... 0x446d5e92d6976bb05c85bb95b243d6f7405af6bb12b3b6fe08xxxxxxxxxxxxxx
  Capture: 0x1d2f60c6ef979ca86f53af1942xxxxxxxxxxxxxx
Saving successful migration to network...
  ... 0x0b6f918ccc8e3b82cdf43038a2c32fe1fef66d0fa9aeb2260bxxxxxxxxxxxxxx
Saving artifacts...

Tada! You now have your custom contracts deployed to testnet!

Or, you got an out of gas error, as it is not uncommon to have to adjust the gas price to get it onto the network, as truffle does not automatically figure that out for you. A follow up post will show how to calculate and adjust gas price as needed.

 

 

 

Discover AWS State Machines using Python Lambdas for an ETL process

Step Functions, State Machines, and Lambdas oh my! AWS has really been expanding what you can do without needing to actually stand up any servers. I’m going to walk through a very basic example of how to get going with your own Python code to create an ETL (Extract Transform Load) process using Amazon’s services. And don’t worry, all this goodness is included in the free tier!

The goal of this exercise will be to have an aggregation of news headlines downloaded and transformed into CSV format and uploaded to another service. We are going to achieve this by breaking up each step of the process into its own AWS Lambda.

What are Lambdas?

AWS Lambdas are a “serverless”, stateless way to run snippets of code with no extra initialization or shutdown time.

When to use Lambdas

They are great if you have small highly reusable pieces of code that serve a single purpose. (If you have a few that go together really well, that’s where state machines come in.)  For example if you have some code that does image recognition and you need to use it across multiple projects. Or even just want it to run faster or be more accessible, as Lambdas have several ways they can be initiated, including via an API you can define.

They will NOT fit your purpose if you need something that does a multitude of tasks, will run for a long time, use a lot of memory or update frequently.

Creating a Lambda

Creating your own is a lot easier than a lot of other tutorials seem to show. If you haven’t already, sign up for an AWS account. Then open your AWS console and search for Lambda.

You’ll be presented with a welcome screen most likely, after clicking through “Get Started” or whatever they updated it to this month, you’ll have a screen where you can create new functions as well as check on existing ones.

See the big orange button that even Trump would be proud of? Click it.

As this is probably your first Lambda, you will have to create a new role. Super simple, don’t have to leave the page even. Just give it a new name, and give it a policy template. I used the Simple Microserve permissions as it seemed to fit the bill for me most.

Then you will be greeted with a page with a large amount of info and stuff going on. The part that we are going to be most concerned about is the Function Code area (and will also need Environment Variables to store API keys in).

It may seem like we need to set up triggers or resources for this information to go to, but as we plan to use these inside a state machine, that will handle all that bother for us.

ETL – Extract Transform Load

Now that we know how to make a Lambda, lets look at some code we could use with it. For the state machine we will create later, I want to have an entire process where I pull in information from an outside source (extract), modify to fit my needs (transform) and then put it into my own system (load.)

Extract

As stated above, this scenario involves pulling down data from a news source, in this case we are using News API that allows you to create a free API key to grab top news headlines and link to their stories.

That code is dead simple:

import json
from urllib import request


def retrieve_news(key, source):
    url = f"https://newsapi.org/v2/top-headlines?sources={source}&apiKey={key}"
    with request.urlopen(url) as req:
        return json.loads(req.read())

print(retrieve_news(my_key, 'associated-press')

If I wasn’t using this in a Lambda, I would be using the wonderful Requests module instead, but Python 3’s urllib is at least a lot better than 2s.

So now, we need a way for the Lambda function to call this code and pass along the results in a manor we can use later. On the page to fill in the code, you’ll see a place that says under Function Code that lists the Handler this is the entry point to your code. lambda_function.lambda_handler is the default, which means it will use the function lambda_handler inside the file lambda_function.py as the entry.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import os
import json
from urllib import request


def retrieve_news(key, source):
    url = f"https://newsapi.org/v2/top-headlines?sources={source}&apiKey={key}"
    with request.urlopen(url) as req:
        return json.load(req)


# What AWS Lambda calls
def lambda_handler(event, context):
    key = os.getenv('NEWSAPI_KEY')
    if not key:
        raise Exception('Not all environment variables set')

    if not event['source']:
        raise Exception('Source not set')

    return {'data': retrieve_news(key, event['source']),
            'source': event['source']}

There are two arguments passed into the function, the first is event which is all the information sent to the lambda function (if using a standard JSON object this will be a dictionary, as seen above). The second is context which is a class that will tell you about the current lambda function if necessary, you can learn more about it here, but it will not be used in this example.

Testing the lambda

You may also notice that we are pulling the API key not from the event, but from an environment variable, so make sure to set that as well on the page. Last and not least, I would suggest increasing the timeout for the lambda to 10 seconds, from the default 3.

Before we go on and add the other functions, lets make sure this one works properly.  At the top of the page, where there is a drop down beside test and Actions on the right, click Configure test events we are going to add a new one with the details that will be passed into the event dictionary.

{
  "source": "associated-press"
}

On the pop-up, copy in the above JSON and save it as a new test event.

Hit the test button at the top, and see the results. You should get a big green window that shows you how it ran. If you have a red error window, you will have to figure out what went wrong first.

Transform

This will be our second lambda, so we get to go through the process again of creating a new one (you can use the exiting role from the last one) and copying this code into it. No Environment variables needed this time!

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import csv
from io import StringIO


# What AWS Lambda calls
def lambda_handler(event, context):

    sio = StringIO()
    writer = csv.writer(sio)
    writer.writerow(["Source", "Title", "Author", "URL"])
    for article in event['data']['articles']:
        writer.writerow([
            article['source']['name'],
            article['title'],
            article['author'],
            article['url']
        ])

    csv_content = sio.getvalue()
    print(csv_content)

    return {'data': csv_content,
            'source': event['source']}


The tricky part here is now you need good test data for it. Luckily you can copy the output of the last Lambda (provided snippet below) to do just that.

{
  "data": {
    "status": "ok",
    "totalResults": 5,
    "articles": [
      {
        "source": {
          "id": "associated-press",
          "name": "Associated Press"
        },
        "author": "FRANCES D'EMILIO",
        "title": "Pope accepts resignation of McCarrick after sex abuse claims",
        "description": "VATICAN CITY (AP) — In a move described as unprecedented, Pope Francis has effectively stripped U.S. prelate Theodore McCarrick of his cardinal's title and rank following allegations of sexual abuse, including one involving an 11-year-old boy. The Vatican ann…",
        "url": "https://apnews.com/46e8e15911034e7f971c7542b60a6444",
        "urlToImage": "https://storage.googleapis.com/afs-prod/media/media:b5c82ad2f2b74b50ab9faccf51898309/2628.jpeg",
        "publishedAt": "2018-07-28T16:21:57Z"
      },
      {
        "source": {
          "id": "associated-press",
          "name": "Associated Press"
        },
        "author": "KEVIN FREKING",
        "title": "On trade policy, Trump is turning GOP orthodoxy on its head",
        "description": "WASHINGTON (AP) — President Donald Trump's trade policies are turning long-established Republican orthodoxy on its head, marked by tariff fights and now $12 billion in farm aid that represents the type of government intervention GOP voters railed against a de…",
        "url": "https://apnews.com/57cd042b57054e5790b9b444c561ac3b",
        "urlToImage": "https://storage.googleapis.com/afs-prod/media/media:90f04d837f514d0b984e25bd5153be8a/3000.jpeg",
        "publishedAt": "2018-07-28T16:20:11Z"
      },
      {
        "source": {
          "id": "associated-press",
          "name": "Associated Press"
        },
        "author": "SETH BORENSTEIN and FRANK JORDANS",
        "title": "Science Says: Record heat, fires worsened by climate change",
        "description": "Heat waves are setting all-time temperature records across the globe, again. Europe suffered its deadliest wildfire in more than a century, and one of nearly 90 large fires in the U.S. West burned dozens of homes and forced the evacuation of at least 37,000 p…",
        "url": "https://apnews.com/a4255779e2b6461b9cc8dbf24ea4b96c",
        "urlToImage": "https://storage.googleapis.com/afs-prod/media/media:f9b76dc0354e47caafcfad96c36443ca/3000.jpeg",
        "publishedAt": "2018-07-28T15:03:01Z"
      },
      {
        "source": {
          "id": "associated-press",
          "name": "Associated Press"
        },
        "author": "MICHAEL KUNZELMAN and LARRY NEUMEISTER",
        "title": "No mystery to Supreme Court nominee Kavanaugh's gun views",
        "description": "SILVER SPRING, Md. (AP) — Supreme Court nominee Brett Kavanaugh says he recognizes that gun, drug and gang violence \"has plagued all of us.\" Still, he believes the Constitution limits how far government can go to restrict gun use to prevent crime. As a federa…",
        "url": "https://apnews.com/c8fc0785b429497abf9621efcdb345e8",
        "urlToImage": "https://storage.googleapis.com/afs-prod/media/media:4c3619ea948b4c91b8f2fcdd50162d26/3000.jpeg",
        "publishedAt": "2018-07-28T14:11:06Z"
      },
      {
        "source": {
          "id": "associated-press",
          "name": "Associated Press"
        },
        "author": "HOPE YEN, JOSH BOAK and CHRISTOPHER RUGABER",
        "title": "AP FACT CHECK: Trump's hyped claims on economy, NKorea, vets",
        "description": "WASHINGTON (AP) — President Donald Trump received positive economic news this past week and twisted it out of proportion. That impulse ran through days of rhetoric as he hailed the success of a veterans program that hasn't started and saw progress with North …",
        "url": "https://apnews.com/5b405824a9d843a09a641754d84aa1ab",
        "urlToImage": "https://storage.googleapis.com/afs-prod/media/media:636c2c3068b94181ba3c5bcb8d2a3ae9/3000.jpeg",
        "publishedAt": "2018-07-28T12:30:33Z"
      }
    ]
  },
  "source": "associated-press"
}

Configure and run the test like before using the above data.

In this case I also printed the output so you could see that any standard output is captured by the logs.

Load

Now to actually submit this data to a server, you could set up your own, or use file.io which is a free filedropper website, as the code uses below. No API needed!

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

from urllib import request, parse
import json


# What AWS Lambda calls
def lambda_handler(event, context):
    url = 'https://file.io'

    encoded_args = parse.urlencode({'text': event['data']}).encode('utf-8')

    with request.urlopen(url, encoded_args) as req:
        info = json.load(req)

    return {'data': info, 'source': event['source']}

Again, as this is reaching out to an external API, I would increase the default 3 second timeout limit of the Lambda from 3 to 10 seconds.

Woo! We now have three lambda’s that can take each other’s outputs in a row and do a full ETL process. Now lets put them together.

State Machines

AWS Step functions allow for creating a set of various actions to run with each other, and then presented in a pretty auto-generated graph. Back at the console, find the Step functions.

Then create a new state machine.

This is probably the hardest part, is the actual state machine definition. The state language can be confusing, thankfully for our needs we don’t need to do anything complicated.

You can use this code, and will just have to update the actual Resource links under Extract, Transform and Load. (You can even click on them and should be presented with a drop down of your previously created resources so you don’t have to copy the ARNs manually.)

{
  "StartAt": "Set Source",
  "States": {
    "Set Source": {
      "Type": "Pass",
      "Result": {"source": "associated-press"},
      "ResultPath": "$",
      "Next": "Extract"
    },
    "Extract": {
      "Type": "Task",
      "Resource": "<ARN>:function:google-news-extract",
      "ResultPath": "$",
      "Next": "Transform"
    },
    "Transform": {
      "Type": "Task",
      "Resource": "<ARN>:function:google-news-transform",
      "ResultPath": "$",
      "Next": "Load"
    },
     "Load": {
      "Type": "Task",
      "Resource": "<ARN>:function:google-news-load",
      "ResultPath": "$",
      "End": true
    }
  }
}

Notice the first step is not a task, but rather a pass through state that sets the source. We could do this during initialization, but wanted to highlight the ability to add information where needed.

After creation, we will need to start a new execution. It doesn’t need any input, but doesn’t hurt to include a comment if you want.

Then run it!

 

During the middle of an execution, it will show what has been run successfully and what is currently in progress, or erred. At any time, you can click on a specific block to see what it’s input and outputs were.

This function then can be run whenever to run the full ETL process!

Scheduling

For a process like this, you want to run it on a schedule. That means creating a new CloudWatch rule. Search for CloudWatch in the console, then click on Rules on the left hand side.

Then, click the big blue button.

It’s pretty simple to create a fixed rate schedule, and then just make sure to select the right state machine on the right side!