Cartoon or Photo? – Image detection with Python

This all started with me just wanting a fast way to sort through image folders and remove cartoon images. That lead me down a spiraling rabbit hole of possibility. From using OpenCV to do different types of detection, or even training a Machine Learning model from scratch with Keras. This article will go through the multiple options available and learn how accurate they end up being.

What makes an image a cartoon?

Model is Jessica Marie Frye

If we are going to detect the differences between photos and cartoons, first we need to figure out how they are different. Importantly, how they are different in a quantifiable way. These are the measurable differences I could think up:

  • Cartoons have smoother gradients
  • Real images use a larger color palette
  • Cartoons usually have drawn edge outlines

Below we will try each of these options and see how well they fare.

Results First

You’re probably more interested in what will work the best for you and less about how I spent days toiling away at this, so to cut to the chase, here is how everything performed.

The best OpenCV contender turned out to be counting colors, with a combined 75% accuracy.

Overall, Machine Learning Image Classification using the Xception model wiped the floor with a combined 96% accuracy. This probably isn’t even as good as it could get with further fine tuning, but was more than good enough for my own needs.

These results were also with a hard threshold set to force it into either bucket of real or cartoon. I personally modify them for my own use to use a smaller range for absolutes, and then put the “unsure” ones in another folder. Further reducing any errant picks.

Machine Learning with Keras is the obvious pick if you have a good set of data to train with and a computer beefy enough to process it. However it’s not as portable and much longer than simply trying out one of the OpenCV methods.

OpenCV Gradient Differences

The first method we will use is pretty straight forward. We will blur the image a little, and compare it to it’s unaltered form and quantify the difference. First things first is that we will need opencv installed for python. Go into your venv for this and run:

pip install opencv-python

Now open up your IDE and create a new .py file to get started. First thing we are going to do is read the image into opencv which is the cv2 module. We will use a JPEG image as it will be the expected BRG color format.

import cv2

img = cv2.imread("/path/to/my/image.jpg")

Next we will blur the image a little using a bilateral filter, to even out the colors. We will also resize the image to a standard size so the blur across every image is the same.

img = cv2.resize(img, (1024, 1024))
color_blurred = cv2.bilateralFilter(img, 6, 250, 250)

You can check out the result to see how strong the effect is by previewing the image. Press any key to close the window.

# Optional Preview
cv2.imshow("blurred", color_blurred)

Then we need to compare this new color_blurred image to the original image. I accomplished this by comparing the histograms. We will have do that for each color individually.

diffs = []
for k, color in enumerate(('b', 'r', 'g')):
    print(f"Comparing histogram for color {color}")
    real_histogram = cv2.calcHist(img, [k], None, [256], [0, 256])
    color_histogram = cv2.calcHist(color_blurred, [k], None, [256], [0, 256])
    diffs.append(cv2.compareHist(real_histogram, color_histogram, cv2.HISTCMP_CORREL))

result = sum(diffs) / 3

compareHist will give us a result between 0 and 1 (one being the most similar.) We will need to set a threshold for how similar we cartoons will be. I have mine set at 0.98 (aka 98% similar.)

if result > 0.98:
    print("It's a cartoon!")
    print("It's a photo!")

And that’s it! Now you can test it out and see how it works for your images. I found this iteration to work very fast and have about a ~70% proper detection rate.

Let’s put it all together into a usable script!

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from pathlib import Path
from typing import Union
import argparse

import cv2

def is_cartoon(
    image: Union[str, Path],
    threshold: float = 0.98,
    preview: bool = False,
) -> bool:
    # read and resize image
    img = cv2.imread(str(image))
    img = cv2.resize(img, (1024, 1024))

    # blur the image to "even out" the colors
    color_blurred = cv2.bilateralFilter(img, 6, 250, 250)

    if preview:
        cv2.imshow("blurred", color_blurred)

    # compare the colors from the original image to blurred one.
    diffs = []
    for k, color in enumerate(("b", "r", "g")):
        # print(f"Comparing histogram for color {color}")
        real_histogram = cv2.calcHist(img, [k], None, [256], [0, 256])
        color_histogram = cv2.calcHist(color_blurred, [k], None, [256], [0, 256])
            cv2.compareHist(real_histogram, color_histogram, cv2.HISTCMP_CORREL)

    return sum(diffs) / 3 > threshold

def command_line_options():
    args = argparse.ArgumentParser(
        description="Determine if a image is likely a cartoon or photo.",
        help="Show the blurred image",
        help="Cutoff threshold",
        help="Path to image file",
    return vars(args.parse_args())

if __name__ == "__main__":
    options = command_line_options()
    if not options["image"].exists():
        raise FileNotFoundError(f"No image exists at {options['image'].absolute()}")
    if is_cartoon(**options):
        print(f"{options['image'].name} is a cartoon!")
        print(f"{options['image'].name} is a photo!")

OpenCV Color Counting

Using a subset of 512 colors in a 1024×1024 image, determine how much of the image can be reproduced

Next possible way to approach the problem was just figuring out how many colors were used for the majority of the image. We will use all the same code to load and resize the image from above, just this time we will add a loop to count all the colors (slow way, faster option below):

    # Find count of each color
    a = {}
    for item in img.flatten():
        value = tuple(item)
        if value not in a:
            a[value] = 1
            a[value] += 1

Next we will sort the dictionary by the most used color, and add the count of the top 512 images together.

The actual calculation is a lot of functionality in a small block of code. First let’s get a visual preview of what is happening by re-creating the image with only the selected colors with:

mask = numpy.zeros(img.shape[:2], dtype=bool)

for color, _ in sorted(a.items(), key=lambda pair: pair[1], reverse=True)[:512]:
    mask |= (img == color).all(-1)

img[~mask] = (255, 255, 255)

cv2.imshow("img", img)

Here’s the code that basically calculates what you are seeing. We divide the sum of all the top colors by the size of the image to get what percent could be recreated with just those colors.

    # Identify the percent of the image that uses the top 512 colors
    most_common_colors = sum([x[1] for x in sorted(a.items(), key=lambda pair: pair[1], reverse=True)[:512]])
    return (most_common_colors / (1024 * 1024)) > 0.3 # new threshold

This script is pretty similar to the last one. And has a slightly higher success rate on my data set at 76%! It is 20% better at detecting what is a cartoon, but 10% worse at photos. It is also much much slower, some 20~50 times slower. (Will take a second per image instead of 0.05 of a second). However, we can speed that up by instead using some numpy trickery.

    # Replace everything after "Find count of each color" with this faster version, but it doesn't work with the preview.
    flattened = numpy.reshape(img, ((1024 * 1024), 3))
    multiplied = numpy.multiply(flattened, [100_000, 100, 1])
    sums = multiplied.sum(axis=1)
    unique, counts = numpy.unique(sums, return_counts=True)

    # Identify the percent of the image that uses the top 512 colors
    most_common_colors = sum(sorted(counts, reverse=True)[:512])
    return (most_common_colors / (1024 * 1024)) > threshold

Here is the entire script with the slower version that works with preview image. However after getting a good threshold I would suggest replacing the code with the faster version above.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from pathlib import Path
from typing import Union
import argparse

import cv2
import numpy

def is_cartoon(
    image: Union[str, Path],
    threshold: float = 0.3,
    preview: bool = False,
) -> bool:
    # read and resize image
    img = cv2.imread(str(image))
    img = cv2.resize(img, (1024, 1024))

    # Find count of each color
    a = {}
    for row in img:
        for item in row:
            value = tuple(item)
            if value not in a:
                a[value] = 1
                a[value] += 1

    if preview:
        mask = numpy.zeros(img.shape[:2], dtype=bool)

        for color, _ in sorted(a.items(), key=lambda pair: pair[1], reverse=True)[:512]:
            mask |= (img == color).all(-1)

        img[~mask] = (255, 255, 255)

        cv2.imshow("img", img)

    # Identify the percent of the image that uses the top 512 colors
    most_common_colors = sum(
        [x[1] for x in sorted(a.items(), key=lambda pair: pair[1], reverse=True)[:512]]
    return (most_common_colors / (1024 * 1024)) > threshold

def command_line_options():
    args = argparse.ArgumentParser(
        description="Determine if a image is likely a cartoon or photo.",
        help="Show the blurred image",
        help="Cutoff threshold",
        help="Path to image file",
    return vars(args.parse_args())

if __name__ == "__main__":
    options = command_line_options()
    if not options["image"].exists():
        raise FileNotFoundError(f"No image exists at {options['image'].absolute()}")
    if is_cartoon(**options):
        print(f"{options['image'].name} is a cartoon!")
        print(f"{options['image'].name} is a photo!")

OpenCV Edge Detection

This I haven’t had much luck with, only about 55% successful at determine what type of image it is. A coin toss is about as accurate. I don’t recommend using it as is, but if you have any ideas or improvements, let me hear them!

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from pathlib import Path
from typing import Union
import argparse

import cv2
import numpy

def is_cartoon(
    image: Union[str, Path],
    threshold: float = 4500,
    preview: bool = False,
) -> bool:
    # read and resize image
    img = cv2.imread(str(image))
    img = cv2.resize(img, (1024, 1024))

    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    blurred_gray = cv2.medianBlur(gray, 3)

    edges = cv2.adaptiveThreshold(
        gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 5, 10
    blurred_edges = cv2.adaptiveThreshold(
        blurred_gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 5, 10

    if preview:
        cv2.imshow("edges", edges)
        cv2.imshow("blurred edges", blurred_edges)

    count_1 = numpy.count_nonzero(edges)
    count_2 = numpy.count_nonzero(blurred_edges)

    return abs(count_2 - count_1) < threshold

def command_line_options():
    args = argparse.ArgumentParser(
        description="Determine if a image is likely a cartoon or photo.",
        help="Show the blurred image",
        help="Cutoff threshold",
        help="Path to image file",
    return vars(args.parse_args())

if __name__ == "__main__":
    options = command_line_options()
    if not options["image"].exists():
        raise FileNotFoundError(f"No image exists at {options['image'].absolute()}")
    if is_cartoon(**options):
        print(f"{options['image'].name} is a cartoon!")
        print(f"{options['image'].name} is a photo!")

Keras Machine Learning Model

Time to take the kid gloves off, let’s use some ML image detection.

First we have to train a minified Xception model with a lot of hand picked good data. I used 556 cartoon images and 2295 real images in the training datasets. Then used that model for detection between another 2000+ unsorted images.

To get set by step details, please check out the same tutorial I used for this.

import shutil

import numpy as np
import os
import time
from pathlib import Path

import tensorflow as tf

root_dir = Path("/training/")
image_size = (180, 180)
batch_size = 32
epochs = 20
model_name = "my_model"

def make_model(input_shape, num_classes):
    inputs = tf.keras.Input(shape=input_shape)
    # Image augmentation block
    data_augmentation = tf.keras.Sequential(
    x = data_augmentation(inputs)

    # Entry block
    x = tf.keras.layers.Rescaling(1.0 / 255)(x)
    x = tf.keras.layers.Conv2D(32, 3, strides=2, padding="same")(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Activation("relu")(x)

    x = tf.keras.layers.Conv2D(64, 3, padding="same")(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Activation("relu")(x)

    previous_block_activation = x  # Set aside residual

    for size in [128, 256, 512, 728]:
        x = tf.keras.layers.Activation("relu")(x)
        x = tf.keras.layers.SeparableConv2D(size, 3, padding="same")(x)
        x = tf.keras.layers.BatchNormalization()(x)

        x = tf.keras.layers.Activation("relu")(x)
        x = tf.keras.layers.SeparableConv2D(size, 3, padding="same")(x)
        x = tf.keras.layers.BatchNormalization()(x)

        x = tf.keras.layers.MaxPooling2D(3, strides=2, padding="same")(x)

        # Project residual
        residual = tf.keras.layers.Conv2D(size, 1, strides=2, padding="same")(
        x = tf.keras.layers.add([x, residual])  # Add back residual
        previous_block_activation = x  # Set aside next residual

    x = tf.keras.layers.SeparableConv2D(1024, 3, padding="same")(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Activation("relu")(x)

    x = tf.keras.layers.GlobalAveragePooling2D()(x)
    if num_classes == 2:
        activation = "sigmoid"
        units = 1
        activation = "softmax"
        units = num_classes

    x = tf.keras.layers.Dropout(0.5)(x)
    outputs = tf.keras.layers.Dense(units, activation=activation)(x)
    return tf.keras.Model(inputs, outputs)

def train_data():
    train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    val_ds = tf.keras.preprocessing.image_dataset_from_directory(

    train_ds = train_ds.prefetch(buffer_size=32)
    val_ds = val_ds.prefetch(buffer_size=32)

    model = make_model(input_shape=image_size + (3,), num_classes=2)
    tf.keras.utils.plot_model(model, show_shapes=True)

    callbacks = [
        train_ds, epochs=epochs, callbacks=callbacks, validation_data=val_ds,

def clean_images():
    move_to = root_dir.parent / "bad_data"  # needs to be not in the training directory
    moved = 0
    for directory in root_dir.glob("*"):
        if directory.name == move_to.name:
        if directory.is_dir():
            for i, file in enumerate(directory.glob("*")):
                if not file.name.lower().endswith(("jpg", "jpeg")) or not tf.compat.as_bytes("JFIF") in file.open("rb").read(10):
                    shutil.move(file, move_to / file.name)
                    moved += 1
        print("moved unclean data", moved, "from", directory)

def move_images():
    model = tf.keras.models.load_model(f"{model_name}_{epochs}.h5")
    cartoon_dir = Path("/cartoon/")
    real_dir = Path("/real/")

    real, cartoon, unknown = 0, 0, 0

    for file in Path("/unsorted/").glob("*.[jpg][jpeg][png]"):

        img = tf.keras.preprocessing.image.load_img(
            str(file), target_size=image_size
        img_array = tf.keras.preprocessing.image.img_to_array(img)
        img_array = tf.expand_dims(img_array, 0)  # Create batch axis

        predictions = model.predict(img_array)
        score = predictions[0][0]
        if score > 0.98:
            real += 1
            shutil.move(file, real_dir / file.name)
        elif score < 0.02:
            cartoon += 1
            shutil.move(file, cartoon_dir / file.name)
            unknown += 1
            print(f"Could not figure out {file} as it was {score * 100}")
    print(f"Moved {real} to real and {cartoon} to cartoon, {unknown} were unmoved")

if __name__ == '__main__':

I personally think 95% overall accuracy is amazing for no additional tuning. As well as not using this model for not exactly it’s intended use case. Generally we think about classifying items in the image (cat vs dog) not type of image (anime cat vs real cat).

Creating a successful project – Part 3: Development Tools/Equipment

Every single year that I’ve been doing this, I hear about the next “totally awesome” way to write code.  And more often than not, the new thing is certainly very shiny.

When it comes to projects, with the exception of coding standards (which will be part 4 of this series) I am not a fan of telling developers how to write code.  If you’ve got someone who likes to write code using Notepad on a Microsoft Windows machine, more power to them.  Oh, you like coding in SublimeText3 on Mac – go for it.

If you work on one of my projects there are only a few rules I have about how you write your code:

  1. It must maintain the agreed-upon standard (such as PEP8)
  2. Your code – under penalty of my ire – must work on the designated system.  If it WFM, “Works for Me” then you must get it working on the chosen system. (More on this topic in the test and build posts) And trust me, there’s plenty of people out there – including other contributors to this site – who would shudder to think of my ire directed singly upon them.
  3. Use whatever the agreed upon (preferably Git) source code control system.
  4. Use whatever build system is in play.  Usually, this is done via a Jenkins server, but I’m not picky.  I want consistency, and I want to make sure that the output of the project is reliable.  More on build systems in the CI/CD section.

Notice something odd in there: nowhere did I say you had to use this particular editor or debugger.  I honestly couldn’t care less if you like to write your code using Comic Sans or SourceCodePro.  I really don’t care if you like to code using EMACS or Sublime.  The tools one uses to write code should be selected through a similar vetting process to purchasing a good chef’s knife: use what you feel most comfortable using.

But, in the interest of showing what a rather more seasoned coder uses, here’s my setup:

Keyboard – Microsoft Natural Ergonomic Keyboard – I spend 8-16 hours a day on a keyboard, so I want my keyboard to be comfortable and able to handle heavy use.  The good thing (besides that this is a great keyboard) they’re nice and cheap.  So when one dies, I just buy another.

Mouse – ROCCAT Kone Pure Color – This is just a really great mouse.

Editor- Vim or, as of recent Neovim – I’ve used Vi/Vim for decades so I’m a bit of an old hat at using them.

Operating System – Debian Linux – When you want the best and you don’t want extra crap getting in your way; accept only the best.

I use that same setup at work as well as home.  I am not endorsed by any of the product manufacturers; I just know what works for me.  If I find a keyboard in the same form-factor as the one I’m using with Cherry MX Browns, I’ll buy two of them in a heartbeat.

I have also made use of PyCharm and Atom.  Both of which I still use with Vim Keybindings.


Introducing Box – Python dictionaries with recursive dot notation access

Box logo

Everyone loves Python’s dictionaries; they’re fast, easy to create and quite handy for a range of reasons. However, there are times that ["typing"]["out"]["all"]["those"] extra quotes and  brackets seems excessive. Wouldn’t it be nicer to access.them.like.class.methods?

Say hello to box.

Box logo

from box import Box

movie_data = {
  "movies": {
    "Spaceballs": {
      "imdb_stars": 7.1,
      "rating": "PG",
      "length": 96,
      "Director": "Mel Brooks",
      "Stars": [{"name": "Mel Brooks", "imdb": "nm0000316", "role": "President Skroob"},
                {"name": "John Candy","imdb": "nm0001006", "role": "Barf"},
                {"name": "Rick Moranis", "imdb": "nm0001548", "role": "Dark Helmet"}
    "Robin Hood: Men in Tights": {
      "imdb_stars": 6.7,
      "rating": "PG-13",
      "length": 104,
      "Director": "Mel Brooks",
      "Stars": [
                {"name": "Cary Elwes", "imdb": "nm0000144", "role": "Robin Hood"},
                {"name": "Richard Lewis", "imdb": "nm0507659", "role": "Prince John"},
                {"name": "Roger Rees", "imdb": "nm0715953", "role": "Sheriff of Rottingham"},
                {"name": "Amy Yasbeck", "imdb": "nm0001865", "role": "Marian"}

my_box = Box(movie_data)


'Mel Brooks'

# <Box: {'name': 'Mel Brooks', 'imdb': 'nm0000316', 'role': 'President Skroob'}>

Box is a creation I made over three years ago, originally in the reusables code base named Namespace, inspired by JavaScript Object access methods.

Install is super simple:

pip install python-box

Or just grab the file box.py directly from the github project.

Every Box is usable as a drop in replacement to dictionaries in 99%* of cases. And every time you add a dictionary or list to a Box object, they become Box (subclass of dict) or BoxList (subclass of list) objects as well.

# box.Box
assert isinstance(my_box, dict)

# box.BoxList
assert isinstance(my_box.movies.Spaceballs.Stars, list)

my_box.movies.Spaceballs.Stars[0].additional_info = {'Birth name': 'Melvin Kaminsky', 'Birthday': "05/28/1926"}

# <Box: {'Birth name': 'Melvin Kaminsky', 'Birthday': '05/28/1926'}>

At any level you can change a Box object back into a standard dictionary.


{'Director': 'Mel Brooks',
 'Stars': [
  {'additional_info': {'Birth name': 'Melvin Kaminsky', 'Birthday': '05/28/1926'},
   'imdb': 'nm0000316',
   'name': 'Mel Brooks',
   'role': 'President Skroob'},
  {'imdb': 'nm0001006', 'name': 'John Candy', 'role': 'Barf'},
  {'imdb': 'nm0001548', 'name': 'Rick Moranis', 'role': 'Dark Helmet'},
  {'imdb': 'nm0000597', 'name': 'Bill Pullman', 'role': 'Lone Starr'}],
 'imdb_stars': 7.1,
 'length': 96,
 'rating': 'PG'}

You can also run to_list() on lists in the Box to return them to a standard list, with all inner Box and BoxList objects transformed back to normal.

Box also has built in functions for dealing with json and yaml**.


# {
#    "imdb_stars": 7.1,
#    "rating": "PG",
#    "length": 96,
#    "Director": "Mel Brooks",
#    "Stars": [
# ...


# Director: Mel Brooks
# imdb_stars: 7.1
# length: 96
# rating: PG
# Stars:
# - imdb: nm0000316
#   name: Mel Brooks
#   role: President Skroob
# ...

Calling a Box object will return it’s keys. It’s also possible to access the attributes the standard dictionary method, which is required for keys that are numeric or have spaces.

# ('Spaceballs', 'Robin Hood: Men in Tights')

my_box.movies['Robin Hood: Men in Tights']
# <Box: {'imdb_stars': 6.7, 'rating': 'PG-13', 'length': 104, ...

Unlike addict it does not act as a default dictionary, so you will get built-in errors if you try to access something that isn’t there.


# Traceback (most recent call last):
# ...
# AttributeError: tv_shows

Another power previously mentioned is that you can add dictionaries into lists and they will automatically be converted into Box objects.

    {"name": "Bill Pullman", "imdb": "nm0000597", "role": "Lone Starr"})

'Bill Pullman'

It also protects itself from having its functions overwritten accidentally.

my_box.to_dict = '3'
# AttributeError: Key name 'to_dict' is protected

Box is also a substitute for the Namespace used by argparse, making it super easy to convert incoming arguments to a dict if wanted. This allows incoming arguments to be easily passed to function arguments.

import argparse
from box import Box

parser = argparse.ArgumentParser()
parser.add_argument('floats', metavar='N', type=float, nargs='+')
parser.add_argument("-v", "--verbosity", action="count", default=0)

args = parser.parse_args(['1', '2', '3', '-vv'], namespace=Box())

{'floats': [1.0, 2.0, 3.0], 'verbosity': 2}

def example_func(floats, verbosity):


If you have any questions, suggestions or feedback, please open a github issue and let me know!

Hope you enjoy!


*  Based off nothing but pure guess and personal experience. Only time drop in replacement doesn’t work is when converting or dumping. So make sure do use  first for those cases.  

** If you don’t have PyYAML installed, the to_yaml function will not be available.

Run, Subprocess, Run!

Python is awesome, and can pretty much do everything you ever wanted, but on rare occasion, you may want to call an external program. The original way to do this with Python was to use os.system.

import os

return_code = os.system("echo 'May the force be with you'")

The message “May the force be with you” would be printed to the terminal via stdout, and the return code variable would be 0 as it did not error. Great for running a program, not so great if you need to capture its output.

So the Secret Order of the Pythonic Brotherhood* meet, performed the required rituals to appease our Benevolent Dictator for Life**, and brought fourth subprocess.                                                                                                         * not real  ** real 

Subprocess is a module dedicated to running other processes. You’ve probably already have used or encountered it in it’s many forms. subprocess.call , subprocess.check_callsubprocess.check_output or even the direct call to the process constructor subprocess.Popen.

These made life a lot easier, as you could now have easy interaction with the process pipes, running it as a direct call or opening a new shell first, and more. The downside was the inconvenience of having to switch between them depending on needs, as they all had different outputs considering how you interacted with them.

That all changed in Python 3.5, with the introduction of subprocess.run (for older versions check out reusables.run). It is simply the only subprocess command you should ever need! Let’s look at a quick example.

import subprocess

response = subprocess.run("echo 'Join the Dark Side!' 1>&2", 

# CompletedProcess(args="echo 'Join the Dark Side!' 1>&2",
#                  returncode=0, 
#                  stderr=b"'Join the Dark Side!' \r\n")

Now check that response out. It’s an organized class, that stores what args you sent to start the subprocess, the returncode as well as stdout and/or stderr if there was a pipe specified for them. (If something was sent to stdout or stderr and there wasn’t a pipe specified, it would send it to the current terminal.)

As the return value is a class, you can access any of those attributes as normal.

# Join the Dark Side!
# 0
response.check_returncode() # Would return None in this case

It also includes a check_returncode function that will raise subprocess.CalledProcessError if the return code is not 0. 

Basically, you should use subprocess.run and never look back.  It’s only real limitation is that it is equivalent to using Popen(...).communicate() , which means you cannot provide multiple inputs, wait for certain output, or behave interactively in any manner.

There are plenty of additional capabilities that are good to know, this article will cover:

  1. Timeouts
  2. Shell
  3. Passing arguments as string or list
  4. Pipes and Buffers
  5. Input
  6. Working Directory
  7. Environment Variables


In Python 2 it’s a real pain to have a timeout for a subprocess. You could potentially do a poll for a max amount of time before calling it quits. But if you had input it was much harder. On Linux you could use signals, but Windows required either a forever running background thread or run in a separate process.

Thankfully in the modern world we can simply specify one to the run command.

subprocess.run("ping", shell=True, timeout=1)

You’ll see some ping responses being printed to the terminal (as we didn’t send it to a pipe) then in a second (literally) see a traceback.

subprocess.TimeoutExpired: Command 'ping' timed out after 1 seconds

No crazy multiprocessing or signaling needed, only need to pass a keyword argument.


I see shell being overused and misunderstood a lot, so I want to define its behavior very clearly here. When shell=False  is set, there is no system shell started up, so the first argument must be a path to an executable file or else it will fail.

Setting shell=True will first spin up a system dependent shell process (commonly \bin\sh on Linux or cmd.exe on Windows) and run the command within it. With a shell you can use environment variables, shell built-in commands and have glob “*” expansion.

Also keep in mind a lot of programs are actual files on Linux, whereas they are shell built-ins on Windows. That’s why “echo” with shell=False will work on Linux but will break on Windows:

subprocess.run(["echo", "hi"])

# Linux: CompletedProcess(args=['echo', 'hi'], returncode=0)
# Windows: FileNotFoundError: 
#          [WinError 2] The system cannot find the file specified

“So, just always use shell?” Nope, it’s actually better to avoid it whenever possible. It’s costly, aka slower, to spin up a new shell, and it’s susceptible to shell injection vulnerabilities.

If you are going to be calling an executable file, it’s best to always keep shell=False unless you need one of the shell’s features.

Arguments as string or list

There seems to be very odd behavior with the first argument being passed to subprocess functions, including .run that changes from a list to a string if you use shell=True . In general if shell=False  (the default behavior) pass in a list of arguments. If shell=True, then pass in a string.

subprocess.run(['echo', 'howdy'])             # List when shell=False

subprocess.run('echo "howdy"', shell=True)    # String when shell=True

However it’s important to know why you should do that. It’s because of because of how Python has to create the new processes and send them information, which differs across operating systems.

On Windows you can get away with murder pass either a list or string for either case.  Because when creating a new process, Python has to interpret the list of arguments into a string anyways when shell=False; Otherwise, when shell=True, the string is sent directly to the shell as-is.

On Linux, the same scenario happens when shell=True. The string will be passed directly to the newly spawned shell as is, so it can expand globs and environment variables.  However, if a list is passed, it is sent as positional arguments to the shell. So if you have:

subprocess.run(['echo', 'howdy'], shell=True)

It is not sending “howdy” as an argument to echo , but rather to /bin/sh.

/bin/sh -c "echo" "howdy"

Which will result in confusing behavior of nothing being returned to stdout and no error.

And going the other direction can be a pain on Linux. When shell=False and a string is provided, the entire thing is treated as the path to the program. Which is helpful if you want to run something without passing any arguments, but can be confusing at first when it  returns a FileNotFoundError .

subprocess.run('echo "howdy"')

# FileNotFoundError: [Errno 2] No such file or directory: 'echo "howdy"'

So to be safe, simply remember:

subprocess.run(['echo', 'howdy'])             # List when shell=False

subprocess.run('echo "howdy"', shell=True)    # String when shell=True

You can also “cheat” by always building a string, then use shlex.split on it if you don’t need to use a shell.

import shlex

args = shlex.split("conquer --who 'mine enemy'" 
                   "--when 'Sometime in the next, eh, \"6\" minutes'")

# ['conquer ', '--who', 'mine enemy', 
#  '--when', 'Sometime in the next, eh, "6" minutes']


(Note that shlex.split should also be sent posix=False when being used on Windows)

Stream, Pipes and Buffers

Pipes and buffers are both sections of memory used for the storage and exchange of data between processes. Pipes are designed to transfer and hold the data, while buffers act as temporary vessels to transfer data to files.

Setting stdout or stderr streams to subprocess.PIPE will save any output from the program to memory, and then stored in the CompletedProcess class under the corresponding attribute name.  If you do not set them, they will write to the corresponding default file descriptors, which are same as sys.stdout (aka file descriptor 1) and so on. So if you redirect sys.stdout the subprocess stdout will also be redirected there.

Another common use case is to send their output to a file, i.e:

subprocess.run('sh info_gathering.sh', 
               stdout=open('comp_info.txt', 'w'), 

That way the output is stored into a file, using a buffer to temporarily hold information in memory until a large enough section is worth writing to the file.

If encoding is specified (Python 3.6+), the incoming bytes will be decoded and the buffer will be treated as text, aka “text mode”. This can also happen if either errors or universal_newlines keyword arguments are specified.

There are multiple different ways to use buffering:

bufsizeshorthand description
0unbufferedData will be directly written to file
1line bufferedText mode only, will write out buffer on `\n`
-1system defaultLine buffered if text mode, otherwise will generally be 4096 or 8192
>= 1sized bufferWrite out when (approx) that amount of bytes are in the buffer


With run there is a simple keyword argument of input  the same as Popen().communicate(input) . It is a one time dump to standard input of the new process, which can read any of it at it’s choosing. However it is not possible to wait for certain output for event before sending input, that is more suited to pexpect or similar.


This allows run to be a drop in replacement of check_call. It makes sure the status return code is 0, or will raise subprocess.CalledProcessError.

subprocess.run('exit 1', check=True, shell=True)
# subprocess.CalledProcessError: Command 'exit 1' returned non-zero exit status 1.

subprocess.check_call('exit 1', shell=True)
# subprocess.CalledProcessError: Command 'exit 1' returned non-zero exit status 1.

Working directory

To start the process from a certain directory, pass in the argument cwd with the directory you want to start at.

subprocess.run('dir', shell=True, cwd="C:\\")
# Volume in drive C has no label.
# Volume Serial Number is 58F1-C44C
# Directory of C:\

Environment Variables

To send environment variables to the program, it needs to be run in a shell.

subprocess.run('echo "hey sport, here is a %TEST_VAR%"', 
               env={'TEST_VAR': 'fun toy'})

"hey sport, here is a fun toy"
# CompletedProcess(args='echo "hey sport, here is a %TEST_VAR%"',
#                  returncode=0)

It is not uncommon to want to pass the current environment variables as well as your own.

subprocess.run('echo "hey sport, here is a %TEST_VAR%. Being run on %OS%"', 
               env=dict(os.environ, TEST_VAR='fun toy'))

"hey sport, here is a fun toy. Being run on Windows_NT"
# CompletedProcess(args='echo "hey sport, here is a %TEST_VAR%. Being run on %OS%"',
#                  returncode=0)

Reusables – Part 1: Overview and File Management

Reusables 0.8 has just been released, and it’s about time I give it a proper introduction.

I started this project three years ago, with a simple goal of keeping code that I inevitably end up reusing grouped into a single library. It’s for the stuff that’s too small to do well as it’s own library, but common enough it’s handy to reuse rather than rewrite each time.

It is designed to make the developer’s life easier in a number of ways. First, it requires no external modules, it’s possible to supplement some functionality with the modules specified in the requreiments.txt file, but are only required for specific use cases; for example: rarfile is only used to extract, you guessed it, rar files.

Second, everything is tested on both Python 2.6+ and Python 3.3+, also tested on pypy. It is cross platform compatible Windows/Linux, unless a specific function or class specifies otherwise.

Third, everything is documented via docstrings, so they are available at readthedocs, or through the built-in help() command in python.

Lastly, all functions and classes are all available at the root level (except CLI helpers), and can be broadly categorized as follows:

  • File Management
    • Functions that deal with file system operations.
  • Logging
    • Functions to help setup and modify logging capabilities.
  • Multiprocessing
    • Fast and dynamic multiprocessing or threading tools.
  • Web
    • Things related to dealing with networking, urls, downloading, serving, etc.
  • Wrappers
    • Function wrappers.
  • Namespace
    • Custom class to expand the usability of python dictionaries as objects.
  • DateTime
    • Custom datetime class primarily for easier formatting.
  • Browser Cookie Management
    • Find, extract or modify cookies of Firefox and Chrome on a system.
  • Command Line Helpers
    • Bash analogues to help system admins perform faster operations from inside an interactive python shell.

In this overview, we will cover:

  1. Installation
  2. Getting Started
  3. File, Folder and String Management
    1. Find Files Fast
    2. Archives (Extraction and Compression)
    3. Run Command
    4. File Hashing
    5. Finding Duplicate Files
    6. Safe File and Folder Names
    7. Touch (ing a file)
    8. Simple JSON and CSV
    9. Cut (ing a string into equal lengths)
    10. Config to dictionary


Very straightforward install, just do a simple pip or easy_install from PyPI.

pip install reusables


easy_install reusables

If you need to install it on an offline computer, grab the appropriate Python 2.x or 3.x wheel from PyPI, and just pip install it directly.

There are no additional modules required for install, so if either of those don’t work, please open an issue at github.

Getting Started

import reusables 

reusables.add_stream_handler('reusables', level=10)

The logger’s name is ‘reusables’, and by default does not have any handlers associated with it. For these examples we will have logging on debug level, if you aren’t familiar with logging, please read my post about logging.

File, Folder and String Management

Everything here deals with managing something on the disk, or strings that relate to files. From checking for safe filenames to saving data files.

I’m going to start the show off with my most reused function, that is also one of the most versatile and powerful, find_files. It is basically an advanced implementation of os.walk.

Find Files Fast

                              name="sam", depth=3)

# ['F:\\Pictures\\Family\\SAM.JPG', 
# 'F:\\Pictures\\Family\\Family pictures - assorted\\Sam in 2009.jpg']

With a single line, we are able to search a directory for files by a case insensitive name, a list (or single string) of extensions and even specify a depth.  It’s also really fast, taking under five seconds to search through 70,000 files and 30,000 folders, taking just half a second longer than using the windows built in equivalent dir /s *sam* | findstr /i "\.jpg \.png \.jpeg \.gif \.bmp \.tif \.tiff \.ico \.mng \.tga \.xcf \.svg".

If you don’t need it as a list, use the generator itself.

for pic in reusables.find_files("F:\\Pictures", name="*chris *.jpg"):

# F:\Pictures\Family\Family pictures - assorted\Chris 1st grade.jpg
# F:\Pictures\Family\Family pictures - assorted\Chris 6th grade.jpg
# F:\Pictures\Family\Family pictures - assorted\Chris at 3.jpg

That’s right, it also supports glob wildcards. It even supports using the external module scandir for older versions of Python that don’t have it nativity (only if enable_scandir=True is specified of course, its one of those supplemental modules). Check out the full documentation and more examples at readthedocs.


Dealing with the idiosyncrasies between the compression libraries provided by Python can be a real pain. I set out to make a super simple and straight forward way to archive and extract folders.

reusables.archive(['reusables',    # Folder with files 
                   'tests',        # Folder with subfolders
                   'AUTHORS.rst'], # Standalone file

# 'C:\Users\Me\Reusables\my_archive.bz2'

It will compress everything, store it, and keep folder structure in the archives.

To extract files, it is very similar behavior. Given a ‘wallpapers.zip’ file like this:

It is trivial to extract it to a location without having to specify it’s archive type.

                  path="C:\\Users\\Me\\Desktop\\New Folder 6\\")
# ... DEBUG File wallpapers.zip detected as a zip file
# ... DEBUG Extracting files to C:\Users\Me\Desktop\New Folder 6\
# 'C:\\Users\\Me\\Desktop\\New Folder 6'

We can see that it extracted everything and again kept it’s folder structure.

The only support difference between the two is that you can extract rar files if you have installed rarfile and dependencies (and specified enable_rar=True), but cannot archive them due to licensing.

Run Command

Ok, so it many not always deal with the file system, but it’s better here than anywhere else. As you may or may not know, in Python 3.5 they released the excellent subprocess.run which is a convenient wrapper around Popen that returns a clean CompletedProcess class instance. reusables.run is designed to be a version agnostic clone, and will even directly run subprocess.run on Python 3.5 and higher.

reusables.run("cat setup.cfg", shell=True)

# CompletedProcess(args='cat setup.cfg', returncode=0, 
#                 stdout=b'[metadata]\ndescription-file = README.rst')

It does have a few subtle differences that I want to highlight:

  • By default, sets stdout and stderr to subprocess.PIPE, that way the result is always is in the returned CompletedProcess instance.
  • Has an additional copy_local_env argument, which will copy your current shell environment to the subprocess if True.
  • Timeout is accepted, buy will raise a NotImplimentedError if set on Python 2.x.
  • It doesn’t take positional Popen arguments, only keyword args (2.6 limitation).
  • It returns the same output as Popen, so on Python 2.x stdout and stderr are strings, and on 3.x they are bytes.

Here you can see an example of copy_local_env  in action running on Python 2.6.

import os

os.environ['MYVAR'] = 'Butterfly'

reusables.run("echo $MYVAR", copy_local_env=True, shell=True)

# CompletedProcess(args='echo $MYVAR', returncode=0, 
#                 stdout='Butterfly\n')

File Hashing

Python already has nice hashing capabilities through hashlib, but it’s a pain to rewrite the custom code for being able to handle large files without a large memory impact.  Consisting of opening a file and iterating over it in chunks and updating the hash. Instead, here is a convenient function.

reusables.file_hash("reusables\\reusables.py", hash_type="sha")

# '50c5425f9780d5adb60a137528b916011ed09b06'

By default it returns an md5 hash, but can be set to anything available on that system, and returns it in the hexdigest format, if the kwargs hex_digest is set to false, it will be returned as bytes.

reusables.file_hash("reusables\\reusables.py", hex_digest=False)

# b'4\xe6\x03zPs\xf5\xe9\x8dX\x9c/=/<\x94'

Starting with python 2.7.9, you can quickly view the available hashes directly from hashlib via hashlib.algorithms_available.

# CPython 3.6
import hashlib

# {'sha3_256', 'MD4', 'sha512', 'sha3_512', 'DSA-SHA', 'md4', ...

reusables.file_hash("wallpapers.zip", "sha3_256")

# 'b7c357d582f8932977d785a24f728b267cef1de87537076aadac5049f4e4fa70'

Duplicate Files

You know you’ve seen this picture  before, you shouldn’t have to safe it again, where did that sucker go? Wonder no more, find it!


# ['F:\\Pictures\\20131005_212718.jpg',
#  'F:\\Pictures\\Me\\20131005_212718.jpg',
#  'F:\\Pictures\\Personal Favorite\\20131005_212718.jpg']

dup_finder is a generator that will search for a given file at a directory, and all sub-directories. This is a very fast function, as it does a three step escalation to detect duplicates, if a step does not match, it will not continue with the other checks, they are verified in this order:

  1. File size
  2. First twenty bytes
  3. Full SHA256 compare

That is excellent for finding a single file, but how about all duplicates in a directory? The traditional option is to create a dictionary of hashes of all the files to compares against. It works, but is slow. Reusables has directory_duplicates function, which first does a file size comparison first, and only moves onto hash comparisons if the size matches.


# [['.\\.git\\refs\\heads\\master', '.\\.git\\refs\\tags\\0.5.2'], 
#  ['.\\test\\empty', '.\\test\\fake_dir']]

It returns a list of lists, each internal list is a group of matching files.  (To be clear “empty” and “fake_dir” are both empty files used for testing.)

Just how much faster is it this way? Here’s a benchmark on my system of searching through over sixty-six thousand (66,000)  files in thirty thousand (30,000) directories.

The comparison code (the Reusables duplicate finder is refereed to as ‘size map’)

import reusables

@reusables.time_it(message="hash map took {seconds:.2f} seconds")
def hash_map(directory):
    hashes = {}
    for file in reusables.find_files(directory):
        file_hash = reusables.file_hash(file)
        hashes.setdefault(file_hash, []).append(file)

    return [v for v in hashes.values() if len(v) > 1]

@reusables.time_it(message="size map took {seconds:.2f} seconds")
def size_map(directory):
    return reusables.directory_duplicates(directory)

if __name__ == '__main__':
    directory = "F:\\Pictures"

    size_map_run = size_map(directory)
    print(f"size map returned {len(size_map_run)} duplicates")

    hash_map_run = hash_map(directory)
    print(f"hash map returned {len(hash_map_run)} duplicates")

The speed up of checking size first in our scenario is significant, over 16 times faster.

size map took 40.23 seconds
size map returned 3511 duplicates

hash map took 642.68 seconds
hash map returned 3511 duplicates

It jumps from under a minute for using reusables.directory_duplicates to over ten minutes when using a traditional hash map. This is the fastest pure Python method I have found, if you find faster, let me know!

Safe File Names

There are plenty of instances that you want to save a meaningful filename supplied by a user, say for a file transfer program or web upload service, but what if they are trying to crash your system?

Reusables has three functions to help you out.

  • check_filename: returns true if safe to use, else false
  • safe_filename: returns a pruned filename
  • safe_path: returns a safe path

These are designed not off of all legally allowed characters per system, but a restricted set of letters, numbers, spaces, hyphens, underscores and periods.

# False

# 'safeFile_.txt'

reusables.safe_path("C:\\test'\\%my_file%\\;'1 OR 1\\filename.txt")
# 'C:\\test_\\_my_file_\\__1 OR 1\\filename.txt'


Designed to be same as Linux touch command. It will create the file if it does not exist, and updates the access and modified times to now.

# 1484450442.2250443


# 1484450443.804158

Simple JSON and CSV save and restore

These are already super simple to implement in pure python with the standard library, and are just here for convince of not having to remember conventions.

List of lists to CSV file and back

my_list = [["Name", "Location"],
           ["Chris", "South Pole"],
           ["Harry", "Depth of Winter"],
           ["Bob", "Skull"]]

reusables.list_to_csv(my_list, "example.csv")

# example.csv
# "Name","Location"
# "Chris","South Pole"
# "Harry","Depth of Winter"
# "Bob","Skull"


# [['Name', 'Location'], ['Chris', 'South Pole'], ['Harry', 'Depth of Winter'], ['Bob', 'Skull']]

Save JSON with default indent of 4

my_dict = {"key_1": "val_1",
           "key_for_dict": {"sub_dict_key": 8}}


# example.json
# {
#     "key_1": "val_1",
#     "key_for_dict": {
#         "sub_dict_key": 8
#     }
# }


# {'key_1': 'val_1', 'key_for_dict': {'sub_dict_key': 8}}

Cut a string into equal lengths

Ok, I admit, this one has absolutely nothing to do with the file system, but it’s just to handy to not mention right now (and doesn’t really fit anywhere else). One of the features I was most surprised to not be included in the standard library was to a have a function that could cut strings into even sections.

I haven’t seen any PEPs about it either way, but I wouldn’t be surprised if one of the reasons is ‘why do to with leftover characters?’. Instead of forcing you to stick with one, Reusables has four different ways it can behave for your requirement.

By default, it will simply cut everything into even segments, and not worry if the last one has matching length.

# ['ab', 'cd', 'ef', 'gh', 'i']

The other options are to remove it entirely, combine it into the previous grouping (still uneven but now last item is longer than rest instead of shorter) or raise an IndexError exception.

reusables.cut("abcdefghi", 2, "remove")
# ['ab', 'cd', 'ef', 'gh']

reusables.cut("abcdefghi", 2, "combine")
# ['ab', 'cd', 'ef', 'ghi']

reusables.cut("abcdefghi", 2, "error")
# Traceback (most recent call last):
#     ...
# IndexError: String of length 9 not divisible by 2 to splice

Config to Dictionary

Everybody and their co-worker has written a ‘better’ config file handler of some sort, this isn’t trying to add to that pile, I swear. This is simply a very quick converter using the built in parser directly to dictionary format, or to a python object  I call a Namespace (more on that in future post.)

Just to make clear, this only reads configs, not writes any changes. So given an example config.ini file:

example=A regular string

[Section 2]

It reads it as is into a dictionary. Notice there is no automatic parsing or anything fancy going on at all.

# {'General': {'example': 'A regular string'},
#  'Section 2': {'anint': '234',
#                'examplelist': '234,123,234,543',
#                'floatly': '4.4',
#                'my_bool': 'yes'}}

You can also take it into a ConfigNamespace.

config = reusables.config_namespace("config.ini")
# <ConfigNamespace: {'General': {'example': 'A regular string'}, 'Section 2': ...

Namespaces are special dictionaries that allow for dot notation, similar to Bunch but recursively convert dictionaries into Namespaces.

# <ConfigNamespace: {'example': 'A regular string'}>

ConfigNamespace has handy built-in type specific retrieval.  Notice that dot notation will not work if item have spaces in them, but the regular dictionary key notation works as well.

config['Section 2'].bool("my_bool")
# True

config['Section 2'].bool("bool_that_doesn't_exist", default=False)
# False
# If no default specified, will raise AttributeError

config['Section 2'].float('floatly')
# 4.4

It supports booleans, floats, ints, and unlike the default config parser, lists. Which even accepts a modifier function.

config['Section 2'].list('examplelist', mod=int)
# [234, 123, 234, 543]


That’s all for this first overview,. hope you found something useful and will make your life easier!

Related links: