Tutorial

Is it easier to ask for forgiveness in Python?

Yes, we are looking for graphic artists

Yes, we are looking for graphic artists

Is Python more suited for EAFP1 or LBYL2  coding styles? It has been debated across message boards, email chains, stackoverflow, twitter and workplaces. It’s not as heated as some other battles; like how spaces are better than tabs, or that nano is indisputably the best terminal editor 3; but people seem to have their own preference on the subject. What many will gloss over is the fact that Python as a language doesn’t have a preference. Rather, different design patterns lend themselves better to one or the other.

I disagree with the position that EAFP is better than LBYL, or “generally recommended” by Python – Guido van Rossum 4

If you haven’t already, I suggest taking a look at Brett Cannon’s blog post about how EAFP is a valid method with Python that programmers from other languages may not be familiar with. The benefits are EAFP are simple: Explicit code, fail fast, succeed faster, and DRY (don’t repeat yourself). In general I find myself using it more than LBYL, but that doesn’t mean it’s always the right way.

Use Case Dependent

First, a little code comparison to see the differences between them. Lets try to work with a list of list of strings. Our goal is that if the inner list is at least three items long, copy the third item into another list.

messages = [["hi there", "how are you?", "I'm doing fine"]]
out = []
 
# LBYL
if messages and messages[0] and len(messages[0]) >= 3:
    out.append(messages[0][2])

# EAFP
try:
    out.append(messages[0][2])
except IndexError as err:
    print(err)  # In the real world, this should be a logging `.exception()`

Both pieces of code are short and easy to follow what they are accomplishing. However, when there are at least three items in the inner list, LBYL will take almost twice the amount of time! When there aren’t any errors happening, the checks are just additional weight slowing the process down.

But if we prune the messages down to [["hi there", "how are you?"]]   then EAFP will be much slower, because it will always try a bad lookup, fail and then have to catch that failure. Whereas LBYL will only perform the check, see it can’t do it, and move on. Check out my speed comparison gist yourself.

Some people might think putting both the append and lookup in the same try block is wrong. However, it is both faster; as we only have to do the lookup once; and the proper way; as we are only catching the possible IndexError and not anything that would arise from the append.

The real takeaway from this is to know which side of the coin you area dealing with beforehand. Is your incoming data almost always going to be correct? Then I’d lean towards EAFP, whereas if it’s a crap shoot, LBYL may make more sense.

Sometimes one is always faster

There are some instances when one is almost always faster than the other (though speed isn’t everything). For example, file operations are faster with EAFP because the less  IO wait, the better.

Lets try making a directory when we are not sure if it was created yet or not, while pretending it’s still ye olde times and os.makedirs(..., exist_ok=True) doesn’t exist yet.

import os 

# LBYL 
if not os.path.exists("folder"):
    os.mkdir("folder")

# EAFP 
try:
    os.mkdir("folder")
except FileExistsError:
    pass

In this case, it’s always preferential to use EAFP, as it’s faster (even when executed on a m.2 SSD) , there are no side effects, and the error is highly specific so there is no need for special handling.

Be careful though, many times when dealing with files you don’t want to leave something in a bad state, in those cases, you would use LBYL.

What bad things can happen when you don’t ask permission?

Side effects

In many cases, if something fails to execute, the state of the program or associated files might have changed. If it’s not easy to revert to a known good state in the exception clause, EAFP should be avoided.

# DO NOT DO

with open("file.txt", "w") as f:
    try: 
        f.write("Hi there!\n") 
        f.write(message[3])   
    except IndexError: 
        pass  # Don't do, need to be able to get file back to original state

Catching too much

If you are wrapping something with except Exception or worse, the dreaded blank except:, then you shouldn’t be using EAFP (if you’re using black except: still, you need to read up more on that and stop it!)

# DO NOT DO

try:
    do_something()
except:  # If others see this you will be reported to the Python Secret Police
    pass 

Also, sometimes errors seem specific, like an OSError, but could raise multiple different child errors that must be parsed through.

# DO 
import errno

try:
    os.scandir("secret_files")
except FileNotFoundError: # A child of OSError
    # could be put as 'elif err.errno == errno.ENOENT' below
    log.error("Yo dawg, that directory doesn't exist, "
              "we'll try the backup folder")
except OSError as err:
    if err.errno in (errno.EACCES, errno.EPERM):
        log.error("We don't have permission to access that capt'n!"
                  "We'll try the backup folder")
    else:
        log.exception(f"Unexpected OSError: {err}") 
        raise err # If you don't expect an error, don't keep going. 
                  # This isn't Javascript

When to use what?

EAFP (Easier to Ask for Forgiveness than Permission)

  • IO operations (Hard drive and Networking)
  • Actions that will almost always be successful
  • Database operations (when dealing with transactions and can rollback)
  • Fast prototyping in a throw away environment

LBYL (Look Before You Leap):

  • Irrevocable actions, or anything that may have a side effect
  • Operation that may fail more times than succeed
  • When an exception that needs special attention could be easily caught beforehand

Sometime you might even find yourself using both for complex or mission critical applications. Like programming the space shuttle, or getting your code above that 500 lines the teacher wants.

 

Python Decorators

Often times in python there comes a need to have multiple functions act in a similar manner. It could be anything from making sure that similar functions output the right type of result, they all log when they are called or all report exceptions in the same way.

decorators An easy and repeatable way to accomplish this is . They look like:

@decorator
def my_function(print_string="Hello World"):
    print(print_string)

my_function()
# Hello World

Decorators are simply wrapping a function with another function. In other words, it is exactly the same as doing:

def my_function(print_string="Hello World"):
    print(print_string)

decorator(my_function)("My message")
# My message

So what does a decorator look like?

Decorator Template

from functools import wraps

def decorator(func):
    """This is an example decorators"""
    @wraps(func)
    def container(*args, **kwargs):
        # perform actions before running contained function
        output = func(*args, **kwargs)
        # actions to run after contained function
        return output
    return container

Running the code above does nothing extra currently. It is showing how a decorator runs another function within itself.

@decorator
def my_function():
    print "Hello World"

my_function()
# "Hello World"

Line by line breakdown

def decorator(func):

The name of the decorator function, which takes the wrapped function as its single argument.

@wraps(func)

Wraps is a built-in method that replaces the decorators name and docstring with that of the wrapped function, the section after the line by line breakdown explains why this is necessary.

def container(*args, **kwargs):

This inner function collects the parameters and keyword parameters that are going to be passed to the original function. This allows the decorator access to incoming arguments to verify or modify before the function is ever run.

output = func(*args, **kwargs)

This runs the function with the original arguments and captures the output. It is also possible to return a completely different result. It is more common to check or modify the output, like if you wanted to make sure everything returned was an integer.

return output

Don’t forget to actually return the original function’s output (or a custom one). Otherwise the function will simply return None.

return container

The container function is the actual function being called,  hence why *args, **kwargs are passed to it. It is necessary to return it from the outside decorator so it can be called.

The importance of Wraps

We need to incorporate wraps so that the function name and docstring appear to be from the wrapped function, and not those of the wrapper itself.

@decorator
def my_func():
    """Example function"""
    return "Hello"

@decorator_no_wrap
def second_func():
    """Some awesome docstring"""
    return "World"

help(my_func)
# Help on function my_func:

# my_func()
#    Example function

help(second_func)
# Help on function container:

# container(*args, **kwargs)

It is possible, though more work to accomplish the same thing yourself.

def decorator(func):
    """This is an example decorators"""
    def container(*args, **kwargs):
        return func(*args, **kwargs)
    container.__name__ = func.__name__
    container.__doc__ = func.__doc__
    return container

Useful Example

Now lets turn it into something useful. In this example we will make sure that the function returns the expected type of result. Otherwise it will raise an exception for us so there are not hidden complications down the road.

from functools import wraps

def isint(func):
    """ 
    This decorator will make sure the resulting value 
    is a integer or else it will raise an exception.
    """
    @wraps(func)
    def container(*args, **kwargs):
        output = func(*args, **kwargs)
        if not isinstance(output, int):
            raise TypeError("function did not return integer")
        return output
    return container

@isint
def add(num1, num2):
    """Add two numbers together and return the results"""
    return num1 + num2

print(add(1, 2))
# 3

print(add("this", "that"))
# Type Error: function did not return integer

Regular decorators are already called on execution, so you do not need to add ()s after their name, such as @isint. However, if the decorator accepts arguments, aka a meta-decorator, it will require () even if nothing additional is passed to it.

Meta-decorators

Passing arguments to a decorator turns it into a Meta-decorator. To pass these arguments in, it requires either another function wrapped around the decorator or turn the entire thing into a class.

from functools import wraps

def istype(instance_type=int):
    def decorator(func):
        """ 
        This decorator will make sure the resulting value is the 
        type specified or else it will raise an exception. 
        """
        @wraps(func)
        def container(*args, **kwargs):
            output = func(*args, **kwargs)
            if not isinstance(output, instance_type):
                raise TypeError("function did not return proper type")
            return output
        return container
    return decorator

@istype()
def add(num1, num2):
    """Add two numbers together and return the results"""
    return num1 + num2

@istype(str)
def reverse(forward_string):
    """Reverse and return incoming string"""
    return forward_string[::-1]


print(add(1, 2))
# 3

print(reverse("Hello"))
# "olleH"

print(add("this", "that"))
# Type Error: function did not return proper type

Remember running a decorator is equivalent to:

decorator(my_function)("My message")

Running a meta-decorator adds an additional layer.

reversed_string = istype(str)(reverse)("Reverse Me")

Hence why @decorator doesn’t require to be called when put above a function, but @istype() does.

You can also create this meta-decorator as a class instead of another function.

class IsType: # Instead of 'def istype'

    def __init__(self, inc_type=int):
        self.inc_type = inc_type

    def __call__(self, func): # Replaces 'def decorator(func)'
        @wraps(func)
        def container(*args, **kwargs):
            output = func(*args, **kwargs)
            if not isinstance(output, self.inc_type):
                raise TypeError("function did not return proper type")
            return output
        return container

In functionality they are the same, but be aware they are technically different types. This will really only impact code inspectors and  those trying to manually navigate code, so it is not a huge issue, but is something to be aware of.

type(IsType)
<class 'type'>

type(istype)
<class 'function'>

 Things to keep in mind

  1. Order of operation does matter. If you have multiple decorators around a single function keep in mind they are run from top to bottom
  2. Once a function is wrapped, it cannot be run without the wrapper.
  3.  Meta-decorators require the additional parentheses, regular decorators do not.

Creating a successful project – Part 3: Development Tools/Equipment

Every single year that I’ve been doing this, I hear about the next “totally awesome” way to write code.  And more often than not, the new thing is certainly very shiny.

When it comes to projects, with the exception of coding standards (which will be part 4 of this series) I am not a fan of telling developers how to write code.  If you’ve got someone who likes to write code using Notepad on a Microsoft Windows machine, more power to them.  Oh, you like coding in SublimeText3 on Mac – go for it.

If you work on one of my projects there are only a few rules I have about how you write your code:

  1. It must maintain the agreed-upon standard (such as PEP8)
  2. Your code – under penalty of my ire – must work on the designated system.  If it WFM, “Works for Me” then you must get it working on the chosen system. (More on this topic in the test and build posts) And trust me, there’s plenty of people out there – including other contributors to this site – who would shudder to think of my ire directed singly upon them.
  3. Use whatever the agreed upon (preferably Git) source code control system.
  4. Use whatever build system is in play.  Usually, this is done via a Jenkins server, but I’m not picky.  I want consistency, and I want to make sure that the output of the project is reliable.  More on build systems in the CI/CD section.

Notice something odd in there: nowhere did I say you had to use this particular editor or debugger.  I honestly couldn’t care less if you like to write your code using Comic Sans or SourceCodePro.  I really don’t care if you like to code using EMACS or Sublime.  The tools one uses to write code should be selected through a similar vetting process to purchasing a good chef’s knife: use what you feel most comfortable using.

But, in the interest of showing what a rather more seasoned coder uses, here’s my setup:

Keyboard – Microsoft Natural Ergonomic Keyboard – I spend 8-16 hours a day on a keyboard, so I want my keyboard to be comfortable and able to handle heavy use.  The good thing (besides that this is a great keyboard) they’re nice and cheap.  So when one dies, I just buy another.

Mouse – ROCCAT Kone Pure Color – This is just a really great mouse.

Editor- Vim or, as of recent Neovim – I’ve used Vi/Vim for decades so I’m a bit of an old hat at using them.

Operating System – Debian Linux – When you want the best and you don’t want extra crap getting in your way; accept only the best.

I use that same setup at work as well as home.  I am not endorsed by any of the product manufacturers; I just know what works for me.  If I find a keyboard in the same form-factor as the one I’m using with Cherry MX Browns, I’ll buy two of them in a heartbeat.

I have also made use of PyCharm and Atom.  Both of which I still use with Vim Keybindings.

 

Introducing Box – Python dictionaries with recursive dot notation access

Box logo

Everyone loves Python’s dictionaries; they’re fast, easy to create and quite handy for a range of reasons. However, there are times that ["typing"]["out"]["all"]["those"] extra quotes and  brackets seems excessive. Wouldn’t it be nicer to access.them.like.class.methods?

Say hello to box.

Box logo

from box import Box

movie_data = {
  "movies": {
    "Spaceballs": {
      "imdb_stars": 7.1,
      "rating": "PG",
      "length": 96,
      "Director": "Mel Brooks",
      "Stars": [{"name": "Mel Brooks", "imdb": "nm0000316", "role": "President Skroob"},
                {"name": "John Candy","imdb": "nm0001006", "role": "Barf"},
                {"name": "Rick Moranis", "imdb": "nm0001548", "role": "Dark Helmet"}
      ]
    },
    "Robin Hood: Men in Tights": {
      "imdb_stars": 6.7,
      "rating": "PG-13",
      "length": 104,
      "Director": "Mel Brooks",
      "Stars": [
                {"name": "Cary Elwes", "imdb": "nm0000144", "role": "Robin Hood"},
                {"name": "Richard Lewis", "imdb": "nm0507659", "role": "Prince John"},
                {"name": "Roger Rees", "imdb": "nm0715953", "role": "Sheriff of Rottingham"},
                {"name": "Amy Yasbeck", "imdb": "nm0001865", "role": "Marian"}
      ]
    }
  }
}


my_box = Box(movie_data)

my_box.movies.Spaceballs.rating
'PG'

my_box.movies.Spaceballs.Stars[0].name
'Mel Brooks'

my_box.movies.Spaceballs.Stars[0]
# <Box: {'name': 'Mel Brooks', 'imdb': 'nm0000316', 'role': 'President Skroob'}>

Box is a creation I made over three years ago, originally in the reusables code base named Namespace, inspired by JavaScript Object access methods.

Install is super simple:

pip install python-box

Or just grab the file box.py directly from the github project.

Every Box is usable as a drop in replacement to dictionaries in 99%* of cases. And every time you add a dictionary or list to a Box object, they become Box (subclass of dict) or BoxList (subclass of list) objects as well.

type(my_box)
# box.Box
assert isinstance(my_box, dict)

type(my_box.movies.Spaceballs.Stars)
# box.BoxList
assert isinstance(my_box.movies.Spaceballs.Stars, list)

my_box.movies.Spaceballs.Stars[0].additional_info = {'Birth name': 'Melvin Kaminsky', 'Birthday': "05/28/1926"}

my_box.movies.Spaceballs.Stars[0].additional_info
# <Box: {'Birth name': 'Melvin Kaminsky', 'Birthday': '05/28/1926'}>

At any level you can change a Box object back into a standard dictionary.

my_box.movies.Spaceballs.to_dict()

{'Director': 'Mel Brooks',
 'Stars': [
  {'additional_info': {'Birth name': 'Melvin Kaminsky', 'Birthday': '05/28/1926'},
   'imdb': 'nm0000316',
   'name': 'Mel Brooks',
   'role': 'President Skroob'},
  {'imdb': 'nm0001006', 'name': 'John Candy', 'role': 'Barf'},
  {'imdb': 'nm0001548', 'name': 'Rick Moranis', 'role': 'Dark Helmet'},
  {'imdb': 'nm0000597', 'name': 'Bill Pullman', 'role': 'Lone Starr'}],
 'imdb_stars': 7.1,
 'length': 96,
 'rating': 'PG'}

You can also run to_list() on lists in the Box to return them to a standard list, with all inner Box and BoxList objects transformed back to normal.

Box also has built in functions for dealing with json and yaml**.

my_box.movies.Spaceballs.to_json()

# {
#    "imdb_stars": 7.1,
#    "rating": "PG",
#    "length": 96,
#    "Director": "Mel Brooks",
#    "Stars": [
# ...


my_box.movies.Spaceballs.to_yaml()

# Director: Mel Brooks
# imdb_stars: 7.1
# length: 96
# rating: PG
# Stars:
# - imdb: nm0000316
#   name: Mel Brooks
#   role: President Skroob
# ...


Calling a Box object will return it’s keys. It’s also possible to access the attributes the standard dictionary method, which is required for keys that are numeric or have spaces.

my_box.movies()
# ('Spaceballs', 'Robin Hood: Men in Tights')

my_box.movies['Robin Hood: Men in Tights']
# <Box: {'imdb_stars': 6.7, 'rating': 'PG-13', 'length': 104, ...

Unlike addict it does not act as a default dictionary, so you will get built-in errors if you try to access something that isn’t there.

my_box.tv_shows

# Traceback (most recent call last):
# ...
# AttributeError: tv_shows

Another power previously mentioned is that you can add dictionaries into lists and they will automatically be converted into Box objects.

my_box.movies.Spaceballs.Stars.append(
    {"name": "Bill Pullman", "imdb": "nm0000597", "role": "Lone Starr"})

my_box.moves.Spaceballs.Stars[-1].name
'Bill Pullman'

It also protects itself from having its functions overwritten accidentally.

my_box.to_dict = '3'
# AttributeError: Key name 'to_dict' is protected

Box is also a substitute for the Namespace used by argparse, making it super easy to convert incoming arguments to a dict if wanted. This allows incoming arguments to be easily passed to function arguments.

import argparse
from box import Box

parser = argparse.ArgumentParser()
parser.add_argument('floats', metavar='N', type=float, nargs='+')
parser.add_argument("-v", "--verbosity", action="count", default=0)

args = parser.parse_args(['1', '2', '3', '-vv'], namespace=Box())

print(args.to_dict())
{'floats': [1.0, 2.0, 3.0], 'verbosity': 2}


def example_func(floats, verbosity):
    print(verbosity)

example_func(**args)
2

If you have any questions, suggestions or feedback, please open a github issue and let me know!

Hope you enjoy!

Caveats

*  Based off nothing but pure guess and personal experience. Only time drop in replacement doesn’t work is when converting or dumping. So make sure do use  first for those cases.  

** If you don’t have PyYAML installed, the to_yaml function will not be available.

Reusables – Part 1: Overview and File Management

Reusables 0.8 has just been released, and it’s about time I give it a proper introduction.

I started this project three years ago, with a simple goal of keeping code that I inevitably end up reusing grouped into a single library. It’s for the stuff that’s too small to do well as it’s own library, but common enough it’s handy to reuse rather than rewrite each time.

It is designed to make the developer’s life easier in a number of ways. First, it requires no external modules, it’s possible to supplement some functionality with the modules specified in the requreiments.txt file, but are only required for specific use cases; for example: rarfile is only used to extract, you guessed it, rar files.

Second, everything is tested on both Python 2.6+ and Python 3.3+, also tested on pypy. It is cross platform compatible Windows/Linux, unless a specific function or class specifies otherwise.

Third, everything is documented via docstrings, so they are available at readthedocs, or through the built-in help() command in python.

Lastly, all functions and classes are all available at the root level (except CLI helpers), and can be broadly categorized as follows:

  • File Management
    • Functions that deal with file system operations.
  • Logging
    • Functions to help setup and modify logging capabilities.
  • Multiprocessing
    • Fast and dynamic multiprocessing or threading tools.
  • Web
    • Things related to dealing with networking, urls, downloading, serving, etc.
  • Wrappers
    • Function wrappers.
  • Namespace
    • Custom class to expand the usability of python dictionaries as objects.
  • DateTime
    • Custom datetime class primarily for easier formatting.
  • Browser Cookie Management
    • Find, extract or modify cookies of Firefox and Chrome on a system.
  • Command Line Helpers
    • Bash analogues to help system admins perform faster operations from inside an interactive python shell.

In this overview, we will cover:

  1. Installation
  2. Getting Started
  3. File, Folder and String Management
    1. Find Files Fast
    2. Archives (Extraction and Compression)
    3. Run Command
    4. File Hashing
    5. Finding Duplicate Files
    6. Safe File and Folder Names
    7. Touch (ing a file)
    8. Simple JSON and CSV
    9. Cut (ing a string into equal lengths)
    10. Config to dictionary

Installation

Very straightforward install, just do a simple pip or easy_install from PyPI.

pip install reusables

OR

easy_install reusables

If you need to install it on an offline computer, grab the appropriate Python 2.x or 3.x wheel from PyPI, and just pip install it directly.

There are no additional modules required for install, so if either of those don’t work, please open an issue at github.

Getting Started

import reusables 

reusables.add_stream_handler('reusables', level=10)

The logger’s name is ‘reusables’, and by default does not have any handlers associated with it. For these examples we will have logging on debug level, if you aren’t familiar with logging, please read my post about logging.

File, Folder and String Management

Everything here deals with managing something on the disk, or strings that relate to files. From checking for safe filenames to saving data files.

I’m going to start the show off with my most reused function, that is also one of the most versatile and powerful, find_files. It is basically an advanced implementation of os.walk.

Find Files Fast

reusables.find_files_list("F:\\Pictures",
                              ext=reusables.exts.pictures, 
                              name="sam", depth=3)

# ['F:\\Pictures\\Family\\SAM.JPG', 
# 'F:\\Pictures\\Family\\Family pictures - assorted\\Sam in 2009.jpg']

With a single line, we are able to search a directory for files by a case insensitive name, a list (or single string) of extensions and even specify a depth.  It’s also really fast, taking under five seconds to search through 70,000 files and 30,000 folders, taking just half a second longer than using the windows built in equivalent dir /s *sam* | findstr /i "\.jpg \.png \.jpeg \.gif \.bmp \.tif \.tiff \.ico \.mng \.tga \.xcf \.svg".

If you don’t need it as a list, use the generator itself.

for pic in reusables.find_files("F:\\Pictures", name="*chris *.jpg"):
    print(pic)

# F:\Pictures\Family\Family pictures - assorted\Chris 1st grade.jpg
# F:\Pictures\Family\Family pictures - assorted\Chris 6th grade.jpg
# F:\Pictures\Family\Family pictures - assorted\Chris at 3.jpg

That’s right, it also supports glob wildcards. It even supports using the external module scandir for older versions of Python that don’t have it nativity (only if enable_scandir=True is specified of course, its one of those supplemental modules). Check out the full documentation and more examples at readthedocs.

Archives

Dealing with the idiosyncrasies between the compression libraries provided by Python can be a real pain. I set out to make a super simple and straight forward way to archive and extract folders.

reusables.archive(['reusables',    # Folder with files 
                   'tests',        # Folder with subfolders
                   'AUTHORS.rst'], # Standalone file
                   name="my_archive.bz2")

# 'C:\Users\Me\Reusables\my_archive.bz2'

It will compress everything, store it, and keep folder structure in the archives.

To extract files, it is very similar behavior. Given a ‘wallpapers.zip’ file like this:

It is trivial to extract it to a location without having to specify it’s archive type.

reusables.extract("wallpapers.zip",
                  path="C:\\Users\\Me\\Desktop\\New Folder 6\\")
# ... DEBUG File wallpapers.zip detected as a zip file
# ... DEBUG Extracting files to C:\Users\Me\Desktop\New Folder 6\
# 'C:\\Users\\Me\\Desktop\\New Folder 6'

We can see that it extracted everything and again kept it’s folder structure.

The only support difference between the two is that you can extract rar files if you have installed rarfile and dependencies (and specified enable_rar=True), but cannot archive them due to licensing.

Run Command

Ok, so it many not always deal with the file system, but it’s better here than anywhere else. As you may or may not know, in Python 3.5 they released the excellent subprocess.run which is a convenient wrapper around Popen that returns a clean CompletedProcess class instance. reusables.run is designed to be a version agnostic clone, and will even directly run subprocess.run on Python 3.5 and higher.

reusables.run("cat setup.cfg", shell=True)

# CompletedProcess(args='cat setup.cfg', returncode=0, 
#                 stdout=b'[metadata]\ndescription-file = README.rst')

It does have a few subtle differences that I want to highlight:

  • By default, sets stdout and stderr to subprocess.PIPE, that way the result is always is in the returned CompletedProcess instance.
  • Has an additional copy_local_env argument, which will copy your current shell environment to the subprocess if True.
  • Timeout is accepted, buy will raise a NotImplimentedError if set on Python 2.x.
  • It doesn’t take positional Popen arguments, only keyword args (2.6 limitation).
  • It returns the same output as Popen, so on Python 2.x stdout and stderr are strings, and on 3.x they are bytes.

Here you can see an example of copy_local_env  in action running on Python 2.6.

import os

os.environ['MYVAR'] = 'Butterfly'

reusables.run("echo $MYVAR", copy_local_env=True, shell=True)

# CompletedProcess(args='echo $MYVAR', returncode=0, 
#                 stdout='Butterfly\n')

File Hashing

Python already has nice hashing capabilities through hashlib, but it’s a pain to rewrite the custom code for being able to handle large files without a large memory impact.  Consisting of opening a file and iterating over it in chunks and updating the hash. Instead, here is a convenient function.

reusables.file_hash("reusables\\reusables.py", hash_type="sha")

# '50c5425f9780d5adb60a137528b916011ed09b06'

By default it returns an md5 hash, but can be set to anything available on that system, and returns it in the hexdigest format, if the kwargs hex_digest is set to false, it will be returned as bytes.

reusables.file_hash("reusables\\reusables.py", hex_digest=False)

# b'4\xe6\x03zPs\xf5\xe9\x8dX\x9c/=/<\x94'

Starting with python 2.7.9, you can quickly view the available hashes directly from hashlib via hashlib.algorithms_available.

# CPython 3.6
import hashlib

print(f"{hashlib.algorithms_available}")
# {'sha3_256', 'MD4', 'sha512', 'sha3_512', 'DSA-SHA', 'md4', ...

reusables.file_hash("wallpapers.zip", "sha3_256")

# 'b7c357d582f8932977d785a24f728b267cef1de87537076aadac5049f4e4fa70'

Duplicate Files

You know you’ve seen this picture  before, you shouldn’t have to safe it again, where did that sucker go? Wonder no more, find it!

list(reusables.dup_finder("F:\\Pictures\\20131005_212718.jpg", 
                          directory="F:\\Pictures"))

# ['F:\\Pictures\\20131005_212718.jpg',
#  'F:\\Pictures\\Me\\20131005_212718.jpg',
#  'F:\\Pictures\\Personal Favorite\\20131005_212718.jpg']

dup_finder is a generator that will search for a given file at a directory, and all sub-directories. This is a very fast function, as it does a three step escalation to detect duplicates, if a step does not match, it will not continue with the other checks, they are verified in this order:

  1. File size
  2. First twenty bytes
  3. Full SHA256 compare

That is excellent for finding a single file, but how about all duplicates in a directory? The traditional option is to create a dictionary of hashes of all the files to compares against. It works, but is slow. Reusables has directory_duplicates function, which first does a file size comparison first, and only moves onto hash comparisons if the size matches.

reusables.directory_duplicates(".")

# [['.\\.git\\refs\\heads\\master', '.\\.git\\refs\\tags\\0.5.2'], 
#  ['.\\test\\empty', '.\\test\\fake_dir']]

It returns a list of lists, each internal list is a group of matching files.  (To be clear “empty” and “fake_dir” are both empty files used for testing.)

Just how much faster is it this way? Here’s a benchmark on my system of searching through over sixty-six thousand (66,000)  files in thirty thousand (30,000) directories.

The comparison code (the Reusables duplicate finder is refereed to as ‘size map’)

import reusables

@reusables.time_it(message="hash map took {seconds:.2f} seconds")
def hash_map(directory):
    hashes = {}
    for file in reusables.find_files(directory):
        file_hash = reusables.file_hash(file)
        hashes.setdefault(file_hash, []).append(file)

    return [v for v in hashes.values() if len(v) > 1]


@reusables.time_it(message="size map took {seconds:.2f} seconds")
def size_map(directory):
    return reusables.directory_duplicates(directory)


if __name__ == '__main__':
    directory = "F:\\Pictures"

    size_map_run = size_map(directory)
    print(f"size map returned {len(size_map_run)} duplicates")

    hash_map_run = hash_map(directory)
    print(f"hash map returned {len(hash_map_run)} duplicates")

The speed up of checking size first in our scenario is significant, over 16 times faster.

size map took 40.23 seconds
size map returned 3511 duplicates

hash map took 642.68 seconds
hash map returned 3511 duplicates

It jumps from under a minute for using reusables.directory_duplicates to over ten minutes when using a traditional hash map. This is the fastest pure Python method I have found, if you find faster, let me know!

Safe File Names

There are plenty of instances that you want to save a meaningful filename supplied by a user, say for a file transfer program or web upload service, but what if they are trying to crash your system?

Reusables has three functions to help you out.

  • check_filename: returns true if safe to use, else false
  • safe_filename: returns a pruned filename
  • safe_path: returns a safe path

These are designed not off of all legally allowed characters per system, but a restricted set of letters, numbers, spaces, hyphens, underscores and periods.

reusables.check_filename("safeFile?.text")
# False

reusables.safe_filename("safeFile?.txt")
# 'safeFile_.txt'

reusables.safe_path("C:\\test'\\%my_file%\\;'1 OR 1\\filename.txt")
# 'C:\\test_\\_my_file_\\__1 OR 1\\filename.txt'

Touch

Designed to be same as Linux touch command. It will create the file if it does not exist, and updates the access and modified times to now.

time.time()
# 1484450442.2250443

reusables.touch("new_file")

os.path.getmtime("new_file")
# 1484450443.804158

Simple JSON and CSV save and restore

These are already super simple to implement in pure python with the standard library, and are just here for convince of not having to remember conventions.

List of lists to CSV file and back

my_list = [["Name", "Location"],
           ["Chris", "South Pole"],
           ["Harry", "Depth of Winter"],
           ["Bob", "Skull"]]

reusables.list_to_csv(my_list, "example.csv")

# example.csv
#
# "Name","Location"
# "Chris","South Pole"
# "Harry","Depth of Winter"
# "Bob","Skull"


reusables.csv_to_list("example.csv")

# [['Name', 'Location'], ['Chris', 'South Pole'], ['Harry', 'Depth of Winter'], ['Bob', 'Skull']]

Save JSON with default indent of 4

my_dict = {"key_1": "val_1",
           "key_for_dict": {"sub_dict_key": 8}}

reusables.save_json(my_dict,"example.json")

# example.json
# 
# {
#     "key_1": "val_1",
#     "key_for_dict": {
#         "sub_dict_key": 8
#     }
# }

reusables.load_json("example.json")

# {'key_1': 'val_1', 'key_for_dict': {'sub_dict_key': 8}}

Cut a string into equal lengths

Ok, I admit, this one has absolutely nothing to do with the file system, but it’s just to handy to not mention right now (and doesn’t really fit anywhere else). One of the features I was most surprised to not be included in the standard library was to a have a function that could cut strings into even sections.

I haven’t seen any PEPs about it either way, but I wouldn’t be surprised if one of the reasons is ‘why do to with leftover characters?’. Instead of forcing you to stick with one, Reusables has four different ways it can behave for your requirement.

By default, it will simply cut everything into even segments, and not worry if the last one has matching length.

reusables.cut("abcdefghi")
# ['ab', 'cd', 'ef', 'gh', 'i']

The other options are to remove it entirely, combine it into the previous grouping (still uneven but now last item is longer than rest instead of shorter) or raise an IndexError exception.

reusables.cut("abcdefghi", 2, "remove")
# ['ab', 'cd', 'ef', 'gh']

reusables.cut("abcdefghi", 2, "combine")
# ['ab', 'cd', 'ef', 'ghi']

reusables.cut("abcdefghi", 2, "error")
# Traceback (most recent call last):
#     ...
# IndexError: String of length 9 not divisible by 2 to splice

Config to Dictionary

Everybody and their co-worker has written a ‘better’ config file handler of some sort, this isn’t trying to add to that pile, I swear. This is simply a very quick converter using the built in parser directly to dictionary format, or to a python object  I call a Namespace (more on that in future post.)

Just to make clear, this only reads configs, not writes any changes. So given an example config.ini file:

[General]
example=A regular string

[Section 2]
my_bool=yes
anint=234
exampleList=234,123,234,543
floatly=4.4

It reads it as is into a dictionary. Notice there is no automatic parsing or anything fancy going on at all.

reusables.config_dict("config.ini")
# {'General': {'example': 'A regular string'},
#  'Section 2': {'anint': '234',
#                'examplelist': '234,123,234,543',
#                'floatly': '4.4',
#                'my_bool': 'yes'}}

You can also take it into a ConfigNamespace.

config = reusables.config_namespace("config.ini")
# <ConfigNamespace: {'General': {'example': 'A regular string'}, 'Section 2': ...

Namespaces are special dictionaries that allow for dot notation, similar to Bunch but recursively convert dictionaries into Namespaces.

config.General
# <ConfigNamespace: {'example': 'A regular string'}>

ConfigNamespace has handy built-in type specific retrieval.  Notice that dot notation will not work if item have spaces in them, but the regular dictionary key notation works as well.

config['Section 2'].bool("my_bool")
# True

config['Section 2'].bool("bool_that_doesn't_exist", default=False)
# False
# If no default specified, will raise AttributeError

config['Section 2'].float('floatly')
# 4.4

It supports booleans, floats, ints, and unlike the default config parser, lists. Which even accepts a modifier function.

config['Section 2'].list('examplelist', mod=int)
# [234, 123, 234, 543]

Finale

That’s all for this first overview,. hope you found something useful and will make your life easier!

Related links: