Reusables 0.8 has just been released, and it’s about time I give it a proper introduction.
I started this project three years ago, with a simple goal of keeping code that I inevitably end up reusing grouped into a single library. It’s for the stuff that’s too small to do well as it’s own library, but common enough it’s handy to reuse rather than rewrite each time.
It is designed to make the developer’s life easier in a number of ways. First, it requires no external modules, it’s possible to supplement some functionality with the modules specified in the requreiments.txt
file, but are only required for specific use cases; for example: rarfile
is only used to extract, you guessed it, rar files.
Second, everything is tested on both Python 2.6+ and Python 3.3+, also tested on pypy. It is cross platform compatible Windows/Linux, unless a specific function or class specifies otherwise.
Third, everything is documented via docstrings, so they are available at readthedocs, or through the built-in help()
command in python.
Lastly, all functions and classes are all available at the root level (except CLI helpers), and can be broadly categorized as follows:
- File Management
- Functions that deal with file system operations.
- Logging
- Functions to help setup and modify logging capabilities.
- Multiprocessing
- Fast and dynamic multiprocessing or threading tools.
- Web
- Things related to dealing with networking, urls, downloading, serving, etc.
- Wrappers
- Function wrappers.
- Namespace
- Custom class to expand the usability of python dictionaries as objects.
- DateTime
- Custom datetime class primarily for easier formatting.
- Browser Cookie Management
- Find, extract or modify cookies of Firefox and Chrome on a system.
- Command Line Helpers
- Bash analogues to help system admins perform faster operations from inside an interactive python shell.
In this overview, we will cover:
Installation
Very straightforward install, just do a simple pip or easy_install from PyPI.
pip install reusables
OR
easy_install reusables
If you need to install it on an offline computer, grab the appropriate Python 2.x or 3.x wheel from PyPI, and just pip install it directly.
There are no additional modules required for install, so if either of those don’t work, please open an issue at github.
Getting Started
import reusables reusables.add_stream_handler('reusables', level=10)
The logger’s name is ‘reusables’, and by default does not have any handlers associated with it. For these examples we will have logging on debug level, if you aren’t familiar with logging, please read my post about logging.
File, Folder and String Management
Everything here deals with managing something on the disk, or strings that relate to files. From checking for safe filenames to saving data files.
I’m going to start the show off with my most reused function, that is also one of the most versatile and powerful, find_files
. It is basically an advanced implementation of os.walk
.
Find Files Fast
reusables.find_files_list("F:\\Pictures", ext=reusables.exts.pictures, name="sam", depth=3) # ['F:\\Pictures\\Family\\SAM.JPG', # 'F:\\Pictures\\Family\\Family pictures - assorted\\Sam in 2009.jpg']
With a single line, we are able to search a directory for files by a case insensitive name, a list (or single string) of extensions and even specify a depth. It’s also really fast, taking under five seconds to search through 70,000 files and 30,000 folders, taking just half a second longer than using the windows built in equivalent dir /s *sam* | findstr /i "\.jpg \.png \.jpeg \.gif \.bmp \.tif \.tiff \.ico \.mng \.tga \.xcf \.svg"
.
If you don’t need it as a list, use the generator itself.
for pic in reusables.find_files("F:\\Pictures", name="*chris *.jpg"): print(pic) # F:\Pictures\Family\Family pictures - assorted\Chris 1st grade.jpg # F:\Pictures\Family\Family pictures - assorted\Chris 6th grade.jpg # F:\Pictures\Family\Family pictures - assorted\Chris at 3.jpg
That’s right, it also supports glob wildcards. It even supports using the external module scandir
for older versions of Python that don’t have it nativity (only if enable_scandir=True
is specified of course, its one of those supplemental modules). Check out the full documentation and more examples at readthedocs.
Archives
Dealing with the idiosyncrasies between the compression libraries provided by Python can be a real pain. I set out to make a super simple and straight forward way to archive and extract folders.
reusables.archive(['reusables', # Folder with files 'tests', # Folder with subfolders 'AUTHORS.rst'], # Standalone file name="my_archive.bz2") # 'C:\Users\Me\Reusables\my_archive.bz2'
It will compress everything, store it, and keep folder structure in the archives.
To extract files, it is very similar behavior. Given a ‘wallpapers.zip’ file like this:
It is trivial to extract it to a location without having to specify it’s archive type.
reusables.extract("wallpapers.zip", path="C:\\Users\\Me\\Desktop\\New Folder 6\\") # ... DEBUG File wallpapers.zip detected as a zip file # ... DEBUG Extracting files to C:\Users\Me\Desktop\New Folder 6\ # 'C:\\Users\\Me\\Desktop\\New Folder 6'
We can see that it extracted everything and again kept it’s folder structure.
The only support difference between the two is that you can extract rar files if you have installed rarfile and dependencies (and specified enable_rar=True)
, but cannot archive them due to licensing.
Run Command
Ok, so it many not always deal with the file system, but it’s better here than anywhere else. As you may or may not know, in Python 3.5 they released the excellent subprocess.run which is a convenient wrapper around Popen that returns a clean CompletedProcess class instance. reusables.run is designed to be a version agnostic clone, and will even directly run subprocess.run
on Python 3.5 and higher.
reusables.run("cat setup.cfg", shell=True) # CompletedProcess(args='cat setup.cfg', returncode=0, # stdout=b'[metadata]\ndescription-file = README.rst')
It does have a few subtle differences that I want to highlight:
- By default, sets
stdout
andstderr
tosubprocess.PIPE
, that way the result is always is in the returnedCompletedProcess
instance. - Has an additional
copy_local_env
argument, which will copy your current shell environment to thesubprocess
if True. - Timeout is accepted, buy will raise a
NotImplimentedError
if set on Python 2.x. - It doesn’t take positional
Popen
arguments, only keyword args (2.6 limitation). - It returns the same output as
Popen
, so on Python 2.xstdout
andstderr
are strings, and on 3.x they are bytes.
Here you can see an example of copy_local_env
in action running on Python 2.6.
import os os.environ['MYVAR'] = 'Butterfly' reusables.run("echo $MYVAR", copy_local_env=True, shell=True) # CompletedProcess(args='echo $MYVAR', returncode=0, # stdout='Butterfly\n')
File Hashing
Python already has nice hashing capabilities through hashlib, but it’s a pain to rewrite the custom code for being able to handle large files without a large memory impact. Consisting of opening a file and iterating over it in chunks and updating the hash. Instead, here is a convenient function.
reusables.file_hash("reusables\\reusables.py", hash_type="sha") # '50c5425f9780d5adb60a137528b916011ed09b06'
By default it returns an md5 hash, but can be set to anything available on that system, and returns it in the hexdigest
format, if the kwargs hex_digest
is set to false, it will be returned as bytes.
reusables.file_hash("reusables\\reusables.py", hex_digest=False) # b'4\xe6\x03zPs\xf5\xe9\x8dX\x9c/=/<\x94'
Starting with python 2.7.9, you can quickly view the available hashes directly from hashlib
via hashlib.algorithms_available.
# CPython 3.6 import hashlib print(f"{hashlib.algorithms_available}") # {'sha3_256', 'MD4', 'sha512', 'sha3_512', 'DSA-SHA', 'md4', ... reusables.file_hash("wallpapers.zip", "sha3_256") # 'b7c357d582f8932977d785a24f728b267cef1de87537076aadac5049f4e4fa70'
Duplicate Files
You know you’ve seen this picture before, you shouldn’t have to safe it again, where did that sucker go? Wonder no more, find it!
list(reusables.dup_finder("F:\\Pictures\\20131005_212718.jpg", directory="F:\\Pictures")) # ['F:\\Pictures\\20131005_212718.jpg', # 'F:\\Pictures\\Me\\20131005_212718.jpg', # 'F:\\Pictures\\Personal Favorite\\20131005_212718.jpg']
dup_finder
is a generator that will search for a given file at a directory, and all sub-directories. This is a very fast function, as it does a three step escalation to detect duplicates, if a step does not match, it will not continue with the other checks, they are verified in this order:
- File size
- First twenty bytes
- Full SHA256 compare
That is excellent for finding a single file, but how about all duplicates in a directory? The traditional option is to create a dictionary of hashes of all the files to compares against. It works, but is slow. Reusables has directory_duplicates
function, which first does a file size comparison first, and only moves onto hash comparisons if the size matches.
reusables.directory_duplicates(".") # [['.\\.git\\refs\\heads\\master', '.\\.git\\refs\\tags\\0.5.2'], # ['.\\test\\empty', '.\\test\\fake_dir']]
It returns a list of lists, each internal list is a group of matching files. (To be clear “empty” and “fake_dir” are both empty files used for testing.)
Just how much faster is it this way? Here’s a benchmark on my system of searching through over sixty-six thousand (66,000) files in thirty thousand (30,000) directories.
The comparison code (the Reusables duplicate finder is refereed to as ‘size map’)
import reusables @reusables.time_it(message="hash map took {seconds:.2f} seconds") def hash_map(directory): hashes = {} for file in reusables.find_files(directory): file_hash = reusables.file_hash(file) hashes.setdefault(file_hash, []).append(file) return [v for v in hashes.values() if len(v) > 1] @reusables.time_it(message="size map took {seconds:.2f} seconds") def size_map(directory): return reusables.directory_duplicates(directory) if __name__ == '__main__': directory = "F:\\Pictures" size_map_run = size_map(directory) print(f"size map returned {len(size_map_run)} duplicates") hash_map_run = hash_map(directory) print(f"hash map returned {len(hash_map_run)} duplicates")
The speed up of checking size first in our scenario is significant, over 16 times faster.
size map took 40.23 seconds size map returned 3511 duplicates hash map took 642.68 seconds hash map returned 3511 duplicates
It jumps from under a minute for using reusables.directory_duplicates to over ten minutes when using a traditional hash map. This is the fastest pure Python method I have found, if you find faster, let me know!
Safe File Names
There are plenty of instances that you want to save a meaningful filename supplied by a user, say for a file transfer program or web upload service, but what if they are trying to crash your system?
Reusables has three functions to help you out.
- check_filename: returns true if safe to use, else false
- safe_filename: returns a pruned filename
- safe_path: returns a safe path
These are designed not off of all legally allowed characters per system, but a restricted set of letters, numbers, spaces, hyphens, underscores and periods.
reusables.check_filename("safeFile?.text") # False reusables.safe_filename("safeFile?.txt") # 'safeFile_.txt' reusables.safe_path("C:\\test'\\%my_file%\\;'1 OR 1\\filename.txt") # 'C:\\test_\\_my_file_\\__1 OR 1\\filename.txt'
Touch
Designed to be same as Linux touch command. It will create the file if it does not exist, and updates the access and modified times to now.
time.time() # 1484450442.2250443 reusables.touch("new_file") os.path.getmtime("new_file") # 1484450443.804158
Simple JSON and CSV save and restore
These are already super simple to implement in pure python with the standard library, and are just here for convince of not having to remember conventions.
List of lists to CSV file and back
my_list = [["Name", "Location"], ["Chris", "South Pole"], ["Harry", "Depth of Winter"], ["Bob", "Skull"]] reusables.list_to_csv(my_list, "example.csv") # example.csv # # "Name","Location" # "Chris","South Pole" # "Harry","Depth of Winter" # "Bob","Skull" reusables.csv_to_list("example.csv") # [['Name', 'Location'], ['Chris', 'South Pole'], ['Harry', 'Depth of Winter'], ['Bob', 'Skull']]
Save JSON with default indent of 4
my_dict = {"key_1": "val_1", "key_for_dict": {"sub_dict_key": 8}} reusables.save_json(my_dict,"example.json") # example.json # # { # "key_1": "val_1", # "key_for_dict": { # "sub_dict_key": 8 # } # } reusables.load_json("example.json") # {'key_1': 'val_1', 'key_for_dict': {'sub_dict_key': 8}}
Cut a string into equal lengths
Ok, I admit, this one has absolutely nothing to do with the file system, but it’s just to handy to not mention right now (and doesn’t really fit anywhere else). One of the features I was most surprised to not be included in the standard library was to a have a function that could cut strings into even sections.
I haven’t seen any PEPs about it either way, but I wouldn’t be surprised if one of the reasons is ‘why do to with leftover characters?’. Instead of forcing you to stick with one, Reusables has four different ways it can behave for your requirement.
By default, it will simply cut everything into even segments, and not worry if the last one has matching length.
reusables.cut("abcdefghi") # ['ab', 'cd', 'ef', 'gh', 'i']
The other options are to remove it entirely, combine it into the previous grouping (still uneven but now last item is longer than rest instead of shorter) or raise an IndexError exception.
reusables.cut("abcdefghi", 2, "remove") # ['ab', 'cd', 'ef', 'gh'] reusables.cut("abcdefghi", 2, "combine") # ['ab', 'cd', 'ef', 'ghi'] reusables.cut("abcdefghi", 2, "error") # Traceback (most recent call last): # ... # IndexError: String of length 9 not divisible by 2 to splice
Config to Dictionary
Everybody and their co-worker has written a ‘better’ config file handler of some sort, this isn’t trying to add to that pile, I swear. This is simply a very quick converter using the built in parser directly to dictionary format, or to a python object I call a Namespace (more on that in future post.)
Just to make clear, this only reads configs, not writes any changes. So given an example config.ini
file:
[General] example=A regular string [Section 2] my_bool=yes anint=234 exampleList=234,123,234,543 floatly=4.4
It reads it as is into a dictionary. Notice there is no automatic parsing or anything fancy going on at all.
reusables.config_dict("config.ini") # {'General': {'example': 'A regular string'}, # 'Section 2': {'anint': '234', # 'examplelist': '234,123,234,543', # 'floatly': '4.4', # 'my_bool': 'yes'}}
You can also take it into a ConfigNamespace
.
config = reusables.config_namespace("config.ini") # <ConfigNamespace: {'General': {'example': 'A regular string'}, 'Section 2': ...
Namespaces are special dictionaries that allow for dot notation, similar to Bunch but recursively convert dictionaries into Namespaces.
config.General # <ConfigNamespace: {'example': 'A regular string'}>
ConfigNamespace
has handy built-in type specific retrieval. Notice that dot notation will not work if item have spaces in them, but the regular dictionary key notation works as well.
config['Section 2'].bool("my_bool") # True config['Section 2'].bool("bool_that_doesn't_exist", default=False) # False # If no default specified, will raise AttributeError config['Section 2'].float('floatly') # 4.4
It supports booleans, floats, ints, and unlike the default config parser, lists. Which even accepts a modifier function.
config['Section 2'].list('examplelist', mod=int) # [234, 123, 234, 543]
Finale
That’s all for this first overview,. hope you found something useful and will make your life easier!
Related links: