The Hundred Minute Hack: Processing Lots of Text Files with Python

Tasks involving work with text and files containing text is very common. Python makes dealing with single files a breeze. But what if you need to deal with LOTS of text files? That can be a bit trickier. Fortunately, Python is very much up to that challenge as well.

My Setup For This

The example discussed here runs on Linux Mint 17.2 64-bit. Since we're dealing with Blender here, you can download the 64 bit version for Linux here. The code being shown here is confirmed to run on 64 bit Python 3.4 for Linux.

The Goal In This Example

There's a piece of art software called Blender that interests me. In Blender programming, there's an idea called "region types". Region types help position interfaces. My interest in is in lines that look similar to this.

bl_region_type = "WINDOW"

What region types are commonly used within the Python scripts that come packaged with Blender? That is the question to answer here. Answering it means scanning for info within the contents of multiple files. There are four general steps to take in this sort of situation.

Decide on what files are of interest.
Decide what lines of those files are of interest.
Get whatever it is you need from those lines.
Act on that extracted info.

Let's break this down.

Picking Your Files

For this example, we are interested in files ending in .py within a specific set of directories.

The walk function from the os module is very handy here. It makes iterating through nested directories very simple. It is also a generator.

There are two choices for dealing with generator based info. You could use the generator to return a list of python file paths on one hand. On the other, you could yield each of the created paths one at a time. We're going with this second choice. There's no reason that we need a full list of python file paths in memory all at once. Having pyfiles as a generator is fine.

Picking The Lines

Let's assume that the line of code in question won't be broken up across multiple lines. We'll also assume that the expression in question won't have comments on the same line. That allows us to write naive code like what's below.

Some things to note here. First is the usage of the with statement. Using with in this way allows the file object's context manager to help us along. We are dealing with dozens of files here. Letting that context manager help close those files for us make sense.

The second thing to note is that pick_lines is a generator too. We don't need a full collection of file lines in memory so why have it?

The third thing is the set up of the for loop. It is double-nested. It handles both file name collecting and file reading duties. And yet, it is still easy to read and doesn't burden resources. It's lazy. It's beautiful. It's also practical for what's being done here.

Getting What's Needed From The Lines

The goal is to get specific information from this kind of line.

bl_region_type = "WINDOW"

We want the WINDOW part and nothing else. This only requires a few lines of straight line imperative code. Adding more functions or generators isn't called for here. Let's expand on our loop.

This is self-explanatory. The important stuff comes after the equal sign. The stuff before it doesn't matter. Quotes and extra spaces don't count either. All that gets tossed before throwing the desired info into a list. And since this list is our final target, having it concretely in memory is the right thing to do.

Act On The Information

This can be whatever the situation calls for. That could be writing info to a database or another file or whatever else you need. Here's what I did.

This is pretty much what it looks like. My main interest was seeing what region types were actually being used in Blender addon scripts. Having that info and count tallies output to the console is enough.

And Finally The Completed Script

And here are the results.

WINDOW: 90
TOOLS: 56
UI: 78

A testament to the power of Python is what can be accomplished in barely a couple dozen lines of code. Those few lines can parse through thousands of lines of text spread across dozens of files from various directories. What you can do beyond that is up to your imagination.

3 comments:

BorisFebruary 9, 2016 at 9:31 AM
This is slightly unnecessary. You can use `find` to find the files you are interested in, `awk` to get the lines you are interested in, and either count from inside `awk` or use `sort` and `uniq`. It comes to less then 10 "information" lines of code.
AnonymousFebruary 9, 2016 at 12:13 PM
you'll enjoy Part2 of David Beazley's presentation (page I-34ff) http://www.dabeaz.com/generators/Generators.pdf
UnknownSeptember 22, 2016 at 2:51 AM
> "{}{}{}".format(dir_name, os.path.sep, file_name)
Man doesnt even know the OS standart library module!
> os.path.join(dirname, file_name)

Tuesday, February 9, 2016

Processing Lots of Text Files with Python