Thursday, June 9, 2016

Using Python To Play With Binary Files

Suppose you were to open up an image file in Python 3. Even without knowing much about image formats or binary files, what's possible here?  Let's see what we can find out about these files.  Let's also see what it says about learning strategies.

Some General Ideas

Reading binary files isn't much different than reading text.  One difference is that you open it with "rb" for "read binary" instead of "rt" for "read text".  If the file in question is small enough, you can grab the data with a one-liner like this....

raw_data = open(file_path, "rb").read()

One issue with managing bytes is that there is sometimes text in the mix of it worth considering.  Normal text operations often won't play well with bytes.  For that reason, we decode the bytes so that we have actual text strings to work with.

text_data = byte_data.decode("ascii", errors="ignore")

And yes, a lot of data in the source binary won't translate nicely to text.  Right now, we don't care about that stuff so we ignore it.  

What about going the other direction?  Sometimes, a text string needs to be a pile of bytes.  So we want to encode this.

byte_data = text_data.encode("ascii")

Do you need to do encoding and decoding everywhere you go?  Not necessarily.  It's just a generally good thing to understand the idea of working in both directions. Now, with those ideas established, let's poke around with some image files.

Identifying Image File Formats (sort of)

Here's a thought.  You can dump bytes from an image file straight to the console.  A good Linux shell environment is handy like that. Heck, you could even page through the mess if you felt so inclined.  For example, you can do this.....

cat pinkie1.png | less

From that, you get something that looks like this....



While most of the data isn't readable, there are bits of text there.  Also to note is that PNG, GIF, and JFIF (aka JPEG) appear to have something in common.  The first instance of text that shows up seems to consistently identify the image format.  So that's useful.

Finding text looks like a job for regex.  But wait a minute!  This is bytes.  Regex is meant to be used with text.  No problem.  We'll just decode the bytes we can and toss the rest.  The result is a function that looks like this.

So if you put in image_format("pinkie1.png"), you get PNG as the result.  Pretty nifty.

But also pretty useless.  This "pick-the-first-ascii-text-block-you-find" approach has only been seem to work with samples of PNG, GIF, and JPEG files.  It bombs on TIFF, Windows BMP, and probably several other image formats too.  Oh well.

Stirring Up Curiosity Is Its Own Reward

Making a turd like this one did get me thinking.  How would the pros do it?  There is this function in the official Python 3 libraries called what from the imghdr module.  How, exactly, did they do it?  Oh wait!  There is source code out there.  Here's an unofficial Github resource right here.  This implementation is good reading and beautiful in its simplicity.

This is a good learning strategy for a lot of things.  You poke around with no particular goal.  You stumble upon something.  You dive in.  You screw it up.  You step back. You see how the experts handle it.  Lather.  Rinse.  Repeat.  It's both educational and a lot of fun.  So go do that!  

2 comments:

  1. I need to thank you for this very good read and i have bookmarked to check out new things from your post. Thank you very much for sharing such a useful article and will definitely save and revisit your site.

    This post is very unique and informative for all. Impressive information you share in this. Keep it up and keep sharing.

    ReplyDelete