When scraping sites, I sometimes find myself with upwards to a couple hundred html files in one local directory. By taking advantage of the lazy evaluation allowed via the yield syntax, it's really easy to process them all.
A couple of things to note here. Since there is a yield statement present in soup_line, it behaves as a generator. One Beautiful Soup object is created and returned. Then it stops until the next loop around. This saves on a lot of time and loading of stuff into memory.
I wanted a nice way to keep track of the names of the files associated with soup objects which called for some sort of tuple. SoupFilePair was the solution. This is an example of what's called a namedtuple. These objects behave just like regular tuples but can also double as a sort of "light-weight class". I like them because you don't have to remember as much what order the tuples were in that you were returning. Running "type" on the object will tell you what's what. It's sort of a nice bit of extra documentation whenever you need it.
I really like named tuples and laziness. It makes handling some of the bigger more complicated data beasties so much simpler to tame.
No comments:
Post a Comment