This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def soup_line(dir_name, *exclusions): | |
""" | |
Pair up soups with the files they're based on. | |
:param dir_name: Directory with the html files needed. | |
:param exclusions: Don't include these files. | |
:return: A tuple of soup file name pairs. | |
""" | |
import os | |
from collections import namedtuple | |
SoupFilePair = namedtuple("SoupFileName", ["soup", "file_name"]) | |
def is_valid_file(fpath): | |
if os.path.isfile(fpath): | |
if fpath.endswith(".html"): | |
if fpath not in exclusions: | |
return True | |
else: | |
return False | |
for fname in os.listdir(dir_name): | |
file_path = "{0}/{1}".format(dir_name, fname) | |
if is_valid_file(file_path): | |
with open(file_path, "rt", encoding="utf-8") as file_ob: | |
soup = BeautifulSoup(file_ob.read().strip()) | |
yield SoupFilePair(soup, fname) |
A couple of things to note here. Since there is a yield statement present in soup_line, it behaves as a generator. One Beautiful Soup object is created and returned. Then it stops until the next loop around. This saves on a lot of time and loading of stuff into memory.
I wanted a nice way to keep track of the names of the files associated with soup objects which called for some sort of tuple. SoupFilePair was the solution. This is an example of what's called a namedtuple. These objects behave just like regular tuples but can also double as a sort of "light-weight class". I like them because you don't have to remember as much what order the tuples were in that you were returning. Running "type" on the object will tell you what's what. It's sort of a nice bit of extra documentation whenever you need it.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
for soup, file_name in soup_line("sources"): | |
print(file_name, soup("title")) |