Recipe2.6.Processing Every Word in a File

Recipe 2.6. Processing Every Word in a File

Credit: Luther Blissett

Problem

You need to do something with each and every word in a file.

Solution

This task is best handled by two nested loops, one on lines and another on the words in each line:

for line in open(thefilepath):     for word in line.split( ):         dosomethingwith(word)

The nested for statement's header implicitly defines words as sequences of nonspaces separated by sequences of spaces (just as the Unix program wc does). For other definitions of words, you can use regular expressions. For example:

import re re_word = re.compile(r"[\w'-]+") for line in open(thefilepath):     for word in re_word.finditer(line):         dosomethingwith(word.group(0))

In this case, a word is defined as a maximal sequence of alphanumerics, hyphens, and apostrophes.

Discussion

If you want to use other definitions of words, you will obviously need different regular expressions. The outer loop, on all lines in the file, won't change.

It's often a good idea to wrap iterations as iterator objects, and this kind of wrapping is most commonly and conveniently obtained by coding simple generators:

def words_of_file(thefilepath, line_to_words=str.split):     the_file = open(thefilepath):     for line in the_file:         for word in line_to_words(line):             yield word     the_file.close( ) for word in words_of_file(thefilepath):     dosomethingwith(word)

This approach lets you separate, cleanly and effectively, two different concerns: how to iterate over all items (in this case, words in a file) and what to do with each item in the iteration. Once you have cleanly encapsulated iteration concerns in an iterator object (often, as here, a generator), most of your uses of iteration become simple for statements. You can often reuse the iterator in many spots in your program, and if maintenance is ever needed, you can perform that maintenance in just one placethe definition of the iteratorrather than having to hunt for all uses. The advantages are thus very similar to those you obtain in any programming language by appropriately defining and using functions, rather than copying and pasting pieces of code all over the place. With Python's iterators, you can get these reuse advantages for all of your looping-control structures, too.

We've taken the opportunity afforded by the refactoring of the loop into a generator to perform two minor enhancementsensuring the file is explicitly closed, which is always a good idea, and generalizing the way each line is split into words (defaulting to the split method of string objects, but leaving a door open to more generality). For example, when we need words as defined by a regular expression, we can code another wrapper on top of words_of_file thanks to this "hook":

import re def words_by_re(thefilepath, repattern=r"[\w'-]+"):     wre = re.compile(repattern)     def line_to_words(line):         for mo in wre.finditer(line):             return mo.group(0)     return words_of_file(thefilepath, line_to_words)

Here, too, we supply a reasonable default for the regular expression pattern defining a word but still make it easy to pass a different value in those cases in which different definitions are necessary. Excessive generalization is a pernicious temptation, but a little tasteful generalization suggested by experience will most often amply repay the modest effort it requires. Having a function accept an optional argument, while providing the most likely value for the argument as the default value, is among the simplest and handiest ways to implement this modest and often worthwhile kind of generalization.

Recipe2.6.Processing Every Word in a File