Recipe19.10.Reading a Text File by Paragraphs

Recipe 19.10. Reading a Text File by Paragraphs

Credit: Alex Martelli, Magnus Lie Hetland, Terry Reedy

Problem

You need to read a text file (or any other iterable whose items are lines of text) paragraph by paragraph, where a "paragraph" is defined as a sequence of nonwhite lines (i.e., paragraphs are separated by lines made up exclusively of whitespace).

Solution

A generator is quite suitable for bunching up lines this way:

def paragraphs(lines, is_separator=str.isspace, joiner=''.join):     paragraph = [  ]     for line in lines:         if is_separator(line):             if paragraph:                 yield joiner(paragraph)                 paragraph = [  ]         else:             paragraph.append(line)     if paragraph:         yield joiner(paragraph) if _ _name_ _ == '_ _main_ _':     text = 'a first\nparagraph\n\nand a\nsecond one\n\n'     for p in paragraphs(text.splitlines(True)): print repr(p)

Discussion

Python doesn't directly support paragraph-oriented file reading, but it's not hard to add such functionality. We define a "paragraph" as the string formed by joining a nonempty sequence of nonseparator lines, separated from any adjoining paragraphs by nonempty sequences of separator lines. A separator line is one that satisfies the predicate passed in as argument is_separator. (A predicate is a function whose result is taken as a logical truth value, and we say a predicate is satisfied when the predicate returns a result that is true.) By default, a line is a separator if it is made up entirely of whitespace characters (e.g., space, tab, newline, etc.).

The recipe's code is quite straightforward. The state of the generator during iteration is entirely held in local variable paragraph, a list to which we append the nonseparator lines that make up the current paragraph. Whenever we meet a separator in the body of the for statement, we test if paragraph to check whether the list is currently empty. If the list is empty, we're already skipping a run of separators and need do nothing special to handle the current separator line. If the list is not empty, we've just met a separator line that terminates the current paragraph, so we must join up the list, yield the resulting paragraph string, and then set the list back to empty.

This recipe implements a special case of sequence adaptation by bunching: an underlying iterable is "bunched up" into another iterable with "bigger" items. Python's generators let you express sequence adaptation tasks very directly and linearly. By passing as arguments, with reasonable default values, the is_separator predicate, and the joiner callable that determines what happens to each "bigger item" when we're done bunching it up, we achieve a satisfactory amount of generality without any extra complexity. To see this, consider a snippet such as:

import operator numbers = [1, 2, 3, 0, 0, 6, 5, 3, 0, 12] bunch_up = paragraphs for s in bunch_up(numbers, operator.not_, sum): print 'S', s for l in bunch_up(numbers, bool, len): print 'L', l

In this snippet, we use the paragraphs generator (under the name of bunch_up, which is clearer in this context) to get the sums of "runs" of nonzero numbers separated by runs of zeros, then the lengths of the runs of zerosapplications that, at first sight, might appear to be quite different from the recipe's stated purpose. That's the magic of abstraction: when appropriately and tastefully applied, it can easily turn the solution of a problem into a family of solutions for many other apparently unrelated problems.

An elementary issue, but a crucial one for getting good performance in the "main" use case of this recipe, is that the paragraphs' generator builds up each resulting paragraph as a list of strings, then concatenates all strings in the list with ''.join to obtain each result it yields. An alternate approach, where a large string is built up as a string, by repeated application of += or +, is never the right approach in Python: it is both slow and clumsy. Good Pythonic style absolutely demands that we use a list as the intermediate accumulator, whenever we are building a long string by concatenating a number of smaller ones. Python 2.4 has diminished the performance penalty of the wrong approach. For example, to join a list of 52 one-character strings into a 52-character string on my machine, Python 2.3 takes 14.2 microseconds with the right approach, 73.6 with the wrong one; but Python 2.4 takes 12.7 microseconds with the right approach, 41.6 with the wrong one, so the penalty in this case has decreased from over five times to over three. Nevertheless, there is no reason to choose to pay such a performance penalty without any returns, even the lower penalty that Python 2.4 manages to extract!

Python 2.4 offers a new itertools.groupby function that is quite suitable for sequence-bunching tasks. Using it, we could express the paragraphs' generator in a really tight and concise way:

from itertools import groupby def paragraphs(lines, is_separator=str.isspace, joiner=''.join):     for separator_group, lineiter in groupby(lines, key=is_separator):         if not separator_group:             yield joiner(lineiter)

itertools.groupby, like SQL's GROUP BY clause, which inspired it, is not exactly trivial use, but it can be quite useful indeed for sequence-bunching tasks once you have mastered it thoroughly.

Recipe19.10.Reading a Text File by Paragraphs