Recipe19.12.Iterating on a Stream of Data Blocks as a Stream of Lines

Recipe 19.12. Iterating on a Stream of Data Blocks as a Stream of Lines

Credit: Scott David Daniels, Peter Cogolo

Problem

You want to loop over all lines of a stream, but the stream arrives as a sequence of data blocks of arbitrary size (e.g., from a network socket).

Solution

We need to code a generator that gets blocks and yields lines:

def ilines(source_iterable, eol='\r\n', out_eol='\n'):     tail = ''     for block in source_iterable:         pieces = (tail+block).split(eol)         tail = pieces.pop( )         for line in pieces:             yield line + out_eol     if tail:         yield tail if _ _name_ _ == '_ _main_ _':     s = 'one\r\ntwo\r,\nthree,four,five\r\n,six,\r\nseven\r\nlast'.split(',')     for line in ilines(s): print repr(line)

When run as a main script, this code emits:

'one\n' 'two\n' 'threefourfive\n' 'six\n' 'seven\n' 'last'

Discussion

Many data sources produce their data in fits and startssockets, RSS feeds, the results of expanding compressed text, and (at its heart) most I/O. The data often doesn't arrive at convenient boundaries, but you nevertheless want to consume it in logical units. For text, the logical units are often lines.

This recipe shows generator ilines, a simple way to consume a source_iterable, which yields blocks of data, producing an iterator that yields lines of text instead. ilines is vastly simplified by assuming that lines are separated, on input, by a known end-of-line (EOL) stringby default '\r\n', which is the standard EOL marker in most Internet protocols. ilines' implementation is further simplified by taking a high-level approach, relying on the split method of Python's string types to do most of the work. This basically leaves ilines with the single task of "buffering" data between successive input blocks, on all occasions when a line starts in one block and ends in a following one (including those occasions in which block boundaries "split" an EOL marker).

ilines easily accomplishes its buffering task through its local variable tail, which starts empty and, at each leg of the loop, holds that which followed the latest EOL marker seen so far. When tail+block ends with an EOL marker, the expression (tail+block).split(eol) produces a list whose last item is an empty string (''), exactly what we need; otherwise, the last item of the list is that which followed the last EOL, which again is exactly what we need.

Python's built-in file objects are even more powerful than ilines, since they support a universal newlines reading mode (mode 'U'), which is able to recognize and deal with all common EOL markers (even when different markers are mixed within the same stream!). However, ilines is more flexible, since you may apply it in many situations where you have a stream of arbitrary blocks of text and want to process it as a stream of lines, with a known EOL marker.

Recipe19.12.Iterating on a Stream of Data Blocks as a Stream of Lines