Recipe 19.12. Iterating on a Stream of Data Blocks as a Stream of LinesCredit: Scott David Daniels, Peter Cogolo ProblemYou want to loop over all lines of a stream, but the stream arrives as a sequence of data blocks of arbitrary size (e.g., from a network socket). SolutionWe need to code a generator that gets blocks and yields lines: def ilines(source_iterable, eol='\r\n', out_eol='\n'): tail = '' for block in source_iterable: pieces = (tail+block).split(eol) tail = pieces.pop( ) for line in pieces: yield line + out_eol if tail: yield tail if _ _name_ _ == '_ _main_ _': s = 'one\r\ntwo\r,\nthree,four,five\r\n,six,\r\nseven\r\nlast'.split(',') for line in ilines(s): print repr(line) When run as a main script, this code emits: 'one\n' 'two\n' 'threefourfive\n' 'six\n' 'seven\n' 'last' DiscussionMany data sources produce their data in fits and startssockets, RSS feeds, the results of expanding compressed text, and (at its heart) most I/O. The data often doesn't arrive at convenient boundaries, but you nevertheless want to consume it in logical units. For text, the logical units are often lines. This recipe shows generator ilines, a simple way to consume a source_iterable, which yields blocks of data, producing an iterator that yields lines of text instead. ilines is vastly simplified by assuming that lines are separated, on input, by a known end-of-line (EOL) stringby default '\r\n', which is the standard EOL marker in most Internet protocols. ilines' implementation is further simplified by taking a high-level approach, relying on the split method of Python's string types to do most of the work. This basically leaves ilines with the single task of "buffering" data between successive input blocks, on all occasions when a line starts in one block and ends in a following one (including those occasions in which block boundaries "split" an EOL marker). ilines easily accomplishes its buffering task through its local variable tail, which starts empty and, at each leg of the loop, holds that which followed the latest EOL marker seen so far. When tail+block ends with an EOL marker, the expression (tail+block).split(eol) produces a list whose last item is an empty string (''), exactly what we need; otherwise, the last item of the list is that which followed the last EOL, which again is exactly what we need. Python's built-in file objects are even more powerful than ilines, since they support a universal newlines reading mode (mode 'U'), which is able to recognize and deal with all common EOL markers (even when different markers are mixed within the same stream!). However, ilines is more flexible, since you may apply it in many situations where you have a stream of arbitrary blocks of text and want to process it as a stream of lines, with a known EOL marker. See AlsoLibrary Reference and Python in a Nutshell docs about built-in file objects; Chapter 2 for general issues about handling files. |