2.1 Some Common Tasks | Text Processing in Python

2.1.1 Problem: Quickly sorting lines on custom criteria

Sorting is one of the real meat-and-potatoes algorithms of text processing and, in fact, of most programming. Fortunately for Python developers, the native [].sort method is extraordinarily fast. Moreover, Python lists with almost any heterogeneous objects as elements can be sorted Python cannot rely on the uniform arrays of a language like C (an unfortunate exception to this general power was introduced in recent Python versions where comparisons of complex numbers raise a TypeError; and [1+1j,2+2j].sort() dies for the same reason; Unicode strings in lists can cause similar problems).

schwartzian_sort.py

 # Timing test for "sort on fourth word" # Specifically, two lines >= 4 words will be sorted #   lexographically on the 4th, 5th, etc.. words. #   Any line with fewer than four words will be sorted to #   the end, and will occur in "natural" order. import sys, string, time wrerr = sys.stderr.write # naive custom sort def fourth_word(ln1,ln2):     lst1 = string.split(ln1)     lst2 = string.split(ln2)     #-- Compare "long" lines     if len(lst1) >= 4 and len(lst2) >= 4:         return cmp(lst1[3:],lst2[3:])     #-- Long lines before short lines     elif len(lst1) >= 4 and len(lst2) < 4:         return -1     #-- Short lines after long lines     elif len(lst1) < 4 and len(lst2) >= 4:         return 1     else:                   # Natural order         return cmp(ln1,ln2) # Don't count the read itself in the time lines = open(sys.argv[1]).readlines() # Time the custom comparison sort start = time.time() lines.sort(fourth_word) end = time.time() wrerr("Custom comparison func in %3.2f secs\n" % (end-start)) # open('tmp.custom','w').writelines(lines) # Don't count the read itself in the time lines = open(sys.argv[1]).readlines() # Time the Schwartzian sort start = time.time() for n in range(len(lines)):       # Create the transform     1st = string.split(lines[n])     if len(lst) >= 4:             # Tuple w/ sort info first         lines[n] = (1st[3:], lines[n])     else:                         # Short lines to end         lines[n] = (['\377'], lines[n]) lines.sort()                      # Native sort for n in range(len(lines)):       # Restore original lines     lines[n] = lines[n] [1] end = time.time() wrerr("Schwartzian transform sort in %3.2f secs\n" % (end-start)) # open('tmp.schwartzian','w').writelines(lines)

Only one particular example is presented, but readers should be able to generalize this technique to any sort they need to perform frequently or on large files.

2.1.2 Problem: Reformatting paragraphs of text

While I mourn the decline of plaintext ASCII as a communication format and its eclipse by unnecessarily complicated and large (and often proprietary) formats there is still plenty of life left in text files full of prose. READMEs, HOWTOs, email, Usenet posts, and this book itself are written in plaintext (or at least something close enough to plaintext that generic processing techniques are valuable). Moreover, many formats like HTML and $graphics/latex.gif$ are frequently enough hand-edited that their plaintext appearance is important.

One task that is extremely common when working with prose text files is reformatting paragraphs to conform to desired margins. Python 2.3 adds the module textwrap, which performs more limited reformatting than the code below. Most of the time, this task gets done within text editors, which are indeed quite capable of performing the task. However, sometimes it would be nice to automate the formatting process. The task is simple enough that it is slightly surprising that Python has no standard module function to do this. There is the class formatter.DumbWriter, or the possibility of inheriting from and customizing formatter.AbstractWriter. These classes are discussed in Chapter 5; but frankly, the amount of customization and sophistication needed to use these classes and their many methods is way out of proportion for the task at hand.

Below is a simple solution that can be used either as a command-line tool (reading from STDIN and writing to STDOUT) or by import to a larger application.

reformat_para.py

 # Simple paragraph reformatter.  Allows specification # of left and right margins, and of justification style # (using constants defined in module). LEFT,RIGHT,CENTER = 'LEFT','RIGHT','CENTER' def reformat_para(para='',left=0,right=72,just=LEFT):     words = para.split()     lines = []     line  = ''     word = 0     end_words = 0     while not end_words:         if len(words[word]) > right-left: # Handle very long words             line = words[word]             word +=1             if word >= len(words):                 end_words = 1         else:                             # Compose line of words             while len(line)+len(words[word]) <= right-left:                 line += words[word]+' '                 word += 1                 if word >= len(words):                     end_words = 1                     break         lines.append(line)         line = ''     if just==CENTER:         r, 1 = right, left         return '\n'.join([' '*left+ln.center(r-l) for ln in lines])     elif just==RIGHT:         return '\n'.join([line.rjust(right) for line in lines])     else: # left justify         return '\n'.join([' '*left+line for line in lines]) if __name__=='__main__':     import sys     if len(sys.argv) <> 4:         print "Please specify left_margin, right_marg, justification"     else:         left  = int(sys.argv[1])         right = int(sys.argv[2])         just  = sys.argv[3].upper()               # Simplistic approach to finding initial paragraphs               for p in sys.stdin.read().split('\n\n'):                   print reformat_para(p,left,right,just),'\n'

A number of enhancements are left to readers, if needed. You might want to allow hanging indents or indented first lines, for example. Or paragraphs meeting certain criteria might not be appropriate for wrapping (e.g., headers). A custom application might also determine the input paragraphs differently, either by a different parsing of an input file, or by generating paragraphs internally in some manner.

2.1.3 Problem: Column statistics for delimited or flat-record files

Data feeds, DBMS dumps, log files, and flat-file databases all tend to contain ontologically similar records one per line with a collection of fields in each record. Usually such fields are separated either by a specified delimiter or by specific column positions where fields are to occur.

Parsing these structured text records is quite easy, and performing computations on fields is equally straightforward. But in working with a variety of such "structured text databases," it is easy to keep writing almost the same code over again for each variation in format and computation.

The example below provides a generic framework for every similar computation on a structured text database.

fields_stats.py

 # Perform calculations on one or more of the # fields in a structured text database. import operator from types import * from xreadlines import xreadlines # req 2.1, but is much faster...                                   # could use .readline() meth < 2.1 #-- Symbolic Constants DELIMITED = 1 FLATFILE = 2 #-- Some sample "statistical" func (in functional programming style) nillFunc = lambda 1st: None toFloat = lambda 1st: map(float, 1st) avg_1st = lambda 1st: reduce(operator.add, toFloat(lst))/len(lst) sum_1st = lambda 1st: reduce(operator.add, toFloat(lst)) max_1st = lambda 1st: reduce(max, toFloat(lst)) class FieldStats:     """Gather statistics about structured text database fields text_db may be either string (incl. Unicode) or file-like object style may be in (DELIMITED, FLATFILE) delimiter specifies the field separator in DELIMITED style text_db column_positions lists all field positions for FLATFILE style,                  using one-based indexing (first column is 1).           E.g.:  (1, 7, 40) would take fields one, two, three                  from columns 1, 7, 40 respectively. field_funcs is a dictionary with column positions as keys,             and functions on lists as values.      E.g.:  {1:avg_1st, 4:sum_lst, 5:max_lst} would specify the             average of column one, the sum of column 4, and the             max of column 5.  All other cols--incl 2,3, >=6--             are ignored. """ def __init__(self,              text_db='',              style=DELIMITED,              delimiter=',',              column_positions=(1,),              field_funcs={} ):     self.text_db = text_db     self.style = style     self.delimiter = delimiter     self.column_positions = column_positions     self.field_funcs = field_funcs def calc(self):     """Calculate the column statistics     """     #-- 1st, create a list of lists for data (incl. unused flds)     used_cols = self.field_funcs.keys()     used_cols.sort()     # one-based column naming: column[0] is always unused     columns = []     for n in range(1+used_cols[-1]):         # hint: '[[]]*num' creates refs to same list         columns.append([])           #-- 2nd, fill lists used for calculated fields                   # might use a string directly for text_db           if type(self.text_db) in (StringType,UnicodeType):               for line in self.text_db.split('\n'):                   fields = self.splitter(line)                   for col in used_cols:                       field = fields[col-1]   # zero-based index                         columns[col].append(field)             else:   # Something file-like for text_db                 for line in xreadlines(self.text_db):                     fields = self.splitter(line)                     for col in used_cols:                         field = fields[col-1]   # zero-based index                         columns[col].append(field)             #-- 3rd, apply the field funcs to column lists             results = [None] * (1+used_cols[-1])             for col in used_cols:                 results[col] = \                      apply(self.field_funcs[col],(columns[col],))             #-- Finally, return the result list             return results     def splitter(self, line):         """Split a line into fields according to curr inst specs"""         if self.style == DELIMITED:             return line.split(self.delimiter)         elif self.style == FLATFILE:             fields = []             # Adjust offsets to Python zero-based indexing,             # and also add final position after the line             num_positions = len(self.column_positions)             offsets = [(pos-1) for pos in self.column_positions]             offsets.append(len(line))             for pos in range(num_positions):                 start = offsets[pos]                 end = offsets[pos+1]                 fields.append(line[start:end])             return fields         else:             raise ValueError, \                   "Text database must be DELIMITED or FLATFILE" #-- Test data # First Name, Last Name, Salary, Years Seniority, Department delim = ''' Kevin,Smith,50000,5,Media Relations Tom,Woo,30000,7,Accounting Sally,Jones,62000,10,Management '''.strip()     # no leading/trailing newlines # Comment     First     Last      Salary    Years  Dept flat = ''' tech note     Kevin     Smith     50000     5      Media Relations more filler   Tom       Woo       30000     7      Accounting yet more...   Sally     Jones     62000     10     Management '''.strip()     # no leading/trailing newlines #-- Run self-test code if __name__ == '__main__':     getdelim = FieldStats(delim, field_funcs={3:avg_lst,4:max_lst})     print 'Delimited Calculations:'     results = getdelim.calc()     print '  Average salary -', results[3]     print '  Max years worked -', results[4]     getflat = FieldStats(flat, field_funcs={3:avg_lst,4:max_lst},                                style=FLATFILE,                                column_positions=(15,25,35,45,52))     print 'Flat Calculations:'     results = getflat.calc()     print '  Average salary -', results[3]     print '  Max years worked -', results[4]

The example above includes some efficiency considerations that make it a good model for working with large data sets. In the first place, class FieldStats can (optionally) deal with a file-like object, rather than keeping the whole structured text database in memory. The generator xreadlines.xreadlines() is an extremely fast and efficient file reader, but it requires Python 2.1+ otherwise use FILE.readline() or FILE.readlines() (for either memory or speed efficiency, respectively). Moreover, only the data that is actually of interest is collected into lists, in order to save memory. However, rather than require multiple passes to collect statistics on multiple fields, as many field columns and summary functions as wanted can be used in one pass.

One possible improvement would be to allow multiple summary functions against the same field during a pass. But that is left as an exercise to the reader, if she desires to do it.

2.1.4 Problem: Counting characters, words, lines, and paragraphs

There is a wonderful utility under Unix-like systems called wc. What it does is so basic, and so obvious, that it is hard to imagine working without it. wc simply counts the characters, words, and lines of files (or STDIN). A few command-line options control which results are displayed, but I rarely use them.

In writing this chapter, I found myself on a system without wc, and felt a remedy was in order. The example below is actually an "enhanced" wc since it also counts paragraphs (but it lacks the command-line switches). Unlike the external wc, it is easy to use the technique directly within Python and is available anywhere Python is. The main trick inasmuch as there is one is a compact use of the "".join() and "".split() methods (string.join() and string.split() could also be used, for example, to be compatible with Python 1.5.2 or below).

wc.py

 # Report the chars, words, lines, paragraphs # on STDIN or in wildcard filename patterns import sys, glob if len(sys.argv) > 1:     c, w, 1, p = 0, 0, 0, 0     for pat in sys.argv[1:]:         for file in glob.glob(pat):             s = open(file).read()             wc = len(s), len(s.split()), \                  len(s.split('\n')), len(s.split('\n\n'))             print '\t'.join(map(str, wc)),'\t'+file             c, w, 1, p = c+wc[0], w+wc[1], l+wc[2], p+wc[3]     wc = (c,w,l,p)     print '\t'.join(map(str, wc)), '\tTOTAL' else:     s = sys.stdin.read()     wc = len(s), len(s.split()), len(s.split('\n')), \          len(s.split('\n\n'))     print '\t'.join(map(str, wc)), '\tSTDIN'

This little functionality could be wrapped up in a function, but it is almost too compact to bother with doing so. Most of the work is in the interaction with the shell environment, with the counting basically taking only two lines.

The solution above is quite likely the "one obvious way to do it," and therefore Pythonic. On the other hand a slightly more adventurous reader might consider this assignment (if only for fun):

 >>> wc  = map(len,[s]+map(s.split,(None,'\n','\n\n')))

A real daredevil might be able to reduce the entire program to a single print statement.

2.1.5 Problem: Transmitting binary data as ASCII

Many channels require that the information that travels over them is 7-bit ASCII. Any bytes with a high-order first bit of one will be handled unpredictably when transmitting data over protocols like Simple Mail Transport Protocol (SMTP), Network News Transport Protocol (NNTP), or HTTP (depending on content encoding), or even just when displaying them in many standard tools like editors. In order to encode 8-bit binary data as ASCII, a number of techniques have been invented over time.

An obvious, but obese, encoding technique is to translate each binary byte into its hexadecimal digits. UUencoding is an older standard that developed around the need to transmit binary files over the Usenet and on BBSs. Binhex is a similar technique from the MacOS world. In recent years, base64 which is specified by RFC1521 has edged out the other styles of encoding. All of the techniques are basically 4/3 encodings that is, four ASCII bytes are used to represent three binary bytes but they differ somewhat in line ending and header conventions (as well as in the encoding as such). Quoted printable is yet another format, but of variable encoding length. In quoted printable encoding, most plain ASCII bytes are left unchanged, but a few special characters and all high-bit bytes are escaped.

Python provides modules for all the encoding styles mentioned. The high-level wrappers uu, binhex, base64, and quopri all operate on input and output file-like objects, encoding the data therein. They also each have slightly different method names and arguments. binhex, for example, closes its output file after encoding, which makes it unusable in conjunction with a cStringlO file-like object. All of the high-level encoders utilize the services of the low-level C module binascii. binascii, in turn, implements the actual low-level block conversions, but assumes that it will be passed the right size blocks for a given encoding.

The standard library, therefore, does not contain quite the right intermediate-level functionality for when the goal is just encoding the binary data in arbitrary strings. It is easy to wrap that up, though:

encode_binary.py

 # Provide encoders for arbitrary binary data # in Python strings.  Handles block size issues # transparently, and returns a string. # Precompression of the input string can reduce # or eliminate any size penalty for encoding. import sys import zlib import binascii UU = 45 BASE64 = 57 BINHEX = sys.maxint def ASCIIencode(s='', type=BASE64, compress=1):     """ASCII encode a binary string"""     # First, decide the encoding style     if type == BASE64:   encode = binascii.b2a_base64     elif type == UU:     encode = binascii.b2a_uu     elif type == BINHEX: encode = binascii.b2a_hqx     else: raise ValueError, "Encoding must be in UU, BASE64, BINHEX"     # Second, compress the source if specified     if compress: s = zlib.compress(s)     # Third, encode the string, block-by-block     offset = 0     blocks = []     while 1:         blocks.append(encode(s[offset:offset+type]))         offset += type         if offset > len(s):             break     # Fourth, return the concatenated blocks     return ''.join(blocks) def ASCIIdecode(s='', type=BASE64, compress=1):     """Decode ASCII to a binary string"""     # First, decide the encoding style     if type == BASE64:   s = binascii.a2b_base64(s)     elif type == BINHEX: s = binascii.a2b_hqx(s)     elif type == UU:         s = ''.join([binascii.a2b_uu(line) for line in s.split('\n')])     # Second, decompress the source if specified     if compress: s = zlib.decompress(s)     # Third, return the decoded binary string     return s # Encode/decode STDIN for self-test if __name__ == '__main__':     decode, TYPE = 0, BASE64     for arg in sys.argv:         if   arg.lower()=='-d': decode = 1         elif arg.upper()=='UU': TYPE=UU         elif arg.upper()=='BINHEX': TYPE=BINHEX         elif arg.upper()=='BASE64': TYPE=BASE64     if decode:         print ASCIIdecode(sys.stdin.read(),type=TYPE)     else:         print ASCIIencode(sys.stdin.read(),type=TYPE)

The example above does not attach any headers or delimit the encoded block (by design); for that, a wrapper like uu, mimify, or MimeWriter is a better choice. Or a custom wrapper around encode_binary.py.

2.1.6 Problem: Creating word or letter histograms

A histogram is an analysis of the relative occurrence frequency of each of a number of possible values. In terms of text processing, the occurrences in question are almost always either words or byte values. Creating histograms is quite simple using Python dictionaries, but the technique is not always immediately obvious to people thinking about it. The example below has a good generality, provides several utility functions associated with histograms, and can be used in a command-line operation mode.

histogram.py

 # Create occurrence counts of words or characters # A few utility functions for presenting results # Avoids requirement of recent Python features from string import split, maketrans, translate, punctuation, digits import sys from types import * import types def word_histogram(source):     """Create histogram of normalized words (no punct or digits)"""     hist = {}     trans = maketrans('','')     if type(source) in (StringType,UnicodeType):  # String-like src         for word in split(source):             word = translate(word, trans, punctuation+digits)             if len(word) > 0:                 hist[word] = hist.get(word,0) + 1     elif hasattr(source,'read'):                  # File-like src         try:             from xreadlines import xreadlines     # Check for module             for line in xreadlines(source):                 for word in split(line):                     word = translate(word, trans, punctuation+digits)                     if len(word) > 0:                         hist[word] = hist.get(word,0) + 1         except ImportError:                       # Older Python ver             line = source.readline()          # Slow but mem-friendly             while line:                 for word in split(line):                     word = translate(word, trans, punctuation+digits)                     if len(word) > 0:                         hist[word] = hist.get(word,0) + 1                 line = source.readline()     else:         raise TypeError, \               "source must be a string-like or file-like object"     return hist def char_histogram(source, sizehint=1024*1024):     hist = {}     if type(source) in (StringType,UnicodeType):  # String-like src         for char in source:             hist[char] = hist.get(char,0) + 1     elif hasattr(source,'read'):                  # File-like src         chunk = source.read(sizehint)         while chunk:             for char in chunk:                 hist[char] = hist.get(char,0) + 1             chunk = source.read(sizehint)     else:         raise TypeError, \               "source must be a string-like or file-like object"     return hist def most_common(hist, num=1):     pairs = []     for pair in hist.items():         pairs.append((pair[1],pair[0]))     pairs.sort()     pairs.reverse()     return pairs[:num] def first_things(hist, num=1):     pairs = []     things = hist.keys()     things.sort()     for thing in things:         pairs.append((thing,hist[thing]))     pairs.sort()     return pairs[:num] if __name__ == '__main__':     if len(sys.argv) > 1:         hist = word_histogram(open(sys.argv[1]))     else:         hist = word_histogram(sys.stdin)     print "Ten most common words:"     for pair in most_common(hist, 10):         print '\t', pair[1], pair[0]     print "First ten words alphabetically:"     for pair in first_things(hist, 10):         print '\t', pair[0], pair[1]     # a more practical command-line version might use:     # for pair in most_common(hist,len(hist)):     #     print pair[1],'\t',pair[0]

Several of the design choices are somewhat arbitrary. Words have all their punctuation stripped to identify "real" words. But on the other hand, words are still case-sensitive, which may not be what is desired. The sorting functions first_things() and most_common() only return an initial sublist. Perhaps it would be better to return the whole list, and let the user slice the result. It is simple to customize around these sorts of issues, though.

2.1.7 Problem: Reading a file backwards by record, line, or paragraph

Reading a file line by line is a common task in Python, or in most any language. Files like server logs, configuration files, structured text databases, and others frequently arrange information into logical records, one per line. Very often, the job of a program is to perform some calculation on each record in turn.

Python provides a number of convenient methods on file-like objects for such line-by-line reading. FILE.readlines() reads a whole file at once and returns a list of lines. The technique is very fast, but requires the whole contents of the file be kept in memory. For very large files, this can be a problem. FILE.readline() is memory-friendly it just reads a line at a time and can be called repeatedly until the EOF is reached but it is also much slower. The best solution for recent Python versions is xreadlines.xreadlines() or FILE.xreadlines() in Python 2.1+. These techniques are memory-friendly, while still being fast and presenting a "virtual list" of lines (by way of Python's new generator/iterator interface).

The above techniques work nicely for reading a file in its natural order, but what if you want to start at the end of a file and work backwards from there? This need is frequently encountered when you want to read log files that have records appended over time (and when you want to look at the most recent records first). It comes up in other situations also. There is a very easy technique if memory usage is not an issue:

 >>> open('lines','w').write('\n'.join(['n' for n in range(100)])) >>> fp = open('lines') >>> lines = fp.readlines() >>> lines.reverse() >>> for line in lines [1:5]: ...     # Processing suite here ...     print line, ... 98 97 96 95

For large input files, however, this technique is not feasible. It would be nice to have something analogous to xreadlines here. The example below provides a good starting point (the example works equally well for file-like objects).

read_backwards.py

 # Read blocks of a file from end to beginning. # Blocks may be defined by any delimiter, but the #  constants LINE and PARA are useful ones. # Works much like the file object method '.readline()': #  repeated calls continue to get "next" part, and #  function returns empty string once BOF is reached. # Define constants from os import linesep LINE = linesep PARA = linesep*2 READSIZE = 1000 # Global variables buffer = '' def read_backwards(fp, mode=LINE, sizehint=READSIZE, _init=[0]):     """Read blocks of file backwards (return empty string when done)"""     # Trick of mutable default argument to hold state between calls     if not _init[0]:         fp.seek(0,2)         _init[0] = 1     # Find a block (using global buffer)     global buffer     while 1:         # first check for block in buffer         delim = buffer.rfind(mode)         if delim <> -1:     # block is in buffer, return it             block = buffer[delim+len(mode):]             buffer = buffer[:delim]             return block+mode         #-- BOF reached, return remainder (or empty string)         elif fp.tell()==0:             block = buffer             buffer = ''             return block         else:           # Read some more data into the buffer             readsize = min(fp.tell(),sizehint)             fp.seek(-readsize,1)             buffer = fp.read(readsize) + buffer             fp.seek(-readsize,1) #-- Self test of read_backwards() if __name__ == '__main__':     # Let's create a test file to read in backwards     fp = open('lines','wb')     fp.write(LINE.join(['--- %d ---'%n for n in range(15)]))     # Now open for reading backwards     fp = open('lines','rb')     # Read the blocks in, one per call (block==line by default)     block = read_backwards(fp)     while block:         print block,         block = read_backwards(fp)

Notice that anything could serve as a block delimiter. The constants provided just happened to work for lines and block paragraphs (and block paragraphs only with the current OS's style of line breaks). But other delimiters could be used. It would not be immediately possible to read backwards word-by-word a space delimiter would come close, but would not be quite right for other whitespace. However, reading a line (and maybe reversing its words) is generally good enough.

Another enhancement is possible with Python 2.2+. Using the new yield keyword, read_backwards() could be programmed as an iterator rather than as a multi-call function. The performance will not differ significantly, but the function might be expressed more clearly (and a "list-like" interface like FILE.readlines() makes the application's loop simpler).

QUESTIONS

1:	Write a generator-based version of `read_backwards()` that uses the `yield` keyword. Modify the self-test code to utilize the generator instead.
2:	Explore and explain some pitfalls with the use of a mutable default value as a function argument. Explain also how the style allows functions to encapsulate data and contrast with the encapsulation of class instances.