2.3 Solving Problems | Text Processing in Python

2.3.1 Exercise: Many ways to take out the garbage

DISCUSSION

Recall, if you will, the dictum in "The Zen of Python" that "There should be one and preferably only one obvious way to do it." As with most dictums, the real world sometimes fails our ideals. Also as with most dictums, this is not necessarily such a bad thing.

A discussion on the newsgroup <comp.lang.python> in 2001 posed an apparently rather simple problem. The immediate problem was that one might encounter telephone numbers with a variety of dividers and delimiters inside them. For example, (123) 456-7890, 123-456-7890, or 123/456-7890 might all represent the same telephone number, and all forms might be encountered in textual data sources (such as ones entered by users of a free-form entry field. For purposes of this problem, the canonical form of this number should be 1234567890.

The problem mentioned here can be generalized in some natural ways: Maybe we are interested in only some of the characters within a longer text field (in this case, the digits), and the rest is simply filler. So the general problem is how to extract the content out from the filler.

The first and "obvious" approach might be a procedural loop through the initial string. One version of this approach might look like:

 >>> s = '(123)/456-7890' >>> result = '' >>> for c in s: ...     if c in '0123456789': ...        result = result + c ... >>> result '1234567890'

This first approach works fine, but it might seem a bit bulky for what is, after all, basically a single action. And it might also seem odd that you need to loop though character-by-character rather than just transform the whole string.

One possibly simpler approach is to use a regular expression. For readers who have skipped to the next chapter, or who know regular expressions already, this approach seems obvious:

 >>> import re >>> s = '(123)/456-7890' >>> re.sub(r'\D', '', s) '1234567890'

The actual work done (excluding defining the initial string and importing the re module) is just one short expression. Good enough, but one catch with regular expressions is that they are frequently far slower than basic string operations. This makes no difference for the tiny example presented, but for processing megabytes, it could start to matter.

Using a functional style of programming is one way to express the "filter" in question rather tersely, and perhaps more efficiently. For example:

 >>> s = '(123)/456-7890' >>> filter(lambda c:c.isdigit(), s) '1234567890'

We also get something short, without needing to use regular expressions. Here is another technique that utilizes string object methods and list comprehensions, and also pins some hopes on the great efficiency of Python dictionaries:

 >>> isdigit = {'0':1,'1':1,'2':1,'3':1,'4':1, ...            '5':1,'6':1,'7':1,'8':1,'9':1}.has_key >>> " .join([x for x in s if isdigit(x)]) '1234567890'

QUESTIONS

1:	Which content extraction technique seems most natural to you? Which would you prefer to use? Explain why.
2:	What intuitions do you have about the performance of these different techniques, if applied to large data sets? Are there differences in comparative efficiency of techniques between operating on one single large string input and operating on a large number of small string inputs?
3:	Construct a program to verify or refute your intuitions about performance of the constructs.
4:	Can you think of ways of combining these techniques to maximize efficiency? Are there any other techniques available that might be even better (hint: think about what `string.translate()` does)? Construct a faster technique, and demonstrate its efficiency.
5:	Are there reasons other than raw processing speed to prefer some of these techniques over others? Explain these reasons, if they exist.

2.3.2 Exercise: Making sure things are what they should be

DISCUSSION

The concept of a "digital signature" was introduced in Section 2.2.4. As was mentioned, the Python standard library does not include (directly) any support for digital signatures. One way to characterize a digital signature is as some information that proves or verifies that some other information really is what it purports to be. But this characterization actually applies to a broader set of things than just digital signatures. In cryptology literature one is accustomed to talk about the "threat model" a crypto-system defends against. Let us look at a few.

Data may be altered by malicious tampering, but it may also be altered by packet loss, storage-media errors, or by program errors. The threat of accidental damage to data is the easiest threat to defend against. The standard technique is to use a hash of the correct data and send that also. The receiver of the data can simply calculate the hash of the data herself using the same algorithm and compare it with the hash sent. A very simple utility like the one below does this:

crc32.py

 # Calculate CRC32 hash of input files or STDIN # Incremental read for large input sources # Usage:     python crc32.py [file1 [file2 [...]]] #    or:     python crc32.py < STDIN import binascii import fileinput filelist = [] crc = binascii.crc32('') for line in fileinput.input():     if fileinput.isfirstline():         if fileinput.isstdin():             filelist.append('STDIN')         else:             filelist.append(fileinput.filename())     crc = binascii.crc32(line,crc) print 'Files:', ' '.join(filelist) print 'CRC32:', crc

A slightly faster version could use zlib.adler32() instead of binascii.crc32. The chance that a randomly corrupted file would have the right CRC32 hash is approximately (2**-32) unlikely enough not to worry about most times.

A CRC32 hash, however, is far too weak to be used cryptographically. While random data error will almost surely not create a chance hash collision, a malicious tamperer Mallory, in crypto-parlance can find one relatively easily. Specifically, suppose the true message is M, Mallory can find an M' such that CRC32(M) equals CRC32(M'). Moreover, even imposing the condition that M' appears plausible as a message to the receiver does not make Mallory's tasks particularly difficult.

To thwart fraudulent messages, it is necessary to use a cryptographically strong hash, such as SHA or MD5. Doing so is almost the same utility as above:

sha.py

 # Calculate SHA hash of input files or STDIN # Usage:     python sha.py [file1 [file2 [...]]] #    or:     python sha.py < STDIN import sha, fileinput, os, sys filelist = [] sha = sha.sha() for line in fileinput.input():     if fileinput.isfirstline():         if fileinput.isstdin():             filelist.append('STDIN')         else:             filelist.append(fileinput.filename())     sha.update(line[:-1]+os.linesep)   # same as binary read sys.stderr.write('Files: '+' '.join(filelist)+'\nSHA: ') print sha.hexdigest()

An SHA or MD5 hash cannot be forged practically, but if our threat model includes a malicious tamperer, we need to worry about whether the hash itself is authentic. Mallory, our tamperer, can produce a false SHA hash that matches her false message. With CRC32 hashes, a very common procedure is to attach the hash to the data message itself for example, as the first or last line of the data file, or within some wrapper lines. This is called an "in band" or "in channel" transmission. One alternative is "out of band" or "off channel" transmission of cryptographic hashes. For example, a set of cryptographic hashes matching data files could be placed on a Web page. Merely transmitting the hash off channel does not guarantee security, but it does require Mallory to attack both channels effectively.

By using encryption, it is possible to transmit a secured hash in channel. The key here is to encrypt the hash and attach that encrypted version. If the hash is appended with some identifying information before the encryption, that can be recovered to prove identity. Otherwise, one could simply include both the hash and its encrypted version. For the encryption of the hash, an asymmetrical encryption algorithm is ideal; however, with the Python standard library, the best we can do is to use the (weak) symmetrical encryption in rotor. For example, we could use the utility below:

hash_rotor.py

 #!/usr/bin/env python # Encrypt hash on STDIN using sys.argv[1] as password import rotor, sys, binascii cipher = rotor.newrotor(sys.argv[1]) hexhash = sys.stdin.read()[:-1]  # no newline print hexhash hash = binascii.unhexlify(hexhash) sys.stderr.write('Encryption: ') print binascii.hexlify(cipher.encrypt(hash))

The utilities could then be used like:

 %  cat mary.txt Mary had a little lamb % python sha.py mary.txt I hash_rotor.py mypassword >> mary.txt Files: mary.txt SHA: Encryption: % cat mary.txt Mary had a little lamb c49bf9a7840f6c07ab00b164413d7958e0945941 63a9d3a2f4493d957397178354f21915cb36f8f8

The penultimate line of the file now has its SHA hash, and the last line has an encryption of the hash. The password used will somehow need to be transmitted securely for the receiver to validate the appended document (obviously, the whole system make more sense with longer and more proprietary documents than in the example).

QUESTIONS

1:	How would you wrap up the suggestions in the small utilities above into a more robust and complete "digital_signatures.py" utility or module? What concerns would come into a completed utility?
2:	Why is CRC32 not suitable for cryptographic purposes? What sets SHA and MD5 apart (you should not need to know the details of the algorithm for this answer)? Why is uniformity of coverage of hash results important for any hash algorithm?
3:	Explain in your own words why hashes serve to verify documents. If you were actually the malicious attacker in the scenarios above, how would you go about interfering with the crypto-systems outlined here? What lines of attack are left open by the system you sketched out or programmed in (1)?
4:	If messages are subject to corruptions, including accidental corruption, so are hashes. The short length of hashes may make problems in them less likely, but not impossible. How might you enhance the document verification systems above to detect corruption within a hash itself? How might you allow more accurate targeting of corrupt versus intact portions of a large document (it may be desirable to recover as much as possible from a corrupt document)?
5:	Advanced: The RSA public-key algorithm is actually quite simple; it just involves some modulo exponentiation operations and some large primes. An explanation can be found, among other places, at the author's Introduction to Cryptology Concepts II: `<`http://gnosis.cx/publish/programming/cryptology2.pdf`>`.

Try implementing an RSA public-key algorithm in Python, and use this to enrich the digital signature system you developed above.

2.3.3 Exercise: Finding needles in haystacks (full-text indexing)

DISCUSSION

Many texts you deal with are loosely structured and prose-like, rather than composed of well-ordered records. For documents of that sort, a very frequent question you want answered is, "What is (or isn't) in the documents?" at a more general level than the semantic richness you might obtain by actually reading the documents. In particular, you often want to check a large collection of documents to determine the (comparatively) small subset of them that are relevant to a given area of interest.

A certain category of questions about document collections has nothing much to do with text processing. For example, to locate all the files modified within a certain time period, and having a certain file size, some basic use of the os.path module suffices. Below is a sample utility to do such a search, which includes some typical argument parsing and help screens. The search itself is only a few lines of code:

findfile1.py

 # Find files matching date and size _usage = """ Usage:    python findfilel.py [-start=days_ago] [-end=days_ago]                        [-small=min_size] [-large=max_size] [pattern] Example:   python findfile1.py -start=10 -end=5 -small=1000 -large=5000 *.txt """ import os.path import time import glob import sys def parseargs(args):     """Somewhat flexible argument parser for multiple platforms.     Switches can start with - or /, keywords can end with = or :.     No error checking for bad arguments is performed, however.     """     now = time.time()     secs_in_day = 60*60*24     start = 0           # start of epoch     end = time.time()   # right now     small = 0           # empty files     large = sys.maxint  # max file size     pat = '*'           # match all     for arg in args:        if arg[0] in '-/':           if   arg[1:6]=='start': start = now-(secs_in_day*int(arg[7:]))           elif arg[1:4]=='end':   end   = now-(secs_in_day*int(arg[5:]))           elif arg[1:6]=='small': small = int(arg[7:])           elif arg[1:6]=='large': large = int(arg[7:])           elif arg[1] in 'h?': print _usage        else:           pat = arg     return (start,end,small,large,pat) if __name__ == '__main__':     if len(sys.argv) > 1:         (start,end,small,large,pat) = parseargs(sys.argv[1:])         for fname in glob.glob(pat):             if not os.path.isfile(fname):                 continue          # don't check directories             modtime = os.path.getmtime(fname)             size = os.path.getsize(fname)             if small <= size <= large and start <= modtime <= end:                 print time.ctime(modtime),'%8d '%size,fname     else: print _usage

What about searching for text inside files? The string.find() function is good for locating contents quickly and could be used to search files for contents. But for large document collections, hits may be common. To make sense of search results, ranking the results by number of hits can help. The utility below performs a match-accuracy ranking (for brevity, without the argument parsing of findfile1.py):

findfile2.py

 # Find files that contain a word _usage = "Usage: python findfile.py word" import os.path import glob import sys if len(sys.argv) == 2:     search_word = sys.argv[1]     results = []     for fname in glob.glob('*'):         if os.path.isfile(fname):  # don't check directories            text = open(fname).read()            fsize = len(text)            hits = text.count(search_word)            density = (fsize > 0) and float(hits)/(fsize)            if density > 0:         # consider when density==0                results.append((density,fname))     results.sort()     results.reverse()     print 'RANKING  FILENAME'     print '-------  --------------------     for match in results:         print '%6d  '%int(match[0] *1000000), match[1] else:     print _usage

Variations on these are, of course, possible. But generally you could build pretty sophisticated searches and rankings by adding new search options incrementally to findfile2.py. For example, adding some regular expression options could give the utility capabilities similar to the grep utility.

The place where a word search program like the one above falls terribly short is in speed of locating documents in very large document collections. Even something as fast, and well optimized, as grep simply takes a while to search a lot of source text. Fortunately, it is possible to shortcut this search time, as well as add some additional capabilities.

A technique for rapid searching is to perform a generic search just once (or periodically) and create an index i.e., database of those generic search results. Performing a later search need not really search contents, but only check the abstracted and structured index of possible searches. The utility indexer.py is a functional example of such a computed search index. The most current version may be downloaded from the book's Web site <http://gnosis.cx/TPiP/>.

The utility indexer.py allows very rapid searching for the simultaneous occurrence of multiple words within a file. For example, one might want to locate all the document files (or other text sources, such as VARCHAR database fields) that contain the words Python, index, and search. Supposing there are many thousands of candidate documents, searching them on an ad hoc basis could be slow. But indexer.py creates a comparatively compact collection of persistent dictionaries that provide answers to such inquiries.

The full source code to indexer.py is worth reading, but most of it deals with a variety of persistence mechanisms and with an object-oriented programming (OOP) framework for reuse. The underlying idea is simple, however. Create three dictionaries based on scanning a collection of documents:

 *Indexer.fileids:     fileid --> filename *Indexer.files:       filename --> (fileid, wordcount) *Indexer.words:       word --> {fileid1:occurs, fileid2:occurs, ...}

The essential mapping is *Indexer.words. For each word, what files does it occur in and how often? The mappings *Indexer.fileids and *Indexer.files are ancillary. The first just allows shorter numeric aliases to be used instead of long filenames in the *Indexer.words mapping (a performance boost and storage saver). The second, *Indexer.files, also holds a total wordcount for each file. This allows a ranking of the importance of different matches. The thought is that a megabyte file with ten occurrences of Python is less focused on the topic of Python than is a kilobyte file with the same ten occurrences.

Both generating and utilizing the mappings above is straightforward. To search multiple words, one basically simply needs the intersection of the results of several values of the *Indexer.words dictionary, one value for each word key. Generating the mappings involves incrementing counts in the nested dictionary of *Indexer.words, but is not complicated.

QUESTIONS

1:	One of the most significant and surprisingly subtle concerns in generating useful word indexes is figuring out just what a "word" is. What considerations would you bring to determine word identities? How might you handle capitalization? Punctuation? Whitespace? How might you disallow binary strings that are not "real" words. Try performing word-identification tests against real-world documents. How successful were you?
2:	Could other data structures be used to store word index information than those proposed above? If other data structures are used, what efficiency (speed) advantages or disadvantages do you expect to encounter? Are there other data structures that would allow for additional search capabilities than the multiword search of `indexer.py`? If so, what other indexed search capabilities would have the most practical benefit?
3:	Consider adding integrity guarantees to index results. What if an index falls out of synchronization with the underlying documents? How might you address referential integrity? Hint: consider `binascii.crc32`, `sha`, and `md5`. What changes to the data structures would be needed for integrity checks? Implement such an improvement.
4:	The utility `indexer.py` has some ad hoc exclusions of nontextual files from inclusion in an index, based simply on some file extensions. How might one perform accurate exclusion of nontextual data? What does it mean for a document to contain text? Try writing a utility `istextual.py` that will identify text and nontext real-world documents. Does it work to your satisfaction?
5:	Advanced: `indexer.py` implements several different persistence mechanisms. What other mechanisms might you use from those implemented? Benchmark your mechanism. Does it do better than `SlicedZPickleIndexer` (the best variant ncluded in both speed and space)?