2.3.1 Exercise: Many ways to take out the garbageDISCUSSIONRecall, if you will, the dictum in "The Zen of Python" that "There should be one and preferably only one obvious way to do it." As with most dictums, the real world sometimes fails our ideals. Also as with most dictums, this is not necessarily such a bad thing. A discussion on the newsgroup <comp.lang.python> in 2001 posed an apparently rather simple problem. The immediate problem was that one might encounter telephone numbers with a variety of dividers and delimiters inside them. For example, (123) 456-7890, 123-456-7890, or 123/456-7890 might all represent the same telephone number, and all forms might be encountered in textual data sources (such as ones entered by users of a free-form entry field. For purposes of this problem, the canonical form of this number should be 1234567890. The problem mentioned here can be generalized in some natural ways: Maybe we are interested in only some of the characters within a longer text field (in this case, the digits), and the rest is simply filler. So the general problem is how to extract the content out from the filler. The first and "obvious" approach might be a procedural loop through the initial string. One version of this approach might look like: >>> s = '(123)/456-7890' >>> result = '' >>> for c in s: ... if c in '0123456789': ... result = result + c ... >>> result '1234567890' This first approach works fine, but it might seem a bit bulky for what is, after all, basically a single action. And it might also seem odd that you need to loop though character-by-character rather than just transform the whole string. One possibly simpler approach is to use a regular expression. For readers who have skipped to the next chapter, or who know regular expressions already, this approach seems obvious: >>> import re >>> s = '(123)/456-7890' >>> re.sub(r'\D', '', s) '1234567890' The actual work done (excluding defining the initial string and importing the re module) is just one short expression. Good enough, but one catch with regular expressions is that they are frequently far slower than basic string operations. This makes no difference for the tiny example presented, but for processing megabytes, it could start to matter. Using a functional style of programming is one way to express the "filter" in question rather tersely, and perhaps more efficiently. For example: >>> s = '(123)/456-7890' >>> filter(lambda c:c.isdigit(), s) '1234567890' We also get something short, without needing to use regular expressions. Here is another technique that utilizes string object methods and list comprehensions, and also pins some hopes on the great efficiency of Python dictionaries: >>> isdigit = {'0':1,'1':1,'2':1,'3':1,'4':1, ... '5':1,'6':1,'7':1,'8':1,'9':1}.has_key >>> " .join([x for x in s if isdigit(x)]) '1234567890' QUESTIONS
2.3.2 Exercise: Making sure things are what they should beDISCUSSIONThe concept of a "digital signature" was introduced in Section 2.2.4. As was mentioned, the Python standard library does not include (directly) any support for digital signatures. One way to characterize a digital signature is as some information that proves or verifies that some other information really is what it purports to be. But this characterization actually applies to a broader set of things than just digital signatures. In cryptology literature one is accustomed to talk about the "threat model" a crypto-system defends against. Let us look at a few. Data may be altered by malicious tampering, but it may also be altered by packet loss, storage-media errors, or by program errors. The threat of accidental damage to data is the easiest threat to defend against. The standard technique is to use a hash of the correct data and send that also. The receiver of the data can simply calculate the hash of the data herself using the same algorithm and compare it with the hash sent. A very simple utility like the one below does this: crc32.py# Calculate CRC32 hash of input files or STDIN # Incremental read for large input sources # Usage: python crc32.py [file1 [file2 [...]]] # or: python crc32.py < STDIN import binascii import fileinput filelist = [] crc = binascii.crc32('') for line in fileinput.input(): if fileinput.isfirstline(): if fileinput.isstdin(): filelist.append('STDIN') else: filelist.append(fileinput.filename()) crc = binascii.crc32(line,crc) print 'Files:', ' '.join(filelist) print 'CRC32:', crc A slightly faster version could use zlib.adler32() instead of binascii.crc32. The chance that a randomly corrupted file would have the right CRC32 hash is approximately (2**-32) unlikely enough not to worry about most times. A CRC32 hash, however, is far too weak to be used cryptographically. While random data error will almost surely not create a chance hash collision, a malicious tamperer Mallory, in crypto-parlance can find one relatively easily. Specifically, suppose the true message is M, Mallory can find an M' such that CRC32(M) equals CRC32(M'). Moreover, even imposing the condition that M' appears plausible as a message to the receiver does not make Mallory's tasks particularly difficult. To thwart fraudulent messages, it is necessary to use a cryptographically strong hash, such as SHA or MD5. Doing so is almost the same utility as above: sha.py# Calculate SHA hash of input files or STDIN # Usage: python sha.py [file1 [file2 [...]]] # or: python sha.py < STDIN import sha, fileinput, os, sys filelist = [] sha = sha.sha() for line in fileinput.input(): if fileinput.isfirstline(): if fileinput.isstdin(): filelist.append('STDIN') else: filelist.append(fileinput.filename()) sha.update(line[:-1]+os.linesep) # same as binary read sys.stderr.write('Files: '+' '.join(filelist)+'\nSHA: ') print sha.hexdigest() An SHA or MD5 hash cannot be forged practically, but if our threat model includes a malicious tamperer, we need to worry about whether the hash itself is authentic. Mallory, our tamperer, can produce a false SHA hash that matches her false message. With CRC32 hashes, a very common procedure is to attach the hash to the data message itself for example, as the first or last line of the data file, or within some wrapper lines. This is called an "in band" or "in channel" transmission. One alternative is "out of band" or "off channel" transmission of cryptographic hashes. For example, a set of cryptographic hashes matching data files could be placed on a Web page. Merely transmitting the hash off channel does not guarantee security, but it does require Mallory to attack both channels effectively. By using encryption, it is possible to transmit a secured hash in channel. The key here is to encrypt the hash and attach that encrypted version. If the hash is appended with some identifying information before the encryption, that can be recovered to prove identity. Otherwise, one could simply include both the hash and its encrypted version. For the encryption of the hash, an asymmetrical encryption algorithm is ideal; however, with the Python standard library, the best we can do is to use the (weak) symmetrical encryption in rotor. For example, we could use the utility below: hash_rotor.py#!/usr/bin/env python # Encrypt hash on STDIN using sys.argv[1] as password import rotor, sys, binascii cipher = rotor.newrotor(sys.argv[1]) hexhash = sys.stdin.read()[:-1] # no newline print hexhash hash = binascii.unhexlify(hexhash) sys.stderr.write('Encryption: ') print binascii.hexlify(cipher.encrypt(hash)) The utilities could then be used like: % cat mary.txt Mary had a little lamb % python sha.py mary.txt I hash_rotor.py mypassword >> mary.txt Files: mary.txt SHA: Encryption: % cat mary.txt Mary had a little lamb c49bf9a7840f6c07ab00b164413d7958e0945941 63a9d3a2f4493d957397178354f21915cb36f8f8 The penultimate line of the file now has its SHA hash, and the last line has an encryption of the hash. The password used will somehow need to be transmitted securely for the receiver to validate the appended document (obviously, the whole system make more sense with longer and more proprietary documents than in the example). QUESTIONS
Try implementing an RSA public-key algorithm in Python, and use this to enrich the digital signature system you developed above. 2.3.3 Exercise: Finding needles in haystacks (full-text indexing)DISCUSSIONMany texts you deal with are loosely structured and prose-like, rather than composed of well-ordered records. For documents of that sort, a very frequent question you want answered is, "What is (or isn't) in the documents?" at a more general level than the semantic richness you might obtain by actually reading the documents. In particular, you often want to check a large collection of documents to determine the (comparatively) small subset of them that are relevant to a given area of interest. A certain category of questions about document collections has nothing much to do with text processing. For example, to locate all the files modified within a certain time period, and having a certain file size, some basic use of the os.path module suffices. Below is a sample utility to do such a search, which includes some typical argument parsing and help screens. The search itself is only a few lines of code: findfile1.py# Find files matching date and size _usage = """ Usage: python findfilel.py [-start=days_ago] [-end=days_ago] [-small=min_size] [-large=max_size] [pattern] Example: python findfile1.py -start=10 -end=5 -small=1000 -large=5000 *.txt """ import os.path import time import glob import sys def parseargs(args): """Somewhat flexible argument parser for multiple platforms. Switches can start with - or /, keywords can end with = or :. No error checking for bad arguments is performed, however. """ now = time.time() secs_in_day = 60*60*24 start = 0 # start of epoch end = time.time() # right now small = 0 # empty files large = sys.maxint # max file size pat = '*' # match all for arg in args: if arg[0] in '-/': if arg[1:6]=='start': start = now-(secs_in_day*int(arg[7:])) elif arg[1:4]=='end': end = now-(secs_in_day*int(arg[5:])) elif arg[1:6]=='small': small = int(arg[7:]) elif arg[1:6]=='large': large = int(arg[7:]) elif arg[1] in 'h?': print _usage else: pat = arg return (start,end,small,large,pat) if __name__ == '__main__': if len(sys.argv) > 1: (start,end,small,large,pat) = parseargs(sys.argv[1:]) for fname in glob.glob(pat): if not os.path.isfile(fname): continue # don't check directories modtime = os.path.getmtime(fname) size = os.path.getsize(fname) if small <= size <= large and start <= modtime <= end: print time.ctime(modtime),'%8d '%size,fname else: print _usage What about searching for text inside files? The string.find() function is good for locating contents quickly and could be used to search files for contents. But for large document collections, hits may be common. To make sense of search results, ranking the results by number of hits can help. The utility below performs a match-accuracy ranking (for brevity, without the argument parsing of findfile1.py): findfile2.py# Find files that contain a word _usage = "Usage: python findfile.py word" import os.path import glob import sys if len(sys.argv) == 2: search_word = sys.argv[1] results = [] for fname in glob.glob('*'): if os.path.isfile(fname): # don't check directories text = open(fname).read() fsize = len(text) hits = text.count(search_word) density = (fsize > 0) and float(hits)/(fsize) if density > 0: # consider when density==0 results.append((density,fname)) results.sort() results.reverse() print 'RANKING FILENAME' print '------- -------------------- for match in results: print '%6d '%int(match[0] *1000000), match[1] else: print _usage Variations on these are, of course, possible. But generally you could build pretty sophisticated searches and rankings by adding new search options incrementally to findfile2.py. For example, adding some regular expression options could give the utility capabilities similar to the grep utility. The place where a word search program like the one above falls terribly short is in speed of locating documents in very large document collections. Even something as fast, and well optimized, as grep simply takes a while to search a lot of source text. Fortunately, it is possible to shortcut this search time, as well as add some additional capabilities. A technique for rapid searching is to perform a generic search just once (or periodically) and create an index i.e., database of those generic search results. Performing a later search need not really search contents, but only check the abstracted and structured index of possible searches. The utility indexer.py is a functional example of such a computed search index. The most current version may be downloaded from the book's Web site <http://gnosis.cx/TPiP/>. The utility indexer.py allows very rapid searching for the simultaneous occurrence of multiple words within a file. For example, one might want to locate all the document files (or other text sources, such as VARCHAR database fields) that contain the words Python, index, and search. Supposing there are many thousands of candidate documents, searching them on an ad hoc basis could be slow. But indexer.py creates a comparatively compact collection of persistent dictionaries that provide answers to such inquiries. The full source code to indexer.py is worth reading, but most of it deals with a variety of persistence mechanisms and with an object-oriented programming (OOP) framework for reuse. The underlying idea is simple, however. Create three dictionaries based on scanning a collection of documents: *Indexer.fileids: fileid --> filename *Indexer.files: filename --> (fileid, wordcount) *Indexer.words: word --> {fileid1:occurs, fileid2:occurs, ...} The essential mapping is *Indexer.words. For each word, what files does it occur in and how often? The mappings *Indexer.fileids and *Indexer.files are ancillary. The first just allows shorter numeric aliases to be used instead of long filenames in the *Indexer.words mapping (a performance boost and storage saver). The second, *Indexer.files, also holds a total wordcount for each file. This allows a ranking of the importance of different matches. The thought is that a megabyte file with ten occurrences of Python is less focused on the topic of Python than is a kilobyte file with the same ten occurrences. Both generating and utilizing the mappings above is straightforward. To search multiple words, one basically simply needs the intersection of the results of several values of the *Indexer.words dictionary, one value for each word key. Generating the mappings involves incrementing counts in the nested dictionary of *Indexer.words, but is not complicated. QUESTIONS
|