21.3. String Method UtilitiesPython's string methods include a variety of text-processing utilities that go above and beyond string expression operators. For instance, given an instance str of the built-in string object type:
This list is representative but partial, and some of these methods take additional optional arguments. For the full list of string methods, run a dir(str) call at the Python interactive prompt and run help(str.method) on any method for some quick documentation. The Python library manual also includes an exhaustive list. Moreover, in Python today, Unicode (wide) strings fully support all normal string methods, and most of the older string module's functions are also now available as string object methods. For instance, in Python 2.0 and later, the following two expressions are equivalent: string.find(aString, substr) # original module aString.find(substr) # methods new in 2.0 However, the second form does not require callers to import the string module first. As of this third edition of the book, the method call form is used everywhere, since it has been the recommended best-practice pattern for some time. If you see older code based on the module call pattern, it is a simple mapping to the newer method-based call form. The original string module still contains predefined constants (e.g., string.uppercase), as well as the new Template substitution interface in 2.4, and so remains useful in some contexts apart from method calls. 21.3.1. Templating with Replacements and FormatsSpeaking of templates, as we saw when coding the web page migration scripts in Part II of this book, the string replace method is often adequate as a string templating toolwe can compute values and insert them at fixed positions in a string with a single replace call: >>> template = '---$target1---$target2---' >>> val1 = 'Spam' >>> val2 = 'shrubbery' >>> template = template.replace('$target1', val1) >>> template = template.replace('$target2', val2) >>> template '---Spam---shrubbery---' As we also saw when generating HTML code in our Common Gateway Interface (CGI) scripts in Part III of this book, the string % formatting operator is also a powerful templating toolsimply fill out a dictionary with values and apply substitutions to the HTML string all at once: >>> template = """ ... --- ... ---%(key1)s--- ... ---%(key2)s--- ... """ >>> >>> vals = {} >>> vals['key1'] = 'Spam' >>> vals['key2'] = 'shrubbery' >>> print template % vals --- ---Spam--- ---shrubbery--- The 2.4 string module's Template feature is essentially a simplified variation of the dictionary-based format scheme, but it allows some additional call patterns: >>> vals {'key2': 'shrubbery', 'key1': 'Spam'} >>> import string >>> template = string.Template('---$key1---$key2---') >>> template.substitute(vals) '---Spam---shrubbery---' >>> template.substitute(key1='Brian', key2='Loretta') '---Brian---Loretta---' See the library manual for more on this extension. Although the string datatype does not itself support the pattern-directed text processing that we'll meet later in this chapter, its tools are powerful enough for many tasks. 21.3.2. Parsing with Splits and JoinsIn terms of this chapter's main focus, Python's built-in tools for splitting and joining strings around tokens turn out to be especially useful when it comes to parsing text:
These two are among the most powerful of string methods. As we saw earlier in Chapter 3, split chops a string into a list of substrings and join puts them back together:[*]
>>> 'A B C D'.split( ) ['A', 'B', 'C', 'D'] >>> 'A+B+C+D'.split('+') ['A', 'B', 'C', 'D'] >>> '--'.join(['a', 'b', 'c']) 'a--b--c' Despite their simplicity, they can handle surprisingly complex text-parsing tasks. Moreover, string method calls are very fast because they are implemented in C language code. For instance, to quickly replace all tabs in a file with four periods, pipe the file into a script that looks like this: from sys import * stdout.write( ('.' * 4).join( stdin.read( ).split('\t') ) ) The split call here divides input around tabs, and the join puts it back together with periods where tabs had been. The combination of the two calls is equivalent to using the global replacement string method call as follows: stdout.write( stdin.read( ).replace('\t', '.'*4) ) As we'll see in the next sections, splitting strings is sufficient for many text-parsing goals. 21.3.3. Summing Columns in a FileLet's look at a couple of practical applications of string splits and joins. In many domains, scanning files by columns is a fairly common task. For instance, suppose you have a file containing columns of numbers output by another system, and you need to sum each column's numbers. In Python, string splitting does the job, as demonstrated by Example 21-1. As an added bonus, it's easy to make the solution a reusable tool in Python. Example 21-1. PP3E\Lang\summer.py
Notice that we use file iterators here to read line by line, instead of calling the file readlines method explicitly (recall from Chapter 4 that iterators avoid loading the entire file into memory all at once). As usual, you can both import this module and call its function and run it as a shell tool from the command line. The summer.py script calls split to make a list of strings representing the line's columns, and eval to convert column strings to numbers. Here's an input file that uses both blanks and tabs to separate columns: C:\...\PP3E\Lang>type table1.txt 1 5 10 2 1.0 2 10 20 4 2.0 3 15 30 8 3 4 20 40 16 4.0 C:\...\PP3E\Lang>python summer.py 5 table1.txt [10, 50, 100, 30, 10.0] Also notice that because the summer script uses eval to convert file text to numbers, you could really store arbitrary Python expressions in the file. Here, for example, it's run on a file of Python code snippets: C:\...\PP3E\Lang>type table2.txt 2 1+1 1<<1 eval("2") 16 2*2*2*2 pow(2,4) 16.0 3 len('abc') [1,2,3][2] {'spam':3}['spam'] C:\...\PP3E\Lang>python summer.py 4 table2.txt [21, 21, 21, 21.0] We'll revisit eval later in this chapter, when we explore expression evaluators. Sometimes this is more than we wantif we can't be sure that the strings that we run this way won't contain malicious code, for instance, it may be necessary to run them with limited machine access or use more restrictive conversion tools. Consider the following recoding of the summer function: def summer(numCols, fileName): sums = [0] * numCols for line in open(fileName): # use file iterators cols = line.split(',') # assume comma-delimited nums = [int(x) for x in cols] # use limited converter both = zip(sums, nums) # avoid nested for loop sums = [x + y for (x, y) in both] return sums This version uses int for its conversions from strings to support only numbers, and not arbitrary and possibly unsafe expressions. Although the first four lines of this coding are similar to the original, for variety this version also assumes the data is separated by commas rather than whitespace and runs list comprehensions and zip to avoid the nested for loop statement. This version is also substantially trickier than the original and so might be less desirable from a maintenance perspective. If its code is confusing, try adding print statements after each step to trace the results of each operation. For related examples, also see the grid examples in Chapter 10 for another case of eval table magic at work. The summer script here is a much simpler version of that chapter's column sum logic. To remove the need to pass in a number-columns value, see also the more advanced floating-point column summer example in the "Other Uses for Dictionaries" sidebar in Chapter 2it works by making the column number a key rather than an offset. 21.3.4. Parsing and Unparsing Rule StringsSplitting comes in handy for diving text into columns, but it can also be used as a more general parsing toolby splitting more than once on different delimiters, we can pick apart more complex text. Although such parsing can also be achieved with more powerful tools such as the regular expressions we'll meet later in this chapter, split-based parsing is simper to code, and may run quicker. For instance, Example 21-2 demonstrates one way that splitting and joining strings can be used to parse sentences in a simple language. It is taken from a rule- based expert system shell (holmes) that is written in Python and included in this book's examples distribution (see the top-level Ai examples directory). Rule strings in holmes take the form: "rule <id> if <test1>, <test2>... then <conclusion1>, <conclusion2>..." Tests and conclusions are conjunctions of terms ("," means "and"). Each term is a list of words or variables separated by spaces; variables start with ?. To use a rule, it is translated to an internal forma dictionary with nested lists. To display a rule, it is translated back to the string form. For instance, given the call: rules.internal_rule('rule x if a ?x, b then c, d ?x') the conversion in function internal_rule proceeds as follows: string = 'rule x if a ?x, b then c, d ?x' i = ['rule x', 'a ?x, b then c, d ?x'] t = ['a ?x, b', 'c, d ?x'] r = ['', 'x'] result = {'rule':'x', 'if':[['a','?x'], ['b']], 'then':[['c'], ['d','?x']]} We first split around the if, then around the then, and finally around rule. The result is the three substrings that were separated by the keywords. Test and conclusion substrings are split around "," first and spaces last. join is used to convert back (unparse) to the original string for display. Example 21-2 is the concrete implementation of this scheme. Example 21-2. PP3E\Lang\rules.py
Notice that we could use newer list comprehensions to gain some conciseness here. The internal function, for instance, could be recoded to simply: return [clause.split( ) for clause in conjunct.split(',')] to produce the desired nested lists by combining two steps into one. This form might run faster; we'll leave it to the reader to decide whether it is more difficult to understand. As usual, we can test components of this module interactively: >>> import rules >>> rules.internal('a ?x, b') [['a', '?x'], ['b']] >>> rules.internal_rule('rule x if a ?x, b then c, d ?x') {'if': [['a', '?x'], ['b']], 'rule': 'x', 'then': [['c'], ['d', '?x']]} >>> r = rules.internal_rule('rule x if a ?x, b then c, d ?x') >>> rules.external_rule(r) 'rule x if a ?x, b then c, d ?x.' Parsing by splitting strings around tokens like this takes you only so far. There is no direct support for recursive nesting of components, and syntax errors are not handled very gracefully. But for simple language tasks like this, string splitting might be enough, at least for prototyping systems. You can always add a more robust rule parser later or reimplement rules as embedded Python code or classes.
21.3.5. More on the holmes Expert System ShellSo how are these rules actually used? As mentioned, the rule parser we just met is part of the Python-coded holmes expert system shell. This book does not cover holmes in detail due to lack of space; see the PP3E\AI\ExpertSystem directory in this book's examples distribution for its code and documentation. But by way of introduction, holmes is an inference engine that performs forward and backward chaining deduction on rules that you supply. For example, the rule: rule pylike if ?X likes coding, ?X likes spam then ?X likes Python can be used both to prove whether someone likes Python (backward, from "then" to "if"), and to deduce that someone likes Python from a set of known facts (forward, from "if" to "then"). Deductions may span multiple rules, and rules that name the same conclusion represent alternatives. holmes also performs simple pattern-matching along the way to assign the variables that appear in rules (e.g., ?X), and it is able to explain its work. To make all of this more concrete, let's step through a simple holmes session. The += interactive command adds a new rule to the rule base by running the rule parser, and @@ prints the current rule base: C:..\PP3E\Ai\ExpertSystem\holmes\holmes>python holmes.py -Holmes inference engine- holmes> += rule pylike if ?X likes coding, ?X likes spam then ?X likes Python holmes> @@ rule pylike if ?X likes coding, ?X likes spam then ?X likes Python. Now, to kick off a backward-chaining proof of a goal, use the ?- command. A proof explanation is shown here; holmes can also tell you why it is asking a question. Holmes pattern variables can show up in both rules and queries; in rules, variables provide generalization; in a query, they provide an answer: holmes> ?- mel likes Python is this true: "mel likes coding" ? y is this true: "mel likes spam" ? y yes: (no variables) show proof ? yes "mel likes Python" by rule pylike "mel likes coding" by your answer "mel likes spam" by your answer more solutions? n holmes> ?- ann likes ?X is this true: "ann likes coding" ? y is this true: "ann likes spam" ? y yes: ann likes Python Forward chaining from a set of facts to conclusions is started with a +- command. Here, the same rule is being applied but in a different way: holmes> +- chris likes spam, chris likes coding I deduced these facts... chris likes Python I started with these facts... chris likes spam chris likes coding time: 0.0 More interestingly, deductions chain through multiple rules when part of a rule's "if" is mentioned in another rule's "then": holmes> += rule 1 if thinks ?x then human ?x holmes> += rule 2 if human ?x then mortal ?x holmes> ?- mortal bob is this true: "thinks bob" ? y yes: (no variables) holmes> +- thinks bob I deduced these facts... human bob mortal bob I started with these facts... thinks bob time: 0.0 Finally, the @= command is used to load files of rules that implement more sophisticated knowledge bases; the rule parser is run on each rule in the file. Here is a file that encodes animal classification rules (other example files are available in the book's examples distribution, if you'd like to experiment): holmes> @= ..\kbases\zoo.kb holmes> ?- it is a penguin is this true: "has feathers" ? why to prove "it is a penguin" by rule 17 this was part of your original query. is this true: "has feathers" ? y is this true: "able to fly" ? n is this true: "black color" ? y yes: (no variables) Type stop to end a session and help for a full commands list; see the text files in the holmes directories for more details. Holmes is an old system written before Python 1.0 (and around 1993), but it still works unchanged on all platforms. |