Section 21.3. String Method Utilities

21.3. String Method Utilities

Python's string methods include a variety of text-processing utilities that go above and beyond string expression operators. For instance, given an instance str of the built-in string object type:

str.find(substr): Performs substring searches
str.replace(old, new): Performs substring substitutions
str.split(delim): Chops up a string around delimiters
str.join(seq): Puts substrings together with delimiters between
str.strip( ): Removes leading and trailing whitespace
str.rstrip( ): Removes trailing whitespace only, if any
str.rjust(width): Right-justifies a string in a fixed-width field
str.upper( ): Converts to uppercase
str.isupper( ): Tests whether the string is uppercase
str.isdigit( ): Tests whether the string is all digit characters
str.endswith(substr): Tests for a substring at the end
str.startswith(substr): Tests for a substring at the front

This list is representative but partial, and some of these methods take additional optional arguments. For the full list of string methods, run a dir(str) call at the Python interactive prompt and run help(str.method) on any method for some quick documentation. The Python library manual also includes an exhaustive list.

Moreover, in Python today, Unicode (wide) strings fully support all normal string methods, and most of the older string module's functions are also now available as string object methods. For instance, in Python 2.0 and later, the following two expressions are equivalent:

 string.find(aString, substr)      # original module aString.find(substr)              # methods new in 2.0

However, the second form does not require callers to import the string module first.

As of this third edition of the book, the method call form is used everywhere, since it has been the recommended best-practice pattern for some time. If you see older code based on the module call pattern, it is a simple mapping to the newer method-based call form. The original string module still contains predefined constants (e.g., string.uppercase), as well as the new Template substitution interface in 2.4, and so remains useful in some contexts apart from method calls.

21.3.1. Templating with Replacements and Formats

Speaking of templates, as we saw when coding the web page migration scripts in Part II of this book, the string replace method is often adequate as a string templating toolwe can compute values and insert them at fixed positions in a string with a single replace call:

 >>> template = '---$target1---$target2---' >>> val1 = 'Spam' >>> val2 = 'shrubbery' >>> template = template.replace('$target1', val1) >>> template = template.replace('$target2', val2) >>> template '---Spam---shrubbery---'

As we also saw when generating HTML code in our Common Gateway Interface (CGI) scripts in Part III of this book, the string % formatting operator is also a powerful templating toolsimply fill out a dictionary with values and apply substitutions to the HTML string all at once:

 >>> template = """ ... --- ... ---%(key1)s--- ... ---%(key2)s--- ... """ >>> >>> vals = {} >>> vals['key1'] = 'Spam' >>> vals['key2'] = 'shrubbery' >>> print template % vals --- ---Spam--- ---shrubbery---

The 2.4 string module's Template feature is essentially a simplified variation of the dictionary-based format scheme, but it allows some additional call patterns:

 >>> vals {'key2': 'shrubbery', 'key1': 'Spam'} >>> import string >>> template = string.Template('---$key1---$key2---') >>> template.substitute(vals) '---Spam---shrubbery---' >>> template.substitute(key1='Brian', key2='Loretta') '---Brian---Loretta---'

See the library manual for more on this extension. Although the string datatype does not itself support the pattern-directed text processing that we'll meet later in this chapter, its tools are powerful enough for many tasks.

21.3.2. Parsing with Splits and Joins

In terms of this chapter's main focus, Python's built-in tools for splitting and joining strings around tokens turn out to be especially useful when it comes to parsing text:

str.split(delimiter?, maxsplits?): Splits a string into substrings, using either whitespace substrings (tabs, spaces, newlines) or an explicitly passed string as a delimiter. maxsplits limits the number of splits performed, if passed.
delimiter.join(sequence): Concatenates a sequence of substrings (e.g., list or tuple), adding the subject separator string between each.

These two are among the most powerful of string methods. As we saw earlier in Chapter 3, split chops a string into a list of substrings and join puts them back together:^[*]

^[*] Very early Python releases had similar tools called spitfields and joinfields; the more modern (and less verbose) split and join are the preferred way to spell these today.

 >>> 'A B C D'.split( ) ['A', 'B', 'C', 'D'] >>> 'A+B+C+D'.split('+') ['A', 'B', 'C', 'D'] >>> '--'.join(['a', 'b', 'c']) 'a--b--c'

Despite their simplicity, they can handle surprisingly complex text-parsing tasks. Moreover, string method calls are very fast because they are implemented in C language code. For instance, to quickly replace all tabs in a file with four periods, pipe the file into a script that looks like this:

 from sys import * stdout.write( ('.' * 4).join( stdin.read( ).split('\t') ) )

The split call here divides input around tabs, and the join puts it back together with periods where tabs had been. The combination of the two calls is equivalent to using the global replacement string method call as follows:

 stdout.write( stdin.read( ).replace('\t', '.'*4) )

As we'll see in the next sections, splitting strings is sufficient for many text-parsing goals.

21.3.3. Summing Columns in a File

Let's look at a couple of practical applications of string splits and joins. In many domains, scanning files by columns is a fairly common task. For instance, suppose you have a file containing columns of numbers output by another system, and you need to sum each column's numbers. In Python, string splitting does the job, as demonstrated by Example 21-1. As an added bonus, it's easy to make the solution a reusable tool in Python.

Example 21-1. PP3E\Lang\summer.py

 #!/usr/local/bin/python def summer(numCols, fileName):     sums = [0] * numCols                             # make list of zeros     for line in open(fileName):                      # scan file's lines         cols = line.split( )                              # split up columns         for i in range(numCols):                     # around blanks/tabs             sums[i] += eval(cols[i])                 # add numbers to sums     return sums if _ _name_ _ == '_ _main_ _':     import sys     print summer(eval(sys.argv[1]), sys.argv[2])     # '% summer.py cols file'

Notice that we use file iterators here to read line by line, instead of calling the file readlines method explicitly (recall from Chapter 4 that iterators avoid loading the entire file into memory all at once).

As usual, you can both import this module and call its function and run it as a shell tool from the command line. The summer.py script calls split to make a list of strings representing the line's columns, and eval to convert column strings to numbers. Here's an input file that uses both blanks and tabs to separate columns:

 C:\...\PP3E\Lang>type table1.txt 1       5       10    2   1.0 2       10      20    4   2.0 3       15      30    8    3 4       20      40   16   4.0 C:\...\PP3E\Lang>python summer.py 5 table1.txt [10, 50, 100, 30, 10.0]

Also notice that because the summer script uses eval to convert file text to numbers, you could really store arbitrary Python expressions in the file. Here, for example, it's run on a file of Python code snippets:

 C:\...\PP3E\Lang>type table2.txt 2     1+1          1<<1           eval("2") 16    2*2*2*2      pow(2,4)       16.0 3     len('abc')   [1,2,3][2]     {'spam':3}['spam'] C:\...\PP3E\Lang>python summer.py 4 table2.txt [21, 21, 21, 21.0]

We'll revisit eval later in this chapter, when we explore expression evaluators. Sometimes this is more than we wantif we can't be sure that the strings that we run this way won't contain malicious code, for instance, it may be necessary to run them with limited machine access or use more restrictive conversion tools. Consider the following recoding of the summer function:

 def summer(numCols, fileName):     sums = [0] * numCols     for line in open(fileName):                     # use file iterators         cols = line.split(',')                      # assume comma-delimited         nums = [int(x) for x in cols]               # use limited converter         both = zip(sums, nums)                      # avoid nested for loop         sums = [x + y for (x, y) in both]     return sums

This version uses int for its conversions from strings to support only numbers, and not arbitrary and possibly unsafe expressions. Although the first four lines of this coding are similar to the original, for variety this version also assumes the data is separated by commas rather than whitespace and runs list comprehensions and zip to avoid the nested for loop statement. This version is also substantially trickier than the original and so might be less desirable from a maintenance perspective. If its code is confusing, try adding print statements after each step to trace the results of each operation.

For related examples, also see the grid examples in Chapter 10 for another case of eval table magic at work. The summer script here is a much simpler version of that chapter's column sum logic. To remove the need to pass in a number-columns value, see also the more advanced floating-point column summer example in the "Other Uses for Dictionaries" sidebar in Chapter 2it works by making the column number a key rather than an offset.

21.3.4. Parsing and Unparsing Rule Strings

Splitting comes in handy for diving text into columns, but it can also be used as a more general parsing toolby splitting more than once on different delimiters, we can pick apart more complex text. Although such parsing can also be achieved with more powerful tools such as the regular expressions we'll meet later in this chapter, split-based parsing is simper to code, and may run quicker.

For instance, Example 21-2 demonstrates one way that splitting and joining strings can be used to parse sentences in a simple language. It is taken from a rule- based expert system shell (holmes) that is written in Python and included in this book's examples distribution (see the top-level Ai examples directory). Rule strings in holmes take the form:

 "rule <id> if <test1>, <test2>... then <conclusion1>, <conclusion2>..."

Tests and conclusions are conjunctions of terms ("," means "and"). Each term is a list of words or variables separated by spaces; variables start with ?. To use a rule, it is translated to an internal forma dictionary with nested lists. To display a rule, it is translated back to the string form. For instance, given the call:

 rules.internal_rule('rule x if a ?x, b then c, d ?x')

the conversion in function internal_rule proceeds as follows:

 string = 'rule x if a ?x, b then c, d ?x' i = ['rule x', 'a ?x, b then c, d ?x'] t = ['a ?x, b', 'c, d ?x'] r = ['', 'x'] result = {'rule':'x', 'if':[['a','?x'], ['b']], 'then':[['c'], ['d','?x']]}

We first split around the if, then around the then, and finally around rule. The result is the three substrings that were separated by the keywords. Test and conclusion substrings are split around "," first and spaces last. join is used to convert back (unparse) to the original string for display. Example 21-2 is the concrete implementation of this scheme.

Example 21-2. PP3E\Lang\rules.py

 def internal_rule(string):     i = string.split(' if ')     t = i[1].split(' then ')     r = i[0].split('rule ')     return {'rule': r[1].strip( ), 'if':internal(t[0]), 'then':internal(t[1])} def external_rule(rule):     return ('rule '    + rule['rule']           +             ' if '     + external(rule['if'])   +             ' then '   + external(rule['then']) + '.') def internal(conjunct):     res = []                                    # 'a b, c d'     for clause in conjunct.split(','):          # -> ['a b', ' c d']         res.append(clause.split( ))                  # -> [['a','b'], ['c','d']]     return res def external(conjunct):     strs = []     for clause in conjunct:                     # [['a','b'], ['c','d']]         strs.append(' '.join(clause))           # -> ['a b', 'c d']     return ', '.join(strs)                      # -> 'a b, c d'

Notice that we could use newer list comprehensions to gain some conciseness here. The internal function, for instance, could be recoded to simply:

 return [clause.split( ) for clause in conjunct.split(',')]

to produce the desired nested lists by combining two steps into one. This form might run faster; we'll leave it to the reader to decide whether it is more difficult to understand. As usual, we can test components of this module interactively:

 >>> import rules >>> rules.internal('a ?x, b') [['a', '?x'], ['b']] >>> rules.internal_rule('rule x if a ?x, b then c, d ?x') {'if': [['a', '?x'], ['b']], 'rule': 'x', 'then': [['c'], ['d', '?x']]} >>> r = rules.internal_rule('rule x if a ?x, b then c, d ?x') >>> rules.external_rule(r) 'rule x if a ?x, b then c, d ?x.'

Parsing by splitting strings around tokens like this takes you only so far. There is no direct support for recursive nesting of components, and syntax errors are not handled very gracefully. But for simple language tasks like this, string splitting might be enough, at least for prototyping systems. You can always add a more robust rule parser later or reimplement rules as embedded Python code or classes.

Lesson 1: Prototype and Migrate

As a rule of thumb, use the string object's methods rather than things such as regular expressions whenever you can. Although this can vary from release to release, some string methods may be faster because they have less work to do.

In fact, the original implementation of these operations in the string module became substantially faster when they were moved to the C language. When you imported string, it internally replaced most of its content with functions imported from the strop C extension module; strop methods were reportedly 100 to 1,000 times faster than their Python-coded equivalents at the time (though Python has been heavily optimized since then).

The string module was originally written in Python but demands for string efficiency prompted recoding it in C. The result was dramatically faster performance for string client programs without impacting the interface. That is, string module clients became instantly faster without having to be modified for the new C-based module. A similar migration was applied to the pickle module we met in Chapter 19the later cPickle recoding is compatible but much faster.

This is a great lesson about Python development: modules can be coded quickly in Python at first and translated to C later for efficiency if required. Because the interface to Python and C extension modules is identical (both are imported), C translations of modules are backward compatible with their Python prototypes. The only impact of the translation of such modules on clients is an improvement in performance.

There is usually no need to move every module to C for delivery of an application: you can pick and choose performance-critical modules (such as string and pickle) for translation and leave others coded in Python. Use the timing and profiling techniques discussed in Chapter 20 to isolate which modules will give the most improvement when translated to C. C-based extension modules are introduced in Part VI of this book.

Actually, in Python 2.0, the string module changed its implementation again: it is now a frontend to new string methods, which are able to also handle Unicode strings. As mentioned, most string functions are also available as object methods in 2.0. For instance, string.split(X) is now simply a synonym for X.split( ); both forms are still supported, but the latter is more prevalent and preferred today (and may be the only option in the future). Either way, clients of the original string module are not affected by this changeyet another lesson!

21.3.5. More on the holmes Expert System Shell

So how are these rules actually used? As mentioned, the rule parser we just met is part of the Python-coded holmes expert system shell. This book does not cover holmes in detail due to lack of space; see the PP3E\AI\ExpertSystem directory in this book's examples distribution for its code and documentation. But by way of introduction, holmes is an inference engine that performs forward and backward chaining deduction on rules that you supply. For example, the rule:

 rule pylike if ?X likes coding, ?X likes spam then ?X likes Python

can be used both to prove whether someone likes Python (backward, from "then" to "if"), and to deduce that someone likes Python from a set of known facts (forward, from "if" to "then"). Deductions may span multiple rules, and rules that name the same conclusion represent alternatives. holmes also performs simple pattern-matching along the way to assign the variables that appear in rules (e.g., ?X), and it is able to explain its work.

To make all of this more concrete, let's step through a simple holmes session. The += interactive command adds a new rule to the rule base by running the rule parser, and @@ prints the current rule base:

 C:..\PP3E\Ai\ExpertSystem\holmes\holmes>python holmes.py -Holmes inference engine- holmes> += rule pylike if ?X likes coding, ?X likes spam then ?X likes Python holmes> @@ rule pylike if ?X likes coding, ?X likes spam then ?X likes Python.

Now, to kick off a backward-chaining proof of a goal, use the ?- command. A proof explanation is shown here; holmes can also tell you why it is asking a question. Holmes pattern variables can show up in both rules and queries; in rules, variables provide generalization; in a query, they provide an answer:

 holmes> ?- mel likes Python is this true: "mel likes coding" ? y is this true: "mel likes spam" ? y yes: (no variables) show proof ? yes   "mel likes Python" by rule pylike       "mel likes coding" by your answer       "mel likes spam" by your answer more solutions? n holmes> ?- ann likes ?X is this true: "ann likes coding" ? y is this true: "ann likes spam" ? y yes: ann likes Python

Forward chaining from a set of facts to conclusions is started with a +- command. Here, the same rule is being applied but in a different way:

 holmes> +- chris likes spam, chris likes coding I deduced these facts...     chris likes Python I started with these facts...     chris likes spam     chris likes coding time: 0.0

More interestingly, deductions chain through multiple rules when part of a rule's "if" is mentioned in another rule's "then":

 holmes> += rule 1 if thinks ?x then human ?x holmes> += rule 2 if human ?x then mortal ?x holmes> ?- mortal bob is this true: "thinks bob" ? y yes: (no variables) holmes> +- thinks bob I deduced these facts...     human bob     mortal bob I started with these facts...     thinks bob time: 0.0

Finally, the @= command is used to load files of rules that implement more sophisticated knowledge bases; the rule parser is run on each rule in the file. Here is a file that encodes animal classification rules (other example files are available in the book's examples distribution, if you'd like to experiment):

 holmes> @= ..\kbases\zoo.kb holmes> ?- it is a penguin is this true: "has feathers" ? why to prove "it is a penguin" by rule 17 this was part of your original query. is this true: "has feathers" ? y is this true: "able to fly" ? n is this true: "black color" ? y yes: (no variables)

Type stop to end a session and help for a full commands list; see the text files in the holmes directories for more details. Holmes is an old system written before Python 1.0 (and around 1993), but it still works unchanged on all platforms.