String Module Utilities | Text and Language

Table of contents:

Lesson 1 Prototype and Migrate

Python's string module includes a variety of text-processing utilities that go above and beyond string expression operators. For instance:

string.find performs substring searches.
string.atoi converts strings to integers.
string.strip removes leading and trailing whitespace.
string.upper converts to uppercase.
string.replace performs substring substitutions.

The Python library manual includes an exhaustive list of available tools. Moreover, as of Python 2.0, Unicode (wide) strings are fully supported by Python string tools, and most of the string module's functions are also now available as string object methods. For instance, in Python 2.0, the following two expressions are equivalent:

string.find(str, substr) # traditional
str.find(substr) # new in 2.0

except that the second form does not require callers to import the string module first. As usual, you should consult the library manuals and Appendix A, for late-breaking news on the string tools front.

In terms of this chapter's main focus, though, Python's built-in tools for splitting and joining strings around tokens turn out to be especially useful when it comes to parsing text:

string.split

Splits a string into substrings, using either whitespace (tabs, spaces, newlines) or an explicitly passed string as a delimiter.

string.join

Concatenates a list or tuple of substrings, adding a space or an explicitly passed separator string between each.

As we saw earlier in this book, split chops a string into a list of substrings, and join puts them back together:[1]

[1] Earlier Python releases had similar tools called spitfields and joinfields; the more modern (and less verbose) split and join are the preferred way to spell these today.

>>> import string
>>> string.split('A B C D')
['A', 'B', 'C', 'D']
>>> string.split('A+B+C+D', '+')
['A', 'B', 'C', 'D']
>>> string.join(['a', 'b', 'c'], '--')
'a--b--c'

Despite their simplicity, they can handle surprisingly complex text-parsing tasks. Moreover, the string module is very fast because it has been migrated to C. For instance, to quickly replace all tabs in a file with four periods, pipe the file into a script that looks like this:

from sys import *
from string import *
stdout.write( join( split(stdin.read( ), '	'), '.'*4) )

The split call here divides input around tabs, and the join puts it back together with periods where tabs had been. The combination of the two calls is equivalent to using the global replacement call in the string module as follows:

stdout.write( replace(stdin.read( ), '	', '.'*4) )

18.3.1 Summing Columns in a File

Let's look at a couple of practical applications of string splits and joins. In many domains, scanning files by columns is a fairly common task. For instance, suppose you have a file containing columns of numbers output by another system, and you need to sum each column's numbers. In Python, string splitting does the job (see Example 18-1). As an added bonus, it's easy to make the solution a reusable tool in Python.

Example 18-1. PP2ELangsummer.py

#!/usr/local/bin/python
import string

def summer(numCols, fileName):
 sums = [0] * numCols # make list of zeros
 for line in open(fileName, 'r').readlines( ): # scan file's lines
 cols = string.split(line) # split up columns
 for i in range(numCols): # around blanks/tabs
 sums[i] = sums[i] + eval(cols[i]) # add numbers to sums
 return sums

if __name__ == '__main__':
 import sys
 print summer(eval(sys.argv[1]), sys.argv[2]) # '% summer.py cols file'

As usual, you can both import this module and call its function, and run it as a shell tool from the command line. The summer calls split to make a list of strings representing the line's columns, and eval to convert column strings to numbers. Here's an input file that uses both blanks and tabs to separate columns:

C:...PP2ELang>type table1.txt
1 5 10 2 1.0
2 10 20 4 2.0
3 15 30 8 3
4 20 40 16 4.0
C:...PP2ELang>python summer.py 5 table1.txt
[10, 50, 100, 30, 10.0]

Notice that because the summer script uses eval to convert file text to numbers, you could really store arbitrary Python expressions in the file. Here, for example, it's run on a file of Python code snippets:

C:...PP2ELang>type table2.txt
2 1+1 1<<1 eval("2")
16 2*2*2*2 pow(2,4) 16.0
3 len('abc') [1,2,3][2] {'spam':3}['spam']

C:...PP2ELang>python summer.py 4 table2.txt
[21, 21, 21, 21.0]

We'll revisit eval later in this chapter when we explore expression evaluators.[2]

[2] Also see the grid examples in Chapter 8, for another example of eval table magic at work. The summer script here is a much simpler version of that chapter's column sum logic.

18.3.2 Parsing and Unparsing Rule Strings

Example 18-2 demonstrates one way that splitting and joining strings can be used to parse sentences in a simple language. It is taken from a rule-based expert system shell (holmes) that is written in Python and included at http://examples.oreilly.com/python2 (see the top-level Ai examples directory). Rule strings in holmes take the form:

"rule  if , ... then , ..."

Tests and conclusions are conjunctions of terms ("," means "and"). Each term is a list of words or variables separated by spaces; variables start with ?. To use a rule, it is translated to an internal form -- a dictionary with nested lists. To display a rule, it is translated back to the string form. For instance, given a call:

rules.internal_rule('rule x if a ?x, b then c, d ?x')

the conversion in function internal_rule proceeds as follows:

string = 'rule x if a ?x, b then c, d ?x'
i = ['rule x', 'a ?x, b then c, d ?x']
t = ['a ?x, b', 'c, d ?x']
r = ['', 'x'] 
result = {'rule':'x', 'if':[['a','?x'], ['b']], 'then':[['c'], ['d','?x']]}

It first splits around the if, then around the then, and finally around rule. The result is the three substrings that were separated by the keywords. Test and conclusion substrings are split around "," and spaces last. join is used to convert back (unparse) to the original string for display. Example 18-2 is the concrete implementation of this scheme.

Example 18-2. PP2ELang ules.py

from string import split, join, strip

def internal_rule(string): 
 i = split(string, ' if ') 
 t = split(i[1], ' then ') 
 r = split(i[0], 'rule ') 
 return {'rule':strip(r[1]), 'if':internal(t[0]), 'then':internal(t[1])}

def external_rule(rule):
 return ('rule ' + rule['rule'] + 
 ' if ' + external(rule['if']) + 
 ' then ' + external(rule['then']) + '.')

def internal(conjunct):
 res = [] # 'a b, c d'
 for clause in split(conjunct, ','): # -> ['a b', ' c d']
 res.append(split(clause)) # -> [['a','b'], ['c','d']]
 return res

def external(conjunct): 
 strs = []
 for clause in conjunct: # [['a','b'], ['c','d']] 
 strs.append(join(clause)) # -> ['a b', 'c d'] 
 return join(strs, ', ') # -> 'a b, c d'

As usual, we can test components of this module interactively:

>>> import rules
>>> rules.internal('a ?x, b')
[['a', '?x'], ['b']]

>>> rules.internal_rule('rule x if a ?x, b then c, d ?x')
{'if': [['a', '?x'], ['b']], 'rule': 'x', 'then': [['c'], ['d', '?x']]}

>>> r = rules.internal_rule('rule x if a ?x, b then c, d ?x')
>>> rules.external_rule(r)
'rule x if a ?x, b then c, d ?x.'

Parsing by splitting strings around tokens like this only takes you so far: there is no direct support for recursive nesting of components, and syntax errors are not handled very gracefully. But for simple language tasks like this, string splitting might be enough, at least for prototyping systems. You can always add a more robust rule parser later or reimplement rules as embedded Python code or classes .

Lesson 1 Prototype and Migrate

As a rule of thumb, use the string module's functions instead of things like regular expressions whenever you can. They tend to be much faster, because they've been moved to a C language implementation. When you import string, it internally replaces most of its content with functions imported from the strop C extension module; strop methods are reportedly 100-1000 times faster than their Python-coded equivalents. [a]

The string module was originally written in Python but demands for string efficiency prompted recoding it in C. The result was dramatically faster performance for string client programs without impacting the interface. That is, string module clients became instantly faster without having to be modified for the new C-based module. A similar migration was applied to the pickle module we met in Chapter 16 -- the newer cPickle recoding is compatible but much faster.

Which is a great lesson about Python development: modules can be coded quickly in Python at first, and translated to C later for efficiency if required. Because the interface to Python and C extension modules is identical (both are imported), C translations of modules are backward compatible with their Python prototypes. The only impact of the translation of such modules on clients is an improvement in performance.

There is usually no need to move every module to C for delivery of an application: you can pick and choose performance-critical modules (like string and pickle) for translation, and leave others coded in Python. Use the timing and profiling techniques of the prior chapter to isolate which modules will give the most improvement when translated to C. C-based extension modules are introduced in the next part of this book.

Actually, in Python 2.0, the string module has changed its implementation again: it is now a frontend to new string methods, which are able to also handle Unicode strings. As mentioned, most string functions are also available as object methods in 2.0. For instance, string.split(X) is now simply a synonym for X.split( ); both forms are supported, but the latter may become prevalent over time. Either way, clients of the original string module are not affected by this change -- yet another lesson!

[a] Actually, in Python 2.0, the string module has changed its implementation again: it is now a frontend to new strng methods, which are able to also handle Unicode strings. As mentioned, most string functions are also available as object methods in 2.0. For instance, string.split (X) is now simply a synonym for X.split( ); both forms are supported, but the latter may become prevalent over time. Either way, clients of the original string module are not affected by this changeyet another lesson!

18.3.2.1 More on the holmes expert system shell

So how are these rules actually used? As mentioned, the rule parser we just met is part of the Python-coded holmes expert system shell. This book does not cover holmes in detail due to lack of space; see the PP2EAIExpertSystem directory on the book CD (see http://examples.oreilly.com/python2) for its code and documentation. But by way of introduction, holmes is an inference engine that performs forward and backward chaining deduction on rules that you supply. For example, a rule:

rule pylike if ?X likes coding, ?X likes spam then ?X likes Python

can be used both to prove whether someone likes Python (backward, from "then" to "if"), and to deduce that someone likes Python from a set of known facts (forward, from "if" to "then"). Deductions may span multiple rules, and rules that name the same conclusion represent alternatives. Holmes also performs simple pattern-matching along the way to assign the variables that appear in rules (e.g., ?X), and is able to explain its work.

To make this all more concrete, let's step through a simple holmes session. The += interactive command adds a new rule to the rule base by running the rule parser, and @@ prints the current rule base:

C:..PP2EAiExpertSystemholmesholmes>python holmes.py
-Holmes inference engine-
holmes> += rule pylike if ?X likes coding, ?X likes spam then ?X likes Python
holmes> @@
rule pylike if ?X likes coding, ?X likes spam then ?X likes Python.

Now, to kick off a backward-chaining proof of a goal, use the ?- command. A proof explanation is shown here; holmes can also tell you why it is asking a question. Holmes pattern variables can show up in both rules and queries; in rules, variables provide generalization; in a query, they provide an answer:

holmes> ?- mel likes Python
is this true: "mel likes coding" ? y
is this true: "mel likes spam" ? y
yes: (no variables)

show proof ? yes
 "mel likes Python" by rule pylike
 "mel likes coding" by your answer
 "mel likes spam" by your answer
more solutions? n

holmes> ?- linda likes ?X
is this true: "linda likes coding" ? y
is this true: "linda likes spam" ? y
yes: linda likes Python

Forward-chaining from a set of facts to conclusions is started with a +- command. Here, the same rule is being applied but in a different way:

holmes> +- chris likes spam, chris likes coding
I deduced these facts...
 chris likes Python
I started with these facts...
 chris likes spam
 chris likes coding
time: 0.0

More interestingly, deductions chain through multiple rules when part of a rule's "if" is mentioned in another rule's "then":

holmes> += rule 1 if thinks ?x then human ?x
holmes> += rule 2 if human ?x then mortal ?x
holmes> ?- mortal bob
is this true: "thinks bob" ? y
yes: (no variables)

holmes> +- thinks bob
I deduced these facts...
 human bob
 mortal bob
I started with these facts...
 thinks bob
time: 0.0

Finally, the @= command is used to load files of rules that implement more sophisticated knowledgebases; the rule parser is run on each rule in the file. Here is a file that encodes animal classification rules; other example files are available on the CD (see http://examples.oreilly.com/python2) if you'd like to experiment:

holmes> @= ..¸baseszoo.kb
holmes> ?- it is a penguin
is this true: "has feathers" ? why
to prove "it is a penguin" by rule 17
this was part of your original query.
is this true: "has feathers" ? y
is this true: "able to fly" ? n
is this true: "black color" ? y
yes: (no variables)

Type "stop" to end a session and "help" for a full commands list, and see the text files in the holmes directories for more details. Holmes is an old system written before Python 1.0 (and around 1993), but still works unchanged on all platforms under Python 1.5.2.

Introducing Python

Part I: System Interfaces