21.4. Regular Expression Pattern Matching
Splitting and joining strings is a simple way to process text, as long as it follows the format you expect. For more general text analysis tasks, Python provides regular expression matching utilities. Regular expressions are simply strings that define patterns to be matched against other strings. Supply a pattern and a string and ask whether the string matches your pattern. After a match, parts of the string matched by parts of the pattern are made available to your script. That is, matches not only give a yes/no answer, but also can pick out substrings as well.
Regular expression pattern strings can be complicated (let's be honestthey can be downright gross to look at). But once you get the hang of them, they can replace larger handcoded string search routinesa single pattern string generally does the work of dozens of lines of manual string scanning code and may run much faster. They are a concise way to encode the expected structure of text and extract portions of it.
In Python, regular expressions are not part of the syntax of the Python language itself, but they are supported by extension modules that you must import to use. The modules define functions for compiling pattern strings into pattern objects, matching these objects against strings and fetching matched substrings after a match. They also provide tools for pattern-based splitting, replacing, and so on.
Beyond those generalities, Python's regular expression story is complicated a little by history:
When it was first made available, re was generally slower than regex, so you had to choose between speed and Perl-like regular expression syntax. Today, though, re has been optimized to the extent that regex no longer offers any clear advantages. Moreover, re supports a richer pattern syntax and matching of Unicode strings (strings with 16-bit-wide or wider characters for representing large character sets).
Because of this migration, I've recoded regular expression examples in this text to use the new re module rather than regex. The old regex-based versions are still available in the book's examples distribution in the directory PP3E\lang\old-regex. If you find yourself having to migrate old regex code, you can also find a document describing the translation steps needed at http://www.python.org. Both modules' interfaces are similar, but re introduces a match object and changes pattern syntax in minor ways.
Having said that, I also want to warn you that regular expressions is a complex topic that cannot be covered in depth here. If this area sparks your interest, the text Mastering Regular Expressions, written by Jeffrey E. F. Friedl (O'Reilly), is a good next step to take. We won't be able to go into pattern construction in much depth here.
Once you learn how to code patterns, though, the top-level interface for performing matches is straightforward. In fact, they are so easy to use that we'll jump right into an example before getting into more details.
21.4.1. First Examples
There are two basic ways to kick off matches: through top-level function calls and via methods of precompiled pattern objects. The latter precompiled form is quicker if you will be applying the same pattern more than onceto all lines in a text file, for instance. To demonstrate, let's do some matching on the following strings:
>>> text1 = 'Hello spam...World' >>> text2 = 'Hello spam...other'
The match performed in the following code does not precompile: it executes an immediate match to look for all the characters between the words Hello and World in our text strings:
>>> import re >>> matchobj = re.match('Hello(.*)World', text2) >>> print matchobj None
When a match fails as it does here (the text2 string doesn't end in World), we get back the None object, which is Boolean false if tested in an if statement.
In the pattern string we're using here (the first argument to re.match), the words Hello and World match themselves, and (.*) means any character (.) repeated zero or more times (*). The fact that it is enclosed in parentheses tells Python to save away the part of the string matched by that part of the pattern as a groupa matched substring. To see how, we need to make a match work:
>>> matchobj = re.match('Hello(.*)World', text1) >>> print matchobj <_sre.SRE_Match object at 0x009D6520> >>> matchobj.group(1) ' spam...'
When a match succeeds, we get back a match object, which has interfaces for extracting matched substringsthe group(1) call returns the portion of the string matched by the first, leftmost, parenthesized portion of the pattern (our (.*)). In other words, matching is not just a yes/no answer (as already mentioned); by enclosing parts of the pattern in parentheses, it is also a way to extract matched substrings.
The interface for precompiling is similar, but the pattern is implied in the pattern object we get back from the compile call:
>>> pattobj = re.compile('Hello(.*)World') >>> matchobj = pattobj.match(text1) >>> matchobj.group(1) ' spam...'
Again, you should precompile for speed if you will run the pattern multiple times. Here's something a bit more complex that hints at the generality of patterns. This one allows for zero or more blanks or tabs at the front ([ \t]*), skips one or more after the word Hello ([ \t]+), and allows the final word to begin with an upper- or lowercase letter ([Ww]); as you can see, patterns can handle wide variations in data:
>>> patt = '[ \t]*Hello[ \t]+(.*)[Ww]orld' >>> line = ' Hello spamworld' >>> mobj = re.match(patt, line) >>> mobj.group(1) 'spam'
In addition to the tools these examples demonstrate, there are methods for scanning ahead to find a match (search), splitting and replacing on patterns, and so on. All have analogous module and precompiled call forms. Let's dig into a few details of the module before we get to more code.
21.4.2. Using the re Module
The Python re module comes with functions that can search for patterns right away or make compiled pattern objects for running matches later. Pattern objects (and module search calls) in turn generate match objects, which contain information about successful matches and matched substrings. The next few sections describe the module's interfaces and some of the operators you can use to code patterns.
22.214.171.124. Module functions
The top level of the module provides functions for matching, substitution, precompiling, and so on:
126.96.36.199. Compiled pattern objects
At the next level, pattern objects provide similar attributes, but the pattern string is implied. The re.compile function in the previous section is useful to optimize patterns that may be matched more than once (compiled patterns match faster). Pattern objects returned by re.compile have these sorts of attributes.
match(string [, pos] [, endpos]) search(string [, pos] [, endpos]) split(string [, maxsplit]) sub(repl, string [, count]) subn(repl, string [, count]) findall(string [, pos [, endpos]]) finditer(string [, pos [, endpos]])
Same as the re functions, but the pattern is implied, and pos and endpos give start/end string indexes for the match.
188.8.131.52. Match objects
Finally, when a match or search function or method is successful, you get back a match object (None comes back on failed matches). Match objects export a set of attributes of their own, including:
184.108.40.206. Regular expression patterns
Regular expression strings are built up by concatenating single-character regular expression forms, shown in Table 21-1. The longest-matching string is usually matched by each form, except for the nongreedy operators. In the table, R means any regular expression form, C is a character, and N denotes a digit.
Within patterns, ranges and selections can be combined. For instance, [a-zA-Z0-9_]+ matches the longest possible string of one or more letters, digits, or underscores. Special characters can be escaped as usual in Python strings: [\t ]* matches zero or more tabs and spaces (i.e., it skips whitespace).
The parenthesized grouping construct, (R), lets you extract matched substrings after a successful match. The portion of the string matched by the expression in parentheses is retained in a numbered register. It's available through the group method of a match object after a successful match.
In addition to the entries in this table, special sequences in Table 21-2 can be used in patterns too. Due to Python string rules, you sometimes must double up on backslashes (\\) or use Python raw strings (r'...') to retain backslashes in the pattern. Python ignores backslashes in normal strings if the letter following the backslash is not recognized as an escape code.
Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser: \a, \b, \f, \n, \r, \t, \v, \x, and \\.
The Python library manual gives additional details. But to demonstrate how the re pattern syntax is typically used, let's go back to writing some code.
21.4.3. Basic Patterns
To illustrate how to combine regular expression operators, we'll turn to a few short test files that match simple pattern forms. Comments in Example 21-3 describe the operations exercised; check Table 21-1 to see which operators are used in these patterns.
Example 21-3. PP3E\lang\re-basics.py
Notice again that there are different ways to kick off a match with re: by calling module search functions and by making compiled pattern objects. In either event, you can hang on to the resulting match object or not. All the print statements in this script show a result of 2the offset where the pattern was found in the string. In the first test, for example, A.C. matches the ABCD at offset 2 in the search string (i.e., after the first xx):
C:\...\PP3E\Lang>python re-basic.py 2 2 2 2 2 2
In Example 21-4, parts of the pattern strings enclosed in parentheses delimit groups; the parts of the string they matched are available after the match.
Example 21-4. PP3E\lang\re-groups.py
In the first test here, for instance, the three (.) groups each match a single character, but they retain the character matched; calling group pulls out the bits matched. The second test's (.*) groups match and retain any number of characters. The last test matches C #define lines; more on this later.
C:\...\PP3E\Lang>python re-groups.py 0 1 2 ('000', '111', '222') ('A', 'Y', 'C') ('spam', '1 + 2 + 3')
Finally, besides matches and substring extraction, re also includes tools for string replacement or substitution (see Example 21-5).
Example 21-5. PP3E\lang\re-subst.py
In the first test, all characters in the set are replaced; in the second, they must be followed by an underscore:
C:\...\PP3E\Lang>python re-subst.py X*X*X*X*X*X* XA-X*XB-X*XC-X*
21.4.4. Scanning C Header Files for Patterns
On to some realistic examples: the script in Example 21-6 puts these pattern operators to more practical use. It uses regular expressions to find #define and #include lines in C header files and extract their components. The generality of the patterns makes them detect a variety of line formats; pattern groups (the parts in parentheses) are used to extract matched substrings from a line after a match.
Example 21-6. PP3E\Lang\cheader.py
To test, let's run this script on the text file in Example 21-7.
Example 21-7. PP3E\Lang\test.h
Notice the spaces after # in some of these lines; regular expressions are flexible enough to account for such departures from the norm. Here is the script at work; picking out #include and #define lines and their parts. For each matched line, it prints the line number, the line type, and any matched substrings:
C:\...\PP3E\Lang>python cheader.py test.h 2 defined TEST_H = 4 include stdio.h 5 include lib/spam.h 6 include Python.h 8 defined DEBUG = 9 defined HELLO = 'hello regex world' 10 defined SPAM = 1234 12 defined EGGS = sunny + side + up 13 defined ADDER = (arg) 123 + arg
21.4.5. A File Pattern Search Utility
The next script searches for patterns in a set of files, much like the grep command-line program. We wrote file and directory searchers earlier in Chapter 7. Here, the file searches look for patterns rather than simple strings (see Example 21-8). The patterns are typed interactively, separated by a space, and the files to be searched are specified by an input pattern for Python's glob.glob filename expansion tool that we studied earlier.
Example 21-8. PP3E\Lang\pygrep1.py
Here's what a typical run of this script looks like, scanning old versions of some of the source files in this chapter; it searches all Python files in the current directory for two different patterns, compiled for speed. Notice that files are named by a pattern tooPython's glob module also uses re internally:
C:\...\PP3E\Lang>python pygrep1.py patterns? >import.*string spam files? >*.py [cheader.py] [finder2.py] 0002) import string, glob, os, sys [patterns.py] 0048) mobj = patt.search(" # define spam 1 + 2 + 3") [pygrep1.py] [rules.py] [summer.py] 0002) import string [_ _init_ _.py]