Section 21.4. Regular Expression Pattern Matching


21.4. Regular Expression Pattern Matching

Splitting and joining strings is a simple way to process text, as long as it follows the format you expect. For more general text analysis tasks, Python provides regular expression matching utilities. Regular expressions are simply strings that define patterns to be matched against other strings. Supply a pattern and a string and ask whether the string matches your pattern. After a match, parts of the string matched by parts of the pattern are made available to your script. That is, matches not only give a yes/no answer, but also can pick out substrings as well.

Regular expression pattern strings can be complicated (let's be honestthey can be downright gross to look at). But once you get the hang of them, they can replace larger handcoded string search routinesa single pattern string generally does the work of dozens of lines of manual string scanning code and may run much faster. They are a concise way to encode the expected structure of text and extract portions of it.

In Python, regular expressions are not part of the syntax of the Python language itself, but they are supported by extension modules that you must import to use. The modules define functions for compiling pattern strings into pattern objects, matching these objects against strings and fetching matched substrings after a match. They also provide tools for pattern-based splitting, replacing, and so on.

Beyond those generalities, Python's regular expression story is complicated a little by history:


The regex module (old)

In earlier Python releases, a module called regex was the standard (and only) regular expression module. It was fast and supported patterns coded in awk, grep, and emacs styles, but it is now somewhat deprecated. (It generates a deprecation when imported today, though it will likely still be available for some time to come.)


The re module (new)

Today, you should use re, a new regular expression module for Python that was introduced sometime around Python release 1.5. This module provides a much richer regular expression pattern syntax that tries to be close to that used to code patterns in the Perl language (yes, regular expressions are a feature of Perl worth emulating). For instance, re supports the notions of named groups, character classes, and non-greedy matchesregular expression pattern operators that match as few characters as possible (other regular expression pattern operators always match the longest possible substring).

When it was first made available, re was generally slower than regex, so you had to choose between speed and Perl-like regular expression syntax. Today, though, re has been optimized to the extent that regex no longer offers any clear advantages. Moreover, re supports a richer pattern syntax and matching of Unicode strings (strings with 16-bit-wide or wider characters for representing large character sets).

Because of this migration, I've recoded regular expression examples in this text to use the new re module rather than regex. The old regex-based versions are still available in the book's examples distribution in the directory PP3E\lang\old-regex. If you find yourself having to migrate old regex code, you can also find a document describing the translation steps needed at http://www.python.org. Both modules' interfaces are similar, but re introduces a match object and changes pattern syntax in minor ways.

Having said that, I also want to warn you that regular expressions is a complex topic that cannot be covered in depth here. If this area sparks your interest, the text Mastering Regular Expressions, written by Jeffrey E. F. Friedl (O'Reilly), is a good next step to take. We won't be able to go into pattern construction in much depth here.

Once you learn how to code patterns, though, the top-level interface for performing matches is straightforward. In fact, they are so easy to use that we'll jump right into an example before getting into more details.

21.4.1. First Examples

There are two basic ways to kick off matches: through top-level function calls and via methods of precompiled pattern objects. The latter precompiled form is quicker if you will be applying the same pattern more than onceto all lines in a text file, for instance. To demonstrate, let's do some matching on the following strings:

 >>> text1 = 'Hello spam...World' >>> text2 = 'Hello spam...other' 

The match performed in the following code does not precompile: it executes an immediate match to look for all the characters between the words Hello and World in our text strings:

 >>> import re >>> matchobj = re.match('Hello(.*)World', text2) >>> print matchobj None 

When a match fails as it does here (the text2 string doesn't end in World), we get back the None object, which is Boolean false if tested in an if statement.

In the pattern string we're using here (the first argument to re.match), the words Hello and World match themselves, and (.*) means any character (.) repeated zero or more times (*). The fact that it is enclosed in parentheses tells Python to save away the part of the string matched by that part of the pattern as a groupa matched substring. To see how, we need to make a match work:

 >>> matchobj = re.match('Hello(.*)World', text1) >>> print matchobj <_sre.SRE_Match object at 0x009D6520> >>> matchobj.group(1) ' spam...' 

When a match succeeds, we get back a match object, which has interfaces for extracting matched substringsthe group(1) call returns the portion of the string matched by the first, leftmost, parenthesized portion of the pattern (our (.*)). In other words, matching is not just a yes/no answer (as already mentioned); by enclosing parts of the pattern in parentheses, it is also a way to extract matched substrings.

The interface for precompiling is similar, but the pattern is implied in the pattern object we get back from the compile call:

 >>> pattobj  = re.compile('Hello(.*)World') >>> matchobj = pattobj.match(text1) >>> matchobj.group(1) ' spam...' 

Again, you should precompile for speed if you will run the pattern multiple times. Here's something a bit more complex that hints at the generality of patterns. This one allows for zero or more blanks or tabs at the front ([ \t]*), skips one or more after the word Hello ([ \t]+), and allows the final word to begin with an upper- or lowercase letter ([Ww]); as you can see, patterns can handle wide variations in data:

 >>> patt = '[ \t]*Hello[ \t]+(.*)[Ww]orld' >>> line = ' Hello   spamworld' >>> mobj = re.match(patt, line) >>> mobj.group(1) 'spam' 

In addition to the tools these examples demonstrate, there are methods for scanning ahead to find a match (search), splitting and replacing on patterns, and so on. All have analogous module and precompiled call forms. Let's dig into a few details of the module before we get to more code.

21.4.2. Using the re Module

The Python re module comes with functions that can search for patterns right away or make compiled pattern objects for running matches later. Pattern objects (and module search calls) in turn generate match objects, which contain information about successful matches and matched substrings. The next few sections describe the module's interfaces and some of the operators you can use to code patterns.

21.4.2.1. Module functions

The top level of the module provides functions for matching, substitution, precompiling, and so on:


compile(pattern [, flags])

Compile a regular expression pattern string into a regular expression pattern object, for later matching. See the reference manual for the flags argument's meaning.


match(pattern, string [, flags])

If zero or more characters at the start of string match the pattern string, return a corresponding match object, or None if no match is found. Roughly like a search for a pattern that begins with the ^ operator.


search(pattern, string [, flags])

Scan through string for a location matching pattern, and return a corresponding match object, or None if no match is found.


split(pattern, string [, maxsplit])

Split string by occurrences of pattern. If capturing parenthese (( )) are used in the pattern, occurrences of patterns or subpatterns are also returned.


sub(pattern, repl, string [, count])

Return the string obtained by replacing the (first count) leftmost nonoverlapping occurrences of pattern (a string or a pattern object) in string by repl (which may be a string or a function that is passed a single match object).


subn(pattern, repl, string [, count])

Same as sub, but returns a tuple: (new-string, number-of-substitutions-made).


findall(pattern, string [, flags])

Return a list of strings giving all nonoverlapping matches of pattern in string; if there are any groups in patterns, returns a list of groups.


finditer(pattern, string [, flags])

Return iterator over all nonoverlapping matches of pattern in string.


escape(string)

Return string with all nonalphanumeric characters backslashed, such that they can be compiled as a string literal.

21.4.2.2. Compiled pattern objects

At the next level, pattern objects provide similar attributes, but the pattern string is implied. The re.compile function in the previous section is useful to optimize patterns that may be matched more than once (compiled patterns match faster). Pattern objects returned by re.compile have these sorts of attributes.

 match(string [, pos] [, endpos]) search(string [, pos] [, endpos]) split(string [, maxsplit]) sub(repl, string [, count]) subn(repl, string [, count]) findall(string [, pos [, endpos]]) finditer(string [, pos [, endpos]]) 

Same as the re functions, but the pattern is implied, and pos and endpos give start/end string indexes for the match.

21.4.2.3. Match objects

Finally, when a match or search function or method is successful, you get back a match object (None comes back on failed matches). Match objects export a set of attributes of their own, including:


group(g)group([g1, g2, ...])

Return the substring that matched a parenthesized group (or groups) in the pattern. Accept group numbers or names. Group numbers start at 1; group 0 is the entire string matched by the pattern.


groups( )

Returns a tuple of all groups' substrings of the match.


groupdict( )

Returns a dictionary containing all named groups of the match.


start([group])end([group])

Indices of the start and end of the substring matched by group (or the entire matched string, if no group).


span([group])

Returns the two-item tuple: (start(group),end(group)).


expand(template])

Performs backslash group substitutions; see the Python library manual.

21.4.2.4. Regular expression patterns

Regular expression strings are built up by concatenating single-character regular expression forms, shown in Table 21-1. The longest-matching string is usually matched by each form, except for the nongreedy operators. In the table, R means any regular expression form, C is a character, and N denotes a digit.

Table 21-1. re pattern syntax

Operator

Interpretation

.

Matches any character (including newline if DOTALL flag is specified)

^

Matches start of the string (of every line in MULTILINE mode)

$

Matches end of the string (of every line in MULTILINE mode)

C

Any nonspecial character matches itself

R*

Zero or more of preceding regular expression R (as many as possible)

R+

One or more of preceding regular expression R (as many as possible)

R?

Zero or one occurrence of preceding regular expression R

R{m}

Matches exactly m copies preceding R: a{5} matches 'aaaaa'

R{m,n}

Matches from m to n repetitions of preceding regular expression R

R*?, R+?, R??, R{m,n}?

Same as *, +, and ? but matches as few characters/times as possible; these are known as nongreedy match operators (unlike others, they match and consume as few characters as possible)

[...]

Defines character set: e.g., [a-zA-Z] to match all letters

[^...]

Defines complemented character set: matches if char is not in set

\

Escapes special chars (e.g., *?+|( )) and introduces special sequences

\\

Matches a literal \ (write as \\\\ in pattern, or r'\\')

\number

Matches the contents of the group of the same number: (.+) \1 matches "42 42"

R|R

Alternative: matches left or right R

RR

Concatenation: match both Rs

(R)

Matches any regular expression inside ( ), and delimits a group (retains matched substring)

(?: R)

Same but doesn't delimit a group

(?= R)

Look-ahead assertion: matches if R matches next, but doesn't consume any of the string (e.g., X (?=Y) matches X only if followed by Y)

(?! R)

Matches if R doesn't match next; negative of (?=R)

(?P<name>R)

Matches any regular expression inside ( ), and delimits a named group

(?P=name)

Matches whatever text was matched by the earlier group named name

(?#...)

A comment; ignored

(?letter)

Set mode flag; letter is one of i, L, m, s, u, x (see the library manual)

(?<= R)

Look-behind assertion: matches if the current position in the string is preceded by a match of R that ends at the current position

(?<! R)

Matches if the current position in the string is not preceded by a match for R; negative of (?<= R)

(?(id/name)yespattern|nopattern)

Will try to match with yespattern if the group with given id or name exists, else with optional nopattern


Within patterns, ranges and selections can be combined. For instance, [a-zA-Z0-9_]+ matches the longest possible string of one or more letters, digits, or underscores. Special characters can be escaped as usual in Python strings: [\t ]* matches zero or more tabs and spaces (i.e., it skips whitespace).

The parenthesized grouping construct, (R), lets you extract matched substrings after a successful match. The portion of the string matched by the expression in parentheses is retained in a numbered register. It's available through the group method of a match object after a successful match.

In addition to the entries in this table, special sequences in Table 21-2 can be used in patterns too. Due to Python string rules, you sometimes must double up on backslashes (\\) or use Python raw strings (r'...') to retain backslashes in the pattern. Python ignores backslashes in normal strings if the letter following the backslash is not recognized as an escape code.

Table 21-2. re special sequences

Sequence

Interpretation

\number

Matches text of group number (numbered from 1)

\A

Matches only at the start of the string

\b

Empty string at word boundaries

\B

Empty string not at word boundaries

\d

Any decimal digit character (such as [0-9])

\D

Any nondecimal digit character (such as [^O-9])

\s

Any whitespace character (such as [ \t\n\r\f\v])

\S

Any nonwhitespace character (such as [^ \t\n\r\f\v])

\w

Any alphanumeric character (uses LOCALE flag)

\W

Any nonalphanumeric character (uses LOCALE flag)

\Z

Matches only at the end of the string


Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser: \a, \b, \f, \n, \r, \t, \v, \x, and \\.

The Python library manual gives additional details. But to demonstrate how the re pattern syntax is typically used, let's go back to writing some code.

21.4.3. Basic Patterns

To illustrate how to combine regular expression operators, we'll turn to a few short test files that match simple pattern forms. Comments in Example 21-3 describe the operations exercised; check Table 21-1 to see which operators are used in these patterns.

Example 21-3. PP3E\lang\re-basics.py

 # literals, sets, ranges   (all print 2 = offset where pattern found) import re                                  # the one to use today pattern, string = "A.C.", "xxABCDxx"       # nonspecial chars match themselves matchobj = re.search(pattern, string)      # '.' means any one char if matchobj:                               # search returns match object or None     print matchobj.start( )                 # start is index where matched pattobj  = re.compile("A.*C.*")            # 'R*' means zero or more Rs matchobj = pattobj.search("xxABCDxx")      # compile returns pattern obj if matchobj:                               # patt.search returns match obj     print matchobj.start( ) # selection sets print re.search(" *A.C[DE][D-F][^G-ZE]G\t+ ?", "..ABCDEFG\t..").start( ) # alternatives print re.search("A|XB|YC|ZD", "..AYCD..").start( )  # R1|R2 means R1 or R2 # word boundaries print re.search(r"\bABCD", "..ABCD ").start( )      # \b means word boundary print re.search(r"ABCD\b", "..ABCD ").start( )      # use r'...' to escape '\' 

Notice again that there are different ways to kick off a match with re: by calling module search functions and by making compiled pattern objects. In either event, you can hang on to the resulting match object or not. All the print statements in this script show a result of 2the offset where the pattern was found in the string. In the first test, for example, A.C. matches the ABCD at offset 2 in the search string (i.e., after the first xx):

 C:\...\PP3E\Lang>python re-basic.py 2 2 2 2 2 2 

In Example 21-4, parts of the pattern strings enclosed in parentheses delimit groups; the parts of the string they matched are available after the match.

Example 21-4. PP3E\lang\re-groups.py

 # groups (extract substrings matched by REs in '( )' parts) import re patt = re.compile("A(.)B(.)C(.)")                  # saves 3 substrings mobj = patt.match("A0B1C2")                        # each '( )' is a group, 1..n print mobj.group(1), mobj.group(2), mobj.group(3)  # group( ) gives substring patt = re.compile("A(.*)B(.*)C(.*)")               # saves 3 substrings mobj = patt.match("A000B111C222")                  # groups( ) gives all groups print mobj.groups( ) print re.search("(A|X)(B|Y)(C|Z)D", "..AYCD..").groups( ) patt = re.compile(r"[\t ]*#\s*define\s*([a-z0-9_]*)\s*(.*)") mobj = patt.search(" # define  spam  1 + 2 + 3")              # parts of C #define print mobj.groups( )                                         # \s is whitespace 

In the first test here, for instance, the three (.) groups each match a single character, but they retain the character matched; calling group pulls out the bits matched. The second test's (.*) groups match and retain any number of characters. The last test matches C #define lines; more on this later.

 C:\...\PP3E\Lang>python re-groups.py 0 1 2 ('000', '111', '222') ('A', 'Y', 'C') ('spam', '1 + 2 + 3') 

Finally, besides matches and substring extraction, re also includes tools for string replacement or substitution (see Example 21-5).

Example 21-5. PP3E\lang\re-subst.py

 # substitutions (replace occurrences of patt with repl in string) import re print re.sub('[ABC]', '*', 'XAXAXBXBXCXC') print re.sub('[ABC]_', '*', 'XA-XA_XB-XB_XC-XC_') 

In the first test, all characters in the set are replaced; in the second, they must be followed by an underscore:

 C:\...\PP3E\Lang>python re-subst.py X*X*X*X*X*X* XA-X*XB-X*XC-X* 

21.4.4. Scanning C Header Files for Patterns

On to some realistic examples: the script in Example 21-6 puts these pattern operators to more practical use. It uses regular expressions to find #define and #include lines in C header files and extract their components. The generality of the patterns makes them detect a variety of line formats; pattern groups (the parts in parentheses) are used to extract matched substrings from a line after a match.

Example 21-6. PP3E\Lang\cheader.py

 #! /usr/local/bin/python import sys, re pattDefine = re.compile(                               # compile to pattobj     '^#[\t ]*define[\t ]+([a-zA-Z0-9_]+)[\t ]*(.*)')   # "# define xxx yyy..." pattInclude = re.compile(     '^#[\t ]*include[\t ]+[<"]([a-zA-Z0-9_/\.]+)')     # "# include <xxx>..." def scan(file):     count = 0     while 1:                                     # scan line-by-line         line = file.readline( )         if not line: break         count += 1         matchobj = pattDefine.match(line)        # None if match fails         if matchobj:             name = matchobj.group(1)             # substrings for (...) parts             body = matchobj.group(2)             print count, 'defined', name, '=', body.strip( )             continue         matchobj = pattInclude.match(line)         if matchobj:             start, stop = matchobj.span(1)       # start/stop indexes of (...)             filename = line[start:stop]          # slice out of line             print count, 'include', filename     # same as matchobj.group(1) if len(sys.argv) == 1:     scan(sys.stdin)                    # no args: read stdin else:     scan(open(sys.argv[1], 'r'))       # arg: input filename 

To test, let's run this script on the text file in Example 21-7.

Example 21-7. PP3E\Lang\test.h

 #ifndef TEST_H #define TEST_H #include <stdio.h> #include <lib/spam.h> #  include   "Python.h" #define DEBUG #define HELLO 'hello regex world' #  define SPAM    1234 #define EGGS sunny + side + up #define  ADDER(arg) 123 + arg #endif 

Notice the spaces after # in some of these lines; regular expressions are flexible enough to account for such departures from the norm. Here is the script at work; picking out #include and #define lines and their parts. For each matched line, it prints the line number, the line type, and any matched substrings:

 C:\...\PP3E\Lang>python cheader.py test.h 2 defined TEST_H = 4 include stdio.h 5 include lib/spam.h 6 include Python.h 8 defined DEBUG = 9 defined HELLO = 'hello regex world' 10 defined SPAM = 1234 12 defined EGGS = sunny + side + up 13 defined ADDER = (arg) 123 + arg 

21.4.5. A File Pattern Search Utility

The next script searches for patterns in a set of files, much like the grep command-line program. We wrote file and directory searchers earlier in Chapter 7. Here, the file searches look for patterns rather than simple strings (see Example 21-8). The patterns are typed interactively, separated by a space, and the files to be searched are specified by an input pattern for Python's glob.glob filename expansion tool that we studied earlier.

Example 21-8. PP3E\Lang\pygrep1.py

 #!/usr/local/bin/python import sys, re, glob help_string = """ Usage options. interactive:  % pygrep1.py """ def getargs( ):     if len(sys.argv) == 1:         return raw_input("patterns? >").split( ), raw_input("files? >")     else:         try:             return sys.argv[1], sys.argv[2]         except:             print help_string             sys.exit(1) def compile_patterns(patterns):     res = []     for pattstr in patterns:         try:             res.append(re.compile(pattstr))           # make re patt object         except:                                       # or use re.match             print 'pattern ignored:', pattstr     return res def searcher(pattfile, srchfiles):     patts = compile_patterns(pattfile)                  # compile for speed     for file in glob.glob(srchfiles):                   # all matching files         lineno = 1                                      # glob uses re too         print '\n[%s]' % file         for line in open(file, 'r').readlines( ):            # all lines in file             for patt in patts:                 if patt.search(line):                   # try all patterns                     print '%04d)' % lineno, line,       # match if not None                     break             lineno = lineno+1 if _ _name_ _ == '_ _main_ _':     searcher(*getargs( ))                                # was apply(func, args) 

Here's what a typical run of this script looks like, scanning old versions of some of the source files in this chapter; it searches all Python files in the current directory for two different patterns, compiled for speed. Notice that files are named by a pattern tooPython's glob module also uses re internally:

 C:\...\PP3E\Lang>python pygrep1.py patterns? >import.*string spam files? >*.py [cheader.py] [finder2.py] 0002) import string, glob, os, sys [patterns.py] 0048) mobj = patt.search(" # define  spam  1 + 2 + 3") [pygrep1.py] [rules.py] [summer.py] 0002) import string [_ _init_ _.py] 




Programming Python
Programming Python
ISBN: 0596009259
EAN: 2147483647
Year: 2004
Pages: 270
Authors: Mark Lutz

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net