Appendix E. Regular Expressions

CONTENTS
  •  A Simple Example
  •  Pattern Characteristics
  •  Regular Expression Functions and Error and Flags Properties
  •  re Object Methods and Properties
  •  Match Object Methods and Properties
  •  Metacharacters
  •  Putting Things Together

Jaysen Lorenzen

Regular expressions are patterns that match groups of characters. We can use them to find and return (or replace) characters, words, and lines. Python's standard-distribution re module derives from Perl's regular expressions, which, like Perl itself, combine features from UNIX utilities such as Awk, Sed, and Grep, and programs like vi and Gnu Emacs. Though the syntax and metacharacters may be slightly different, expressions in any of these tools can be written in Jython.

A Simple Example

Let's say that you're looking for a job in the newspaper and you've decided to narrow your focus to ads that say, "Money is no object." This is the pattern you're searching for, and of course you're looking in the employment section, which is the group of characters you're searching in. You don't care where you find your pattern, only that it's there.

If you were writing a Python program to help you find a job, it would start out looking like this:

>>> from re import * mino = compile("money is no object"); for ad in classifieds:         if(mino.search(ad)):         print ad;

Let's work through this example interactively.

Import the re module.

>>> from re import *

Import class re.

>>> classifieds = ["School Teacher; salary $20,000.","Engineer; salary money     is no object"]

Make the variable classifieds a list of ads.

>>> mino = compile("money is no object")

Compile the search phrase, and store in it the mino ("money is no object") object.

>>> for ad in classifieds: ...      if(mino.search(ad)): ...            print ad ...

Compare the precompiled expression to each element in the list. If a match is found within an element, print it.

'Engineer; salary money is no object' >>>

Pattern Characteristics

A pattern can be as small as one character (as in the previous example), matching only itself, or of almost any length. It can contain wildcard and special characters, subexpressions, and so forth, depending on system resources.

Our expression "money is no object" is a string of literal characters requiring an exact match in the exact order, but regular expressions can be much more powerful. Say, for example, that you're looking for a job whose salary figure is within a certain range that is, with a certain number of digits to the left of the decimal. To find all records with two digits before the comma and three digits after, you can write this expression:

>>> reStr = r'\$..,...\.'

Then you can compile and match it against a string.

>>> cre = compile(reStr) >>> cre.search("salary: $90,000.00")

The figure $90,000 matches, and the statement returns a match object.

The Raw String Construct

Notice the r'...' construct in the code above. This denotes a raw string that is, one with its backslash (\) escaped characters left intact. In this case the $ and the decimal are the special characters that must be escaped to be matched with a wildcard or metacharacter.

Actually, the pattern above would work without the raw string construct, but I wanted to introduce it early on because its absence can occasionally cause problems, as you can see in the following examples.

The following expression, with r'...', returns a match:

>>> STR = r'(\be[ a-z][ a-z])' >>> cre = compile(STR) >>> cre.search("the beg the end")

Its matched text is end, gotten with group.

>>> cre.search("the beg the end").group() 'end'

This one, without r'...', doesn't get a match:

>>> STR = "(\be[ a-z][ a-z])" >>> cre = compile(STR) >>> cre.search("the beg the end") >>>

A look at the two expressions' patterns shows why. Here's the first one:

>>> STR = r'(\be[ a-z][ a-z])' >>> cre = compile(STR) >>> cre.pattern

which returns

'(\\be[ a-z][ a-z])'

Here's the second one:

>>> STR = "(\be[ a-z][ a-z])" >>> cre = compile(STR) >>> cre.pattern

which returns

'(\be[ a-z][ a-z])'

The lesson is, use the raw string construct whenever a normal string might cause confusion. It can't hurt.

Wildcard Metacharacters

In regular expression patterns, the period is a wildcard, that is, a special character that by default matches any character except a newline (\n). In our example, '\$..,...\' says in English, A dollar sign followed by any two characters, except a newline, followed by a comma, then any three non-newline characters and a period. Thus, $XX,XXX or $00,000 or even a dollar sign and six spaces makes a match. Of course, this isn't very useful, so we have to narrow our search. One way to do this is with a character class instead of a metacharacter.

Character Classes

A character class is a group of characters enclosed in brackets, only one of which is required for the sequence to match. In our salary example, each period can be replaced with a class of numbers such as [0123456789], which gets the job done but is hard to read. Fortunately, a class can contain a range of characters expressed as [0-9], which makes for easier reading.

r'\$[0-9][0-9],[0-9][0-9][0-9]\.'

This is still a bit unwieldy. Read on.

Escaped Special Characters

A number of escaped characters have special meaning. The \d character, for example, matches any decimal digit and is equal to [0-9], so our expression can be written as

r'\$\d\d,\d\d\d\.'

Easier still, but wait, there's more.

Multiplier Characters

Multiplier characters modify, or multiply, the character to their immediate left. The most popular (and sometimes the most dangerous) is the asterisk (*), which requires 0 or more of the characters to its left to produce a match. In our case, all of the digit characters are mandatory, so we need another multiplier, +, which matches one or more of the characters to its right. Now our expression can be written as

r'\$\d+,\d+\.'

This is shorter and easier, but we still have a problem. The + character matches one or more of the preceding characters, which means that a salary such as $9999,99999 would be caught in its net. To avoid this, we use the sequence {min,max}, which requires at least min and no more than max repetitions of the character to the immediate left to produce a match.

Our expression can now be written precisely as

r'\$\d{2,2},\d{3,3}\.'

The only remaining problem is that we'll get salaries as low as $00,000 and as high as $99,999, but we can fix this by converting part of our pattern back to a character class containing a more sensible range. Finally, our expression can be written as

r'\$[5-7]\d,\d{3,3}\.'

which allows salaries only in the range of $50,000 to $79,000.

Grouping and Backreferences

A very useful feature of Python regular expressions is their ability to reference (or "backreference") text previously matched and reuse it. Suppose we're looking for a salary in the range of $XX,XXX,XXX. We can extend our expression by adding another set of \d{3,3}, but, since we already have this written, we can reuse it instead. To capture the pattern we use the (...) sequence.

r'\$[5-7]\d(,\d{3,3})\.'

The piece enclosed in parentheses is called a group or a subexpression. We can use it to add another three digits to our search.

Groups are referenced within the expression by their numeric position and escaped by a backslash. The group numbers are determined by a count of the open parentheses starting from the left: (1 )(2 )(3 (4 )). Now we can extend our expression like so:

r'\$[5-7]\d(,\d{3,3})\1\.'

The \1 gives us another set of ,\d{3,3}, and our expression thus matches our chosen range.

The Match Object A Brief Introduction

In Python, when a match is found a match object is returned. The match object and its properties will be dealt with later, but we'll give it a brief look here.

Here's our salary example compiled and matched against the list of classifieds, now with some added records and capturing the match object for later use.

>>> from re import * >>> classifieds = ["School Teacher; salary $20,000.","Engineer; salary money is no object","Bicycle Racer; salary $75,000,000."] >>> for ad in classifieds: ...     mo = mino.search(ad) ...     if mo: ...                   print ad ...             print mo.group(1) ...

The output is the ad in which the match was found on the first line and only the text matched by group 1 on the second line.

Bicycle Racer; salary $75,000,000. ,000

Alternation

Alternation, which requires the special, or meta-, sequence ...|..., works in any type of expression in which the character class operates only with literal characters or ranges. Within a pattern, characters, classes, groups, or special characters separated by a bar (|) need only one item from either side to match.

a|b|c|d matches a or b or c or d abc|xyz matches abc or xyz 123|[123] matches 123 or 1 or 2 or 3

Within a longer expression, the alternation sequence is used like this:

"(thousands|tons) of copies"

which matches "thousands of copies" or "tons of copies."

We can use alternation to enhance our salary pattern by allowing numbers in the range $50,000,000 to $79,999,999 or in the range $50,000,000,000 to $79,999,999,999. All we have to do is replace the \1 with (\1|\1\1).

r'\$[5-7]\d(,\d{3,3})(\1|\1\1)\.'

Regular Expression Functions and Error and Flags Properties

The following sections describe the individual regular expression functions, including their syntax, use, arguments, and return values. Also described are the error and flag properties.

compile()

compile (pattern[, flags])

The compile() function compiles a regular expression pattern, as a string, into a regular expression object. It enables the object to operate like the search() and match() functions and to be reused for subsequent searches. If the same expression is applied to many searches within a program and/or will never change, precompiling it and reusing it goes faster.

compile()'s arguments are

  • pattern the pattern, as a string, to search for

  • flags one or more of the following variables: I, IGNORECASE; L, LOCALE; M, MULTILINE; X, VERBOSE

It returns a regular expression.

>>> CompiledRE = compile("199[0-9].*") >>> CompiledRE.search("party like it's 1999")

search()

Search (pattern, string[, flags])

The search() function searches for pattern anywhere in string. It has the following arguments:

  • pattern the pattern, as a string, to search for

  • string the string to search in

  • flags one or more of the variables I, IGNORECASE; L, LOCALE; M, MULTILINE; X, VERBOSE

It returns a match object:

>>> search("199[0-9]$","party like it's 1999")

or it returns None, which means the search failed.

>>> search("199[0-9]$","party like it's 2009")

match()

match (pattern, string[, flags])

The match() function searches for a pattern at the beginning of a string and has the following arguments:

  • pattern the pattern, as a string, to search for

  • string the string to search in

  • flags one or more of the variables I, IGNORECASE; L, LOCALE; M, MULTILINE; X, VERBOSE

It returns a match object:

>>> match("199[0-9]","1999 like party")

or it returns None.

>>> match("199[0-9]","party like it's 1999")

split()

Split (pattern, string[, maxsplit = 0])

The split() function splits a string into a list of strings based on occurrences of the regular expression pattern. The string portions that don't match the pattern are returned as a list of strings. If flags are required with the pattern, a regular expression object, obtained with compile(), must be used in place of the pattern string, or the (?iLmsx) sequences (which will be explained later) must be used within the expression. (The compiled regular expression's split() method will achieve the same results.)

split() has the following arguments:

  • pattern the pattern, as a string, to search for

  • string the string to search in

  • maxsplit the maximum number of splits to make (the default is 0)

It returns a list of strings.

from re import * alist = split(":","OS=Linux:Browser=Netscape:WM=Afterstep") for str in alist: ...     print str

The output is

OS=Linux Browser=Netscape WM=Afterstep

sub()

sub (pattern, replacement,string[, count = 0])

The sub() function obtains a string with the non-overlapping occurrences of the expression, starting from the left, replaced by the replacement string. If the optional third argument, count, is supplied, only the requested number of occurrences is replaced. If flags are required with the pattern, a regular expression obtained with compile() must be used in place of the pattern string. (The compiled regular expression's sub() method can achieve the same results.)

If a function is supplied in place of the replacement string, it's executed for each occurrence of the pattern, and a match object is passed as an argument.

sub()'s arguments are

  • pattern the pattern, as a string or a regular expression object, to search for

  • replacement the replacement, as a string or function

  • string the string to search in

  • count a non-negative number, the number of replacements to make (the default, 0, replaces all occurrences in the string)

It returns the modified string if a match is found, or the original argument if there are no matches.

From re import * from string import * def toUpper(matchObj):    return capitalize(matchObj.group(0)) newStr = sub("monkey",toUpper,"monkey see monkey do, ,,") print newStr

The output is

Monkey see Monkey do, ,,

subn()

subn (pattern, replacement,string[, count = 0])

The subn() function is basically the same as sub() except that it returns a tuple containing the modified or unmodified string and the number of replacements made. Its arguments are

  • pattern the pattern, as a string or a regular expression object, to search for

  • replacement the replacement, as a string or function

  • string the string to search in

  • count a non-negative number, the number of replacements to make (the default, 0, replaces all occurrences in the string)

subn() returns a tuple containing the modified string if a match is found (or the original string argument if there are no matches) and containing the count of replacements made, if any.

from re import * from string import *    deftoUpper(matchObj):return capitalize(matchObj.group(0)) tup = subn("monkey",toUpper,"monkey see monkey do, ,,") print "In \" %s \" there were %s replacements" % tup

The output is

In " Monkey see Monkey do, ,, " there were 2 replacements

escape()

escape(string)

escape() escapes all nonalphanumeric characters before passing the expression to the compiler, and returns a string with all metacharacters escaped.

Here's an example in which the compile() function fails because there are unmatched parentheses:

>>> from re import * >>> strVar = "((group")) >>> cre = compile(strVar)   Traceback (innermost last):     File "<console>", line 1, in ?   re.error: Unmatched parentheses.

In this example, the non-alphanumeric characters are escaped, so the compile() function works:

>>> cre = compile(escape(strVar)) >>> cre.search("((group 1)group 2)") org.python.modules.MatchObject@80cceb1

error

The error exception is raised when an illegal expression is passed to a function. It has the form

try: ...    match statement except error,e: ...    handle the error

The optional argument to the except statement is a variable to catch the argument sent to the raise statement at the time of the error. This can be used in the handler statement to alert the user to a mistake.

>>> from re import * >>> try: ...    search("[[[","This is a test") ... except error,e: ...    print e ...

The output is

Unmatched [] in expression. >>>

flags

The compile(), search(), and match() functions can take a number of options (or modes). These options are set via the functions' optional flags argument and are the same for all. A common example is IGNORECASE, which specifies the match as case-insensitive. Thus, if the ad list contains

classifieds = ["School Teacher; salary $20,000","Engineer; salary Money is NO               OBJECT"]

the expression is written as

mino = compile("money is no object",I)

or

mino = compile("money is no object",IGNORECASE)

The flags available at compile time are listed and described in Table E-1 (on next page).

Note that, as of this writing, the LOCALE and VERBOSE flags aren't fully implemented; that is, their values are set to 0, which is the same as not being set at all. You can get VERBOSE functionality by setting the flags integer to 32 instead of X or VERBOSE, but check your release first, since in CPython this value is 2.

re Object Methods and Properties

The sections that follow describe the syntax, use, and arguments of regular expression object methods and properties.

search()

<re object>.search (string[, pos][, endpos])
Table E-1. Optional Flags for Regular Expression Functions
Flag Behavior Example
I (IGNORECASE) Ignore case when searching.
search("^exam","Example",I) returns a match object search("^exam","Example") returns None
L (LOCALE) The metacharacters \w, \W, \b,\B are made to fit the currentlocale.  
M (MULTILINE) The metacharacter ^ matches at the beginning of the string and just after each newline character. The $ metacharacter matches at the end of the string and just before each newline character (by default ^ matches only at the beginning of the string and $ matches only at the end of the string).
var = "A\nB". search("^B",var,M) returns a match object search("^B",var) returns None
S (DOTALL) The "." metacharacter matches any character, including the newline. (The default is for "."to match anything but a newline.)
var = "A\nB" search(".B",var,S) returns a match object search(".B",var) returns None
X (VERBOSE) Allows formatting of the regular expression. Whitespace within the expression is ignored, except when it's in a character class or preceded by an unescaped backslash. If a # character is inserted, all text from the leftmost # is ignored, except when the # is within a character class or preceded by an unescaped backslash.
exp = """   # Start of regex   ^t # starts with "t"   [a-z] # is a lcase char   * # any qty of   t$ # ends with "t"   # end of expression """   regex = compile(exp,X)   regex.search("test")   Returns a match object   regex = compile(exp)   regex.search("test")   Returns None

The search()method looks for compiled regular expressions anywhere in a string. If the optional parameter pos is supplied, the search starts at position pos. If the optional parameter endpos is supplied, the search continues only until endpos is reached. The default is from position 0 to the end of the string. The return value is a match object.

search() has the following arguments:

  • string the string to search in

  • pos the position in the string to start the search

  • endpos the position in the string to stop the search

Here's an example:

cre = compile("199[0-9]$") cre.search("party like it's 1999")

match()

<re object>.match (string[, pos][, endpos])

The match()method searches for a regular expression at the beginning of a string. If the optional parameter pos is supplied, the search starts at position pos, which is considered the start. If the optional parameter endpos is supplied, the search ends when endpos is encountered. The default is from position 0 to the end of the string. The return is a match object.

Here's an example that doesn't match because the test isn't at the beginning of the string. Therefore, nothing is printed out.

>>> cre = compile("test") >>> if (cre.match("This is a test")): ...      print "From beginning \n"

Here's an example in which the starting position is 10. A match is returned only if the expression is found at the beginning of the string.

>>> if (cre.match("This is a test",10)): ...      print "From pos 10 - \n"

split()

<re object>.split (string[, maxsplit = 0])

The split() method splits a "string" into a list of strings based on the occurrences of the compiled regular expression to which the string is being compared. The portions of the string that don't match are returned as a list of strings.

split()'s arguments are

  • string the string to search in

  • maxsplit the maximum number of splits to make (the default is 0)

Here's an example:

>>> from re import * >>> cre = compile(":") >>> alist = cre.split("OS=Linux:Browser=Netscape:WM=Afterstep") >>> for str in alist: ...      print str

The output is

OS=Linux Browser=Netscape WM=Afterstep

sub()

<re object>.sub (replacement,string[, count = 0])

The sub() method achieves the same result as the sub() function described previously and has the same arguments (except pattern). Unlike the sub() function, however, which returns a list of strings, the sub() method returns the modified string if a match is found, or the original string argument if there are no matches.

>>> from re import * >>> from string import * >>> def toUpper(matchObj): ...      return capitalize(matchObj.group(0)) cre = compile("monkey") newStr = cre.sub(toUpper,"monkey see monkey do, ,,") print newStr

The output is

Monkey see Monkey do, ,,

subn()

<re object>.subn (replacement,string[, count = 0])

The subn() method performs similarly to the subn() function described previously. As shown in the following example, it uses the same arguments (except for pattern) and returns a tuple.

from re import * from string import * def toUpper(matchObj): return capitalize(matchObj.group(0)) cre = compile("monkey") tup = cre.subn(toUpper,"monkey see monkey do, ,,") print "In \" %s \" there were %s replacements" % tup

The output is

In " Monkey see Monkey do, ,, " there were 2 replacements

flags

The flags property contains an integer that represents the sum of all flags passed to the compile() function when the object was created, or 0 if no flags were passed. It can be used to determine if a certain set of flags was passed and can take the place of flag arguments to subsequent compile(), search(), or match() function calls.

Finding a Flag Value

You can create a match object for a flag to check its property value (do this one flag at a time; if you try with more than one, the value becomes their sum). Or you can check the underlying object's flag property value, which is the only way to check VERBOSE since compile("test",VERBOSE).flags or compile("test",VERBOSE) returns 0.

To check the underlying values, import the Perl5Compiler class, and check the properties like this:

>>> from com.oroinc.text.regex import Perl5Compiler >>> Perl5Compiler.CASE_INSENSITIVE_MASK 1 >>> Perl5Compiler.MULTILINE_MASK 8 >>> Perl5Compiler.SINGLELINE_MASK 16 >>> Perl5Compiler.EXTENDED_MASK 32

flags returns an integer that represents the sum of all flag values used at the time the object was created, or 0 if no flags were used.

>>> from re import * >>> compile("^f",I).flags 1 >>> compile("^f",M).flags 8 >>> compile("^f",I|M).flags 9 >>> compile("^f",S).flags 16 >>> compile("^f",M|S).flags 24 >>> >>> cre = compile("^f",M|S) >>> compile("g$",cre.flags).flags 24

groupindex

<re object>.groupindex

The groupindex property gets a dictionary of the group numbers of the expression that creates the regular expression object, with their symbolic group names as the key, if the groups were created with the (?P<key>...) construct. It returns such a dictionary or an empty dictionary.

>>> from re import * >>> cre = compile(r'(?P<first>est)(?P<last>t)') >>> dict = cre.groupindex >>> dict["first"] 1 >>> dict["last"]

The output is

{}

pattern

<re object>.pattern

The pattern property contains the expression, that is, the pattern string, with which the regular expression object was created. It returns a string containing a regular expression.

>>> from re import * >>> cre = compile("^t.+t$") >>> cre.pattern

The output is

'^t.+t$'

Match Object Methods and Properties

The following sections describe the methods and properties of the match object.

group()

<match object>.group([name|num][,name|num]...)

The group() method gets matches of each group of the expression, or all matches if no group is specified. If a single argument is passed, the return value is a string containing the group specified. If no arguments are passed, the return value is a string containing the entire match. If a group is in part of a pattern that doesn't match, the return value is None. If it's in part of the pattern that matches multiple times, the last match is returned.

>>> mo = search(r'(?P<first>t)(?P<last>est)',"test") >>> mo.group() 'test' >>> mo.group("first") 't' >>> mo.group("last") 'est' >>> mo.group(2) 'est'

The return is a tuple.

>>> mo.group("first","last") ('t', 'est') >>>

groups()

<match object>.groups()

The groups() method returns a tuple containing all matched text for all groups of the pattern, including those that don't match. Elements in the tuple representing groups that don't match have a value of None.

>>> mo = search(r'(?P<first>t)(?P<last>est)+(o)*',"test") >>> mo.groups() ('t', 'est', None) >>>

groupdict()

<match object>.groupdict()

The groupdict() method returns a dictionary containing all matched text for all groups of the pattern, including those that don't match. Elements in the dictionary representing unmatched groups have a value of None.

>>> mo = search(r'(?P<first>t)(?P<last>est)+(o)*',"test") >>> dict = mo.groupdict() >>> print dict["first")

start() and end()

<match object>.start([name|num]) <match object>.end([name|num])

The start() and end() methods retrieve the starting and ending positions of the groups matched within the string being searched. The optional argument can be a group number (counting from the left of the expression) or a name if the group was named at the time of creation. If no arguments are passed, the start and end positions are the start and end of the entire matched text. The return value for both methods is an integer.

>>> mo = search(r'(?P<first>t)(?P<last>est)+(o)*',"test") >>> mo.start(1) 0 >>> mo.start(2) 1 >>> mo.start("last") 1 >>> mo.end("last") 4 >>> mo.start(3) -1 >>> mo.start() 0 >>> mo.end() 4

span()

<match object>.span([name|num])

The span() method returns the starting and ending positions of a matching group as a two-element tuple. The optional argument is a group number (starting from the left of the expression). If no argument is passed, the tuple will contain the start and end positions of the entire matched text. If the group doesn't match at all, the tuple will contain 1, 1.

>>> mo = search(r'(?P<first>t)(?P<last>est)+(o)*',"test") >>> mo.span(1) (0, 1) >>> mo.span(2) (1, 4) >>> mo.span(3) (-1, -1)

pos and endpos

<match object>.pos <match object>.pos

The pos property returns the pos value, as an integer, passed to the search() or match() method. The endpos property returns the endpos value, as an integer, passed to those methods.

string

<match object>.string

The string property returns the string, as a string to be searched and passed to the search() or match() method.

re

<match object>.re

The re property gets a reference to the regular expression object from which the match was triggered. It returns a regular expression object.

>>> from re import * # from a standard re object >>> cre = compile(r'(?P<first>t)(?P<last>est)+(o)*') >>> mo.re.groupindex {'last': 2, 'first': 1} # Retrieve the re object from the search function. >>> mo = search(r'(?P<first>t)(?P<last>est)+(o)*',"test") >>> mo.re org.python.modules.RegexObject@80ce6f2 >>> mo.re.groupindex {'last': 2, 'first': 1} >>> mo.re.pattern '(?P<first>t)(?P<last>est)+(o)*'

Metacharacters

The Python metacharacters, or special characters and sequences, are listed and described in Tables E 2 and E-3.

Table E-2. Python Metacharacters (Single Characters)
Metacharacter Behavior Example
"." Matches any character except a newline, by default. If the DOTALL flag has been specified, this matches any character, including a newline.
search(".B","AB") returns a match object search(".B","A\nB") returns None
"^" Matches at the beginning of a string and, if the MULTILINE flag is used, also matches immediately after each newline. Used as the first character of the expression or group to match at the beginning of a string.
search("^cat","cats") returns a match object search("^cat","a cat") returns Non
"$" Matches at the end of a string and, if the MULTILINE flag is used, also matches immediately before each newline. Used as the last character of the expression or group to match at the end of a line.
search("at$","cat") returns a match object search("at$","cats") returns None
"*" Matches zero or more repetitions of the preceding character, group, or character class. None of the immediately preceding characters is required to match, and any number of the immediately preceding characters will match. A greedy multiplier, "*" will match as many repetitions as possible.
cre = compile("at*") cre.search("a") cre.search("at") cre.search("att") cre = compile("a[td]*") cre.search("atdt") cre = compile("A(BC)*") search("ABCBC") all return a match search("e[td]*","add") does not return a match
"+" Matches one or more repetitions of the preceding character, group, or character class. Atleast one of the immediately preceding characters or groups of characters is required to match, and any number of the them will match. A greedy multiplier, "+" will match as many repetitions as possible.
cre = compile("at+") cre.search("at") cre.search("att") cre = compile("a[td]+") cre.search("atdt") cre = compile("A(BC)+") search("ABCBC") all return a match cre = compile("at+") cre.search("a") does not return a match
"?" Matches zero or one repetition of the preceding character, group, or character class. None of the immediately preceding characters or groups of characters is required to match, and only one of the them will match. A nongreedy multiplier, "?" will match as few repetitions as possible.
var = compile("xa?bc") var.search("xbc") returns a match object var.search("xaabc") does not return a match var.search("xabc") returns a match object
"*?" Matches zero or more repetitions of the preceding character, group, or character class. None of the immediately preceding characters is required to match and any number of the immediately preceding characters will match. A non-greedy multiplier, "*?" will match as few repetitions as possible.
cre = compile("(:.*:)") STR = ":one:a:two:" mo = cre.search(STR) mo.group(1) returns ':one:a:two:' cre = compile("(:.*?:)") mo = cre.search(STR) mo.group(1) returns ':one:'
"+?" Matches one or more repetitions of the preceding character, group, or character class. One of the immediately preceding characters is required to match, and any number of the immediately preceding characters will match. A nongreedy multiplier, "*?" will match as few repetitions as possible.
cre = compile("(o+)") STR = "Pool" mo = cre.search(STR) mo.group(1) returns 'oo' cre = compile("(o+?)") mo = cre.search(STR) mo.group(1) returns 'o'
"??" Matches zero or one of the preceding character, group, or character class. None of the immediately preceding characters is required to match, and only one of the immediately preceding characters will match. A nongreedy multiplier, "??" will match as few characters as possible.
cre = compile("(:.?:)") mo = cre.search(":::") mo.group(1) returns ':::' cre = compile("(:.??:)") mo = cre.search(":::") mo.group(1) returns '::'

 

Table E-3. Special Escaped Characters
Character Behavior Example
\<No.> Matches the same text matched previously by the group of the same number. Groups are numbered 1 99 and refer to open parentheses counting from the left.
STR = "y(aba)d\1" cre = compile(STR) mo = cre.search("yabadabadoo") matches yabadaba so mo.group() returns 'yabadaba' mo.group(1) returns 'aba'
\A Matches at the start of the string and is equal to the ^ meta-character.
STR = "(\Athe [a-z])" cre = compile(STR) mo = cre.search("the beg the end") mo.group() returns 'the b'
\Z Matches at the end of the string and is equal to the $ meta-character.
STR = "([a-z][a-z][a-z]\Z)" cre = compile(STR) mo = cre.search("the beg the end") mo.group() returns 'end'
\b Matches an empty string at the beginning or ending of a word (that is, a sequence of characters terminated by whitespace or any non-alphanumeric character).
STR = r'(\be[ a-z][ a-z])' cre = compile(STR) mo = cre.search("the beg the end") mo.group() returns 'end
\B

Matches an empty string as long as it's not at the beginning or ending of a word (that is, a position within a word but not between the first or last character and a space, period, etc).

(Note: \B and \b are anchors. They do not match any literal characters, but match positions with a string.)

STR = r'(\Be[ a-z][ a-z])' cre = compile(STR) mo = cre.search("the beg the end") mo.group() returns 'e b' STR = r'(\Be\B[ a-z][ a-z])' cre = compile(STR) mo = cre.search("the beg the end") mo.group() returns 'eg '
\d Matches any decimal digit character. This is equal to [0 9].
STR = r'(\B\d[a-z])' cre = compile(STR) mo = cre.search("3c59x") mo.group() returns '9x'
\D Matches any nondigit character. This is equal to [^0 9].
STR = r'(\d\D\d)' cre = compile(STR) mo = cre.search("3c59x") mo.group() returns '3c5'
\s Matches any whitespace character. This is equal to [^\t\n\r\f\v]
STR = r'(\D\sat)' cre = compile(STR) mo = cre.search("look at that") mo.group() returns 'k at'
\S Matches any non-whitespace character. This is equal to [^ \t\n\r\f\v]
STR = r'(\D\Sat)' cre = compile(STR) mo = cre.search("look at that") mo.group() returns 'that'
\w [a-zA-Z0-9_]
STR = r'(\D\D\w\b)' cre = compile(STR) mo = cre.search("No. Calif") mo.group() returns 'lif'
\W [^a-zA-Z0-9_]
STR = r'\D\D\W' cre = compile(STR) cre.search("No. Calif") mo = cre.search("No. Calif") mo.group() returns 'No.'

Sequences of Characters

The ...|... sequence is used for alternation and means "or." Its purpose is delimiting multiple subexpressions. The main expression matches if any of the subexpressions match. The | metacharacter has the form

<expression (<subexpression>|<subexpression>)> <expression>|<expression>

The following expressions both return a match object:

>>> search("cat|dog","cat") >>> search("a pet( cat| dog)*","I have a pet dog")

This expression also matches:

>>> search("a pet( cat| dog)*","I have a pet")

This expression returns dog:

>>> mo = search("a pet( cat| dog)*","I have a pet dog") >>> mo.group(1)

This expression returns a pet dog:

>>> mo.group()

The (...) sequence has the form

<expression (<subexpression>)>

It returns a group if the subexpression with the parentheses matches. Any text matched by the group will be available through the resulting match object's group() and groups() methods.

>>> STR = r'(t\w\w\w) and (t\w\w\w)' >>> cre = compile(STR) >>> mo = cre.search("this and that") >>> mo.group(1)

returns

'this' >>> mo.group(2)

returns

'that' >>> mo.groups()

returns

('this', 'that')

The (?iLmsx) sequence is used as a way to include flags as part of a regular expression. It has the form

<expression (?iLmsx)>

This sequence is useful if you need to use the match() or search() function but have to pass flags with the expression (instead of passing a flag argument to the compile() function and then using its search() or match() methods). The flags can be one or more of the set i, L, m, s, and x. The corresponding flags are I, L, M, S, or X for the entire expression. (Note that, even though the compile object's flags are passed as all uppercase, only the L is passed that way here.) The group containing the sequence itself matches an empty string; if it's sent as the only expression, an empty match object is returned.

Here's an example in which a match object is returned:

>>> search("test(?i)","THIS IS A TEST") <re.MatchObject instance at 80c87b0>

Here no match object is returned because test doesn't match in the string:

>>> search("test(?i)","THIS IS TEXT") >>>

This example returns an empty match object:

>>> search("(?i)","THIS IS TEST") <re.MatchObject instance at 80ac768>

The (?:...) sequence has the form

<expression (?:<subexpression>)>

It treats the subexpression within the parentheses like a regular expression but doesn't return a group if matched. Any text matched by the group is unavailable through the match object returned.

>>> STR = r'(~\w+~)(?:\w+)(~\w+~)' >>> cre = compile(STR) >>> mo = cre.search("~one~two~three~") >>> mo.group(1)

returns

'~one~' >>> mo.group(2)

returns

'~three~'

The (?P<name>...) sequence has the form

<expression (?P<group name> subexpression)>

It matches the subexpression within the parentheses, saving the matched text as a group, just as in regular parentheses, but it allows the group to be given a name that can be referenced later.

This expression:

>>> mo = search(r'(?P<first>t)(?P<last>est)',"test") >>> mo.group()

returns

'test'

This one:

>>> mo.group("first")

returns

't'

This one:

>>> mo.group("last")

returns

'est'

This one:

>>> mo.group(2)

returns

'est'

The (?P=Name...) sequence has the form

<expression (?P=name)>

It matches the text matched by the named group referenced earlier in the sub expression.

This example:

>>> cre = compile(r'\w(?P<first>aba)\w(?P=first)doo') >>> mo = cre.search("yabadabadoo") >>> mo.group(1)

returns this proper match:

'aba'

This one:

>>> mo.group(2)

returns an error:

Traceback (innermost last):   File "<console>", line 1, in ? IndexError: group 2 is undefined >>> mo.group()

because there's actually only one group:

'yabadabadoo'

The (?#...) sequence has the form

<expression (?# comment text)>

It allows comments to be inserted in an expression with use of the VERBOSE flag. The contents of the group are ignored.

>>> cre = compile(r'(yaba)(?#silly example)(daba)doo') >>> mo = cre.search("yabadabadoo") >>> mo.group(1) returns 'yaba' >>> mo.group(2) returns 'daba'

The (?=...) sequence has the form

<expression (?=<subexpression>)>

It acts as a conditional lookahead and matches expression only if it's followed by subexpression. Any text matched by the (?=...) group is unsaved and so is unaccessible via methods such as group() and groups().

This example doesn't return a match object:

>>> cre = compile(r'Jython (?=1\.1)') >>> cre.search("Jython is stable")

This one does:

>>> cre.search("Jython 1.1 is stable") org.python.modules.MatchObject@80ce02a >>> mo = cre.search("Jython 1.1 is stable") >>> mo.group()

and returns

'Jython '

The (?! ) sequence has the form

<expression (?!<subexpression>)>

It's a negative conditional lookahead (the opposite of (?=...)) and matches expression only if not followed by subexpression. Any text matched by its group isn't saved and so isn't accessible by methods such as group() and groups().

This example:

>>> cre = compile(r'Jython (?!1\.1)') >>> cre.search("Jython is stable")

returns a match object. This one:

org.python.modules.MatchObject@80cdb0d >>> cre.search("Jython 1.1 is stable")

doesn't return a match object:

>>> mo = cre.search("Jython is stable") >>> mo.group()

Instead, it returns

'Jython '

The {min,max} sequence has the form

<expression {<min>,<max>} >

It requires a minimum of min and allows a maximum of max repetitions of the immediately preceding character, group, or character class. A greedy multiplier, it matches as many repetitions as possible.

This example doesn't return a match object:

>>> cre = compile("\$\d{2,3}(,\d\d\d){2,5}") >>> cre.search("salary: $9,000.00")

Nor does this one:

>>> cre.search("salary: $90,000.00")

This one does return a match object:

org.python.modules.MatchObject@80ce140

The {min,max} ? sequence has the form

<expression {<min>,<max>} ?>

It's the same as {min,max}, but as a nongreedy multiplier, it matches as few repetitions as possible. Thus,

>>> cre = compile("(:.{3,10} :)") >>> STR = ":one:a:two:" >>> mo = cre.search(STR) >>> mo.group(1)

returns

':one:a:two:' >>> cre = compile("(:.{3,10} ?:)") >>> STR = ":one:a:two:" >>> mo = cre.search(STR) >>> mo.group(1)

returns

':one:' 

The [...] sequence has the form

<expression [character class]>

It matches any one of the characters within the brackets, listed explicitly or as a range delimited by the (hyphen) metacharacter (it understands [12345] and [1-5] as the same). Popular ranges are [0-9], [a-z], and [A-Z]. Multiple ranges, such as [a-zA-Z], can also be used.

The following examples all return a match object:

>>> search(r'[RTL]oad',"Toad") >>> search(r'[RTL]oad',"Road") >>> search(r'[RTL]oad',"Load") >>> search(r'199[7-9a-z]',"1998") >>> search(r'199[7-9a-z]',"1999") >>> search(r'199[7-9a-z]',"199x")

The [^...] sequence has the form

<expression [^character class]>

The opposite of [...], it matches anything but the characters listed within the brackets. Possible characters can be listed explicitly or as a (hyphen) delimited range.

This example returns a match object:

>>> search(r'199[^1-7]',"1998")

This one doesn't:

>>> search(r'199[^1-7]',"1997")

Putting Things Together

A third-party vendor I (Rick Hightower) worked with once had a problem reading the IDL files we were trying to integrate with its product. Specifically it didn't like identifiers beginning with ::, as in ::com::mycompany::Employee, preferring com::mycompany::Employee. My job was to convert the IDL files to their liking, which I did using regular expressions. The code I wrote to do this is just four lines long (actually six but four with a little cheating).

Here's the code to read in the file and write all the replacement text:

f=open(filename) fstr=re.compile(r'\s((::)(\w+))\b').sub(r' \3', f.read()) f.close(); f=open(filename, "w") f.write(fstr); f.close()

Here's the complete fixIDL() function:

import re def fixIDL(filename):      f=open(filename)      fstr=re.compile(r'\s((::)(\w+))\b').sub(r' \3', f.read())      f.close(); f=open(filename, "w")      f.write(fstr); f.close()

The key is the regular expression r'\s((::)(\w+))\b'. Let's break it down into its most basic components:

  • \s whitespace

  • \w alphanumeric

  • \b word boundary

  • + one or more

Now let's see how these components are used:

  • (::)(\w+) find a sequence of characters that begins with ::, followed by one or more alphanumeric characters

  • \s(()(\w+)) find a sequence of characters that begins with whitespace, followed by ::, followed by one or more alphanumeric characters

  • \s((::)(\w+))\b find a sequence of characters that begins with whitespace, followed by ::, followed by one or more alphanumeric characters, followed by a word boundary

re.compile

The re.compile call compiles the regular expression. It returns a regex object that has a sub() (substitute) method. The call to regex.sub(r'\3\,f.read()) means replace the matched text with a space and the third group.

A group is defined by the parentheses in regex and read from left to right. Thus,

  • ((::)(\w+)) defines group 1

  • (::) defines group 2

  • (\w+) defines group 3

Group 0 is always the whole matched regex. Group 3 is important because it represents the text without the preceding ::.

Now we'll break down the code line by line, each line followed by its explanation.

import re

imports the re (regular expression) module.

def fixIDL(filename):

defines the fixIDL() function with an argument of filename.

f=open(filename)

opens a file corresponding to filename.

fstr=re.compile(r'\s((::)(\w+))\b').sub(r' \3', f.read())

does the following:

  • re.compile(r'\s((::)(\w+))\b') compiles the regular expression into a regex object

  • .sub(r' \3', f.read()) replaces every occurrence of the matched text with the third group in the match, returning a string with the text replaced

  • f.read()) reads the entire file into a string:

f.close(); f=open(filename, "w")

then closes the file (f.close()) and reopens it in write mode:

f.write(fstr); f.close()

It then writes the entire new text into the file and closes the file.

Here's another way to write the code:

f=open(filename) fstr=re.compile(r'\s((::)(\w+))\b').sub(r'\3',f.read()) f.close(); f=open(filename, "w") f.write(fstr); f.close()

But then I would have lost my bet that I could write this utility package in four lines.

Note

This appendix summary was written by Rick Hightower.

CONTENTS


Python Programming with the JavaT Class Libraries. A Tutorial for Building Web and Enterprise Applications with Jython
Python Programming with the Javaв„ў Class Libraries: A Tutorial for Building Web and Enterprise Applications with Jython
ISBN: 0201616165
EAN: 2147483647
Year: 2001
Pages: 25

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net