Regular Expressions | UNIX: The Complete Reference, Second Edition (Complete Reference Series)

A regular expression is a string used for pattern matching. Regular expressions can be used to search for strings that match a certain pattern, and sometimes to manipulate those strings. Many UNIX System commands (including grep, vi, emacs, sed, and awk) use regular expressions for searching and for text manipulation.

The re module in Python gives you many powerful ways to use regular expressions in your scripts. Only some of the features of re will be covered here. For more information, see the documentation pages at http://docs.python.org/lib/module-re.html

Pattern Matching

In Python, a regular expression object is created with re.compile(). Regular expression objects have many methods for working with strings, including search(), match(), findall(), split(), and sub(), Here’s an example of using a pattern to match a string:

 import re maillist = ["alice@wonderland.gov", "mgardner@sciam.bk", "smullyan@puzzleland.bk"] emailre = re.compile(r"land") for email in maillist :     if emailre.search(email) :         print email, "is a match."

This example will print the addresses alice@wonderland.gov and smullyan@puzzleland.bk, but not mgardner@sciam.bk. It uses re.compile(r"land”) to create an object that can search for the string land. (The r is used in front of a regular expression string to prevent Python from interpreting any escape sequences it might contain.) This script then uses emailre.search(email) to search each e-mail address for land, and prints the ones that match.

You can also use the regular expression methods without first creating a regular expression object. For example, the command re.search(r“land”, email) could be used in the if statement in the preceding example, in place of emailre.search(email). In short scripts it may be convenient to eliminate the extra step of calling re.compile(), but using a regular expression object (emailre, in this example) is generally more efficient.

The method match() is just like search(), except that it only looks for the pattern at the beginning of the string. For example,

 regexp = re.compile(r'kn', re.I) for element in ["Knight", "knave", "normal"] :     if regexp.match(element) :         print regexp.match (element).group ()

will find strings that start with “kn”. The re.I option in re.compile(r‘kn’, re.I) causes the match to ignore case, so this example will also find strings starting with “KN”. The method group() returns the part of the string that matched. The output from this example would look like

 Kn kn

Constructing Patterns

As you have seen, a string by itself is a regular expression. It matches any string that contains it. For example, venture matches “Adventures”. However, you can create far more interesting regular expressions.

Certain characters have special meanings in regular expressions. Table 23–1 lists these characters, with examples of how they might be used.

Table 23–1: Python Regular Expressions
Char	Definition	Example	Matches
.	Matches any single character.	th.nk	think, thank, thunk, etc.
\	Quotes the following character.	script\.py	script.py
*	Previous item may occur zero or more times in a row.	.*	any string, including the empty string
+	Previous item occurs at least once, and maybe more.	\*+	, ****, etc.
?	Previous item may or may not occur.	web\.html?	web.htm, web.html
{n,m}	Previous item must occur at least n times but no more than m times.	\*{3,5}	*, , ***
( )	Group a portion of the pattern.	script(\.pl)?	script, script.pl
\|	Matches either the value before or after the \|.	(R\|r)af	Raf, raf
[ ]	Matches any one of the characters inside. Frequently used with ranges.	[QqXx]*	Q, q, X, or x
[^]	Matches any character not inside the brackets.	[^AZaz]	any nonalphabetic character, such as 2
\n	Matches whatever was in the nth set of parenthesis.	(croquet)\1	croquetcroquet
\s	Matches any white space character.	\s	space, tab, newline
\s	Matches any non-white space.	the \S	then, they, etc. (but not the)
\d	Matches any digit.	\d*	0110, 27, 9876, etc.
\D	Matches anything that’s not a digit.	\D+	same as [^0–9]+
\w	Matches any letter, digit, or underscore.	\w+	t, AL1c3, Q_of_H, etc.
\W	Matches anything that \w doesn’t match.	\W+	&#$%,* etc.
\b	Matches the beginning or end of a word.	\bcat\b	cat, but not catenary or concatenate
^	Anchor the pattern to the beginning of a string.	^ If	any string beginning with If
$	Anchor the pattern to the end of the string.	\.$	any string ending in a period

Remember that it is usually a good idea to add the character r in front of a regular expression string. Otherwise, Python may perform substitutions that change the expression.

Saving Matches

One use of regular expressions is to parse strings by saving the portions of the string that match your pattern. For example, suppose you have an e-mail address, and you want to get just the username part of the address:

 email = 'alice@wonderland.gov' parsemail = re.compile(r"(.*)@(.*)") (username, domain)=parsemail.search(email).groups() print "Username:", username, "Domain:", domain

This example uses the regular expression pattern “(.*)@(.*)” to match the e-mail address. The pattern contains two groups enclosed in parentheses. One group is the set of characters before the @, and the other is the set of characters following the @. The method groups() returns the list of strings that match each of these groups. In this example, those strings are alice and wonderland.gov.

Finding a List of Matches

In some cases, you may want to find and save a list of all the matches for an expression. For example,

 regexp = re.compile(r"ap*le") matchlist = regexp.findall(inputline)

searches for all the substrings of inputline that match the expression "ap*le". This includes strings like ale or apple. If you also want to match capitalized words like Apple, you could use the regular expression

 regexp = re.compile(r"ap*le", re.I)

instead.

One common use of findall() is to divide a line into sections. For example, the sample program in the earlier section “Variable Scope” used

 splitline = re.findall (r"\w+", line.lower())

to get a list of all the words in line.lower().

Splitting a String

The split() method breaks a string at each occurrence of a certain pattern.

Consider the following line from the file /etc/passwd:

 line = "lewisc:x:3943:100:L. Carroll:/home/lewisc:/bin/bash"

We can use split() to turn the fields from this line into a list:

 passre = re.compile(r":") passlist = passre.split(line) # passlist =  ['lewisc', 'x', 3943, 100, 'L. Carroll', '/home/lewisc, '/bin/bash']

Better yet, we can assign a variable name to each field:

 (logname, passwd, uid, gid, gcos, home, shell) = re.split (r":", line)

Substitutions

Regular expressions can also be used to substitute text for the part of the string matched by the pattern. In this example, the string “Hello, world” is transformed into “Hello, sailor”:

 hello = "Hello, world" hellore = re.compile(r"world") newhello = hellore.sub("sailor", hello)

This could also be written as

 hello = "Hello, world" newhello = re.sub (r"world", "sailor", hello)

Here's a slightly more interesting example of a substitution. This will replace all the digits in the input with the letter X:

 import re, fileinput matchdigit = re.compile(r"\d") for line in fileinput.input():     print matchdigit.sub('X', line)