Section 15.2. Special Symbols and Characters


15.2. Special Symbols and Characters

We will now introduce the most popular of the metacharacters, special characters and symbols, which give regular expressions their power and flexibility. You will find the most common of these symbols and characters in Table 15.1.

Table 15.1. Common Regular Expression Symbols and Special Characters

Notation

Description

Example RE

Symbols

  

literal

Match literal string value literal

foo

re1|re2

Match regular expressions re1 or re2

foo|bar

.

Match any character (except NEWLINE)

b.b

^

Match start of string

^Dear

$

Match end of string

/bin/*sh$

*

Match 0 or more occurrences of preceding RE

[A-Za-z0-9]*

+

Match 1 or more occurrences of preceding RE

[a-z]+\.com

?

Match 0 or 1 occurrence(s) of preceding RE

goo?

{N}

Match N occurrences of preceding RE

[0-9]{3}

{M,N}

Match from M to N occurrences of preceding RE

[0-9]{5,9}

[...]

Match any single character from character class

[aeiou]

[..x-y..]

Match any single character in the range from x to y

[0-9],[A-Za-z]

[^...]

Do not match any character from character class, including any ranges, if present

[^aeiou], [^A-Za-z0-9_]

(*|+|?| {})?

Apply "non-greedy" versions of above occurrence/repetition symbols ( *, +, ?, {})

.*?[a-z]

(...)

Match enclosed RE and save as subgroup

([0-9]{3})?, f(oo|u)bar

Special Characters

  

\d

Match any decimal digit, same as [0-9](\D is inverse of \d: do not match any numeric digit)

data\d+.txt

\w

Match any alphanumeric character, same as [A-Za-z0-9_] (\W is inverse of \w)

[A-Za-z_]\w+

\s

Match any whitespace character, same as [ \n\t\r\v\f] (\S is inverse of \s)

of\sthe

\b

Match any word boundary (\B is inverse of \b)

\bThe\b

\nn

Match saved subgroup nn (see (...) above)

price: \16

\c

Match any special character c verbatim (i.e., with out its special meaning, literal)

\., \\, \*

\A (\Z)

Match start (end) of string (also see ^ and $ above)

\ADear


15.2.1. Matching More Than One RE Pattern with Alternation ( | )

The pipe symbol ( | ), a vertical bar on your keyboard, indicates an alternation operation, meaning that it is used to choose from one of the different regular expressions, which are separated by the pipe symbol. For example, below are some patterns that employ alternation, along with the strings they match:

RE Pattern

Strings Matched

at|home

at, home

r2d2|c3po

r2d2, c3po

bat|bet|bit

bat, bet, bit


With this one symbol, we have just increased the flexibility of our regular expressions, enabling the matching of more than just one string. Alternation is also sometimes called union or logical OR.

15.2.2. Matching Any Single Character ( .)

The dot or period ( . ) symbol matches any single character except for NEWLINE (Python REs have a compilation flag [S or DOTALL], which can override this to include NEWLINEs.). Whether letter, number, whitespace not including "\n," printable, non-printable, or a symbol, the dot can match them all.

RE Pattern

Strings Matched

f.o

Any character between "f" and "o", e.g., fao, f9o, f#o, etc.

..

Any pair of characters

.end

Any character before the string end


Q: What if I want to match the dot or period character?

A: In order to specify a dot character explicitly, you must escape its functionality with a backslash, as in "\.".

15.2.3. Matching from the Beginning or End of Strings or Word Boundaries ( ^/$ /\b /\B)

There are also symbols and related special characters to specify searching for patterns at the beginning and ending of strings. To match a pattern starting from the beginning, you must use the carat symbol ( ^ ) or the special character \A (backslash-capital "A"). The latter is primarily for keyboards that do not have the carat symbol, i.e., international. Similarly, the dollar sign ( $ ) or \Z will match a pattern from the end of a string.

Patterns that use these symbols differ from most of the others we describe in this chapter since they dictate location or position. In the Core Note above, we noted that a distinction is made between "matching," attempting matches of entire strings starting at the beginning, and "searching," attempting matches from anywhere within a string. With that said, here are some examples of "edge-bound" RE search patterns:

RE Pattern

Strings Matched

^From

Any string that starts with From

/bin/tcsh$

Any string that ends with /bin/tcsh

^Subject: hi$

Any string consisting solely of the string Subject: hi


Again, if you want to match either (or both) of these characters verbatim, you must use an escaping backslash. For example, if you wanted to match any string that ended with a dollar sign, one possible RE solution would be the pattern ".*\$$".

The \b and \B special characters pertain to word boundary matches. The difference between them is that \b will match a pattern to a word boundary, meaning that a pattern must be at the beginning of a word, whether there are any characters in front of it (word in the middle of a string) or not (word at the beginning of a line). And likewise, \B will match a pattern only if it appears starting in the middle of a word (i.e., not at a word boundary). Here are some examples:

RE Pattern

Strings Matched

the

Any string containing the

\bthe

Any word that starts with the

\bthe\b

Matches only the word the

\Bthe

Any string that contains but does not begin with the


15.2.4. Creating Character Classes ( [ ] )

While the dot is good for allowing matches of any symbols, there may be occasions where there are specific characters you want to match. For this reason, the bracket symbols ( [ ] ) were invented. The regular expression will match any of the enclosed characters. Here are some examples:

RE Pattern

Strings Matched

b[aeiu]t

bat, bet, bit, but

[cr][23][dp][o2]

A string of 4 characters: first is "r" or "c," then "2" or "3," followed by "d" or "p," and finally, either "o" or "2," e.g., c2do, r3p2, r2d2, c3po, etc.


One side note regarding the RE "[cr][23][dp][o2]"a more restrictive version of this RE would be required to allow only "r2d2" or "c3po" as valid strings. Because brackets merely imply "logical OR" functionality, it is not possible to use brackets to enforce such a requirement. The only solution is to use the pipe, as in "r2d2|c3po".

For single-character REs, though, the pipe and brackets are equivalent. For example, let's start with the regular expression "ab," which matches only the string with an "a" followed by a "b". If we wanted either a one-letter string, i.e., either "a" or a "b," we could use the RE "[ab]." Because "a" and "b" are individual strings, we can also choose the RE "a|b". However, if we wanted to match the string with the pattern "ab" followed by "cd," we cannot use the brackets because they work only for single characters. In this case, the only solution is "ab|cd," similar to the "r2d2/c3po" problem just mentioned.

15.2.5. Denoting Ranges ( - ) and Negation ( ^ )

In addition to single characters, the brackets also support ranges of characters. A hyphen between a pair of symbols enclosed in brackets is used to indicate a range of characters, e.g., A-Z, a-z, or 0-9 for uppercase letters, lowercase letters, and numeric digits, respectively. This is a lexicographic range, so you are not restricted to using just alphanumeric characters. Additionally, if a caret ( ^ ) is the first character immediately inside the open left bracket, this symbolizes a directive not to match any of the characters in the given character set.

RE Pattern

Strings Matched

z.[0-9]

"z" followed by any character then followed by a single digit

[r-u][env-y]

"r" "s," "t" or "u" followed by "e," "n," "v," "w," "x," or "y"

[us]

followed by "u" or "s"

[^aeiou]

A non-vowel character (Exercise: Why do we say "non-vowels" rather than "consonants"?)

[^\t\n]

Not a TAB or NEWLINE

["-a]

In an ASCII system, all characters that fall between '"' and "a," i.e., between ordinals 34 and 97


15.2.6. Multiple Occurrence/Repetition Using Closure Operators ( *, +, ?, { } )

We will now introduce the most common RE notations, namely, the special symbols *, +, and ?, all of which can be used to match single, multiple, or no occurrences of string patterns. The asterisk or star operator ( * ) will match zero or more occurrences of the RE immediately to its left (in language and compiler theory, this operation is known as the Kleene Closure). The plus operator ( + ) will match one or more occurrences of an RE (known as Positive Closure), and the question mark operator ( ? ) will match exactly 0 or 1 occurrences of an RE.

There are also brace operators ( { } ) with either a single value or a comma-separated pair of values. These indicate a match of exactly N occurrences (for {N}) or a range of occurrences, i.e., {M,N} will match from M to N occurrences. These symbols may also be escaped with the backslash, i.e., "\*" matches the asterisk, etc.

In the table above, we notice the question mark is used more than once (overloaded), meaning either matching 0 or 1 occurrences, or its other meaning: if it follows any matching using the close operators, it will direct the regular expression engine to match as few repetitions as possible.

What does that last part mean, "as few ... as possible?" When pattern-matching is employed using the grouping operators, the regular expression engine will try to "absorb" as many characters as possible which match the pattern. This is known as being greedy. The question mark tells the engine to lay off and if possible, take as few characters as possible in the current match, leaving the rest to match as many of succeeding characters of the next pattern (if applicable). We will show you a great example where non-greediness is required toward the end of the chapter. For now, let us continue to look at the closure operators:

RE Pattern

Strings Matched

[dn]ot?

"d" or "n," followed by an "o" and, at most, one "t" after that, i.e., do, no, dot, not

0?[1-9]

Any numeric digit, possibly prepended with a "0," e.g., the set of numeric representations of the months January to September, whether single- or double-digits

[0-9]{15,16}

Fifteen or sixteen digits, e.g., credit card numbers

</?[^>]+>

Strings that match all valid (and invalid) HTML tags

[KQRBNP][a-h][1-8]-[a-h][1-8]

Legal chess move in "long algebraic" notation (move only, no capture, check, etc.), i.e., strings which start with any of "K," "Q," "R," "B," "N," or "P" followed by a hyphenated-pair of chess board grid locations from "a1" to "h8" (and everything in between), with the first coordinate indicating the former position and the second being the new position.


15.2.7. Special Characters Representing Character Sets

We also mentioned that there are special characters that may represent character sets. Rather than using a range of "0-9," you may simply use "\d" to indicate the match of any decimal digit. Another special character "\w" can be used to denote the entire alphanumeric character class, serving as a shortcut for "A-Za-z0-9_", and "\s" for whitespace characters. Uppercase versions of these strings symbolize non-matches, i.e., "\D" matches any non-decimal digit (same as "[^0-9]"), etc.

Using these shortcuts, we will present a few more complex examples:

RE Pattern

Strings Matched

\w+-\d+

Alphanumeric string and number separated by a hyphen

[A-Za-z]\w*

Alphabetic first character, additional characters (if present) can be alphanumeric (almost equivalent to the set of valid Python identifiers [see exercises])

\d{3}-\d{3}-\d{4}

(American) telephone numbers with an area code prefix, as in 800-555-1212

\w+@\w+\.com

Simple e-mail addresses of the form XXX@YYY.com


15.2.8. Designating Groups with Parentheses ( ( ) )

Now, perhaps we have achieved the goal of matching a string and discarding non-matches, but in some cases, we may also be more interested in the data that we did match. Not only do we want to know whether the entire string matched our criteria, but also whether we can extract any specific strings or substrings that were part of a successful match. The answer is yes. To accomplish this, surround any RE with a pair of parentheses.

A pair of parentheses ( ( ) ) can accomplish either (or both) of the below when used with regular expressions:

  • Grouping regular expressions

  • Matching subgroups

One good example for wanting to group regular expressions is when you have two different REs with which you want to compare a string. Another reason is to group an RE in order to use a repetition operator on the entire RE (as opposed to an individual character or character class).

One side effect of using parentheses is that the substring that matched the pattern is saved for future use. These subgroups can be recalled for the same match or search, or extracted for post-processing. You will see some examples of pulling out subgroups at the end of Section 15.3.9.

Why are matches of subgroups important? The main reason is that there are times where you want to extract the patterns you match, in addition to making a match. For example, what if we decided to match the pattern "\w+-\d+" but wanted save the alphabetic first part and the numeric second part individually? This may be desired because with any successful match, we may want to see just what those strings were that matched our RE patterns.

If we add parentheses to both subpatterns, i.e., "(\w+)-(\d+)," then we can access each of the matched subgroups individually. Subgrouping is preferred because the alternative is to write code to determine we have a match, then execute another separate routine (which we also had to create) to parse the entire match just to extract both parts. Why not let Python do it, since it is a supported feature of the re module, instead of reinventing the wheel?

RE Pattern

Strings Matched

\d+(\.\d*)?

Strings representing simple floating point number, that is, any number of digits followed optionally by a single decimal point and zero or more numeric digits, as in "0.004," "2," "75.," etc.

(Mr?s?\. )?[A-Z][a-z]* [ A-Za-z-]+

First name and last name, with a restricted first name (must start with uppercase; lowercase only for remaining letters, if any), the full name prepended by an optional title of "Mr.," "Mrs.," "Ms.," or "M.," and a flexible last name, allowing for multiple words, dashes, and uppercase letters




Core Python Programming
Core Python Programming (2nd Edition)
ISBN: 0132269937
EAN: 2147483647
Year: 2004
Pages: 334
Authors: Wesley J Chun

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net