Section 15.3. REs and Python

15.3. REs and Python

Now that we know all about regular expressions, we can examine how Python currently supports regular expressions through the re module. The re module was introduced to Python in version 1.5. If you are using an older version of Python, you will have to use the now-obsolete regex and regsub modulesthese older modules are more Emacs-flavored, are not as full-featured, and are in many ways incompatible with the current re module. Both modules were removed from Python in 2.5, and import either of the modules from 2.5 and above triggers Import Error exception.

However, regular expressions are still regular expressions, so most of the basic concepts from this section can be used with the old regex and regsub software. In contrast, the new re module supports the more powerful and regular Perl-style (Perl5) REs, allows multiple threads to share the same compiled RE objects, and supports named subgroups. In addition, there is a transition module called reconvert to help developers move from regex/regsub to re. However, be aware that although there are different flavors of regular expressions, we will primarily focus on the current incarnation for Python.

The re engine was rewritten in 1.6 for performance enhancements as well as adding Unicode support. The interface was not changed, hence the reason the module name was left alone. The new re engineknown internally as sre thus replaces the existing 1.5 engineinternally called pcre.

15.3.1. `re` Module: Core Functions and Methods

The chart in Table 15.2 lists the more popular functions and methods from the re module. Many of these functions are also available as methods of compiled regular expression objects "regex objects" and RE "match objects." In this subsection, we will look at the two main functions/methods, match() and search(), as well as the compile() function. We will introduce several more in the next section, but for more information on all these and the others that we do not cover, we refer you to the Python documentation.

Table 15.2. Common Regular Expression Functions and Methods
Function/Method	Description
`re` Module Function Only
`compile(pattern, flags=0)`	Compile RE `pattern` with any optional `flags` and return a regex object
`re` Module Functions and regex Object Methods
`match(pattern, string, flags=0)`	Attempt to match RE `pattern` to `string` with optional `flags`; return match object on success, `None` on failure
`search(pattern, string, flags=0)`	Search for first occurrence of RE `pattern` within `string` with optional `flags`; return match object on success, `None` on failure
`findall(pattern, string[,flags])`^[a]	Look for all (non-overlapping) occurrences of `pattern` in `string`; return a list of matches
`finditer(pattern, string[, flags])`^[b]	Same as `findall()` except returns an iterator instead of a list; for each match, the iterator returns a match object
`split(pattern, string, max=0)`	Split `string` into a list according to RE `pattern` delimiter and return list of successful matches, splitting at most `max` times (split all occurrences is the default)
`sub(pattern, repl, string, max=0)`	Replace all occurrences of the RE `pattern` in `string` with `repl`, substituting all occurrences unless `max` provided (also see `subn()` which, in addition, returns the number of substitutions made)
Match Object Methods
`group(num=0)`	Return entire match (or specific subgroup `num`)
`groups()`	Return all matching subgroups in a tuple (empty if there weren't any)

^[a] New in Python 1.5.2; flags parameter added in 2.4.

^[b] New in Python 2.2; flags parameter added in 2.4.

Core Note: RE compilation (to compile or not to compile?)

In Chapter 14, we described how Python code is eventually compiled into bytecode, which is then executed by the interpreter. In particular, we mentioned that calling eval() or exec with a code object rather than a string provides a significant performance improvement due to the fact that the compilation process does not have to be performed. In other words, using precompiled code objects is faster than using strings because the interpreter will have to compile it into a code object (anyway) before execution.

The same concept applies to REsregular expression patterns must be compiled into regex objects before any pattern matching can occur. For REs, which are compared many times during the course of execution, we highly recommend using precompilation first because, again, REs have to be compiled anyway, so doing it ahead of time is prudent for performance reasons. re.compile() provides this functionality.

The module functions do cache the compiled objects, though, so it's not as if every search() and match() with the same RE pattern requires compilation. Still, you save the cache lookups and do not have to make function calls with the same string over and over. In Python 1.5.2, this cache held up to 20 compiled RE objects, but in 1.6, due to the additional overhead of Unicode awareness, the compilation engine is a bit slower, so the cache has been extended to 100 compiled regex objects.

15.3.2. Compiling REs with `compile()`

Almost all of the re module functions we will be describing shortly are available as methods for regex objects. Remember, even with our recommendation, precompilation is not required. If you compile, you will use methods; if you don't, you will just use functions. The good news is that either way, the names are the same whether a function or a method. (This is the reason why there are module functions and methods that are identical, e.g., search(), match(), etc., in case you were wondering.) Since it saves one small step for most of our examples, we will use strings instead. We will throw in a few with compilation, though, just so you know how it is done.

Optional flags may be given as arguments for specialized compilation. These flags allow for case-insensitive matching, using system locale settings for matching alphanumeric characters, etc. Please refer to the documentation for more details. These flags, some of which have been briefly mentioned (i.e., DOTALL, LOCALE), may also be given to the module versions of match() and search() for a specific pattern match attemptthese flags are mostly for compilation reasons, hence the reason why they can be passed to the module versions of match() and search(), which do compile an RE pattern once. If you want to use these flags with the methods, they must already be integrated into the compiled regex objects.

In addition to the methods below, regex objects also have some data attributes, two of which include any compilation flags given as well as the regular expression pattern compiled.

15.3.3. Match Objects and the `group()` and `groups ()` Methods

There is another object type in addition to the regex object when dealing with regular expressions, the match object. These are the objects returned on successful calls to match() or search(). Match objects have two primary methods, group() and groups().

group() will either return the entire match, or a specific subgroup, if requested. groups() will simply return a tuple consisting of only/all the subgroups. If there are no subgroups requested, then groups() returns an empty tuple while group() still returns the entire match.

Python REs also allow for named matches, which are beyond the scope of this introductory section on REs. We refer you to the complete re module documentation regarding all the more advanced details we have omitted here.

15.3.4. Matching Strings with `match()`

match() is the first re module function and RE object (regex object) method we will look at. The match() function attempts to match the pattern to the string, starting at the beginning. If the match is successful, a match object is returned, but on failure, None is returned. The group() method of a match object can be used to show the successful match. Here is an example of how to use match() [and group()]:

>>> m = re.match('foo', 'foo')    # pattern matches string >>> if m is not None:         # show match if successful ...      m.group() ... 'foo'

The pattern "foo" matches exactly the string "foo." We can also confirm that m is an example of a match object from within the interactive interpreter:

>>> m                  # confirm match object returned <re.MatchObject instance at 80ebf48>

Here is an example of a failed match where None is returned:

>>> m = re.match('foo', 'bar')# pattern does not match string >>> if m is not None: m.group()# (1-line version of if clause) ... >>>

The match above fails, thus None is assigned to m, and no action is taken due to the way we constructed our if statement. For the remaining examples, we will try to leave out the if check for brevity, if possible, but in practice it is a good idea to have it there to prevent AttributeError exceptions (None is returned on failures, which does not have a group() attribute [method].)

A match will still succeed even if the string is longer than the pattern as long as the pattern matches from the beginning of the string. For example, the pattern "foo" will find a match in the string "food on the table" because it matches the pattern from the beginning:

>>> m = re.match('foo', 'food on the table') # match succeeds >>> m.group() 'foo'

As you can see, although the string is longer than the pattern, a successful match was made from the beginning of the string. The substring "foo" represents the match, which was extracted from the larger string.

We can even sometimes bypass saving the result altogether, taking advantage of Python's object-oriented nature:

>>> re.match('foo', 'food on the table').group() 'foo'

Note from a few paragraphs above that an AttributeError will be generated on a non-match.

15.3.5. Looking for a Pattern within a String with `search()` (Searching versus Matching)

The chances are greater that the pattern you seek is somewhere in the middle of a string, rather than at the beginning. This is where search() comes in handy. It works exactly in the same way as match except that it searches for the first occurrence of the given RE pattern anywhere with its string argument. Again, a match object is returned on success and None otherwise.

We will now illustrate the difference between match() and search(). Let us try a longer string match attempt. This time, we will try to match our string "foo" to "seafood":

>>> m = re.match('foo', 'seafood')     # no match >>> if m is not None: m.group() ... >>>

As you can see, there is no match here. match() attempts to match the pattern to the string from the beginning, i.e., the "f" in the pattern is matched against the "s" in the string, which fails immediately. However, the string "foo" does appear (elsewhere) in "seafood," so how do we get Python to say "yes"? The answer is by using the search() function. Rather than attempting a match, search() looks for the first occurrence of the pattern within the string. search() searches strictly from left to right.

>>> m = re.search('foo', 'seafood')   # use search() instead >>> if m is not None: m.group() ... 'foo'                   # search succeeds where match failed >>>

We will be using the match() and search() regex object methods and the group() and groups() match object methods for the remainder of this subsection, exhibiting a broad range of examples of how to use regular expressions with Python. We will be using almost all of the special characters and symbols that are part of the regular expression syntax.

15.3.6. Matching More than One String ( `|` )

In Section 15.2, we used the pipe in the RE "bat|bet|bit." Here is how we would use that RE with Python:

>>> bt = 'bat|bet|bit'       # RE pattern: bat, bet, bit >>> m = re.match(bt, 'bat')       # 'bat' is a match >>> if m is not None: m.group() ... 'bat' >>> m = re.match(bt, 'blt')       # no match for 'blt' >>> if m is not None: m.group() ... >>> m = re.match(bt, 'He bit me!') # does not match string >>> if m is not None: m.group() ... >>> m = re.search(bt, 'He bit me!') # found 'bit' via search >>> if m is not None: m.group() ... 'bit'

15.3.7. Matching Any Single Character ( `.` )

In the examples below, we show that a dot cannot match a NEWLINE or a non-character, i.e., the empty string:

>>> anyend = '.end' >>> m = re.match(anyend, 'bend')     # dot matches 'b' >>> if m is not None: m.group() ... 'bend' >>> m = re.match(anyend, 'end')      # no char to match >>> if m is not None: m.group() ... >>> m = re.match(anyend, '\nend')    # any char except \n >>> if m is not None: m.group() ... >>> m = re.search('.end', 'The end.') # matches ' ' in search >>> if m is not None: m.group() ... ' end'

The following is an example of searching for a real dot (decimal point) in a regular expression where we escape its functionality with a backslash:

 >>> patt314 = '3.14'         # RE dot  >>> pi_patt = '3\.14'        # literal dot (dec. point) >>> m = re.match(pi_patt, '3.14') # exact match >>> if m is not None: m.group() ... '3.14' >>> m = re.match(patt314, '3014') # dot matches '0' >>> if m is not None: m.group() ... '3014' >>> m = re.match(patt314, '3.14') # dot matches '.' >>> if m is not None: m.group() ... '3.14'

15.3.8. Creating Character Classes ( `[ ]` )

Earlier, we had a long discussion about "[cr][23][dp][o2]" and how it differs from "r2d2|c3po." With the examples below, we will show that "r2d2|c3po" is more restrictive than "[cr][23][dp][o2]":

>>> m = re.match('[cr][23][dp][o2]', 'c3po') # matches 'c3po' >>> if m is not None: m.group() ... 'c3po' >>> m = re.match('[cr][23][dp][o2]', 'c2do') # matches 'c2do' >>> if m is not None: m.group() ... 'c2do' >>> m = re.match('r2d2|c3po', 'c2do') # does not match 'c2do' >>> if m is not None: m.group() ... >>> m = re.match('r2d2|c3po', 'r2d2') # matches 'r2d2' >>> if m is not None: m.group() ... 'r2d2'

15.3.9. Repetition, Special Characters, and Grouping

The most common aspects of REs involve the use of special characters, multiple occurrences of RE patterns, and using parentheses to group and extract submatch patterns. One particular RE we looked at related to simple e-mail addresses ("\w+@\w+\.com"). Perhaps we want to match more e-mail addresses than this RE allows. In order to support an additional hostname in front of the domain, i.e., "www.xxx.com" as opposed to accepting only "xxx.com" as the entire domain, we have to modify our existing RE. To indicate that the hostname is optional, we create a pattern that matches the hostname (followed by a dot), use the ? operator indicating zero or one copy of this pattern, and insert the optional RE into our previous RE as follows: "\w+@(\w+\.)?\w+\.com." As you can see from the examples below, either one or two names are now accepted in front of the ".com":

>>> patt = '\w+@(\w+\.)?\w+\.com' >>> re.match(patt, 'nobody@xxx.com').group() 'nobody@xxx.com' >>> re.match(patt, 'nobody@www.xxx.com').group() 'nobody@www.xxx.com'

Furthermore, we can even extend our example to allow any number of intermediate subdomain names with the pattern below. Take special note of our slight change from using ? to *.: "\w+@(\w+\.)*\w+\.com":

>>> patt = '\w+@(\w+\.)*\w+\.com' >>> re.match(patt, 'nobody@www.xxx.yyy.zzz.com').group() 'nobody@www.xxx.yyy.zzz.com'

However, we must add the disclaimer that using solely alphanumeric characters does not match all the possible characters that may make up e-mail addresses. The above RE patterns would not match a domain such as "xxx-yyy.com" or other domains with "\W" characters.

Earlier, we discussed the merits of using parentheses to match and save subgroups for further processing rather than coding a separate routine to manually parse a string after an RE match had been determined. In particular, we discussed a simple RE pattern of an alphanumeric string and a number separated by a hyphen, "\w+-\d+," and how adding subgrouping to form a new RE, "(\w+)-(\d+)," would do the job. Here is how the original RE works:

>>> m = re.match('\w\w\w-\d\d\d', 'abc-123') >>> if m is not None: m.group() ... 'abc-123' >>> m = re.match('\w\w\w-\d\d\d', 'abc-xyz') >>> if m is not None: m.group() ... >>>

In the above code, we created an RE to recognize three alphanumeric characters followed by three digits. Testing this RE on "abc-123," we obtained positive results while "abc-xyz" fails. We will now modify our RE as discussed before to be able to extract the alphanumeric string and number. Note how we can now use the group() method to access individual subgroups or the groups() method to obtain a tuple of all the subgroups matched:

>>> m = re.match('(\w\w\w)-(\d\d\d)', 'abc-123') >>> m.group()                      # entire match 'abc-123' >>> m.group(1)                     # subgroup 1 'abc' >>> m.group(2)                     # subgroup 2 '123' >>> m.groups()                     # all subgroups ('abc', '123')

As you can see, group() is used in the normal way to show the entire match, but can also be used to grab individual subgroup matches. We can also use the groups() method to obtain a tuple of all the substring matches.

Here is a simpler example showing different group permutations, which will hopefully make things even more clear:

>>> m = re.match('ab', 'ab')       # no subgroups >>> m.group()                      # entire match 'ab' >>> m.groups()                     # all subgroups () >>> >>> m = re.match('(ab)', 'ab')     # one subgroup >>> m.group()                      # entire match 'ab' >>> m.group(1)                     # subgroup 1 'ab' >>> m.groups()                     # all subgroups ('ab',) >>> >>> m = re.match('(a)(b)', 'ab')        # two subgroups >>> m.group()                      # entire match 'ab' >>> m.group(1)                     # subgroup 1 'a' >>> m.group(2)                     # subgroup 2 'b' >>> m.groups()                     # all subgroups ('a', 'b') >>> >>> m = re.match('(a(b))', 'ab')         # two subgroups >>> m.group()                      # entire match 'ab' >>> m.group(1)                     # subgroup 1 'ab' >>> m.group(2)                     # subgroup 2 'b' >>> m.groups()                     # all subgroups ('ab', 'b')

15.3.10. Matching from the Beginning and End of Strings and on Word Boundaries

The following examples highlight the positional RE operators. These apply more for searching than matching because match() always starts at the beginning of a string.

>>> m = re.search('^The', 'The end.')      # match >>> if m is not None: m.group() ... 'The' >>> m = re.search('^The', 'end. The')         # not at beginning >>> if m is not None: m.group() ... >>> m = re.search(r'\bthe', 'bite the dog') # at a boundary >>> if m is not None: m.group() ... 'the' >>> m = re.search(r'\bthe', 'bitethe dog')  # no boundary >>> if m is not None: m.group() ... >>> m = re.search(r'\Bthe', 'bitethe dog')  # no boundary >>> if m is not None: m.group() ... 'the'

You will notice the appearance of raw strings here. You may want to take a look at the Core Note toward the end of the chapter for clarification on why they are here. In general, it is a good idea to use raw strings with regular expressions.

There are four other re module functions and regex object methods we think you should be aware of: findall(), sub(), subn(), and split().

15.3.11. Finding Every Occurrence with `findall()`

findall() is new to Python as of version 1.5.2. It looks for all non-overlapping occurrences of an RE pattern in a string. It is similar to search() in that it performs a string search, but it differs from match() and search() in that findall() always returns a list. The list will be empty if no occurrences are found but if successful, the list will consist of all matches found (grouped in left-to-right order of occurrence).

>>> re.findall('car', 'car') ['car'] >>> re.findall('car', 'scary') ['car'] >>> re.findall('car', 'carry the barcardi to the car') ['car', 'car', 'car']

Subgroup searches result in a more complex list returned, and that makes sense, because subgroups are a mechanism that allow you to extract specific patterns from within your single regular expression, such as matching an area code that is part of a complete telephone number, or a login name that is part of an entire e-mail address.

For a single successful match, each subgroup match is a single element of the resulting list returned by findall(); for multiple successful matches, each subgroup match is a single element in a tuple, and such tuples (one for each successful match) are the elements of the resulting list. This part may sound confusing at first, but if you try different examples, it will help clarify things.

15.3.12. Searching and Replacing with `sub()` [and `subn()`]

There are two functions/methods for search-and-replace functionality: sub() and subn(). They are almost identical and replace all matched occurrences of the RE pattern in a string with some sort of replacement. The replacement is usually a string, but it can also be a function that returns a replacement string. subn() is exactly the same as sub(), but it also returns the total number of substitutions madeboth the newly substituted string and the substitution count are returned as a 2-tuple.

>>> re.sub('X', 'Mr. Smith', 'attn: X\n\nDear X,\n') 'attn: Mr. Smith\012\012Dear Mr. Smith,\012' >>> >>> re.subn('X', 'Mr. Smith', 'attn: X\n\nDear X,\n') ('attn: Mr. Smith\012\012Dear Mr. Smith,\012', 2) >>> >>> print re.sub('X', 'Mr. Smith', 'attn: X\n\nDear X,\n') attn: Mr. Smith Dear Mr. Smith, >>> re.sub('[ae]', 'X', 'abcdef') 'XbcdXf' >>> re.subn('[ae]', 'X', 'abcdef') ('XbcdXf', 2)

15.3.13. Splitting (on Delimiting Pattern) with `split()`

The re module and RE object method split() work similarly to its string counterpart, but rather than splitting on a fixed string, they split a string based on an RE pattern, adding some significant power to string splitting capabilities. If you do not want the string split for every occurrence of the pattern, you can specify the maximum number of splits by setting a value (other than zero) to the max argument.

If the delimiter given is not a regular expression that uses special symbols to match multiple patterns, then re.split() works in exactly the same manner as string.split(), as illustrated in the example below (which splits on a single colon):

>>> re.split(':', 'str1:str2:str3') ['str1', 'str2', 'str3']

But with regular expressions involved, we have an even more powerful tool. Take, for example, the output from the Unix who command, which lists all the users logged into a system:

% who wesc      console      Jun 20 20:33 wesc      pts/9        Jun 22 01:38    (192.168.0.6) wesc      pts/1        Jun 20 20:33    (:0.0) wesc      pts/2        Jun 20 20:33    (:0.0) wesc      pts/4        Jun 20 20:33    (:0.0) wesc      pts/3        Jun 20 20:33    (:0.0) wesc      pts/5        Jun 20 20:33    (:0.0) wesc      pts/6        Jun 20 20:33    (:0.0) wesc      pts/7        Jun 20 20:33    (:0.0) wesc      pts/8        Jun 20 20:33    (:0.0)

Perhaps we want to save some user login information such as login name, teletype they logged in at, when they logged in, and from where. Using string.split() on the above would not be effective, since the spacing is erratic and inconsistent. The other problem is that there is a space between the month, day, and time for the login timestamps. We would probably want to keep these fields together.

You need some way to describe a pattern such as, "split on two or more spaces." This is easily done with regular expressions. In no time, we whip up the RE pattern "\s\s+," which does mean at least two whitespace characters. Let's create a program called rewho.py that reads the output of the who command, presumably saved into a file called whodata.txt. Our rewho.py script initially looks something like this:

import re f = open('whodata.txt', 'r') for eachLine in f.readlines():          print re.split('\s\s+', eachLine) f.close()

We will now execute the who command, saving the output into whodata.txt, and then call rewho.py and take a look at the results:

% who > whodata.txt % rewho.py ['wesc', 'console', 'Jun 20 20:33\012'] ['wesc', 'pts/9', 'Jun 22 01:38\011(192.168.0.6)\012'] ['wesc', 'pts/1', 'Jun 20 20:33\011(:0.0)\012'] ['wesc', 'pts/2', 'Jun 20 20:33\011(:0.0)\012'] ['wesc', 'pts/4', 'Jun 20 20:33\011(:0.0)\012'] ['wesc', 'pts/3', 'Jun 20 20:33\011(:0.0)\012'] ['wesc', 'pts/5', 'Jun 20 20:33\011(:0.0)\012'] ['wesc', 'pts/6', 'Jun 20 20:33\011(:0.0)\012'] ['wesc', 'pts/7', 'Jun 20 20:33\011(:0.0)\012'] ['wesc', 'pts/8', 'Jun 20 20:33\011(:0.0)\012']

It was a good first try, but not quite correct. For one thing, we did not anticipate a single TAB (ASCII \011) as part of the output (which looked like at least two spaces, right?), and perhaps we aren't really keen on saving the NEWLINE (ASCII \012), which terminates each line. We are now going to fix those problems as well as improve the overall quality of our application by making a few more changes.

First, we would rather run the who command from within the script, instead of doing it externally and saving the output to a whodata.txt filedoing this repeatedly gets tiring rather quickly. To accomplish invoking another program from within ours, we call upon the os.popen() command, discussed briefly in Section 14.5.2. Although os.popen() is available only on Unix systems, the point is to illustrate the functionality of re.split(), which is available on all platforms.

We get rid of the trailing NEWLINEs and add the detection of a single TAB as an additional, alternative re.split() delimiter. Presented in Example 15.1 is the final version of our rewho.py script:

Example 15.1. Split Output of Unix `who` Command (`rewho.py`)

This script calls the who command and parses the input by splitting up its data along various types of whitespace characters.

1  #!/usr/bin/env python 2 3  from os import popen 4  from re import split 5 6  f = popen('who', 'r') 7  for eachLine in f.readlines(): 8    print split('\s\s+|\t', eachLine.strip()) 9  f.close()

Running this script, we now get the following (correct) output:

% rewho.py ['wesc', 'console', 'Jun 20 20:33'] ['wesc', 'pts/9', 'Jun 22 01:38', '(192.168.0.6)'] ['wesc', 'pts/1', 'Jun 20 20:33', '(:0.0)'] ['wesc', 'pts/2', 'Jun 20 20:33', '(:0.0)'] ['wesc', 'pts/4', 'Jun 20 20:33', '(:0.0)'] ['wesc', 'pts/3', 'Jun 20 20:33', '(:0.0)'] ['wesc', 'pts/5', 'Jun 20 20:33', '(:0.0)'] ['wesc', 'pts/6', 'Jun 20 20:33', '(:0.0)'] ['wesc', 'pts/7', 'Jun 20 20:33', '(:0.0)'] ['wesc', 'pts/8', 'Jun 20 20:33', '(:0.0)']

A similar exercise can be achieved in a DOS/Windows environment using the dir command in place of who.

While the subject of ASCII characters is still warm, we would like to note that there can be confusion between regular expression special characters and special ASCII symbols. We may use \n to represent an ASCII NEWLINE character, but we may use \d meaning a regular expression match of a single numeric digit. Problems may occur if there is a symbol used by both ASCII and regular expressions, so in the Core Note on the following page, we recommend the use of Python raw strings to prevent any problems. One more caution: the "\w" and "\W" alphanumeric character sets are affected by the L or LOCALE compilation flag and in Python 1.6 and newer, by Unicode flags starting in 2.0 (U or UNICODE).

Core Note: Use of Python raw strings

You may have seen the use of raw strings in some of the examples above. Regular expressions were a strong motivation for the advent of raw strings. The reason is because of conflicts between ASCII characters and regular expression special characters. As a special symbol, "\b" represents the ASCII character for backspace, but "\b" is also a regular expression special symbol, meaning "match" on a word boundary. In order for the RE compiler to see the two characters "\b" as your string and not a (single) backspace, you need to escape the backslash in the string by using another backslash, resulting in "\\b."

This can get messy, especially if you have a lot of special characters in your string, adding to the confusion. We were introduced to raw strings back in Chapter 6, and they can be (and are often) used to help keep REs looking somewhat manageable. In fact, many Python programmers swear by these and only use raw strings when defining regular expressions.

Here are some examples of differentiating between the backspace "\b" and the regular expression "\b," with and without raw strings:

>>> m = re.match('\bblow', 'blow')  # backspace, no match >>> if m is not None: m.group() ... >>> m = re.match('\\bblow', 'blow')  # escaped \, now it works >>> if m is not None: m.group() ... 'blow' >>> m = re.match(r'\bblow', 'blow')  # use raw string instead >>> if m is not None: m.group() ... 'blow'

You may have recalled that we had no trouble using "\d" in our regular expressions without using raw strings. That is because there is no ASCII equivalent special character, so the regular expression compiler already knew you meant a decimal digit.

15.3. REs and Python

15.3.1. re Module: Core Functions and Methods

Table 15.2. Common Regular Expression Functions and Methods

15.3.2. Compiling REs with compile()

15.3.3. Match Objects and the group() and groups () Methods

15.3.4. Matching Strings with match()

15.3.5. Looking for a Pattern within a String with search() (Searching versus Matching)

15.3.6. Matching More than One String ( | )

15.3.7. Matching Any Single Character ( . )

15.3.8. Creating Character Classes ( [ ] )