Manipulating text/data is a big thing. If you don't believe me, look very
at what computers primarily do today. Word processing, "fill-out-form" Web pages, streams of information coming from a database dump, stock quote information, news feedsthe list goes on and on. Because we may not know the exact text or data that we have programmed our machines to process, it becomes advantageous to be able to express this text or data in patterns that a machine can recognize and take action upon.
If I were running an electronic mail (e-mail) archiving company, and you were one of my customers who
all his or her e-mail sent and received last February, for example, it would be nice if I could set a computer program to
and forward that information to you, rather than having a human being read through your e-mail and process your request manually. You would be horrified (and infuriated) that someone would be rummaging through your messages, even if his or her eyes were
to be looking only at time-stamp. Another example request might be to look for a subject line like "ILOVEYOU" indicating a virus-infected message and remove those e-mail messages from your personal archive. So this begs the question of how we can program machines with the ability to look for patterns in text.
Regular expressions (REs) provide such an infrastructure for advanced text pattern matching, extraction, and/or
functionality. REs are simply strings that use special symbols and
pattern repetition or to represent multiple characters so that they can "match" a set of strings with similar characteristics described by the pattern (Figure 15-1). In other words, they enable matching of multiple stringsan RE pattern that matched only one string would be rather boring and
, wouldn't you say?
Figure 15-1. You can use regular expressions, such as the one here, which recognizes valid Python identifiers. "
" means the first character should be alphabetic, i.e., either A-Z or a-z, followed by at least one (+)
character (\w). In our filter, notice how many strings go into the filter, but the only ones to come out are the ones we asked for via the RE. One example that did not make it was "4xZ" because it starts with a number.
Python supports REs through the standard library
module. In this introductory subsection, we will give you a brief and
introduction. Due to its brevity, only the most common aspects of REs used in everyday Python programming will be covered. Your experience will, of course, vary. We highly recommend reading any of the official supporting documentation as well as external
on this interesting subject. You will never look at strings the same way again!
Note: Searching versus matching
Throughout this chapter, you will find references to searching and matching. When we are
discussing regular expressions with respect to patterns in strings, we will say "matching," referring to the
pattern-matching. In Python terminology, there are two main ways to accomplish
: searching, i.e., looking for a pattern match in any part of a string, and matching, i.e., attempting to match a pattern to an entire string (starting from the beginning). Searches are accomplished using the
function or method, and matching is done with the
function or method. In summary, we keep the term "matching" universal when referencing patterns, and we differentiate between "searching" and "matching" in terms of how Python accomplishes pattern-matching.
15.1.1. Your First Regular Expression
As we mentioned above, REs are strings containing text and special characters that describe a pattern with which to recognize multiple strings. We also
discussed a regular expression
and for general text, the alphabet used for regular expressions is the set of all uppercase and lowercase
plus numeric digits. Specialized alphabets are also possible, for instance, one consisting of only the characters "0" and "1". The set of all strings over this alphabet describes all binary strings, i.e., "0," "1," "00," "01," "10," "11," "100," etc.
Let us look at the most basic of regular expressions now to show you that although REs are sometimes
an "advanced topic," they can also be rather simplistic. Using the standard alphabet for general text, we present some simple REs and the strings that their patterns describe. The following regular expressions are the most basic, "true
," as it were. They simply consist of a string pattern that matches only one string, the string defined by the regular expression. We now present the REs followed by the strings that match them:
The first regular expression pattern from the above chart is "foo." This pattern has no special symbols to match any other symbol other than those described, so the only string that matches this pattern is the string "foo." The same thing applies to "Python" and "abc123." The power of regular expressions comes in when special characters are used to define character sets,
matching, and pattern repetition. It is these special symbols that allow an RE to match a set of strings rather than a single one.