15.2. Special Symbols and CharactersWe will now introduce the most popular of the metacharacters, special characters and symbols, which give regular expressions their power and flexibility. You will find the most common of these symbols and characters in Table 15.1.
15.2.1. Matching More Than One RE Pattern with Alternation ( | )The pipe symbol ( | ), a vertical bar on your keyboard, indicates an alternation operation, meaning that it is used to choose from one of the different regular expressions, which are separated by the pipe symbol. For example, below are some patterns that employ alternation, along with the strings they match:
With this one symbol, we have just increased the flexibility of our regular expressions, enabling the matching of more than just one string. Alternation is also sometimes called union or logical OR. 15.2.2. Matching Any Single Character ( .)The dot or period ( . ) symbol matches any single character except for NEWLINE (Python REs have a compilation flag [S or DOTALL], which can override this to include NEWLINEs.). Whether letter, number, whitespace not including "\n," printable, non-printable, or a symbol, the dot can match them all.
Q: What if I want to match the dot or period character? A: In order to specify a dot character explicitly, you must escape its functionality with a backslash, as in "\.". 15.2.3. Matching from the Beginning or End of Strings or Word Boundaries ( ^/$ /\b /\B)There are also symbols and related special characters to specify searching for patterns at the beginning and ending of strings. To match a pattern starting from the beginning, you must use the carat symbol ( ^ ) or the special character \A (backslash-capital "A"). The latter is primarily for keyboards that do not have the carat symbol, i.e., international. Similarly, the dollar sign ( $ ) or \Z will match a pattern from the end of a string. Patterns that use these symbols differ from most of the others we describe in this chapter since they dictate location or position. In the Core Note above, we noted that a distinction is made between "matching," attempting matches of entire strings starting at the beginning, and "searching," attempting matches from anywhere within a string. With that said, here are some examples of "edge-bound" RE search patterns:
Again, if you want to match either (or both) of these characters verbatim, you must use an escaping backslash. For example, if you wanted to match any string that ended with a dollar sign, one possible RE solution would be the pattern ".*\$$". The \b and \B special characters pertain to word boundary matches. The difference between them is that \b will match a pattern to a word boundary, meaning that a pattern must be at the beginning of a word, whether there are any characters in front of it (word in the middle of a string) or not (word at the beginning of a line). And likewise, \B will match a pattern only if it appears starting in the middle of a word (i.e., not at a word boundary). Here are some examples:
15.2.4. Creating Character Classes ( [ ] )While the dot is good for allowing matches of any symbols, there may be occasions where there are specific characters you want to match. For this reason, the bracket symbols ( [ ] ) were invented. The regular expression will match any of the enclosed characters. Here are some examples:
One side note regarding the RE "[cr][23][dp][o2]"a more restrictive version of this RE would be required to allow only "r2d2" or "c3po" as valid strings. Because brackets merely imply "logical OR" functionality, it is not possible to use brackets to enforce such a requirement. The only solution is to use the pipe, as in "r2d2|c3po". For single-character REs, though, the pipe and brackets are equivalent. For example, let's start with the regular expression "ab," which matches only the string with an "a" followed by a "b". If we wanted either a one-letter string, i.e., either "a" or a "b," we could use the RE "[ab]." Because "a" and "b" are individual strings, we can also choose the RE "a|b". However, if we wanted to match the string with the pattern "ab" followed by "cd," we cannot use the brackets because they work only for single characters. In this case, the only solution is "ab|cd," similar to the "r2d2/c3po" problem just mentioned. 15.2.5. Denoting Ranges ( - ) and Negation ( ^ )In addition to single characters, the brackets also support ranges of characters. A hyphen between a pair of symbols enclosed in brackets is used to indicate a range of characters, e.g., A-Z, a-z, or 0-9 for uppercase letters, lowercase letters, and numeric digits, respectively. This is a lexicographic range, so you are not restricted to using just alphanumeric characters. Additionally, if a caret ( ^ ) is the first character immediately inside the open left bracket, this symbolizes a directive not to match any of the characters in the given character set.
15.2.6. Multiple Occurrence/Repetition Using Closure Operators ( *, +, ?, { } )We will now introduce the most common RE notations, namely, the special symbols *, +, and ?, all of which can be used to match single, multiple, or no occurrences of string patterns. The asterisk or star operator ( * ) will match zero or more occurrences of the RE immediately to its left (in language and compiler theory, this operation is known as the Kleene Closure). The plus operator ( + ) will match one or more occurrences of an RE (known as Positive Closure), and the question mark operator ( ? ) will match exactly 0 or 1 occurrences of an RE. There are also brace operators ( { } ) with either a single value or a comma-separated pair of values. These indicate a match of exactly N occurrences (for {N}) or a range of occurrences, i.e., {M,N} will match from M to N occurrences. These symbols may also be escaped with the backslash, i.e., "\*" matches the asterisk, etc. In the table above, we notice the question mark is used more than once (overloaded), meaning either matching 0 or 1 occurrences, or its other meaning: if it follows any matching using the close operators, it will direct the regular expression engine to match as few repetitions as possible. What does that last part mean, "as few ... as possible?" When pattern-matching is employed using the grouping operators, the regular expression engine will try to "absorb" as many characters as possible which match the pattern. This is known as being greedy. The question mark tells the engine to lay off and if possible, take as few characters as possible in the current match, leaving the rest to match as many of succeeding characters of the next pattern (if applicable). We will show you a great example where non-greediness is required toward the end of the chapter. For now, let us continue to look at the closure operators:
15.2.7. Special Characters Representing Character SetsWe also mentioned that there are special characters that may represent character sets. Rather than using a range of "0-9," you may simply use "\d" to indicate the match of any decimal digit. Another special character "\w" can be used to denote the entire alphanumeric character class, serving as a shortcut for "A-Za-z0-9_", and "\s" for whitespace characters. Uppercase versions of these strings symbolize non-matches, i.e., "\D" matches any non-decimal digit (same as "[^0-9]"), etc. Using these shortcuts, we will present a few more complex examples:
15.2.8. Designating Groups with Parentheses ( ( ) )Now, perhaps we have achieved the goal of matching a string and discarding non-matches, but in some cases, we may also be more interested in the data that we did match. Not only do we want to know whether the entire string matched our criteria, but also whether we can extract any specific strings or substrings that were part of a successful match. The answer is yes. To accomplish this, surround any RE with a pair of parentheses. A pair of parentheses ( ( ) ) can accomplish either (or both) of the below when used with regular expressions:
One good example for wanting to group regular expressions is when you have two different REs with which you want to compare a string. Another reason is to group an RE in order to use a repetition operator on the entire RE (as opposed to an individual character or character class). One side effect of using parentheses is that the substring that matched the pattern is saved for future use. These subgroups can be recalled for the same match or search, or extracted for post-processing. You will see some examples of pulling out subgroups at the end of Section 15.3.9. Why are matches of subgroups important? The main reason is that there are times where you want to extract the patterns you match, in addition to making a match. For example, what if we decided to match the pattern "\w+-\d+" but wanted save the alphabetic first part and the numeric second part individually? This may be desired because with any successful match, we may want to see just what those strings were that matched our RE patterns. If we add parentheses to both subpatterns, i.e., "(\w+)-(\d+)," then we can access each of the matched subgroups individually. Subgrouping is preferred because the alternative is to write code to determine we have a match, then execute another separate routine (which we also had to create) to parse the entire match just to extract both parts. Why not let Python do it, since it is a supported feature of the re module, instead of reinventing the wheel?
|