The "expression" in " regular expression" is there because regular expressions are constructed and parsed using grammatical rules that are similar to those used for arithmetic expressions. Although regular expressions serve a greatly different purpose, understanding the similarities between them will help you write better regular expressions, and hence better Perl. Regular expressions in Perl are made up of atoms . Atoms are connected by operators like repetition, sequence, and alternation . Most regular expression atoms are single-character matches. For example:
There are also special " zero-width" atoms. For example:
Atoms are modified and/or joined together by regular expression operators. As in arithmetic expressions, there is an order of precedence among these operators:
Fortunately, there are only four precedence levelsimagine if there were as many as there are for arithmetic expressions! Parentheses and the other grouping operators [1] have the highest precedence.
A repetition operator binds tightly to its argument, which is either a single atom or a grouping operator:
Placing two atoms side by side is called sequence . Sequence is a kind of operator, even though it is written without punctuation. This is similar to the invisible multiplication operator in a mathematical expression like y = ax + b . To illustrate this, let's suppose that sequence were actually represented with the character " ". Then the above examples would look like:
The last entry in the precedence chart is alternation . Let's continue to use the " " notation for a moment:
The zero-width atoms, for example, ^ and \b , group in the same way as other atoms:
It's easy to forget about precedence. Removing excess parentheses is a noble pursuit, especially within regular expressions, but be careful not to remove too many:
The pattern was meant to match Sender: and From: lines in a mail header, but it actually matches something somewhat different. Here it is with some parentheses added to clarify the precedence: /(^Sender)(From:\s+(.*))/; Adding a pair of parentheses, or perhaps memory-free parentheses (?:) , fixes the problem:
Double-quote interpolationPerl regular expressions are subject to the same kind of interpolation that double-quoted strings are. [2] Interpolated variables and string escapes like \U and \Q are not regular expression atoms and are never seen by the regular expression parser. Interpolation takes place in a single pass that occurs before a regular expression is parsed:
Double-quote interpolation and the separate regular expression parsing phase combine to produce a number of common "gotchas." For example, here's what can happen if you forget that an interpolated variable is not an atom: Read a pattern into $pat and match two consecutive occurrences of it.
In this example, if the user types in bob , the first regular expression will match bobb , because the contents of $pat are expanded before the regular expression is interpreted. All three regular expressions in this example have another potential pit-fall. Suppose the user types in the string " hello :-) ". This will generate a fatal run-time error. The result of interpolating this string into /($pat){2}/ is /(hello :-)){2}/ , which, aside from being nonsense , has unbalanced parentheses. If you don't want special characters like parentheses, asterisks , periods, and so forth interpreted as regular expression metacharacters, use the quotemeta operator or the quotemeta escape, \Q . Both quotemeta and \Q put a backslash in front of any character that isn't a letter, digit, or underscore :
As with seemingly everything else pertaining to regular expressions, tiny errors in quoting metacharacters can result in strange bugs :
|