Item 15: Know the precedence of regular expression operators. | Effective Perl Programming: Writing Better Programs with Perl

precedence of regular expression operators."-->

The "expression" in " regular expression" is there because regular expressions are constructed and parsed using grammatical rules that are similar to those used for arithmetic expressions. Although regular expressions serve a greatly different purpose, understanding the similarities between them will help you write better regular expressions, and hence better Perl.

Regular expressions in Perl are made up of atoms . Atoms are connected by operators like repetition, sequence, and alternation . Most regular expression atoms are single-character matches. For example:

`a`	Matches the letter `a` .
`\$`	Matches the character `$` backslash escapes metacharacters.
`\n`	Matches newline.
`[a-z]`	Matches a lowercase letter.
`.`	Matches any character except `\n` .
`\1`	Matches contents of first memoryarbitrary length.

There are also special " zero-width" atoms. For example:

`\b`	Word boundarytransition from `\w` to `\W` .
`^`	Matches start of a string.
`\Z`	Matches end of a string or before newline at end.

Atoms are modified and/or joined together by regular expression operators. As in arithmetic expressions, there is an order of precedence among these operators:

Regular expression operator precedence
Precedence	Operator	Description
Highest	`()` , `(?:)` , etc.	Parentheses and other grouping operators
	`?` , `+` , `*` , `{` m `,` n `}` , `+?` , etc.	Repetition
	`^abc`	Sequence (see below)
Lowest		Alternation

Fortunately, there are only four precedence levelsimagine if there were as many as there are for arithmetic expressions! Parentheses and the other grouping operators ^[1] have the highest precedence.

^[1] A multitude of new grouping operators were introduced in Perl 5.

A repetition operator binds tightly to its argument, which is either a single atom or a grouping operator:

ab*c	Matches `ac` , `abc` , `abbc` , `abbbc` , etc.
abc*	Matches `ab` , `abc` , `abcc` , `abccc` , etc.
ab(c)*	Same thing, and memorizes the `c` actually matched.
ab(?:c)*	Same thing, but doesn't memorize the `c` .
abc{2,4}	Matches `abcc` , `abccc` , `abcccc` .
(abc)*	Matches empty string, `abc` , `abcabc` , etc.; memorizes `abc` .

Placing two atoms side by side is called sequence . Sequence is a kind of operator, even though it is written without punctuation. This is similar to the invisible multiplication operator in a mathematical expression like y = ax + b . To illustrate this, let's suppose that sequence were actually represented with the character " ". Then the above examples would look like:

ab*c	Matches `ac` , `abc` , `abbc` , `abbbc` , etc.
abc*	Matches `ab` , `abc` , `abcc` , `abccc` , etc.
ab(c)*	Same thing, and memorizes the `c` actually matched.
ab(?:c)*	Same thing, but doesn't memorize the `c` .
abc{2,4}	Matches `abcc` , `abccc` , `abcccc` .
(abc)*	Matches empty string, `abc` , `abcabc` , etc.; memorizes `abc` .

The last entry in the precedence chart is alternation . Let's continue to use the " " notation for a moment:

edjo	Matches `ed` or `jo` .
(ed)(jo)	Same thing.
e(dj)o	Matches `edo` or `ejo` .
edjo{1,3}	Matches `ed` , `jo` , `joo` , `jooo` .

The zero-width atoms, for example, ^ and \b , group in the same way as other atoms:

^edjo$	Matches `ed` at beginning, `jo` at end.
^(edjo)$	Matches exactly `ed` or `jo` .

It's easy to forget about precedence. Removing excess parentheses is a noble pursuit, especially within regular expressions, but be careful not to remove too many:

 /^SenderFrom:\s+(.*)/;

WRONGwould match:

X-Not-Really-From: faker Senderella is misspelled

The pattern was meant to match Sender: and From: lines in a mail header, but it actually matches something somewhat different. Here it is with some parentheses added to clarify the precedence:

 /(^Sender)(From:\s+(.*))/;

Adding a pair of parentheses, or perhaps memory-free parentheses (?:) , fixes the problem:

 /^(SenderFrom):\s+(.*)/;

$1 contains Sender or From .

$2 has the data.

 /^(?:SenderFrom):\s+(.*)/;

$1 contains the data.

Double-quote interpolation

Perl regular expressions are subject to the same kind of interpolation that double-quoted strings are. ^[2] Interpolated variables and string escapes like \U and \Q are not regular expression atoms and are never seen by the regular expression parser. Interpolation takes place in a single pass that occurs before a regular expression is parsed:

^[2] Well, more or less. The $ anchor receives special treatment so that it is not always interpreted as a scalar variable prefix.

 /te(st)/;  /\Ute(st)/;  /\Qte(st)/;

Matches test in $_ .

Matches TEST .

Matches te(st) .

 $x = 'test';  /$x*/;  /test*/;

Matches tes , test , testt , etc.

Same thing as /$x*/ .

Double-quote interpolation and the separate regular expression parsing phase combine to produce a number of common "gotchas." For example, here's what can happen if you forget that an interpolated variable is not an atom:

Read a pattern into $pat and match two consecutive occurrences of it.

 chomp($pat = <STDIN>);

For example, bob .

 print "matched\n" if /$pat{2}/;

WRONG /bob{2}/ .

 print "matched\n" if /($pat){2}/;  print "matched\n" if /$pat$pat/;

RIGHT /(bob){2}/ .

Brute force way.

In this example, if the user types in bob , the first regular expression will match bobb , because the contents of $pat are expanded before the regular expression is interpreted.

All three regular expressions in this example have another potential pit-fall. Suppose the user types in the string " hello :-) ". This will generate a fatal run-time error. The result of interpolating this string into /($pat){2}/ is /(hello :-)){2}/ , which, aside from being nonsense , has unbalanced parentheses.

If you don't want special characters like parentheses, asterisks , periods, and so forth interpreted as regular expression metacharacters, use the quotemeta operator or the quotemeta escape, \Q . Both quotemeta and \Q put a backslash in front of any character that isn't a letter, digit, or underscore :

 chomp($pat = <STDIN>);  $quoted = quotemeta $pat;

For example, hello :-) .

Now hello\ \:\-\) .

 print "matched\n" if /($quoted){2}/;  print "matched\n" if /(\Q$pat\E){2}/;

"Safe" to match now.

Another approach.

As with seemingly everything else pertaining to regular expressions, tiny errors in quoting metacharacters can result in strange bugs :

 print "matched\n" if /(\Q$pat){2}/;

WRONGno \E ... means /hello \ \:\-\)\{2\}/ .