Item 15: Know the precedence of regular expression operators.

precedence of regular expression operators."-->

The "expression" in " regular expression" is there because regular expressions are constructed and parsed using grammatical rules that are similar to those used for arithmetic expressions. Although regular expressions serve a greatly different purpose, understanding the similarities between them will help you write better regular expressions, and hence better Perl.

Regular expressions in Perl are made up of atoms . Atoms are connected by operators like repetition, sequence, and alternation . Most regular expression atoms are single-character matches. For example:

 a Matches the letter a . \\$ Matches the character \$ backslash escapes metacharacters. \n Matches newline. [a-z] Matches a lowercase letter. . Matches any character except \n . \1 Matches contents of first memoryarbitrary length.

There are also special " zero-width" atoms. For example:

 \b Word boundarytransition from \w to \W . ^ Matches start of a string. \Z Matches end of a string or before newline at end.

Atoms are modified and/or joined together by regular expression operators. As in arithmetic expressions, there is an order of precedence among these operators:

Regular expression operator precedence

Precedence

Operator

Description

Highest

() , (?:) , etc.

Parentheses and other grouping operators

? , + , * , { m , n } , +? , etc.

Repetition

^abc

Sequence (see below)

Lowest

Alternation

Fortunately, there are only four precedence levelsimagine if there were as many as there are for arithmetic expressions! Parentheses and the other grouping operators [1] have the highest precedence.

[1] A multitude of new grouping operators were introduced in Perl 5.

A repetition operator binds tightly to its argument, which is either a single atom or a grouping operator:

 ` ab*c ` Matches ac , abc , abbc , abbbc , etc. ` abc* ` Matches ab , abc , abcc , abccc , etc. ` ab(c)* ` Same thing, and memorizes the c actually matched. ` ab(?:c)* ` Same thing, but doesn't memorize the c . ` abc{2,4} ` Matches abcc , abccc , abcccc . ` (abc)* ` Matches empty string, abc , abcabc , etc.; memorizes abc .

Placing two atoms side by side is called sequence . Sequence is a kind of operator, even though it is written without punctuation. This is similar to the invisible multiplication operator in a mathematical expression like y = ax + b . To illustrate this, let's suppose that sequence were actually represented with the character " ". Then the above examples would look like:

 ` ab*c ` Matches ac , abc , abbc , abbbc , etc. ` abc* ` Matches ab , abc , abcc , abccc , etc. ` ab(c)* ` Same thing, and memorizes the c actually matched. ` ab(?:c)* ` Same thing, but doesn't memorize the c . ` abc{2,4} ` Matches abcc , abccc , abcccc . ` (abc)* ` Matches empty string, abc , abcabc , etc.; memorizes abc .

The last entry in the precedence chart is alternation . Let's continue to use the " " notation for a moment:

 ` edjo ` Matches ed or jo . ` (ed)(jo) ` Same thing. ` e(dj)o ` Matches edo or ejo . ` edjo{1,3} ` Matches ed , jo , joo , jooo .

The zero-width atoms, for example, ^ and \b , group in the same way as other atoms:

 ` ^edjo\$ ` Matches ed at beginning, jo at end. ` ^(edjo)\$ ` Matches exactly ed or jo .

It's easy to forget about precedence. Removing excess parentheses is a noble pursuit, especially within regular expressions, but be careful not to remove too many:

 ` /^SenderFrom:\s+(.*)/; ` WRONGwould match: X-Not-Really-From: faker Senderella is misspelled

The pattern was meant to match Sender: and From: lines in a mail header, but it actually matches something somewhat different. Here it is with some parentheses added to clarify the precedence:

` /(^Sender)(From:\s+(.*))/; `

Adding a pair of parentheses, or perhaps memory-free parentheses (?:) , fixes the problem:

 ` /^(SenderFrom):\s+(.*)/; ` \$1 contains Sender or From . \$2 has the data. ` /^(?:SenderFrom):\s+(.*)/; ` \$1 contains the data.

Double-quote interpolation

Perl regular expressions are subject to the same kind of interpolation that double-quoted strings are. [2] Interpolated variables and string escapes like \U and \Q are not regular expression atoms and are never seen by the regular expression parser. Interpolation takes place in a single pass that occurs before a regular expression is parsed:

[2] Well, more or less. The \$ anchor receives special treatment so that it is not always interpreted as a scalar variable prefix.

 ` /te(st)/; /\Ute(st)/; /\Qte(st)/; ` Matches test in \$_ . Matches TEST . Matches te(st) . ` \$x = 'test'; /\$x*/; /test*/; ` Matches tes , test , testt , etc. Same thing as /\$x*/ .

Double-quote interpolation and the separate regular expression parsing phase combine to produce a number of common "gotchas." For example, here's what can happen if you forget that an interpolated variable is not an atom:

Read a pattern into \$pat and match two consecutive occurrences of it.

 ` chomp(\$pat = ); ` For example, bob . ` print "matched\n" if /\$pat{2}/; ` WRONG /bob{2}/ . ` print "matched\n" if /(\$pat){2}/; print "matched\n" if /\$pat\$pat/; ` RIGHT /(bob){2}/ . Brute force way.

In this example, if the user types in bob , the first regular expression will match bobb , because the contents of \$pat are expanded before the regular expression is interpreted.

All three regular expressions in this example have another potential pit-fall. Suppose the user types in the string " hello :-) ". This will generate a fatal run-time error. The result of interpolating this string into /(\$pat){2}/ is /(hello :-)){2}/ , which, aside from being nonsense , has unbalanced parentheses.

If you don't want special characters like parentheses, asterisks , periods, and so forth interpreted as regular expression metacharacters, use the quotemeta operator or the quotemeta escape, \Q . Both quotemeta and \Q put a backslash in front of any character that isn't a letter, digit, or underscore :

 ` chomp(\$pat = ); \$quoted = quotemeta \$pat; ` For example, hello :-) . Now hello\ \:\-\) . ` print "matched\n" if /(\$quoted){2}/; print "matched\n" if /(\Q\$pat\E){2}/; ` "Safe" to match now. Another approach.

As with seemingly everything else pertaining to regular expressions, tiny errors in quoting metacharacters can result in strange bugs :

 ` print "matched\n" if /(\Q\$pat){2}/; ` WRONGno \E ... means /hello \ \:\-\)\{2\}/ .

Effective Perl Programming: Writing Better Programs with Perl
ISBN: 0201419750
EAN: 2147483647
Year: 1996
Pages: 116

Similar book on Amazon