Item 15: Know the precedence of regular expression operators.

precedence of regular expression operators."-->

The "expression" in " regular expression" is there because regular expressions are constructed and parsed using grammatical rules that are similar to those used for arithmetic expressions. Although regular expressions serve a greatly different purpose, understanding the similarities between them will help you write better regular expressions, and hence better Perl.

Regular expressions in Perl are made up of atoms . Atoms are connected by operators like repetition, sequence, and alternation . Most regular expression atoms are single-character matches. For example:

a

Matches the letter a .

\$

Matches the character $ backslash escapes metacharacters.

\n

Matches newline.

[a-z]

Matches a lowercase letter.

.

Matches any character except \n .

\1

Matches contents of first memoryarbitrary length.

There are also special " zero-width" atoms. For example:

\b

Word boundarytransition from \w to \W .

^

Matches start of a string.

\Z

Matches end of a string or before newline at end.

Atoms are modified and/or joined together by regular expression operators. As in arithmetic expressions, there is an order of precedence among these operators:

Regular expression operator precedence

Precedence

Operator

Description

Highest

() , (?:) , etc.

Parentheses and other grouping operators

 

? , + , * , { m , n } , +? , etc.

Repetition

 

^abc

Sequence (see below)

Lowest

Alternation

Fortunately, there are only four precedence levelsimagine if there were as many as there are for arithmetic expressions! Parentheses and the other grouping operators [1] have the highest precedence.

[1] A multitude of new grouping operators were introduced in Perl 5.

A repetition operator binds tightly to its argument, which is either a single atom or a grouping operator:

 ab*c 

Matches ac , abc , abbc , abbbc , etc.

 abc* 

Matches ab , abc , abcc , abccc , etc.

 ab(c)* 

Same thing, and memorizes the c actually matched.

 ab(?:c)* 

Same thing, but doesn't memorize the c .

 abc{2,4} 

Matches abcc , abccc , abcccc .

 (abc)* 

Matches empty string, abc , abcabc , etc.; memorizes abc .

Placing two atoms side by side is called sequence . Sequence is a kind of operator, even though it is written without punctuation. This is similar to the invisible multiplication operator in a mathematical expression like y = ax + b . To illustrate this, let's suppose that sequence were actually represented with the character " ". Then the above examples would look like:

 ab*c 

Matches ac , abc , abbc , abbbc , etc.

 abc* 

Matches ab , abc , abcc , abccc , etc.

 ab(c)* 

Same thing, and memorizes the c actually matched.

 ab(?:c)* 

Same thing, but doesn't memorize the c .

 abc{2,4} 

Matches abcc , abccc , abcccc .

 (abc)* 

Matches empty string, abc , abcabc , etc.; memorizes abc .

The last entry in the precedence chart is alternation . Let's continue to use the " " notation for a moment:

 edjo 

Matches ed or jo .

 (ed)(jo) 

Same thing.

 e(dj)o 

Matches edo or ejo .

 edjo{1,3} 

Matches ed , jo , joo , jooo .

The zero-width atoms, for example, ^ and \b , group in the same way as other atoms:

 ^edjo$ 

Matches ed at beginning, jo at end.

 ^(edjo)$ 

Matches exactly ed or jo .

It's easy to forget about precedence. Removing excess parentheses is a noble pursuit, especially within regular expressions, but be careful not to remove too many:

 /^SenderFrom:\s+(.*)/; 

WRONGwould match:

X-Not-Really-From: faker Senderella is misspelled

The pattern was meant to match Sender: and From: lines in a mail header, but it actually matches something somewhat different. Here it is with some parentheses added to clarify the precedence:

 /(^Sender)(From:\s+(.*))/; 

Adding a pair of parentheses, or perhaps memory-free parentheses (?:) , fixes the problem:

 /^(SenderFrom):\s+(.*)/; 

$1 contains Sender or From .

$2 has the data.

 /^(?:SenderFrom):\s+(.*)/; 

$1 contains the data.

Double-quote interpolation

Perl regular expressions are subject to the same kind of interpolation that double-quoted strings are. [2] Interpolated variables and string escapes like \U and \Q are not regular expression atoms and are never seen by the regular expression parser. Interpolation takes place in a single pass that occurs before a regular expression is parsed:

[2] Well, more or less. The $ anchor receives special treatment so that it is not always interpreted as a scalar variable prefix.

 /te(st)/;  /\Ute(st)/;  /\Qte(st)/; 

Matches test in $_ .

Matches TEST .

Matches te(st) .

 $x = 'test';  /$x*/;  /test*/; 

Matches tes , test , testt , etc.

Same thing as /$x*/ .

Double-quote interpolation and the separate regular expression parsing phase combine to produce a number of common "gotchas." For example, here's what can happen if you forget that an interpolated variable is not an atom:

Read a pattern into $pat and match two consecutive occurrences of it.

 chomp($pat = <STDIN>); 

For example, bob .

 print "matched\n" if /$pat{2}/; 

WRONG /bob{2}/ .

 print "matched\n" if /($pat){2}/;  print "matched\n" if /$pat$pat/; 

RIGHT /(bob){2}/ .

Brute force way.

In this example, if the user types in bob , the first regular expression will match bobb , because the contents of $pat are expanded before the regular expression is interpreted.

All three regular expressions in this example have another potential pit-fall. Suppose the user types in the string " hello :-) ". This will generate a fatal run-time error. The result of interpolating this string into /($pat){2}/ is /(hello :-)){2}/ , which, aside from being nonsense , has unbalanced parentheses.

If you don't want special characters like parentheses, asterisks , periods, and so forth interpreted as regular expression metacharacters, use the quotemeta operator or the quotemeta escape, \Q . Both quotemeta and \Q put a backslash in front of any character that isn't a letter, digit, or underscore :

 chomp($pat = <STDIN>);  $quoted = quotemeta $pat; 

For example, hello :-) .

Now hello\ \:\-\) .

 print "matched\n" if /($quoted){2}/;  print "matched\n" if /(\Q$pat\E){2}/; 

"Safe" to match now.

Another approach.

As with seemingly everything else pertaining to regular expressions, tiny errors in quoting metacharacters can result in strange bugs :

 print "matched\n" if /(\Q$pat){2}/; 

WRONGno \E ... means /hello \ \:\-\)\{2\}/ .



Effective Perl Programming. Writing Better Programs with Perl
Effective Perl Programming: Writing Better Programs with Perl
ISBN: 0201419750
EAN: 2147483647
Year: 1996
Pages: 116

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net