Item 17: Avoid greed when parsimony is best. | Effective Perl Programming: Writing Better Programs with Perl

One of the stickier problems you may encounter in dealing with regular expressions is greed .

Greed isn't about money, at least where regular expressions are concerned . It's the term used to describe the matching behavior of most regular expression engines, Perl's included. A general rule ^[3] is that a Perl regular expression will return the longest match it can find, at the first position in a string at which it can find a match of any length. Repetition operators like * and + "gobble up" characters in the string until matching more characters causes the match to fail:

^[3] But not strictly accurate, as you will see.

$_ = "Greetings, planet Earth!\n";	Some data to match.
/\w+/; /\w*/;	Matches `Greetings` . Matches `Greetings` .
/n[et]*/; /n[et]+/;	Matches `n` in `Greetings` . Matches `net` in `planet` .
/G.*t/;	Matches `Greetings, planet Eart` .

This is normally a desirable behavior. But not always. Be especially careful when using greedy regular expressions to match delimited patterns like quoted strings and C comments:

Don't use greedy regular expressions with delimiters.

These examples illustrate incorrect patterns for matching text enclosed by delimiters in this case single-quoted strings and C comments.
$_ = "This 'test' isn't successful?";	Hoping to match `'test'` .
($str) = /('.*')/;	Matches `'test' isn'` .
$_ = "/* temp / x = 10; / too much? */";	Hoping to match `/* temp */` .
s#(/\.\*/)##;	OOPSerases the whole string!

In these examples, Perl keeps matching beyond what appears to be the end of the pattern. But the match operator hasn't run amok: ' , / , and * are all matched by . , and the match ends at the last occurrence of ' , or */ . We can fix the single-quoted string example by excluding single quotes from the characters allowed inside the string:

 $_ = "This 'test' isn't successful?";  ($str) = /('[^']*')/;

Matches 'test' now.

Straightening out the regular expression for C comments is more troublesome . I will bet confidently that when you write your first regular expression that you believe matches C comments, it will not work. Here is one of many possibilitiesit seems reasonable at first:

 s#/\*([^*]\*[^/])*\*/##g;

ALMOST works.

Do you see the problem with it? It fails on the following input:

 /***/

The reason is that there is no way for it to match an asterisk inside the comment that isn't followed by exactly one other character, thus an odd number of asterisks fails to match. It has other problems, too, but this one is enough. The real answer looks like this: ^[4]

^[4] As long as it isn't necessary to deal with things like comments embedded in strings, anyway.

 s#/\*[^*]*\*+([^/*][^*]*\*+)*/##g;

CORRECT

You are not likely to understand the how and why of this without recourse to a diagram of the underlying state machine:

graphics/04fig01.gif

And if you haven't suffered through a class in discrete math or compiler construction, this may not help you either. In fact, a good many people will go through their lives using regular expressions like the above without knowing why they work. This isn't necessarily bad; it's just not ideal.

Now, if it's this hard to construct a regular expression for something as simple as C comments, imagine what it could be like to try to write one for something more complex, like HTML comments, or strings with character escapes . Pretty scary.

Fortunately, Perl 5 has non-greedy repetition operators. This is a powerful and enormously helpful new feature that allows you to write simple regular expressions for cases that previously required complex or even impossibly difficult regular expressions.

You can make any repetition operator ( * , + , { m , n } ) non-greedy by following it with a question mark. The operator will now match the shortest string that results in a pattern match, rather than the longest. This makes the examples above trivially simple:

Do use non-greedy regular expressions with delimiters.

These examples illustrate patterns that correctly match text enclosed by delimiters.
$_ = "This 'test' isn't successful?"; ($str) = /('.*?')/;	Matches `'test'` .
$_ = "/* temp / x = 10; / too much? /"; s#(/\.?\/)##;	Deletes `/* temp */` .

You can now attempt more ambitious things, like a double-quoted string with character escapes (let's support \" , \\ , and \123 ):

 $_ = 'a "double-q \"string2"';  ($str) = /("(\["\]\\d{1,3}.)*?")/;  print $str;

"double-q \"string\042"

The only problem with non-greedy matching is that it can be slower than greedy matching. Don't use non-greedy operators unnecessarily. But do use non-greedy operators to avoid having to write complex regular expressions that might or might not be correct.

Procedural regular expressions versus deterministic finite automatons (DFAs)

Perl and most other tools with robust pattern-matching capabilities that include features like backreferences use what I call a procedural approach to regular expression pattern matching. When Perl encounters a regular expression in a program, it compiles it into a treelike structure and saves it away. When that regular expression is used to match a string, Perl looks for the match by "executing" the compiled regular expression. Consider a simple regular expression and target string:

 $_ = 'testing';  /t(ees)./;  print "matched: $&\n";

matched: tes

If Perl could talk, it might describe the matching process something like this:

"OK, start at first character position. Looking for a t . Got one.

"Now, an alternation , first one is e . Looking for e . Got one.

"OK, the alternation matched. Next thing is a dot. Need one char to match the dot. Got an s .

"Anything else? Nope. Guess we're done."

If you have no background experience with tools like lex or flex , or if this is the only kind of regular expression you have ever known, you probably don't see anything unusual with this interpretation of this regular expression. On the other hand, if you are familiar with flex , you might be thinking, "Hmm, why didn't that match test instead?"

Well, you could get it to match test by rewriting it:

 $_ = 'testing';  /t(ese)./;  print "matched: $&\n";

matched: test

This illustrates the differenceoutwardly, at leastbetween procedural regular expressions and state machine or DFA regular expressions. ^[5] Tools like flex go far beyond parsing regular expressions. They go through an involved process that generates a deterministic state machine from the regular expression, rendering it into what is basically a table of numbers . If you use flex on the regular expression above, it will generate an internal state machine that looks something like this:

^[5] The Hip Owls book uses the term " NFA regular expressions" to refer to what I call "procedural" matching.

graphics/04fig02.gif

The bold circles represent "accepting states" in which a complete match has been found. This view of the regular expression is somewhat different from the one that Perl has. For one thing, this DFA would have matched test , not tes , from the string testing . For another, the process of creating a DFA from a regular expression discards the syntactic structure of the original regular expression. In general, DFA-based tools always find the longest match for a regular expression, regardless of the order of alternations or grouping of repetition operators. On the other hand, as you can see, the arrangement of the parts of a Perl regular expression is significant. The flexibility to change what a pattern matches by rearranging its parts helps make Perl regular expressions particularly powerful and expressive.

If you are more familiar with the DFA view of regular expressions than the procedural view, you should take some time to think about procedural regular expressions. Experiment with them. You can do things with procedural regular expressions that are very difficult or impossible to do with DFA regular expressions, especially if you make use of Perl's non-greedy repetition operators.

The down side to procedural regular expressions is that they generally run slower than DFA regular expressions. But they are not that much slower, and they do compile into an internal represention significantly faster.