Recipe 22.4. Choosing Greedy or Nongreedy Matches


22.4.1. Problem

You want your pattern to match the smallest possible string instead of the largest.

22.4.2. Solution

Place a ? after a quantifier to alter that portion of the pattern, as in Example 22-6.

Making a quantifier match as few characters as possible

<?php // find all <em>emphasized</em> sections preg_match_all('@<em>.+?</em>@', $html, $matches); ?>

Or use the U pattern-modifier ending to invert all quantifiers from greedy ("match as many characters as possible") to nongreedy ("match as few characters as possible"). The code in Example 22-6 does the same thing as the code in Example 22-6.

Making a quantifier match as few characters as possible

<?php // find all <em>emphasized</em> sections preg_match_all('@<em>.+</em>@U', $html, $matches); ?>

22.4.3. Discussion

By default, all regular expression quantifiers in PHP are greedy. For example, consider the pattern <em>.+</em>, which matches "<em>, one or more characters, </em>", matching against the string I simply <em>love</em> your <em>work</em>. A greedy regular expression finds one match, because after it matches the opening <em>, its .+ slurps up as much as possible, finally grinding to a halt at the final </em>. The .+ matches love</em> your <em>work.

A nongreedy regular expression, on the other hand, finds a pair of matches. The first <em> is matched as before, but then .+ stops as soon as it can, only matching love. A second match then goes ahead: the next .+ matches work.

Example 22-8 shows the greedy and nongreedy patterns at work.

Greedy versus nongreedy matching

<?php $html = 'I simply <em>love</em> your <em>work</em>'; // Greedy $matchCount = preg_match_all('@<em>.+</em>@', $html, $matches); print "Greedy count: " . $matchCount . "\n"; // Nongreedy $matchCount = preg_match_all('@<em>.+?</em>@', $html, $matches); print "First non-greedy count: " . $matchCount . "\n"; // Nongreedy $matchCount = preg_match_all('@<em>.+</em>@U', $html, $matches); print "Second non-greedy count: " . $matchCount . "\n"; ?>

Example 22-8 prints:

Greedy count: 1 First non-greedy count: 2 Second non-greedy count: 2

Greedy matching is also known as maximal and nongreedy matching can be called minimal matching , because these methods match either the maximum or minimum number of characters possible.

The ereg( ) and ereg_replace( ) functions are always greedy. Being able to choose between greedy and nongreedy matching is another reason to use the PCRE functions instead.

While nongreedy matching is useful for simplistic HTML parsing, it can break down if your markup isn't 100 percent valid and there are, for example, stray <em> tags lying around.[] If your goal is just to remove all (or some) HTML tags from a block of text, you're better off not using a regular expression. Instead, use the built-in function strip_tags( ); it's faster and it works correctly. See 13.14 for more details.

[] It's possible to have valid HTML and still get into trouble; for instance, if you have bold tags inside a comment. A true HTML parser would ignore them, but our pattern won't.

Finally, even though the idea of nongreedy matching comes from Perl, the U modifier is incompatible with Perl and is unique to PHP's Perl-compatible regular expressions. It inverts all quantifiers, turning them from greedy to nongreedy and also the reverse. So to get a greedy quantifier inside of a pattern operating under a trailing /U, just add a ? to the end, the same way you would normally turn a greedy quantifier into a nongreedy one.

22.4.4. See Also

Recipe 22.6 for more on capturing text inside HTML tags; 13.14 for more on stripping HTML tags; documentation on preg_match_all( ) at http://www.php.net/preg-match-all.




PHP Cookbook, 2nd Edition
PHP Cookbook: Solutions and Examples for PHP Programmers
ISBN: 0596101015
EAN: 2147483647
Year: 2006
Pages: 445

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net