Recipe 13.5. Creating a Nongreedy Pattern


Problem

You're using a regular expression but not seeing the correct results. The pattern is acting greedy, matching more characters than you want it to.

Solution

Replace the regex with a nongreedy version to match the smallest amount of characters possible.

Discussion

Whenever you create a pattern that includes matches for character repetition using the * and + metacharacters or the { n , m } metasequence, the pattern is greedy. A greedy pattern is one that tries to consume as much text as possible, matching the largest substring it can. Patterns are greedy because of the underlying code in the regular expression engine, and understanding how that engine works allows you to create more precise regexes.

Consider that you want to remove HTML tags in a string, as described in Recipe 13.4. Every HTML tag starts with an opening < and ends in a closing >. In between the angle brackets, there could be a wide variety of characters, such a numbers, letters, quotes, the equal sign, etc. For the sake of simplicity, instead of creating a character class for everything that could appear between the < and >, a .* will match "any character, any number of times." So, you construct the regular expression /<.*>/g and try running it against a string that contains HTML tags:

var example:String = "<b>hello</b>, world!"; // Displays: <b>hello</b> trace( example.match( /<.*>/g ) );

You might have expected that the regular expression would produce two matches, one for <b> and another for </b>. From the preceding code block, you can see that only one match was produced, the entire string <b>hello</b>. This is an example of a greedy pattern in action.

The first character in the expression is a <, which is taken literally. It matches in the string at the very first position so the engine moves on in the pattern and keeps processing. The next character in the pattern is a ., which matches everything except a newline, followed by the *, which repeats the . zero or more times. This is where the pattern becomes greedy.

The . matches the b because it falls under the "any character" umbrella. After the b in the string is a >, which again falls under "any character" andas you guessed itis matched by the .. The engine keeps repeating the ., matching hello</b>, world! up until the end of the string. At the end of the string, the . fails because there is nothing left; however, the pattern still needs to match a closing >. Because the engine knows that the * repetition has been fulfilled in matching the ., the engine backtracks and starts looking for a match to the > to complete the pattern. Moving backward in the string from the end, the ! is not a match to the > required by the pattern, so the engine moves backward yet again knowing that the .* has been fulfilled. It keeps moving backward through the d, l, r, o, etc. until it meets the > character just before the comma. The angle bracket at this location matches the closing angle bracket in the pattern, and the engine considers the pattern to have been matched successfully, returning a match of <b>hello</b>.

The RegExp engine wants to match a pattern as quickly as possible. Because a match has been made at this point, there is no need to continue to backtrack and try to find a smaller match. Therefore, greedy patterns always return the longest match.

There are two possible solutions to greedy patterns. The first is to make a pattern nongreedy, sometimes called a lazy pattern. The second is to creatively use a character class.

You make a pattern lazy by removing greedy repetition. Lazy patterns match the smallest possible substring. Creating a lazy pattern is as simple as adding the ? metacharacter to the end of the repetition expression, as described in Table 13-5.

Table 13-5. Lazy patterns
ExpressionLazily matches the preceding character or group
                                                              ??

Zero or one time
                               *?

Zero or more times
                               +?

One or more times
                               {n,}?

At least n times
                               {n,m}?

At least n but no more than m times


The greedy /<.*>/g pattern used earlier to match HTML tags can be replaced with the lazy pattern of /<.*?>/g, giving the desired effect of matching all of the HTML tags:

var example:String = "<b>hello</b>, world!"; // Displays: <b>,</b> trace( example.match( /<.*?>/g ) );

By making the pattern lazy, the underlying RegExp engine treats the pattern differently than it would a greedy pattern. The < is matched just like before, but this time the . is followed by a lazy * (as signified by the ?), and the engine knows to match as few characters as possible. At this point, the requirement of "any character, zero or more times" has already been met since zero characters have been consumed and the engine starts looking for the closing >. The b in the string is encountered, which doesn't match >. The engine backtracks just like before, but in doing so forces the lazy * to consume more text. The .* is expanded to match the b, and the engine again looks to match the >. The next character in the string is >, which matches successfully and completes the pattern. The engine therefore returns <b> as a match, which is exactly the desired result.

Instead of making a lazy pattern, you can replace the . with a character class that includes every character except the >, resulting in the /<[^>]*>/g pattern:

var example:String = "<b>hello</b>, world!"; // Displays: <b>,</b> trace( example.match( /<[^>]*>/g ) );

In this case, a character class might be the better option because it eliminates the backtracking done by the engine. The engine processes the < as normal, and then matches characters up until the > in the string since it's not a part of the character class. It then moves on to the > in the pattern and reports a successful match because > is found in the string.

See Also

Recipes 13.1, 13.3, and 13.4




ActionScript 3. 0 Cookbook
ActionScript 3.0 Cookbook: Solutions for Flash Platform and Flex Application Developers
ISBN: 0596526954
EAN: 2147483647
Year: 2007
Pages: 351

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net