Section 6.5. Common Optimizations

6.5. Common Optimizations

A smart regex implementation has many ways to optimize how quickly it produces the results you ask of it. Optimizations usually fall into two classes:

  • Doing something faster Some types of operations, such as \d+ , are so common that the engine might have special-case handling set up to execute them faster than the general engine mechanics would.

  • Avoiding work If the engine can decide that some particular operation is unneeded in producing a correct result, or perhaps that some operation can be applied to less text than originally thought, skipping those operations can result in a time savings. For example, a regex beginning with \A (start-of-line) can match only when started at the beginning of the string, so if no match is found there, the transmission need not bother checking from other positions .

Over the next dozen or so pages, we'll look at many of the different and ingenious optimizations that I've seen. No one language or tool has them all, or even the same mix as another language or tool, and I'm sure that there are plenty of other optimizations that I've not yet seen, but this chapter should leave you much more empowered to take advantage of whatever optimizations your tool offers.

6.5.1. No Free Lunch

Optimizations often result in a savings, but not always. There's a benefit only if the amount of time saved is more than the extra time spent checking to see whether the optimization is applicable in the first place. In fact, if the engine checks to see if an optimization is applicable and the answer is "no," the overall result is slower because it includes the fruitless check on top of the subsequent normal application of the regex. So, there's a balance among how much time an optimization takes, how much time it saves, and importantly, how likely it is to be invoked.

Let's look at an example. The expression \b\B ( word boundary at the same location as a non-word boundary) can't possibly match. If an engine were to realize that a regex contained \b\B in such a way that it was required for any match, the engine would know that the overall regex could never match, and hence never have to actually apply that regex. Rather, it could always immediately report failure. If applied to long strings, the savings could be substantial.

Yet, no engine that I know of actually uses this optimization. Why not? Well, first of all, it's not necessarily easy to decide whether it would apply to a particular regex. It's certainly possible for a regex to have \b\B somewhere in it, yet still match, so the engine has to do extra work ahead of time to be absolutely certain. Still, the savings could be truly substantial, so it could be worth doing the extra work if \b\B was expected to be common.

But, it's not common (it's silly!) [ ] so even though the savings could be huge in the rare case, the added overhead is not worth slowing down the common case.

[ ] Actually, in times past, I would use \b\B to force one part of a larger expression to fail, during testing. For example, I might insert it at the marked point of ‹(this this other)‹ to guaranteed failure of the first alternative. These days, when I need a "must fail component, I use (?!) . You can see an interesting Perl-specific example of this on page 333.

6.5.2. Everyone's Lunch is Different

Keep this in mind when looking at the various kinds of optimizations that this chapter discusses. Even though I've tried to pick simple, clean names for each one, it may well be that every engine that implements it does so in a different way. A seemingly innocuous change in a regex can cause it to become substantially faster with one implementation, but substantially slower with another.

6.5.3. The Mechanics of Regex Application

Before looking at how advanced systems optimize their regex performance, and how we can take advantage of those optimizations, it's important to first understand the basics of regex application. We've already covered the details about backtracking, but in this short section, we'll look at the broader picture.

Here are the main steps taken in applying a regular expression to a target string:

  • 1 . Regex Compilation The regex is inspected for errors, and if valid, compiled into an internal form.

  • 2 . Transmission Begins The transmission "positions" the engine at the start of the target string.

  • 3 . Component Tests The engine works through the regex and the text, moving from component to component in the regex, as described in Chapter 4. We've already covered backtracking for NFAs in great detail, but there are a few additional points to mention:

    • With components next to each other, as with the S , u , b , j , e ..., of Subject , each component is tried in turn , stopping only if one fails.

    • With quantifiers, control jumps between the quantifier (to see whether the quantifier should continue to make additional attempts) and the component quantified (to test whether it matches).

    • There is some overhead when control enters or exits a set of capturing parentheses. The actual text matched by the parentheses must be remembered so that $1 and the like are supported. Since a set of parentheses may be "backtracked out of," the state of the parentheses is part of the states used for backtracking, so entering and exiting capturing parentheses requires some modification of that state.

  • 4 . Finding a Match If a match is found, a Traditional NFA "locks in" the current state and reports overall success. On the other hand, a POSIX NFA merely remembers the possible match if it is the longest seen so far, and continues with any saved states still available. Once no more states are left, the longest match that was seen is the one reported .

  • 5 . Transmission Bump-Along If no match is found, the transmission bumps the engine along to the next character in the text, and the engine applies the regex all over again (going back to step 3).

  • 6 . Overall Failure If no match is found after having applied the engine at every character in the target string (and after the last character as well), overall failure must be reported.

The next few sections discuss the many ways this work can be reduced by smart implementations , and taken advantage of by smart users.

6.5.4. Pre-Application Optimizations

A good regex engine implementation can reduce the amount of work that needs to be done before the actual application of a regex, and sometimes can even decide quickly beforehand that the regex can never match, thereby avoiding the need to even apply the regex in the first place. Compile caching

Recall the mini mail program from Chapter 2 (˜ 57). The skeleton of the main loop, which processes every line of the header, looks like:

 while () {         if ($line =~ m/  ^\s*$  /)          if ($line =~ m/  ^Subject: (.*)  /)          if ($line =~ m/  ^Date: (.*)  /)          if ($line =~ m/  ^Reply-To: (\S+)  /)         if ($line =~ m/  ^From: (\S+) \(([^()]*)\)  /)  } 

The first thing that must be done before a regular expression can be used is that it must be inspected for errors, and compiled into an internal form. Once compiled, that internal form can be used in checking many strings, but will it? It would certainly be a waste of time to recompile each regex each time through the loop. Rather, it is much more time efficient (at the cost of some memory) to save, or cache , the internal form after it's first compiled, and then use that same internal form for each subsequent application during the loop.

The extent to which this can be done depends on the type of regular-expression handling the application offers. As described starting on page 93, the three types of handling are integrated, procedural , and object-oriented . Compile caching in the integrated approach

An integrated approach, like Perl's and awk's, allows compile caching to be done with ease. Internally, each regex is associated with a particular part of the code, and the compiled form can be associated with the code the first time it's executed, and merely referenced subsequent times. This provides for the maximum optimization of speed at the cost of the memory needed to hold all the cached expressions.

DFAs, Tcl, and Hand-Tuning Regular Expressions

For the most part, the optimizations described in this chapter simply don't apply to DFAs. The compile caching optimization, discussed on page 242, does apply to all types of engines, but none of the techniques for hand-tuning discussed throughout this chapter apply to DFAs. As Chapter 4 makes clear (˜ 157), expressions that are logically equivalent thisthat and th(isat) , for example are equivalent to a DFA. It's because they're not necessarily equivalent to an NFA that this chapter exists.

But what about Tcl, which has a hybrid DFA/NFA engine? Tcl's regex engine was custom built for Tcl by regular-expression legend Henry Spencer (˜ 88), who has done a fantastic job blending the best of both DFA and NFA worlds . Henry noted himself in an April 2000 Usenet posting:

In general, the Tcl RE-matching engine is much less sensitive to the exact form of the RE than traditional matching engines. Things that it does quickly will be fast no matter how you write them; things that it does slowly will be slow no matter how you write them. The old folklore about hand-optimizing your REs simply does not apply.

Henry's Tcl regex engine is an important step forward. If this technology were more widespread, much of this chapter would not be needed.

The ability to interpolate variables into the regex operand (that is, use the contents of a variable as part of the regular expression) throws somewhat of a monkey wrench into the caching plan. When variables are interpolated, as with something like m/ ^Subject: \Q $DesiredSubject \E\s*$/ , the actual regular expression may change from iteration to iteration because it depends on the value in the variable, which can change from iteration to iteration. If it changes every time, the regex must be compiled every time, so nothing can be reused.

Well, the regular expression might change with each iteration, but that doesn't mean it needs to be recompiled each time. An intermediate optimization is to check the results of the interpolation (the actual value to be used as the regular expression), and recompile only if it's different from the previous time. If the value actually changes each time, there's no optimization, as the regex indeed must be recompiled each time. But, if it changes only sporadically, the regular expression need only be checked (but not compiled) most times, yielding a handsome optimization. Compile caching in the procedural approach

With an integrated approach, regex use is associated with a particular location in a program, so the compiled version of the regex can be cached and used the next time that location in the program is executed. But, with a procedural approach, there is just a general "apply this regex" function that is called as needed. This means that there's no location in a program with which to associate the compiled form, so the next time the function is called, the regex must be compiled from scratch again. That's how it works in theory, but in practice, it's much too inefficient to abandon all attempts at caching. Rather, what's often done is that a mapping of recently used regex patterns is maintained , linking each pattern to its resulting compiled form.

When the apply-this-regex function is called, it compares the pattern argument with those in the cache of saved regular expressions, and uses the cached version if it's there. If it's not, it goes ahead and compiles the regex, saving it to the cache (and perhaps flushing an old one out, if the cache has a size limit). When the cache has become full and a compiled form must be thrown out, it's usually the least recently used one.

GNU Emacs keeps a cache of up to 20 expressions. Tcl keeps up to 30. PHP keeps more than 4,000. The .NET Framework by default keeps up to only 15 expressions cached, although that can be changed on the fly or even disabled (˜ 432).

A large cache size can be important because if more regular expressions are used within a loop than the size of the cache, by the time the loop restarts, the first regex will have been flushed from the cache, guaranteeing that every expression will have to be compiled from scratch every time. Compile caching in the object-oriented approach

The object-oriented approach puts control of when a regex is compiled directly into the programmer's hands. Compilation of the regex is exposed to the user via object constructors such as New Regex, re.compile , and Pattern.compile (which are from .NET, Python, and java.util.regex , respectively). In the simple examples from Chapter 3 where these are introduced (starting on page 95), the compilation is done just before the regex is actually used, but there's no reason that they can't be done earlier (such as sometime before a loop, or even at program initialization) and then used freely as needed. This is done, in the benchmarking examples on pages 235, 237, and 238.

The object-oriented approach also affords the programmer control over when a compiled form is thrown away, via the object's destructor. Being able to immediately throw away compiled forms that will no longer be needed saves memory. Pre-check of required character/substring optimization

Searching a string for a particular character (or perhaps some literal substring) is a much "lighter" operation than applying a full NFA regular expression, so some systems do extra analysis of the regex during compilation to determine if there are any characters or substrings that are required to exist in the target for a possible match. Then, before actually applying the regex to a string, the string is quickly checked for the required character or stringif it's not found, the entire application of the regex can be bypassed.

For example, with ^Subject: (.*) , the string ' Subject : ' is required. A program can look for the entire string, perhaps using the Boyer-Moore search algorithm (which is a fast way to search for literal strings within textthe longer the literal string, the more efficient the search). A program not wishing to implement the Boyer-Moore algorithm can still gain a benefit by picking a required character and just checking every character in the target text. Picking a character less likely to be found in the target (such as picking ':' over ' t ' from our ' Subject : ' example) is likely to yield better results.

While it's trivial for a regex engine to realize what part of ^Subject: (.*) is a fixed literal string required for any match, its more work to recognize that ' th ' is required for any match of thisthatother , and most dont do it. It's not exactly black and whitean implementation not realizing that ' th ' is required may well still be able to easily realize that ' h ' and ' t ' are required, so at least do a one-character check.

There is a great variety in how well different applications can recognize required characters and strings. Most are thwarted by the use of alternation . With such systems, using th(isat) can provide an improvement over thisthat . Also, be sure to see the related section "Initial character/class/substring discrimination on page 247. Length-cognizance optimization

^Subject: (.*) can match arbitrarily long text, but any match is certainly at least nine characters long. Therefore, the engine need not be started up and applied to strings shorter than that length. Of course, the benefit is more pronounced with a regex with a longer required length, such as :\d{79}: (81 characters in any match).

Also see the length-cognizance transmission optimization on page 247.

6.5.5. Optimizations with the Transmission

If the regex engine can't decide ahead of time that a particular string can never match, it may still be able to reduce the number of locations that the transmission actually has to apply the regex. Start of string/line anchor optimization

This optimization recognizes that any regex that begins with ^ can match only when applied where ^ can match, and so need be applied at those locations only.

The comments in the "Pre-check of required character/substring" section on the previous page about the ability of the regex engine to derive just when the optimization is applicable to a regex is also valid here. Any implementation attempting this optimization should be able to recognize that ^(thisthat) can match starting only at locations where ^ can match, but many wont come to the same realization with ^this^that . In such situations, writing ^this^that or (even better) ^(?:thisthat) can allow a match to be performed much faster.

Similar optimizations involve \A , and for repeated matches, \G . Implicit-anchor optimization

An engine with this optimization realizes that if a regex begins with .* or .+ , and has no global alternation, an implicit ^ can be prepended to the regex. This allows the start of string/line anchor optimization of the previous section to be used, which can provide a lot of savings.

More advanced systems may realize that the same optimization can also be applied when the leading .* or .+ is within parentheses, but care must be taken when the parentheses are capturing. For example, the regex (.+)X\1 finds locations where a string is repeated on either side of ' X ', and an implicit leading ^ causes it to improperly not match ' '. [ ]

[ ] Its interesting to note that Perl had this over-optimization bug unnoticed for over 10 years until Perl developer Jeff Pinyan discovered (and fixed) it in early 2002. Apparently, regular expressions like (.+)X\1 arent used often, or the bug would have been discovered sooner. End of string/line anchor optimization

This optimization recognizes that some regexes ending with $ or other end anchors (˜ 129) have matches that start within a certain number of characters from the end of the string. For example, with regex(es)?$ , any match must start no more than eight [ ] characters from the end of the string, so the transmission can jump directly there, potentially bypassing much of the target string.

[ ] I say eight characters rather than seven because in many flavors, $ can match before a string-ending newline (˜ 129). Initial character/class/substring discrimination optimization

A more generalized version of the pre-check of required character/string optimization, this optimization uses the same information (that any match by the regex must begin with a specific character or literal substring) to let the transmission use a fast substring check so that it need apply the regex only at appropriate spots in the string. For example thisthatother can match only at locations beginning with [ot] , so having the transmission pre-check each character in the string and applying the regex only at matching positions can afford a huge savings. The longer the substring that can be pre-checked, the fewer "false starts are likely. Embedded literal string check optimization

This is almost exactly like the initial string discrimination optimization, but is more advanced in that it works for literal strings embedded a known distance into any match. \b(perl;java)\.regex\. info \b , for example, has ' ' four characters into any match, so a smart transmission can use a fast Boyer-Moore literal-string check to find ' ', and then actually apply the regex starting four characters before.

In general, this works only when the literal string is embedded a fixed distance into any match. It doesn't apply to \b(vbjava)\.regex\.info\b , which does have a literal string, but one thats embedded either two or four characters into any match. It also doesn't apply to \b(\w+)\.regex\.info\b , whose literal string is embedded any number of characters into any match. Length-cognizance transmission optimization

Directly related to the Length-cognizance optimization on page 245, this optimization allows the transmission to abandon the attempt if it's gotten too close to the end of the string for a match to be possible.

6.5.6. Optimizations of the Regex Itself Literal string concatenation optimization

Perhaps the most basic optimization is that abc can be treated by the engine as "one part," rather than the three parts " a then b then c ." If this is done, the one part can be applied by one iteration of the engine mechanics, avoiding the overhead of three separate iterations. Simple quantifier optimization

Uses of star, plus, and friends that apply to simple items, such as literal characters and character classes, are often optimized such that much of the step-by-step overhead of a normal NFA engine is removed. The main control loop inside a regex engine must be general enough to deal with all the constructs the engine supports. In programming, "general" often means "slow," so this important optimization makes simple quantifiers like .* into one "part," replacing the general engine mechanics of quantifier processing with fast, specialized processing. Thus, the general engine is short-circuited for these tests.

For example, .* and (?:.)* are logically identical, but for systems with this optimization, the simple .* is substantially faster. A few examples: its only about 10 percent faster in java.util.regex , but with Ruby and the .NET languages, it's about two and a half times faster. With Python, it's about 50 times faster, and with PHP/PCRE, it's about 150 times faster. Because Perl has the optimization discussed in the next section, both .* and (?:.)* are the same speed. (Be sure to see the sidebar on the facing page for more on how to interpret these numbers .) Needless parentheses elimination

If an implementation can realize that (?:.)* is exactly the same as .* , it opens up the latter to the previous optimization. Needless character class elimination

A character class with a single character in it is a bit silly because it invokes the processing overhead of a character class, but without any benefits of one. So, a smarter implementation internally converts something like [.] to \. . Character following lazy quantifier optimization

With a lazy quantifier, as in "(.*?)" , the engine normally must jump between checking what the quantifier controls (the dot) with checking what comes after ( " ). For this and other reasons, lazy quantifiers are generally much slower than greedy ones, especially for greedy ones that are optimized with the simple quantifier optimization discussed two sections ago. Another factor is that if the lazy quantifier is inside capturing parentheses, control must repeatedly jump in and out of the capturing, which causes additional overhead.

So, this optimization involves the realization that if a literal character follows the lazy quantifier, the lazy quantifier can act like a normal greedy quantifier so long as the engine is not at that literal character. Thus, implementations with this optimization switch to specialized lazy quantifier processing for use in these situations, which quickly checks the target text for the literal character, bypassing the normal "skip this attempt" if the target text is not at that special literal character.

Variations on this optimization might include the ability to pre-check for a class of characters, rather than just a specific literal character (for instance, a pre-check for ['"] with ['"](.*?)["'] , which is similar to the initial character discrimination optimization discussed on the previous page).

Understanding Benchmarks in This Chapter

For the most part, benchmarks in this chapter are reported as relative ratios for a given language. For example, on page 248, I note that a certain optimized construct is 10 percent faster than the unoptimized construct, at least with Sun's Java regex package. In the .NET Framework, the optimized and unoptimized constructs differ by a factor of two and a half, but in PCRE, it's a factor of about whopping 150x. In Perl, it's a factor of one (i.e., they are the same speedno difference).

From this, what can you infer about the speed of one language compared to another? Absolutely nothing. The 150x speedup for the optimization in PCRE may mean that the optimization has been implemented particularly well, relative to the other languages, or it may mean that the unoptimized version is particularly slow. For the most part, I report very little timing information about how languages compare against each other, since that's of interest mostly for bragging rights among language developers.

But, for what it's worth, it may be interesting to see the details behind such different results as Java's 10 percent speedup and PCRE's 150x speedup. It turns out that PCRE's unoptimized (?:.)* is about 11 times slower than Java's, but its optimized .* is about 13 times faster . Java's and Ruby's optimized versions are about the same speed, but Ruby's unoptimized version is about 2.5 times slower than Java's unoptimized version. Ruby's unoptimized version is only about 10 percent slower than Python's unoptimized version, but Python's optimized version is about 20 times faster than Ruby's optimized version.

All of these are slower than Perl's. Both Perl's optimized and unoptimized versions are 10 percent faster than Python's fastest . Note that each language has its own strong points, and these numbers are for only one specific test case. "Excessive" backtracking detection

The problem revealed with the "Reality Check" on page 226 is that certain combinations of quantifiers, such as (.+)* , can create an exponential amount of backtracking. One simple way to avoid this is to keep a count of the backtracking, and abort the match when theres "too much." This is certainly useful in the reality-check situation, but it puts an artificial limit on the amount of text that some regular expressions can be used with.

For example, if the limit is 10,000 backtracks, .*? can never match text longer than 10,000 characters, since each character matched involves a backtrack. Working with these amounts of text is not all that uncommon, particularly when working with, say, web pages, so the limitation is unfortunate.

For different reasons, some implementations have a limit on the size of the backtrack stack (on how many saved states there can be at any one time). For example, Python allows at most 10,000. Like a backtrack limit, it limits the length of text some regular-expressions can work with.

This issue made constructing some of the benchmarks used while researching this book rather difficult. To get the best results, the timed portion of a benchmark should do as much of the target work as possible, so I created huge strings and compared the time it took to execute, say, "(.*)" , "(.)*" , "(.)*?" , and "([^"])*?" . To keep meaningful results, I had to limit the length of the strings so as not to trip the backtrack-count or stack-size limitations. You can see an example on page 239. Exponential (a.k.a., super-linear) short-circuiting

A better solution to avoid matching forever on an exponential match is to detect when the match attempt has gone super-linear. You can then make the extra effort to keep track of the position at which each quantifier's subexpression has been attempted, and short-circuit repeat attempts.

It's actually fairly easy to detect when a match has gone super-linear. A quantifier should rarely "iterate" (loop) more times than there are characters in the target string. If it does, it's a good sign that it may be an exponential match. Having been given this clue that matching may go on forever, it's a more complex issue to detect and eliminate redundant matches, but since the alternative is matching for a very, very long time, it's probably a good investment.

One negative side effect of detecting a super-linear match and returning a quick failure is that a truly inefficient regex now has its inefficiency mostly hidden. Even with exponential short-circuiting, these matches are much slower than they need to be, but no longer slow enough to be easily detected by a human (instead of finishing long after the sun has gone dormant , it may take 1/100 of a secondquick to us, but still an eternity in computer time).

Still, the overall benefit is probably worth it. There are many people who don't care about regex efficiencythey're scared of regular expressions and just want the thing to work, and don't care how. (You may have been this way before, but I hope reading this book emboldens you, like the title says, to master the use of regular expressions.) State-suppression with possessive quantifiers

After something with a normal quantifier has matched, a number of "try the nonmatch option" states have been created (one state per iteration of the quantifier). Possessive quantifiers (˜ 142) don't leave those states around. This can be accomplished by removing the extra states after the quantifier has run its course, or, it can be done more efficiently by removing the previous iteration's state while adding the current iteration's. (During the matching, one state is always required so that the regex can continue once the quantified item can no longer match.)

The reason the on-the-fly removal is more efficient is because it takes less memory. Applying .* leaves one state per character matched, which could consume a vast amount of memory if the string is long.

Automatic "Possessification"

Recall the example from Chapter 4 (˜ 171) where ^\w+: is applied to ' Subject '. Once \w+ matches to the end of the string, the subsequent colon cant match, and the engine must waste the effort of trying : at each position where backtracking forces \w+ to give up a character. The example then concluded that we could have the engine avoid that extra work by using atomic grouping, ^ (?> \w+ ) : , or possessive quantifiers, ^\w+ + : .

A smart implementation should be able to do this for you. When the regex is first compiled, the engine can see that what follows the quantifier can't be matched by what is quantified , so the quantifier can be automatically turned into a possessive one.

Although I know of no system that currently has this optimization, I include it here to encourage developers to consider it, for I believe it can have a substantial positive impact. Small quantifier equivalence

Some people like to write \d\d\d\d directly, while some like to use a small quantifier and write \d{4} . Is one more efficient than the other? For an NFA, the answer is almost certainly yes, but which is faster depends on the tool. If the tools quantifier has been optimized, the \d{4} version is likely faster unless the version without the quantifier can somehow be optimized more. Sound a bit confusing? It is.

My tests show that with Perl, Python, PHP/PCRE, and .NET, \d{4} is faster by as much as 20 percent. On the other hand, with Ruby and Suns Java regex package, \d\d\d\d is fastersometimes several times faster. So, this seems to make it clear that the small quantifier is better for some, but worse for others. But, it can be more complex than that.

Compare ==== with ={4} . This is a quite different example because this time, the subject of the repetition is a literal character, and perhaps using ==== directly makes it easier for the regex engine to recognize the literal substring. If it can, the highly effective initial character/substring discrimination optimization (˜ 247) can kick in, if supported. This is exactly the case for Python and Sun's Java regex package, for whom the ==== version can be up to 100x faster than ={4} .

More advanced still, Perl, Ruby, and .NET recognize this optimization with either ==== or ={4} , and as such, both are equally fast (and in either case, can be hundreds or thousands of times faster than the \d\d\d\d and \d{4} counterparts). Need cognizance

One simple optimization is if the engine realizes that some aspect of the match result isn't needed (say, the capturing aspect of capturing parentheses), it can eliminate the work to support them. The ability to detect such a thing is very language dependent, but this optimization can be gained as easily as allowing an extra match-time option to disable various high-cost features.

One example of a system that has this optimization is Tcl. Its capturing parentheses don't actually capture unless you explicitly ask. Conversely, the .NET Framework regular expressions have an option that allows the programmer to indicate that capturing parentheses shouldn't capture.

Mastering Regular Expressions
Mastering Regular Expressions
ISBN: 0596528124
EAN: 2147483647
Year: 2004
Pages: 113

Similar book on Amazon © 2008-2017.
If you may any questions please contact us: