Section 10.6. PHP Efficiency Issues


10.6. PHP Efficiency Issues

PHP's preg routines use PCRE, an optimized NFA regular-expression engine, so many of the techniques discussed in Chapters 4 through 6 apply directly. This includes benchmarking critical sections of code to understand practically, and not just theoretically, what is fast and what is not. Chapter 6 shows an example of benchmarking in PHP (˜ 234).

For particularly time-critical code, remember that a callback is generally faster than using the e pattern modifier (˜ 465), and that named capture with very long strings can result in a lot of extra data copying.

Regular expressions are compiled as they're encountered at runtime, but PHP has a huge 4,096-entry cache (˜ 242), so in practice, a particular pattern string is compiled only the first time it is encountered .

The S pattern modifier deserves special coverage: it "studies" a regex to try to achieve a faster match. (This is unrelated, by the way, to Perl's study function, which works with target text rather than a regular expression ˜ 359.)

10.6.1. The S Pattern Modifier: "Study"

Using the S pattern modifier instructs the preg engine to spend a little extra time [ ] studying the regular expression prior to its application, with the hope that the extra time spent increases match speed enough to justify it. It may well be that no extra speed is achieved by doing this, but in some situations the speedup is measured by orders of magnitudes .

[ ] Really, a very little extra time. For very long and complex regexes on an average CPU, the extra time taken by the S pattern modifier is less than a hundred-thousandths of a second.

Currently, the situations where study can and can't help are fairly well defined: it enhances what Chapter 6 calls the initial class discrimination optimization (˜ 247).

I'll start off first by noting that unless you intend to apply a regex to a lot of text, there's probably not a lot of time to save in the first place. You need to be concerned with the S pattern modifier only when applying the same regex to large chunks of text, or to many small chunks .

10.6.1.1. Standard optimizations, without the S pattern modifier

Consider a simple expression such as <(\w+) . Due to the nature of this regex, we know that every match must begin with the ' < ' character. A regex engine can (and in the preg suites case, does) take advantage of that by presearching the target string for ' < ' and applying the full regular expression at those locations only (since a match must begin with < , applying it starting at any other character is pointless).

This simple presearch can be much faster than a full regex application, and therein lies the optimization. Particularly, the less frequently the character in question appears in the target text, the greater the optimization. Also, the more work a regex engine must do to detect a first-character failure, the greater the benefit of the optimization. This optimization helps <i></i><b>!</b> more than <(\w+) because in the first case, the regex engine would otherwise have to attempt four different alternatives before moving on to the next attempt. Thats a lot of work to avoid.

10.6.1.2. Enhancing the optimization with the S pattern modifier

The preg engine is smart enough to apply this optimization to most expressions that have only a single character that must start any match, as in the previous examples. However, the S pattern modifier tells the engine to preanalyze the expression to enable this optimization for expressions whose possible matches have multiple starting characters .

Here are several sample expressions, some of which we've already seen in this chapter, that require the S pattern modifier to be optimized in this way:

Regex

Possible Starting Characters

<(\w+) <&(\w+);

< &

[Rr]e:

R r

(JanFeb‹Dec)\b

A D F J M N O S

(Re:\S*)? SPAM

R S

\S*,\s*

\x09 \x0A \x0C \x0D \x20,

[&<">]

& < " >

\r?\n\r?\n

\r \n


10.6.1.3. When the S pattern modifier can't help

It's instructive to look at the type of expressions that don't benefit from the S pattern modifier:

  • Expressions that have a leading anchor (e.g., ^ and \b ), or an anchor leading a global-level alternative. This is a restriction of the current implementation that theoretically could be removed, in some future version, for \b .

  • Expressions that can match nothingness , such as \S* .

  • Expressions that can match starting at any character (or most any character), such as (?: [^()]++ <\((?R) \) )* , seen in an example on page 475. This expression can start on any character except ' ) ', so a precheck is not likely to eliminate many starting positions .

  • Expressions that have only one possible starting character, because they are already optimized.

10.6.1.4. Suggested use

It doesn't take long for the preg engine to do the extra analysis invoked by the S pattern modifier, so if you'll be applying a regex to relatively large chunks of text, it doesn't hurt to use it. If you think there's any chance it might apply, the potential benefit makes it worthwhile.



Mastering Regular Expressions
Mastering Regular Expressions
ISBN: 0596528124
EAN: 2147483647
Year: 2004
Pages: 113

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net