6.3. A Global View of Backtracking
On a local level, backtracking is simply the return to attempt an untried option. That's simple enough to understand, but the global implications of backtracking are not as easily grasped. In this section, we'll take an explicit look at the details of backtracking, both during a match and during a non-match, and we'll try to make some sense out of the patterns we see emerge.
Let's start by looking closely at some examples from the previous chapters. From page 165, if we apply ".*" to
we can visualize the matching action as shown in Figure 6-3.
Figure 6-3. Successful match of ".* "
The regex is attempted starting at each string position in
then matches to the end of the string, where the dot is unable to match the
This means that we try to match the closing quote at the end of the string. Well, aquote can match nothingness no better than dot, so this fails too. The engine backtracks again, this time trying to match the closing quote at ‹ Japanes e , which also fails.
If this is a Traditional NFA, the remaining unused states are simply discarded and the successful match is
6.3.1. More Work for a POSIX NFA
For POSIX NFA, the match noted earlier is remembered as "the longest match we've seen so far," but all remaining states must still be explored to see whether they could come up with a longer match. We know this won't happen in this case, but the regex engine must find that out for itself.
So, the states are tried and immediately discarded except for the remaining two situations where there is a quote in the string available to match the final quote. Thus, the sequences D - E - F and F - G - H are similar to B - C - D , except the matches at F and H are discarded as being shorter than a previously found match at D
, the only remaining backtrack is the "bump along and
6.3.2. Work Required During a Non-Match
We still need to look at what happens when there is no match. Let's look at ".*" . We know this wont match our example text, but it comes close on a number of occasions throughout the match attempt. As we'll see, that results in much more work.
Figure 6-4 illustrates this work. The A - I sequence looks similar to that in Figure 6-3. One difference is that this time it does not match at point D (because the ending exclamation point can't match). Another difference is that the entire sequence in Figure 6-4 applies to both Traditional and POSIX NFA engines: finding no match, the Traditional NFA must try as many possibilities as the POSIX NFAall of them.
Figure 6-4. Failing attempt to match ".* "!
Since there is no match from the overall attempt starting at
and ending at
, the transmission bumps along to retry the match. Attempts eventually starting at points
6.3.3. Being More Specific
As a comparison, let's replace the dot with
. As discussed in the previous chapter, this gives less surprising results because it is more specific, and the end result is that with it, the new regex is more efficient to boot. With
cant get past the closing quote, eliminating much matching and
Figure 6-5 shows the failing attempt (compare to Figure 6-4). As you can see, much less backtracking is needed. If the different results suit your needs, the reduced backtracking is a welcome side effect.
Figure 6-5. Failing attempt to match "[^"]* "!