Section 4.3. Regex-Directed Versus Text-Directed

4.3. Regex-Directed Versus Text-Directed

The two basic engine types reflect a fundamental difference in algorithms available for applying a regular expression to a string. I call the gasoline-driven NFA engine "regex-directed," and the electric-driven DFA "text-directed."

4.3.1. NFA Engine: Regex-Directed

Let's consider one way an engine might match to(nite knight night) against the text ' ‹tonight‹ '. Starting with the t , the regular expression is examined one component at a time, and the "current text is checked to see whether it is matched by the current component of the regex. If it does, the next component is checked, and so on, until all components have matched, indicating that an overall match has been achieved.

Quiz Answer

Answer to the question on page 153 .

The desire is to get the last whole number, but it doesn't work. As before, .* is forced to relinquish some of what it had matched because the subsequent [0-9]+ requires a match to be successful. In this example, that means unmatching the final period and ' 3 ', which then allows [0-9] to match. Thats governed by + , so matching just once fulfills its minimum, and now facing '.' in the string, it finds nothing else to match.

Unlike before, though, there's then nothing further that must match, so .* is not forced to give up the or any other digits it might have matched. Were .* to do so, the [0-9]+ would certainly be a grateful and greedy recipient, but nope, first come first served . Greedy constructs give up something theyve matched only when forced. In the end, $1 gets only ' 3 '.

If this feels counter-intuitive, realize that [0-9]+ is at most one match away from [0-9] , which is in the same league as .* . Substituting that into ^.*([0-9]+) , we get ^.*() as our regex, which looks suspiciously like the ^Subject: (.*) example from page 152, where the second .* was guaranteed to match nothing.

With the to(niteknightnight) example, the first component is t , which repeatedly fails until a ' t ' is reached in the target string. Once that happens, the o is checked against the next character, and if it matches, control moves to the next component. In this case, the "next component is (niteknightnight) which really means " nite or knight or night ." Faced with three possibilities, the engine just tries each in turn . We ( humans with advanced neural nets between our ears) can see that if were matching tonight , the third alternative is the one that leads to a match. Despite their brainy origins (˜ 85), a regex-directed engine can't come to that conclusion until actually going through the motions to check.

Attempting the first alternative, nite , involves the same component-at-a-time treatment as before: "Try to match n , then i , then t , and finally e . " If this fails, as it eventually does, the engine tries another alternative, and so on until it achieves a match or must report failure. Control moves within the regex from component to component, so I call it "regex-directed."

4.3.1.1. The control benefits of an NFA engine

In essence, each subexpression of a regex in a regex-directed match is checked independently of the others. Other than backreferences , there's no interrelation among subexpressions , except for the relation implied by virtue of being thrown together to make a larger expression. The layout of the subexpressions and regex control structures (e.g., alternation , parentheses, and quantifiers) controls an engine's overall movement through a match.

Since the regex directs the NFA engine, the driver (the writer of the regular expression) has considerable opportunity to craft just what he or she wants to happen. (Chapters 5 and 6 show how to put this to use to get a job done correctly and efficiently .) What this really means may seem vague now, but it will all be spelled out soon.

4.3.2. DFA Engine: Text-Directed

Contrast the regex-directed NFA engine with an engine that, while scanning the string, keeps track of all matches "currently in the works." In the tonight example, the moment the engine hits t , it adds a potential match to its list of those currently in progress:

in string	in regex
after ‹ `t` `onight` ‹	possible matches: `t o(niteknightnight)`

Each subsequent character scanned updates the list of possible matches. After a few more characters are matched, the situation becomes

in string	in regex
after ‹ `toni` `ght` ‹	possible matches: `to(ni teknightni ght)`

with two possible matches in the works (and one alternative, knight , ruled out). With the g that follows , only the third alternative remains viable . Once the h and t are scanned as well, the engine realizes it has a complete match and can return success.

I call this "text-directed" matching because each character scanned from the text controls the engine. As in the example, a partial match might be the start of any number of different, yet possible, matches. Matches that are no longer viable are pruned as subsequent characters are scanned. There are even situations where a "partial match in progress" is also a full match. If the regex were to(‹)? , for example, the parenthesized expression becomes optional, but its still greedy, so it's always attempted. All the time that a partial match is in progress inside those parentheses, a full match (of ' to ') is already confirmed and in reserve in case the longer matches don't pan out.

If the engine reaches a character in the text that invalidates all the matches in the works, it must revert to one of the full matches in reserve. If there are none, it must declare that there are no matches at the current attempt's starting point.

4.3.3. First Thoughts: NFA and DFA in Comparison

If you compare these two engines based only on what I've mentioned so far, you might conclude that the text-directed DFA engine is generally faster. The regex-directed NFA engine might waste time attempting to match different subexpressions against the same text (such as the three alternatives in the example).

You would be right. During the course of an NFA match, the same character of the target might be checked by many different parts of the regex (or even by the same part, over and over). Even if a subexpression can match, it might have to be applied again (and again and again) as it works in concert with the rest of the regex to find a match. A local subexpression can fail or match, but you just never know about the overall match until you eventually work your way to the end of the regex. (If I could find a way to include "It's not over until the fat lady sings." in this paragraph, I would.) On the other hand, a DFA engine is deterministic each character in the target is checked once (at most). When a character matches, you don't know yet if it will be part of the final match (it could be part of a possible match that doesn't pan out), but since the engine keeps track of all possible matches in parallel, it needs to be checked only once, period.

The two basic technologies behind regular-expression engines have the somewhat imposing names Nondeterministic Finite Automaton (NFA) and Deterministic Finite Automaton (DFA). With mouthfuls like this, you see why I stick to just "NFA" and "DFA." We won't be seeing these phrases spelled out again. ^{[ ]}

^{[ ]} I suppose I could explain the underlying theory that goes into these names, if I only knew it! As I hinted, the word deterministic is pretty important, but for the most part the theory is not relevant, so long as we understand the practical effects. By the end of this chapter, we will.

4.3.3.1. Consequences to us as users

Because of the regex-directed nature of an NFA, the details of how the engine attempts a match are very important. As I said before, the writer can exercise a fair amount of control simply by changing how the regex is written. With the tonight example, perhaps less work would have been wasted had the regex been written differently, such as in one of the following ways:

to ( ni(ghtte) knight )
tonitetoknighttonight
to(k?nightnite)

With any given text, these all end up matching exactly the same thing, but in doing so direct the engine in different ways. At this point, we don't know enough to judge which of these, if any, are better than the others, but that's coming soon.

It's the exact opposite with a DFAsince the engine keeps track of all matches simultaneously, none of these differences in representation matter so long as in the end they all represent the same set of possible matches. There could be a hundred different ways to achieve the same result, but since the DFA keeps track of them all simultaneously (almost magicallymore on this later), it doesn't matter which form the regex takes. To a pure DFA, even expressions that appear as different as abc and [aa-a](bb{1}b)c are utterly indistinguishable.

Three things come to my mind when describing a DFA engine:

DFA matching is very fast.
DFA matching is very consistent.
Talking about DFA matching is very boring.

I'll eventually expand on all these points.

The regex-directed nature of an NFA makes it interesting to talk about. NFAs provide plenty of room for creative juices to flow. There are great benefits in crafting an expression well, and even greater penalties for doing it poorly. A gasoline engine is not the only engine that can stall and conk out completely. To get to the bottom of this, we need to look at the essence of an NFA engine: backtracking .