4.1 An Introduction to Parsers | Text Processing in Python

4.1.1 When Data Becomes Deep and Texts Become Stateful

Regular expressions can match quite complicated patterns, but they fall short when it comes to matching arbitrarily nested subpatterns. Such nested subpatterns occur quite often in programming languages and textual markup languages (and other places sometimes). For example, in HTML documents, you can find lists or tables nested inside each other. For that matter, character-level markup is also allowed to nest arbitrarily the following defines a valid HTML fragment:

 >>>  s = '''<p>Plain text, <i>italicized phrase,             <i>italicized subphrase</i>, <b>bold             subphrase</b></i>, <i>other italic             phrase</i></p>'''

The problem with this fragment is that most any regular expression will match either less or more than a desired <i> element body. For example:

 >>> ital = r'''(?sx)<i>.+</i>''' >>> for phrs in re.findall(ital, s): ...     print phrs, '\n-----' ... <i>italicized phrase,        <i>italicized subphrase</i>, <b>bold        subphrase</b></i>, <i>other italic        phrase</i> ----- >>> ital2 = r'''(?sx)<i>.+?</i>''' >>> for phrs in re.findall(ital2, s): ...     print phrs, '\n-----' ... <i>italicized phrase,        <i>italicized subphrase</i> ----- <i>other italic        phrase</i> -----

What is missing in the proposed regular expressions is a concept of state. If you imagine reading through a string character-by-character (which a regular expression match must do within the underlying regex engine), it would be useful to keep track of "How many layers of italics tags am I in?" With such a count of nesting depth, it would be possible to figure out which opening tag <i> a given closing tag </i> was meant to match. But regular expressions are not stateful in the right way to do this.

You encounter a similar nesting in most programming languages. For example, suppose we have a hypothetical (somewhat BASIC-like) language with an IF/THEN/END structure. To simplify, suppose that every condition is spelled to match the regex cond\d+, and every action matches act\d+. But the wrinkle is that IF/THEN/END structures can nest within each other also. So for example, let us define the following three top-level structures:

 >>> s = ''' IF cond1 THEN act1 END ----- IF cond2 THEN   IF cond3 THEN act3 END END ----- IF cond4 THEN   act4 END '''

As with the markup example, you might first try to identify the three structures using a regular expression like:

 >>> pat = r'''(?sx) IF \s+ cond\d+ \s+ THEN \s+ act\d+ \s+ END''' >>> for stmt in re.findall(pat, s): ...     print stmt, '\n-----' ... IF cond1 THEN act1 END ----- IF cond3 THEN act3 END ----- IF cond4 THEN   act4 END -----

This indeed finds three structures, but the wrong three. The second top-level structure should be the compound statement that used cond2, not its child using cond3. It is not too difficult to allow a nested IF/THEN/END structure to optionally substitute for a simple action; for example:

 >>> pat2 = '''(?sx)( IF \s+ cond\d+ \s+ THEN \s+ (  (IF \s+ cond\d+ \s+ THEN \s+ act\d+ \s+ END)  | (act\d+) ) \s+ END )''' >>> for stmt in re.findall(pat2, s): ...     print stmt[0], '\n-----' ... IF cond1 THEN act1 END ----- IF cond2 THEN   IF cond3 THEN act3 END END ----- IF cond4 THEN   act4 END -----

By manually nesting a "first order" IF/THEN/END structure as an alternative to a simple action, we can indeed match the example in the desired fashion. But we have assumed that nesting of IF/THEN/END structures goes only one level deep. What if a "second order" structure is nested inside a "third order" structure and so on, ad infinitum? What we would like is a means of describing arbitrarily nested structures in a text, in a manner similar to, but more general than, what regular expressions can describe.

4.1.2 What Is a Grammar?

In order to parse nested structures in a text, you usually use something called a "grammar." A grammar is a specification of a set of "nodes" (also called "productions") arranged into a strictly hierarchical "tree" data structure. A node can have a name and perhaps some other properties and it can also have an ordered collection of child nodes. When a document is parsed under a grammar, no resultant node can ever be a descendent of itself; this is another way of saying that a grammar produces a tree rather than a graph.

In many actual implementations, such as the famous C-based tools lex and yacc, a grammar is expressed at two layers. At the first layer, a "lexer" (or "tokenizer") produces a stream of "tokens" for a "parser" to operate on. Such tokens are frequently what you might think of as words or fields, but in principle they can split the text differently than does our normal idea of a "word." In any case tokens are nonoverlapping subsequences of the original text. Depending on the specific tool and specification used, some subsequences may be dropped from the token stream. A "zero-case" lexer is one that simply treats the actual input bytes as the tokens a parser operates on (some modules discussed do this, without losing generality).

The second layer of a grammar is the actual parser. A parser reads a stream or sequence of tokens and generates a "parse tree" out of it. Or rather, a tree is generated under the assumption that the underlying input text is "well-formed" according to the grammar that is, there is a way to consume the tokens within the grammar specification. With most parser tools, a grammar is specified using a variant on EBNF.

An EBNF grammar consists of a set of rule declarations, where each rule allows similar quantification and alternation as that in regular expressions. Different tools use slightly different syntax for specifying grammars, and different tools also differ in expressivity and available quantifiers. But almost all tools have a fairly similar feel in their grammar specifications. Even the DTDs used in XML dialect specifications (see Chapter 5) have a very similar syntax to other grammar languages which makes sense since an XML dialect is a particular grammar. A DTD entry looks like:

 <!ELEMENT body  ((example-column | image-column)?, text-column) >

In brief, under the sample DTD, a <body> element may contain either one or zero occurrences of a "first thing" that first thing being either an <example-column> or an <image-column>. Following the optional first component, exactly one <text-column> must occur. Of course, we would need to see the rest of the DTD to see what can go in a <text-column>, or to see what other element(s) a <body> might be contained in. But each such rule is similar in form.

A familiar EBNF grammar to Python programmers is the grammar for Python itself. On many Python installations, this grammar as a single file can be found at a disk location like [...]/Python22/Doc/ref/grammar.txt. The online and downloadable Python Language Reference excerpts from the grammar at various points. As an example, a floating point number in Python is identified by the specification:

EBNF-style description of Python floating point

 floatnumber   ::= pointfloat | exponentfloat pointfloat    ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponent intpart       ::= digit+ fraction      ::= "." digit+ exponent      ::= ("e" | "E") ["+" | "-"] digit+ digit         ::= "0"..."9"

The Python grammar is given in an EBNF variant that allows considerable expressivity. Most of the tools this chapter discusses are comparatively limited (but are still ultimately capable of expressing just as general grammars, albeit more verbosely). Both literal strings and character ranges may be specified as part of a production. Alternation is expressed with "|". Quantifications with both "+" and "*" are used. These features are very similar to those in regular expression syntax. Additionally, optional groups are indicated with square brackets ("[" and "]"), and mandatory groups with parentheses. Conceptually the former is the same as the regex "?" quantifier.

Where an EBNF grammar goes beyond a regular expression pattern is in its use of named terms as parts of patterns. At first glance, it might appear possible simply to substitute regular expression patterns for named subexpressions. In fact, in the floating point pattern presented, we could simply do this as:

Regular expression to identify a floating point

 pat = r'''(?x)       (                   # exponentfloat         (                 # intpart or pointfloat           (               # pointfloat             (\d+)?[.]\d+  # optional intpart with fraction             |             \d+[.]        # intpart with period           )               # end pointfloat           |           \d+             # intpart         )                 # end intpart or pointfloat         [eE][+-]?\d+      # exponent       )                   # end exponentfloat       |       (                   # pointfloat         (\d+)?[.]\d+      # optional intpart with fraction         |         \d+[.]            # intpart with period       )                   # end pointfloat       '''

As a regular expression, the description is harder to read, even with the documentation added to a verbose regex. The EBNF grammar is more or less self-documenting. Moreover, some care had to be taken about the order of the regular expression the exponentfloat alternative is required to be listed before the pointfloat alternative since the latter can form a subsequence of the latter. But aside from the need for a little tweaking and documentation, the regular expression above is exactly as general and exactly equivalent to the Python grammar for a floating point number.

You might wonder, therefore, what the point of a grammar is. It turns out that a floating point number is an unusually simple structure in one very specific respect. A floatnumber requires no recursion or self-reference in its definition. Everything that makes up a floatnumber is something simpler, and everything that makes up one of those simpler components is itself made up of still simpler ones. You reach a bottom in defining a Python floating point number.

In the general case, structures can recursively contain themselves, either directly or by containing other structures that in turn contain the first structures. It is not even entirely absurd to imagine floating point numbers with such a grammar (whatever language had them would not be Python, however). For example, the famous number a "googol" was defined in 1938 by Edward Kasner as 10 to the 100th power (otherwise called "10 dotrigintillion"). As a Python floating point, you could write this as 1e100. Kasner also defined a "googolplex" as 10 to the googol power (a number much larger than anyone needs for any practical reason). While you can create a Python expression to name a googolplex for example, 10**1e100 it is not difficult to conceive a programming language that allowed the term 1e1e100 as a name for a googolplex. By the way: If you try to actually compute a googolplex in Python (or any other programming language), you will be in for disappointment; expect a frozen computer and/or some sort of crash or overflow. The numbers you can express in most language grammars are quite a bit more numerous than those your computer can actually do anything with.

Suppose that you wanted to allow these new "extended" floating point terms in a language. In terms of the grammar, you could just change a line of the EBNF description:

 exponent ::= ("e" | "E") ["+" | "-"] floatnumber

In the regular expression, the change is a problem. A portion of the regular expression identifies the (optional) exponent:

 [eE][+-]?\d+      # exponent

In this case, an exponent is just a series of digit characters. But for "extended" floating point terms, the regular expression would need to substitute the entire pat regular expression in place of \d+. Unfortunately, this is impossible, since each replacement would still contain the insufficient \d+ description, which would again require substitution. The sequence of substitutions continues ad infinitum, until the regular expression is infinitely long.

4.1.3 An EBNF Grammar for IF/THEN/END Structures

The IF/THEN/END language structure presented above is a more typical and realistic example of nestable grammatical structures than are our "extended" floating point numbers. In fact, Python along with almost every other programming language allows precisely such if statements inside other if statements. It is worthwhile to look at how we might describe our hypothetical simplified IF/THEN/END structure in the same EBNF variant used for Python's grammar.

Recall first our simplified rules for allowable structures: The keywords are IF, THEN, and END, and they always occur in that order within a completed structure. Keywords in this language are always in all capitals. Any whitespace in a source text is insignificant, except that each term is separated from others by at least some whitespace. Every condition is spelled to match the regular expression cond\d+. Every IF "body" either contains an action that matches the regular expression act\d+, or it contains another IF/THEN/END structure. In our example, we created three IF/THEN/END structures, one of which contained a nested structure:

 IF cond1 THEN act1 END ----- IF cond2 THEN   IF cond3 THEN act3 END END ----- IF cond4 THEN   act4 END

Let us try a grammar:

EBNF grammar for IF/THEN/END structures

 if_expr   ::= "IF" ws cond ws "THEN" ws action ws "END" whitechar ::= " " | "\t" | "\n" | "\r" | "\f" | "\v" ws        ::= whitechar+ digit     ::= "0"..."9" number    ::= digit+ cond      ::= "cond" number action    ::= simpleact | if_expr simpleact ::= "act" number

This grammar is fairly easy to follow. It defines a few "convenience" productions like ws and number that consist of repetitions of simpler productions. whitechar is defined as an explicit alternation of individual characters, as is digit for a continuous range. Taken to the extreme, every production could actually be included in a much more verbose if_expr production you would just substitute all the right-hand sides of nested productions for the names in the if_expr production. But as given, the grammar is much easier to read. The most notable aspect of this grammar is the action production, since an action can itself recursively contain an if_expr.

For this problem, the reader is encouraged to develop grammars for some more robust variations on the very simple IF/THEN/END language we have looked at. As is evident, it is difficult to actually do much with this language by itself, even if its actions and conditions are given semantic meaning outside the structure. Readers can invent their own variations, but a few are proposed below.

4.1.4 Pencil-and-Paper Parsing

To test a grammar at this point, just try to expand each successive character into some production that is allowed at that point in the parent production, using pencil and paper. Think of the text of test cases as a tape: Each symbol either completes a production (if so, write the satisfied production down next to the subsequence), or the symbol is added to the "unsatisfied register." There is one more rule to follow with pencil and paper, however: It is better to satisfy a production with a longer subsequence than a shorter one. If a parent production consists of child productions, the children must be satisfied in the specified order (and in the quantity required). For now, assume only one character of lookahead in trying to follow this rule. For example, suppose you find the following sequence in a test case:

 "IF   cond1..."

Your steps with the pencil would be something like this:

Read the "I" no production is satisfied.
Read the "F", unsatisfied becomes "I"-"F". Note that "I"-"F" matches the literal term in if_expr (a literal is considered a production). Since the literal term contains no quantifiers or alternates, write down the "IF" production. Unsatisfied becomes empty.
Read the space, Unsatisfied becomes simply a space. Space satisfies the production ws, but hold off for a character since ws contains a quantifier that allows a longer substring to satisfy it.
Read the second space, unsatisfied becomes space-space. Space-space satisfies the production ws. But again hold off for a character.
Read the third space, unsatisfied becomes space-space-space. This again satisfies the production ws. But keep holding off for the next character.
Read the "c", unsatisfied becomes "space-space-space-c". This does not satisfy any production, so revert to the production in 5. Unsatisfied becomes "c".
Et cetera.

If you get to the last character, and everything fits into some production, the test case is valid under the grammar. Otherwise, the test case is nongrammatical. Try a few IF/THEN/END structures that you think are and are not valid against the provided grammar.

4.1.5 Exercise: Some variations on the language

Create and test an IF/THEN/END grammar that allows multiple actions to occur between the THEN and the END. For example, the following structures are valid under this variation:
```
 IF cond1 THEN act1 act2 act3 END ----- IF cond2 THEN   IF cond3 THEN act3 END   IF cond4 THEN act4 END END ----- IF cond5 THEN IF cond6 THEN act6 act7 END act8 END 
```
Create and test an IF/THEN/END grammar that allows for arithmetic comparisons of numbers as conditions (as an enhancement of variation 1, if you wish). Specifically, a comparison consists of two numbers with one of "<", ">", or "=" between them. There might or might not be any whitespace between a comparison symbol and surrounding numbers. Use your judgment about what a number consists of (the Python floating point grammar might provide an example, but yours could be simpler).
Create and test an IF/THEN/END grammar that includes a loop expression as a valid action. A loop consists of the keyword LOOP, followed by a positive integer, followed by action(s), and terminated by the END keyword. Loops should be considered actions, and therefore ifs and loops can be contained inside one another; for example:
```
 IF cond1 THEN   LOOP 100     IF cond2 THEN       act2     END   END END 
```
You can make this LOOP-enhanced grammar an enhancement of whichever variant you wish.
Create and test an IF/THEN/END grammar that includes an optional ELSE keyword. If an ELSE occurs, it is within an IF body, but ELSE might not occur. An ELSE has its own body that can contain action(s). For example (assuming variant 1):
```
 IF cond1 THEN   act1   act2 ELSE   act3   act4 END 
```
Create and test an IF/THEN/END grammar that may include zero actions inside an IF, ELSE, or LOOP body. For example, the following structures are valid under this variant:
```
 IF cond1 THEN ELSE act2 END -*- IF cond1 THEN   LOOP 100 END ELSE END 
```