2.4 The Parser Hierarchy

	Building Parsers with Java By Steven John Metsker
	Table of Contents

	Chapter 2. The Elements of a Parser

Content

A parser is an object that recognizes strings. The examples in this book always wrap a string to be recognized in an Assembly object, as an assembly of either characters or tokens. Figure 2.6 shows the public interface of Parser objects that the examples use.

Figure 2.6. `Parser` class. A parser is an object that can recognize a string (wrapped in an `Assembly` object) and assemble a result.

graphics/02fig06.gif

The Parser class in sjm.parse recognizes assemblies and returns assemblies. Often, you want to see whether a parser can recognize an entire string. If the parser can recognize the assembly you provide, its completeMatch() method returns a new assembly with its index at the end. If the parser does not recognize an assembly, it returns null . If you want a parser only to match as much of a string as it can, bestMatch() returns an assembly with its index moved as far forward as possible through the supplied assembly. The Parser class also has methods for matching collections ( Vectors ) of assemblies and for associating assemblers with the parser.

2.4.1 The Composition of a Parser

The parsers in this book follow the composite pattern [Gamma et al.]. This means that some classes in the Parser hierarchy define aggregations of other parsers. Other classes in the hierarchy are terminals, which are parsers that can match an assembly without the aid of other parsers. They get their name from the fact that they terminate the recursion inherent in the idea that parsers are compositions of parsers. Because there are many types of terminals, there are many ways a parser can look for characters or tokens to appear in an assembly. This also means that there are many terminal classes.

Three types of nonterminals ”parsers that are compositions of other parsers ”are sufficient to build an infinite variety of parsers. Figure 2.7 shows the top of the hierarchy of parser classes. The three types of composition are repetition, sequence, and alternation . Many of the examples that lie ahead show how to use these compositions. In brief,

Figure 2.7. Parser hierarchy. The hierarchy of parsers contains three concrete classes that contain other parsers ” `Repetition` , `Alternation` , and `Sequence` ”and two that do not: `Terminal` and `Empty` .

graphics/02fig07.gif

A repetition matches its underlying parser repeatedly against an assembly.
A sequence is a collection of parsers, all of which must in turn match against an assembly for the sequence parser to successfully match.
An alternation is a collection of parsers, any one of which can successfully match against an assembly.

It is also helpful, if not strictly necessary, to have an empty parser:

An empty parser reports a successful match without consuming any elements from the assembly.

As Figure 2.7 shows, repetitions, alternations , and sequences are compositions of other parsers. The overall composition of a parser can be arbitrarily deep. For example, a parser for a programming language might itself be a sequence of declaration statements followed by executable statements. The parser for executable statements is typically an alternation of different types of statements. Each of these alternatives might be a sequence, and so on.

The composite nature of parsers implies that you must create simple parsers before you make complex ones. The simplest parsers are terminals, which match string assemblies without using other parsers.

Top

Figure 2.6. Parser class. A parser is an object that can recognize a string (wrapped in an Assembly object) and assemble a result.

2.4.1 The Composition of a Parser

Figure 2.7. Parser hierarchy. The hierarchy of parsers contains three concrete classes that contain other parsers ” Repetition , Alternation , and Sequence ”and two that do not: Terminal and Empty .

Figure 2.6. `Parser` class. A parser is an object that can recognize a string (wrapped in an `Assembly` object) and assemble a result.

Figure 2.7. Parser hierarchy. The hierarchy of parsers contains three concrete classes that contain other parsers ” `Repetition` , `Alternation` , and `Sequence` ”and two that do not: `Terminal` and `Empty` .