2.6 Composite Parsers


 
Building Parsers with Java
By Steven  John  Metsker

Table of Contents
Chapter  2.   The Elements of a Parser

    Content

Every parser is either a terminal or a composite. A terminal is a parser that recognizes all or part of a string without using other parsers. For example, a terminal might check that the next word in a string is "hello" . A composite parser is a parser that accomplishes its task by using other parsers. The three fundamental ways to compose parsers are repetition, sequence, and alternation .

2.6.1 Repetition

Like a tree that is composed of other trees and leaves , a composite typically contains several subcomponents. A repetition contains only one component: another parser that the repetition applies repeatedly against an assembly to recognize. Figure 2.9 shows this structure.

Figure 2.9. The Repetition class. Composites usually contain one or more subparsers, but a Repetition object contains only one.

graphics/02fig09.gif

The repetition can match its underlying parser zero times, one time, two times, or n times, where n can be arbitrarily large. When a repetition matches an assembly, many versions of the assembly are possible, with the resulting assemblies' indexes at different positions .

For example, consider the phrase "steaming hot coffee" . If you match words from the beginning of this phrase, you can match zero, one, two, or three words. You could mark these relative states of progress as follows :

 "^steaming hot coffee"  "steaming^ hot coffee" "steaming hot^ coffee" "steaming hot coffee^" 

The power of the idea of repetition is that it allows you to describe a variable number of language elements. The challenge this creates for parsers is that a repetition does not know the right number of times to match. It may appear that the right answer is for a repetition to match as many times as possible. However, it turns out that this approach is not always sufficient. For example, you could define a language as a repetition of words, followed by the specific word "coffee" . In this case, the "right" number of words for a repetition object to match is two, stopping before the word "coffee" .

To avoid the problem of deciding how many times to match, a simple approach is to return a set of all possible matches. This is the approach taken by class Repetition in package sjm.parse . In fact, each of the composite parsers in the code for this book takes this approach. The subclasses of Parser implement match() , which accepts a Vector of Assembly objects and returns a Vector of Assembly objects. To comply with Java 1.1.6 code, the sample code uses Vector instead of Set .

To see match() and Repetition in action, consider the following program:

 package sjm.examples.introduction;  import java.util.*; import sjm.parse.*; import sjm.parse.tokens.*; /**  * Show that a <code>Repetition</code> object creates  * multiple interpretations.  */ public class ShowRepetition { public static void main(String[] args) {     String s = "steaming hot coffee";     Assembly a = new TokenAssembly(s);     Parser p = new Repetition(new Word());     Vector v = new Vector();     v.addElement(a);     System.out.println(p.match(v)); } } 

This prints the following (with some added whitespace):

 [   []^steaming/hot/coffee,  [steaming]steaming^hot/coffee,  [steaming, hot]steaming/hot^coffee,  [steaming, hot, coffee]steaming/hot/coffee^ ] 

Note that this sample program uses match() instead of bestMatch() . The match() method accepts a Vector of assemblies and returns a Vector of assemblies. This reflects the underlying mechanics of the parsers in sjm.parse . Chapter 10, "Matching Mechanics," describes in detail how matching works.

To show the results of a Repetition object, the example prints the results of the underlying match() method. When the code constructs a new Repetition object, it passes in the parser to repeat. Because a Repetition always repeats a single parser, the only constructor for this class requires its caller to supply the parser to repeat. The parser p creates a set (or Vector ) of possible matches against the input. The output shows the outermost pair of brackets that the Vector object prints. Inside the vector are four assemblies, with four states of possible matching, including zero, one, two, or three word matches.

2.6.2 Alternation and Sequence

An alternation is a composite of other parsers, any of which may match a provided assembly. A sequence is a composite of other parsers, each of which must match in turn . Figure 2.10 shows the structure of these classes.

Figure 2.10. The Alternation and Sequence classes. Alternation objects and Sequence objects are compositions of other parsers.

graphics/02fig10.gif

2.6.3 Composing a Parser

Creating a new parser is a matter of composing it from an existing selection of terminal and composite parsers. A small example that uses each of the composite parsers is as follows. Consider this string:

 "hot hot steaming hot coffee" 

Let us say that this string is a member of a language that, in general, contains strings with any combination of "hot" and "steaming" adjectives, followed by "coffee" . In other words, this language contains all sentences that are

  • A sequence of adjectives followed by the literal "coffee"

where adjectives are

  • A repetition of an adjective

and an adjective is:

  • An alternation, a choice between "hot" and "steaming"

Having described this language fairly precisely, we can create a parser object directly from the language description, as the following program shows.

 package sjm.examples.introduction;  import sjm.parse.*; import sjm.parse.tokens.*; /**  * Show how to create a composite parser.  */ public class ShowComposite { public static void main(String[] args) {     Alternation adjective = new Alternation();     adjective.add(new Literal("steaming"));     adjective.add(new Literal("hot"));     Sequence good = new Sequence();     good.add(new Repetition(adjective));     good.add(new Literal("coffee"));     String s = "hot hot steaming hot coffee";     Assembly a = new TokenAssembly(s);     System.out.println(good.bestMatch(a)); } } 

Running this class prints the following:

 [hot, hot, steaming, hot, coffee]  hot/hot/steaming/hot/coffee^ 

The result of sending bestMatch() to good is an assembly with all five words placed on its stack and with its index at the end of its tokens.

Figure 2.11 shows an object diagram of good .

Figure 2.11. Good coffee. This object diagram shows the composition of a good parser, which recognizes a description of a good cup of coffee.

graphics/02fig11.gif

In Figure 2.11, the box that contains good:Sequence means that good is an object of type Sequence . The underline means that good is an object, and the colon separates the object from its type. Some of the objects, such as :Repetition , are nameless in the diagram, as they are in the sample program. The literals in the diagram indicate their literal values, although this does not imply that the variable literal is publicly accessible.

The leaf nodes of this object diagram are all literals, which are terminals. A parser's composition always terminates with terminals, hence the name "terminal."

2.6.4 The Empty Parser

Sometimes it is convenient to have a parser that returns successfully, having matched nothing at all. For example, you might want to match a list of equipment, recognizing strings such as

 "[die_bonder_2, oven_7, wire_bonder_3, mold_1]"  "[]" "[mold_1]" 

The contents of the list may be empty, or it may contain a single name or a series of names separated by commas. To match such strings, you could say that the content of a list is either empty or an actual list. An actual list is a name followed by a repetition of (comma, word) sequences. Here is an example that uses Empty to help match these lists:

 package sjm.examples.introduction;  import sjm.parse.*; import sjm.parse.tokens.*; /**  * Show how to put the <code>Empty</code> class to good use.  */ public class ShowEmpty { public static void main(String args[]) {     Parser empty, commaTerm, actualList, contents, list;     empty = new Empty();     commaTerm = new Sequence()         .add(new Symbol(',').discard())         .add(new Word());     actualList = new Sequence()         .add(new Word())         .add(new Repetition(commaTerm));     contents = new Alternation()         .add(empty)         .add(actualList);     list = new Sequence()         .add(new Symbol('[').discard())         .add(contents)         .add(new Symbol(']').discard());     String test[] = new String[]{         "[die_bonder_2, oven_7, wire_bonder_3, mold_1]",         "[]",         "[mold_1]"};     for (int i = 0; i < test.length; i++) {         TokenAssembly a = new TokenAssembly(test[i]);         System.out.println(             list.completeMatch(a).getStack());     } } } 

This prints the following:

 [die_bonder_2, oven_7, wire_bonder_3, mold_1]  [] [mold_1] 

The brackets in the output are part of the way a Stack object represents itself; they are not the brackets from the input. The output shows that the list parser recognizes lists with many elements, zero elements, or one element.

2.6.5 Parser Summary

In general, a parser is an object that recognizes a string. The code in this book wraps the string to be recognized in either a CharacterAssembly object or a TokenAssembly object. This approach simplifies the parser's job by giving the parser a place to work as it begins to recognize the contents of the string. In the next section we cover assemblers, which are objects that work on an assembly as the parser recognizes it.

You can create simple parsers by creating instances of subclasses of Terminal . You can create more-complex parsers by composing new parsers as sequences, alternations , or repetitions of other parsers. This method scales up, so you can create advanced parsers, building them from other terminal and composite parsers.


   
Top


Building Parsers with Java
Building Parsers With Javaв„ў
ISBN: 0201719622
EAN: 2147483647
Year: 2000
Pages: 169

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net