3.3 Designing Assemblers

	Building Parsers with Java By Steven John Metsker
	Table of Contents

	Chapter 3. Building a Parser

Content

One way to get a grip on the design of your parser is to think about how you will build the result you want from the text you recognize. So one way to get started with the design of a new parser is to start designing your assemblers.

3.3.1 The Collaboration of Parsers, Assemblers, and Assemblies

An assembly provides both a stack and a target object for the parser's assemblers to work on. The target object is like a sculpture, taking form as the parser recognizes the input text. Figure 3.1 shows the Parser , Assembler , and Assembly classes, which collaborate to sculpt text into a result.

Figure 3.1. The `Parser` , `Assembler` , and `Assembly` classes. Each parser in a parser composite tries to match against the assembly, and each may use an assembler to work on the assembly after a successful match.

graphics/03fig01.gif

3.3.2 Using an Assembly's Stack

Assembly objects contain two work areas: a stack and a target. By default, subclasses of Terminal , such as Word and Num , place on an assembly's stack the object they recognize from the assembly object's text. (You can prevent this by sending the Terminal object a discard() message.) The following code shows a repetition of a Num parser that recognizes and stacks a series of numbers .

 package sjm.examples.design;  import sjm.parse.*; import sjm.parse.tokens.*; /**  * Show how to use an assembly's stack.  */ public class ShowStack { public static void main(String args[]) {     Parser p = new Repetition(new Num());     Assembly a = p.completeMatch(new TokenAssembly("2 4 6 8"));     System.out.println(a); } }

This code creates a TokenAssembly around the string "2 4 6 8" and passes it to the parser p . The result of sending completeMatch() to p is a new TokenAssembly . The completeMatch() method returns the result as an abstract Assembly object, which we could cast to a TokenAssembly if we needed to.

Running this class prints

 [2.0, 4.0, 6.0, 8.0]2.0/4.0/6.0/8.0^

When the assembly a prints itself, it first shows its stack, which is [2.0, 4.0, 6.0, 8.0] . This demonstrates the nature of Num , which treats every number as a double and places on the assembly's stack the tokens it finds. The output assembly also shows its tokenized input text, separating the tokens with slashes . Finally, the output assembly shows the location of its index at the end of the tokens.

3.3.3 Assemblers Plug In to Parser Composites

When a parser recognizes text, it knows nothing about the big picture in which it executes. For example, consider the Num parser in sjm.parse.tokens . A Num object might be recognizing one of a series of numbers in a string such as "1.2 2.3 3.4" , or it might be recognizing the price of a pound of coffee. The Num parser simply puts its number on the assembly's stack, leaving any further work to other assemblers. Typically, Num parsers and other terminals are not stand-alone parsers but rather are part of a composite. After a terminal places an object on an assembly's stack, another parser higher in the composite can find this object and do other work on the assembly.

3.3.4 A Language to Plug In To: Minimath

To see how assemblers plug in to a parser composite, consider a minimal arithmetic parser that recognizes only the " - " operator. That is, you want to recognize a language that contains numbers, all differences of two numbers, differences of three numbers, and so on. For example, the language contains the following:

 {"0.0",   "1.1  2.2",  "1.1  2.2  3.3",  "1.1  2.2  3.3  4.4", ...}

Let's call this language Minimath. You can describe the contents of Minimath with the following rules:

 expression = Num minusNum*;  minusNum   = '-' Num;

These rules are shorthand for describing a language. They are also shorthand for describing the composition of a parser. Section 3.4 explains the rules for this shorthand in detail. You can abbreviate these rules to

 e = Num m*;  m = '-' Num;

The first rule means that e is a number followed by zero or more occurrences of m . For example, "25 “ 16 - 9" is the number " 25 " followed by "- 16" and "- 9" . The first rule uses the capitalized word Num to mean a terminal. In fact, you will use the class Num in sjm.parse.tokens when you build the e parser. The e rule uses the noncapitalized word m to refer to another rule, and it uses an asterisk (" * ") to indicate repetition of m . The m rule indicates a pattern that has a minus sign (" - ") followed by a number.

These rules describe the patterns of strings that make up the Minimath language, and they give you a formula for composing a parser to match Minimath. For example, the parser e recognizes a string such as "25 “ 16 - 9" as an element of the Minimath language. If all you want is to recognize elements of Minimath and not compute their value, you can build the e parser as follows :

 package sjm.examples.minimath;  import sjm.parse.*; import sjm.parse.tokens.*; /**  * Show how to build a parser to recognize elements  * of the language "Minimath".  */ public class MinimathRecognize { public static void main(String args[]) {     Sequence e = new Sequence();     e.add(new Num());     Sequence m = new Sequence();     m.add(new Symbol('-'));     m.add(new Num());     e.add(new Repetition(m));     System.out.println(e.completeMatch(new TokenAssembly("25 - 16 - 9"))); } }

This code prints the following:

 [25.0, -, 16.0, -, 9.0]25.0/-/16.0/-/9.0^

This shows that the parser e recognizes the text "25 “ 16 - 9" . Of course, the point of having a Minimath parser is not only to recognize text but also to build the value that the text represents.

3.3.5 Calculating a Minimath Result

You want the parser e to calculate a difference and leave it as a Double object on an assembly's stack. To accomplish this, you need two assemblers: one to handle numbers as the Num subparser finds them, and a second one to handle subtraction.

When a Num parser recognizes a number, it places an sjm.parse.tokens.Token object on the assembly's stack. To calculate an arithmetic sum, your Num parser needs an assembler to replace this Token object with a Double value that corresponds to the token's value. You can describe the design of the assembler you need as, "Pop the token at the top of the assembly's stack and push a corresponding number." In this example, you can reuse an assembler from sjm.examples.arithmetic . The code for NumAssembler is as follows:

 package sjm.examples.arithmetic;  import sjm.parse.*; import sjm.parse.tokens.*; public class NumAssembler extends Assembler { /**  * Replace the top token in the stack with the token's  * Double value.  */ public void workOn(Assembly a) {     Token t = (Token) a.pop();     a.push(new Double(t.nval())); } }

This method assumes that the top of the input assembly's stack is a Token object. In practice, this assumption is safe as long as the assembler plugs in to a parser that stacks a token. Assemblers are small, non-reusable classes that plug behavior in to a particular parser.

The other assembler that a Minimath parser needs is one to handle subtraction. You can design the assembler without addressing which subparser it belongs to. Let us assume that, at some point, a Minimath composite parser places two numbers on the stack of the assembly it is matching. At that point, you can describe the design of the needed assembler as, "Pop the top two numbers and push their difference." Again, you can reuse an assembler from sjm.examples.arithmetic . The code for MinusAssembler is as follows:

 package sjm.examples.arithmetic;  import sjm.parse.*; public class MinusAssembler extends Assembler { /**  * Pop two numbers from the stack and push the result of  * subtracting the top number from the one below it.  */ public void workOn(Assembly a) {     Double d1 = (Double) a.pop();     Double d2 = (Double) a.pop();     Double d3 =         new Double(d2.doubleValue() - d1.doubleValue());     a.push(d3); } }

The only remaining design question is where this assembler belongs. Note from the rules that an expression e is a number followed by one or more occurrences of m . An effective strategy is to associate a NumAssembler with each Num parser, and a MinusAssembler with the m parser. The code looks like this:

 package sjm.examples.minimath;  import sjm.parse.*; import sjm.parse.tokens.*; import sjm.examples.arithmetic.*; /**  * ...  * This class shows, in a minimal example, where assemblers  * plug in to a parser composite.  */ public class MinimathCompute { public static void main(String args[]) {     Sequence e = new Sequence();     Num n = new Num();     n.setAssembler(new NumAssembler());     e.add(n);     Sequence m = new Sequence();     m.add(new Symbol('-').discard());     m.add(n);     m.setAssembler(new MinusAssembler());     e.add(new Repetition(m));     TokenAssembly t = new TokenAssembly("25 - 16 - 9");     Assembly out = e.completeMatch(t);     System.out.println(out.pop()); } }

The main() method creates a composite parser e for Minimath and plugs in assemblers to assemble a result for an arithmetic string. The code discards the minus signs because they serve their purpose in guiding the parser and need not appear on the stack. The method asks the parser for a complete match against a tokenized string and prints the top of the stack of the resulting assembly. Running this class prints the conventional answer:

0.0

3.3.6 The Minimath Parser as an Object

Figure 3.2 shows an object diagram for the parser e . The parser e is a sequence of two subparsers: a Num and a Repetition . Each time the Num parser in e recognizes a number token, it uses a NumAssembler object to replace the number token on the stack with a corresponding Double . Two of the subparsers ” n and m ”use assemblers to work on the assembly they match against. There is only one n subparser, although this object appears twice in the diagram.

Figure 3.2. Minimath. This object diagram shows the structure of a parser composite that matches Minimath expressions such as `"3 - 2 - 1"` .

graphics/03fig02.gif

3.3.7 Building a Target

In addition to being unusually small, Minimath is unusual in that you can build a parser for it that uses only the assembly's stack while creating a useful result. Usually, you want a parser to build some kind of domain object. The Assembly class maintains a target attribute and provides methods for manipulating it, as Figure 3.1 shows. These methods let you build any kind of object from input text, with one restriction: The target of an Assembly object must be publicly cloneable (a topic addressed in the following section).

Consider designing a program that calculates the average length of words in a string. Let us take the approach of first creating a RunningAverage class that accepts word lengths and keeps track of the total number of words and their total length. With such a class available, you can create an Assembler class that works by updating a target RunningAverage object. Specifically, you can create an AverageAssembler class that pops a word from the stack of an Assembly object and updates a target RunningAverage object with the length of the popped string. Here is RunningAverage.java :

 package sjm.examples.design;  /**  * Objects of this class maintain a running average. Each  * number that is added with the <code>add</code> method  * increases the count by 1, and the total by the amount  * added.  */ public class RunningAverage     implements sjm.utensil.PubliclyCloneable {     protected double count = 0;     protected double total = 0; /**  * Add a value to the running average, increasing the count  * by 1 and the total by the given value.  */ public void add(double d) {     count++;     total += d; } /**  * Return the average so far.  */ public double average() {     return total / count; } /**  * Return a copy of this object.  */ public Object clone() {     try {         return super.clone();     } catch (CloneNotSupportedException e) {         // this shouldn't happen, since we are Cloneable         throw new InternalError();     } } }

This class makes it easy to keep a running average. In your design, you can use a RunningAverage object as the target of an assembly. You can write an AverageAssembler class that expects this target. Here is the code for AverageAssembler.java :

 package sjm.examples.design;  import sjm.parse.*; import sjm.parse.tokens.*; import sjm.engine.*; public class AverageAssembler extends Assembler { /**  * Increases a running average, by the length of the string  * on the stack.  */ public void workOn(Assembly a) {     Token t = (Token) a.pop();     String s = t.sval();     RunningAverage avg = (RunningAverage) a.getTarget();     avg.add(s.length()); } }

The AverageAssembler class updates a RunningAverage target object by the length of whatever string is on the input assembly's stack. Now you have the pieces you need to create a program that calculates the average length of words in a string. You will use a RunningAverage object as the target object of an assembly. You will use a Word object to recognize words, and you will plug an AverageAssembler object in to it. Then you can use a Repetition of the Word object to match a string of words. Figure 3.3 shows the objects you need.

Figure 3.3. Object diagram for calculating a running average. An input assembly and an output assembly both have `RunningAverage` objects as their targets. The parser p is a repetition of a Word object that uses an AverageAssembler object to update a running average.

graphics/03fig03.gif

The parser p in Figure 3.3 matches an input assembly and creates an output assembly. The output has an updated clone of the RunningAverage object that reflects the average length of words in an input string. Here is a program that shows the objects in action:

 package sjm.examples.design;  import sjm.parse.*; import sjm.parse.tokens.*; /**  * Show how to use an assembler. The example shows how to  * calculate the average length of words in a string.  */ public class ShowAssembler { public static void main(String args[]) {     // As Polonius says, in "Hamlet"...     String quote = "Brevity is the soul of wit";     Assembly in = new TokenAssembly(quote);     in.setTarget(new RunningAverage());     Word w = new Word();     w.setAssembler(new AverageAssembler());     Parser p = new Repetition(w);     Assembly out = p.completeMatch(in);     RunningAverage avg = (RunningAverage) out.getTarget();     System.out.println("Average word length: " + avg.average()); } }

The main() method in this class constructs the objects in Figure 3.3. This method wraps an input string in a TokenAssembly and sets the assembly's target to be a RunningAverage object. The method creates a parser that is a repetition of a Word object that uses an AverageAssembler to update the running average. The method creates an output assembly by matching the parser against the input assembly. Finally, the method shows the results of the parse. Running this class prints:

 Average word length: 3.5

This example shows a typical collaboration of assemblies, assemblers, targets, and parsers:

An input target plugs in to an input assembly.
A parser establishes an assembler to work on the target. (In a composite parser, each subparser can have its own assembler.)
The parser matches the input assembly, creating an output assembly.
The output assembly contains an updated target.

This approach relies on the ability to clone an assembly as the parser pursues a match. Because the target is a part of the assembly, targets must also be cloneable.

3.3.8 Making a Target Cloneable

Assembly objects know almost nothing about the targets that they hold, but one message that an assembly sends to its target is clone() . This springs from the way that the parsers in sjm.parse model the nondeterminism inherent in recognizing most languages. These parsers use a backtracking mechanism, and this means that they must clone an assembly and its target each time they consume a token.

When you design your parser to create a target object from input text, your target class must have a public clone() method. To enforce this, the package sjm.utensil includes the interface PubliclyCloneable , whose code is as follows:

 package sjm.utensil;  /**  * Defines a type of object that anybody can clone.  */ public interface PubliclyCloneable extends Cloneable { public Object clone(); }

Providing a public clone method without implementing PubliclyCloneable is insufficient. The Assembly.setTarget() method must know that the object it receives is an instance of a class that implements a public clone() . It insists on this by receiving its input as datatype PubliclyCloneable .

To clone an object means to make a copy of the object. To make a class cloneable by any object, write a clone() method and declare that the class implements PubliclyCloneable . A typical cloneable class method looks like this:

 package sjm.examples.cloning;  import sjm.utensil.*; /**  * This class shows a typical clone() method.  */ public class Course implements PubliclyCloneable {     protected Professor professor;     protected Textbook textbook; // gets and sets... /**  * Return a copy of this object.  */ public Object clone() {     try {         Course copy = (Course) super.clone();         copy.setProfessor((Professor) professor.clone());         copy.setTextbook((Textbook) textbook.clone());         return copy;     } catch (CloneNotSupportedException e) {         // this shouldn't happen, since we are Cloneable         throw new InternalError();     } } }

The sample method calls super.clone() , referring to the method clone() of class Object . This method creates a new object of the same class as the object that receives the clone() message; then it initializes each of the new object's fields by assigning them the same values as the corresponding fields in the copied object. This is a shallow copy as opposed to a deep copy, which would also make copies of each of an object's fields. You might think of Object.clone() as Object.newObjectSameFields() . In your clone() method, you must create a clone of each attribute in your class that is not a primitive type or a string. (Strings are immutable, so there is no need to clone them.)

The Object.clone() method throws CloneNotSupportedException , which you must handle. Surround your call to super.clone() in a try/catch block that throws InternalError . The InternalError exception is arguably the wrong exception to throw, because this error indicates a problem in the virtual machine. However, Vector and other important classes in Java throw an InternalError in this situation, and this book follows that precedent.

With an understanding of assemblers, assemblies, and parsers, you can begin to create meaningful new languages. Before you begin to code, however, it will prove helpful to have a way to work with parsers at a design level.

Top