2.3 Assemblies

	Building Parsers with Java By Steven John Metsker
	Table of Contents

	Chapter 2. The Elements of a Parser

Content

In practice, we demand more of parsers than simply saying whether a string is a valid member of a language. As a parser recognizes a string, it is useful for the parser to react to the contents of the string and build something. The parser also needs to keep an index of how much of the string it has recognized. An Assembly object wraps a string with an index and with work areas for a parser, providing a stack and a target object to work on. For example, a parser that recognizes a data language for coffee might set the target to be a basic Coffee object. This parser could work on the target as it recognizes an input string, informing the Coffee object of the coffee attributes the string indicates.

If a parser needs to build something based on the content of the string it is recognizing, you can think of the string as a set of instructions. As the parser makes progress recognizing the string, the assembly's index moves forward and the assembly's target begins to take shape. Figure 2.2 shows a partially recognized assembly.

Figure 2.2. An assembly example. An assembly wraps a string and provides a work area for a parser and its assemblers.

graphics/02fig02.gif

The figure indicates that a parser is working to recognize this string:

 "place carrier_18 on die_bonder_3"

An internal index records the fact that the parser has already recognized two words. On seeing the word "place" , the parser set the assembly's target to be a PlaceCommand object. Now the parser has seen "carrier_18" and presumably is about to inform the PlaceCommand object which carrier to place.

2.3.1 The Assembly Class Interfaces

An assembly records the progress of a parser's recognition of an input string. Because the recognition may proceed along different paths, a parser may create multiple copies of an assembly as it tries to determine which is the right path . There are other approaches to modeling the nondeterminism inherent in parsing text, but the parsers in this book consistently use copying. To support this copying and to provide an enumeration interface, the Assembly class in sjm.parse implements the interfaces PubliclyCloneable and Enumeration . Figure 2.3 shows a partial class diagram for Assembly .

Figure 2.3. The `Assembly` class. This class implements interfaces that declare that assemblies are cloneable and offer the `Enumeration` methods `hasMoreElements()` and `nextElement()` .

graphics/02fig03.gif

PubliclyCloneable is an interface in sjm.utensil . This interface declares that its implementers must implement a public version of the clone() method. The clone() method on java.lang.Object is protected, meaning that unrelated objects cannot use the clone() method. Because Assembly implements PubliclyCloneable , any object can request a clone of an assembly object.

For an assembly to make a clone of itself, it must in turn clone its target, so a target object must itself implement the PubliclyCloneable interface. In fact, the only requirement for a target object is that it implement this interface, and so Assembly declares its target to be of this type. Chapter 3, "Building a Parser," describes how to make a target cloneable (see Section 3.3.8).

Implementing Enumeration means that Assembly must implement the two methods hasMoreElements() and nextElement() .

The Assembly class itself does not implement these methods, leaving their implementation to subclasses. The two types of assemblies are assemblies of tokens and assemblies of characters .

2.3.2 Token and Character Assemblies

We have said that an assembly is a wrapper around a string. In practice, there are two choices for how to regard the composition of a string. A normal Java String object is a string of characters. For example,

 "hello, world"

contains 12 characters, including the blank and the comma. For many parsers, it is far more convenient to treat such a string as a string of tokens, where a token can be a word, a number, or a punctuation mark. For example, the string "hello, world" can be seen as a string of three tokens:

 "hello"  ',' "world"

To allow parsing text as strings of tokens, Assembly has two subclasses: CharacterAssembly and TokenAssembly . Figure 2.4 shows these subclasses and the packages they lie in.

Figure 2.4. The `Assembly` hierarchy. Token and character assemblies implement the `Assembly` methods related to progress in recognizing input.

graphics/02fig04.gif

Package sjm.parse.chars contains classes that support character-based recognition. Package sjm.parse.tokens contains classes that support token-based recognition. Essentially, a CharacterAssembly object manipulates an array of characters, and a TokenAssembly object manipulates an array of tokens.

The methods consumed() and remainder() show the amount of input consumed and the amount that remains. The defaultDelimiter() method allows the Assembly subclasses to decide how to separate their elements. The TokenAssembly class places a slash between each token, whereas the CharacterAssembly places nothing (an empty string) between characters when showing elements consumed or remaining. The remaining methods of Assembly let a calling class request the next element (a character or token) or peek at the next element without removing it.

2.3.3 Tokenizing

Tokenizing a string means breaking the string into logical chunks , primarily words, numbers , and punctuation. This dissection of text is sometimes called lexical analysis. Chapter 9, "Advanced Tokenizing," covers tokenization in depth. The point of tokenization is that it can make the task of recognition much easier. It can be much simpler and much more appropriate to describe text as a series of tokens than as a series of characters. For example, consider the string

 int i = 3;

A human reader, especially a Java programmer, will read this as a line of Java. The string contains a data type, a variable name , an equal sign, a number, and a semicolon. It would be accurate, but strange , to describe this string as an "i" followed by an "n" followed by a "t" followed by a blank, another "i" , and so on. This string is a pattern of tokens and not a pattern of characters. On the other hand, consider the string

 "Ja.*"

In the right context, this string might describe all words beginning with the letters "J" and "a" . Here, the string is best understood and most easily recognized as a "J" followed by an "a" followed by a dot and an asterisk.

The tokenizer used in this book stores tokens in a Token class, which is in sjm.parse.tokens . Figure 2.5 shows the Token class.

Figure 2.5. The `Token` class. Typically, a `Token` is a receptacle for the results of reading a small amount of text, such as a word or a number.

graphics/02fig05.gif

You will most likely encounter tokens in practice when you need to retrieve a token that a terminal has stacked . If a token contains a string, you can retrieve its string value using the sval() method. If a token contains a number, you can retrieve the number using the nval() method. You can also construct tokens from a string or number, or from a single character that the Token constructor converts into a string.

2.3.4 Default and Custom Tokenization

It is imprecise to say that tokenizing breaks a string into "words, numbers, and punctuation." We can find examples that challenge exactly what is and is not a separate token. For example, is an underscore part of a word, or is it a punctuation mark? Does the string ">=" contain one token or two?

To begin writing parsers you need to know what to expect from the class TokenAssembly . When you construct a TokenAssembly from a string, TokenAssembly breaks the string into tokens, relying primarily on the services of another class, Tokenizer .

Class Tokenizer provides a good set of default rules for how to divide text into tokens. For example, a default Tokenizer object properly tokenizes this string:

 "Let's 'rock and roll'!"

The Tokenizer object treats the apostrophe in "Let's" as part of the word, but it treats the single quotation marks around "rock and roll" as single quotes. Here is code that shows this:

 package sjm.examples.introduction;  import sjm.parse.tokens.*; import sjm.utensil.*; /**  * Show that apostrophes can be parts of words and can  * contain quoted strings.  */ public class ShowApostrophe { public static void main(String[] args) {     String s = "Let's 'rock and roll'!";     TokenAssembly a = new TokenAssembly(s);     while (a.hasMoreElements()) {         System.out.println(a.nextElement());     } } }

Running this class prints the following:

 Let's  'rock and roll' !

You may find that the default tokenization does not fit the purposes of your language. For example, you may need to allow blanks to appear inside words. For this and other types of customization, consult Chapter 9, "Advanced Tokenizing."

2.3.5 Assembly Appearance

The preceding example shows the effect of printing one element at a time from an assembly. If you print an entire assembly, it shows its stack, all its elements, and the position of its index. For example:

 package sjm.examples.introduction;  import sjm.parse.tokens.*; /**  * Show how an assembly prints itself.  */ public class ShowAssemblyAppearance { public static void main(String[] args) {     String s1 = "Congress admitted Colorado in 1876.";     System.out.println(new TokenAssembly(s1));     String s2 = "admitted(colorado, 1876)";     System.out.println(new TokenAssembly(s2)); } }

Running this class prints the two TokenAssembly objects:

 []^Congress/admitted/Colorado/in/1876.0  []^admitted/(/colorado/,/1876.0/)

Both assemblies print their stacks, which are empty and appear as a pair of brackets. These stacks can gain contents only when a parser parses the assembly. Both assemblies show all their elements, separated by slashes . The caret symbolizes the amount of progress a parser has made in recognizing the assembly. Because this example has no parser, both indexes are at the beginning. Note that assemblies include no description of their target when they print. When you want a target to print, you retrieve the target from the assembly and print the target.

2.3.6 Assembly Summary

The Assembly classes wrap a string and provide a work area for a parser to record progress in recognizing the string and building a corresponding object. The assembly may tokenize the string, and that simplifies the parser's job of recognition.

Top