10.8 Terminal Matching

Building Parsers with Java
By Steven  John  Metsker

Table of Contents
Chapter  10.   Matching Mechanics


The parsers discussed to this point have been compositions of other parsers. Each of them matches a collection of assemblies by delegating the matching job to its subparsers. This structural composition and runtime delegation ends with Terminal and its subclasses. Terminal objects are the leaves of a parser composition and thus "terminate" the composition. They also terminate the matching process by deciding themselves whether or not they match a given assembly. Figure 10.6 shows the Terminal class.

Figure 10.6. The Terminal class. A Terminal object is a parser that is not a composition of other parsers.


Any Terminal object can decide whether it matches any assembly, so the job of matching a collection of assemblies reduces to the job of matching each assembly individually. Subclasses of Terminal typically override matchOneAssembly() or qualifies() , but they need not override match() . Here is the code for Terminal.match() :

 public Vector match(Vector in) {      Vector out = new Vector();     Enumeration e = in.elements();     while (e.hasMoreElements()) {         Assembly a = (Assembly) e.nextElement();         Assembly b = matchOneAssembly(a);         if (b != null) {             out.addElement(b);         }     }     return out; } 

This code gathers into an output collection the results of matching the Terminal object against each assembly in the input collection. Here is the code for matchOneAssembly() :

 protected Assembly matchOneAssembly(Assembly in) {      if (!in.hasMoreElements()) {         return null;     }     if (qualifies(in.peek())) {         Assembly out = (Assembly) in.clone();         Object o = out.nextElement();         if (!discard) {             out.push(o);         }         return out;     }     return null; } 

This code peeks at the next element of the input assembly to see whether it "qualifies." The Terminal class has a default implementation of qualifies() , but subclasses typically override it. The qualifies() method determines whether the terminal matches the next element.

Terminal objects by default push the element they match from an assembly onto the assembly's stack. You can prevent this by sending the Terminal object a discard() message. In practice, you usually want to prevent the pushing of symbols, literals, and any other terminals when you know the exact value of the match in advance.

10.8.1 Token Terminals

One result of tokenization is that when a Terminal parser must determine whether it matches the next chunk of text, a tokenizer has already made decisions about how to compose the next token. For example, a tokenizer decides whether scientific notation is allowable in numbers .

A Terminal parser can make further decisions about what it matches, but as a practical matter it cannot undo decisions the tokenizer makes. A good strategy is to keep tokenization simple, adding specialization in Terminal subclasses.

For example, this book has a single token type for word tokens and has three subclasses of Terminal that match against word tokens. Classes Word , Literal , and CaselessLiteral are all Terminal subclasses that match against a token of type Token.TT_WORD . Objects of class Word successfully match any word token. Objects of class Literal recognize only a specific word; a Literal object fails to match any token that does not contain the same string the object receives in its constructor. The CaselessLiteral subclass of Literal relaxes this selectivity, matching a specific string but ignoring case. Figure 10.7 shows the subclasses of Terminal that match word tokens.

Figure 10.7. Token terminals. Typical token terminals simply say whether a given object qualifies as the type of terminal sought.


These classes must say whether they match a given object. In other words, they must say whether a given object qualifies as the specific type of terminal. For example, here is Literal.qualifies() :

 /**   * Returns true if the literal this object represents equals  * an assembly's next element.  */ protected boolean qualifies (Object o) {     return literal.equals((Token) o); } 

The Literal class accepts a string in its constructor but builds a Token from this string and saves it in its literal instance variable. The Literal.qualifies() method determines that a token qualifies as a match if the token contains the same string.

CaselessLiteral is a subclass of Literal that allows the desired string to match regardless of case, as its qualifies() method shows:

 /**   * Returns true if the literal this object represents equals  * an assembly's next element, disregarding case.  */ protected boolean qualifies(Object o) {     return literal.equalsIgnoreCase((Token) o); } 

The qualifies() methods for Num , QuotedString , and Word use the is methods of Token . For example, here is Word.qualifies() :

 /**   * Returns true if an assembly's next element is a word.  */ protected boolean qualifies(Object o) {     Token t = (Token) o;     return t.isWord(); } 

The Symbol class typically recognizes a single character, such as " < ". However, class Tokenizer in sjm.parse.tokens also provides for multicharacter symbols, such as " <= " to appear as single tokens. Here is Symbol.qualifies() :

 /**   * Returns true if the symbol this object represents equals  * an assembly's next element.  */ protected boolean qualifies (Object o) {     return symbol.equals((Token) o); } 

This code is nearly identical to the qualifies() method for Literal , but it matches symbol tokens rather than word tokens. The difference between literals and symbols lies mainly in the tokenizer. For example, a default Tokenizer object tokenizes the string: "<<<" as three tokens of type Token.TT_SYMBOL . No Literal object could match this string, but the following object would:

 new Repetition(new Symbol('<')) 

10.8.2 Character Terminals

Classes for the character-based terminals in sjm.parse.chars are as simple as those of the token terminals. Any terminal has the primary task of saying whether it matches the next element of an assembly. For example, here is Digit.qualifies() :

 /**   * Returns true if an assembly's next element is a digit.  */ public boolean qualifies(Object o) {     Character c = (Character) o;     return Character.isDigit(c.charValue()); } 

10.8.3 Terminals Summary

Creating new Terminal subclasses is usually easy. The mechanics of matching belong to the Terminal class, and the new subclass needs only to decide which tokens or characters qualify as a match.


Building Parsers with Java
Building Parsers With Javaв„ў
ISBN: 0201719622
EAN: 2147483647
Year: 2000
Pages: 169

Similar book on Amazon

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net