11.3 New Token Types

	Building Parsers with Java By Steven John Metsker
	Table of Contents

	Chapter 11. Extending the Parser Toolkit

Content

A tokenizer is responsible for breaking text into logical elements. Your idea of what a "logical" element is may differ from the defaults provided by class Tokenizer in sjm.parse.tokens . For example, you may want to distinguish real numbers from imaginary numbers , or HTML tags from other kinds of text, or reserved and nonreserved words. Often, when you create a new type of token, you will create one or more new Terminal types to recognize it.

One important reason for changing a tokenizer is to avoid ambiguity. Section 4.2.2, "Terminal Ambiguity," shows the ambiguity of the following grammar:

 query  = (Word  volume)* '?';  volume = "cups"  "gallon"  "liter";

This grammar is ambiguous because a word such as "cups" is both a Word and a volume . To prevent this ambiguity without modifying the tokenizer, you can introduce new terminals. You could create new subclasses of Terminal , perhaps ReservedWord and NonReservedWord , having each one keep a list of known reserved words. ReservedWord would succeed in matching a word only if the word were on the list; NonReservedWord would succeed only for words not on the reserved word list. This design has the disadvantage of your having to invent a way of passing around a list of reserved words. An alternative is to push the job of distinguishing types of words down to the tokenizer.

To provide for unambiguous parsing of the query grammar, you can create a new token type ”say, TT_RESERVED ”and have the tokenizer return each word as either a TT_WORD token or a TT_RESERVED token. Figure 11.1 shows a new tokenizer state and a new terminal type that can work together to tokenize text, distinguishing reserved words from nonreserved words.

Figure 11.1. The `reserved` package. A new class for tokenizing, `WordOrReservedState` , allows an input string to produce elements that are either plain words or reserved words. A new terminal class, `ReservedWord` , recognizes the reserved tokens that the tokenizer produces.

graphics/11fig01.gif

The default behavior of class Tokenizer in sjm.parse.tokens is to transfer control to a WordState object when the input stream starts with a word character. By default, a Tokenizer constructor sets words to begin with characters using the following messages:

 //...  setCharacterState( 'a',   'z', wordState()); setCharacterState( 'A',   'Z', wordState()); setCharacterState(0xc0,  0xff, wordState());

The last line here allows words to begin with characters that do not typically appear in English. Figure 9.5 in Chapter 9, "Advanced Tokenizing," shows the values of characters.

To craft a tokenizer that returns reserved words as distinct from nonreserved words, you can replace the WordState tokenizer state with an object of the new class WordOrReservedState . This class subclasses WordState , introduces a static variable TT_RESERVED , and overrides nextToken() to consult a list of known reserved words. Here is WordOrReservedState.java :

 package sjm.examples.reserved;  import java.io.*; import java.util.*; import sjm.parse.*; import sjm.parse.tokens.*; /**  * Override WordState to return known reserved words as  * tokens of type TT_RESERVED.  */ public class WordOrReservedState extends WordState {     Vector reserved = new Vector();     /**      * A constant indicating that a token is a reserved word.      */     public static final TokenType TT_RESERVED =         new TokenType("reserved"); /**  * Adds the specified string as a known reserved word.  */ public void addReservedWord(String word) {     reserved.addElement(word); } /**  * Return all the known reserved words.  */ public Vector getReservedWords() {     return reserved; } /**  * Return a reserved token or a word token from a reader.  */ public Token nextToken(PushbackReader r, int c, Tokenizer t)     throws IOException {     Token tok = super.nextToken(r, c, t);     if (reserved.contains(tok.sval())) {         return new Token(TT_RESERVED, tok.sval(), 0);     }     return tok; } }

The WordOrReservedState class maintains a list of known reserved words and checks this list after the superclass builds a word from the reader. To use this state, you will create a Tokenizer object and direct it to use a WordOrReservedState object wherever it would have used a WordState object. First, to recognize the new type of token that the tokenizer will return, you need the class ReservedWord :

 package sjm.examples.reserved;  import java.util.*; import sjm.parse.*; import sjm.parse.tokens.*; /**  * A Word matches a word from a token assembly.  */ public class ReservedWord extends Terminal { /**  * Returns true if an assembly's next element is a reserved  * word.  */ protected boolean qualifies (Object o) {     Token t = (Token) o;     return t.ttype() == WordOrReservedState.TT_RESERVED; } }

This class is similar to the Word class, whose notion of whether an element "qualifies" depends entirely on the token type. Here is a demonstration of this class, and WordOrReservedState , in action:

 package sjm.examples.reserved;  import java.io.*; import java.util.*; import sjm.parse.*; import sjm.parse.tokens.*; /**  * This class shows the use of a customized tokenizer and  * the use of a terminal that looks for the new token type.  */ public class ShowReserved { /**  * Return a customized tokenizer that uses  * WordOrReservedState in place of WordState.  */ public static Tokenizer tokenizer() {     Tokenizer t = new Tokenizer();     WordOrReservedState wors = new WordOrReservedState();     wors.addReservedWord("cups");     wors.addReservedWord("gallon");     wors.addReservedWord("liter");     t.setCharacterState( 'a',   'z', wors);     t.setCharacterState( 'A',   'Z', wors);     t.setCharacterState(0xc0,  0xff, wors);     return t; } public static void main(String[] args) {     // volume = "cups"  "gallon"  "liter";     Parser volume = new ReservedWord();     // an anonymous Assembler subclass notes volume matches     volume.setAssembler(         new Assembler() {             public void workOn(Assembly a) {                 Object o = a.pop();                 a.push("VOL(" + o + ")");             }         });     // query = (Word  volume)* '?';     Parser wordOrVolume = new Alternation()         .add(new Word())         .add(volume);     Parser query = new Sequence()         .add(new Repetition(wordOrVolume))         .add(new Symbol('?'));     Tokenizer t = tokenizer();     t.setString("How many cups are in a gallon?");     Vector v = new Vector();     v.addElement(new TokenAssembly(t));     System.out.println(query.match(v)); } }

This sample class contains the static method tokenizer() , which returns a special tokenizer that looks for reserved words. The sample hard-codes three reserved words: "cups" , "gallon" , and "liter" . A more comprehensive example would have the names of perhaps 100 to 200 units. The tokenizer() method changed the default Tokenizer object to use the WordOrReservedState object for elements that begin with a letter.

The main() method in the example builds a parser for the grammar:

 query  = (Word  volume)* '?';  volume = "cups"  "gallon"  "liter";

The example uses a ReservedWord terminal for volume . Because the main() method uses the class's customized tokenizer, a word such as "cups" is a token of type TT_RESERVED and not TT_WORD . A Word terminal does not recognize "cups" as a word, but a ReservedWord terminal recognizes "cups" as a reserved word. Section 4.2.2, "Terminal Ambiguity," shows a sample program that prints four parses of the input "How many cups are in a gallon?" . Now, running ShowReserved prints the following:

 [   [How, many, VOL(cups), are, in, a, VOL(gallon), ?]  How/many/cups/are/in/a/gallon/?^  ]

The anonymous assemblers in ShowReserved.main() wrap the characters "VOL()" around the reserved words "cups" and "gallon" . Parsing the query is no longer ambiguous. This means that there is only one way to parse the input phrase.

The connection between tokens and terminals is tight. A tokenizer changes input text into a series of tokens, and a terminal either matches the whole token or it does not. This implies that for every token there must be at least one terminal in a parser that can match it, or else the parser can never match an input string. When you create a new token type, as in this section, you generally must create one or more terminals to match the new type.

Top

Figure 11.1. The reserved package. A new class for tokenizing, WordOrReservedState , allows an input string to produce elements that are either plain words or reserved words. A new terminal class, ReservedWord , recognizes the reserved tokens that the tokenizer produces.

Figure 11.1. The `reserved` package. A new class for tokenizing, `WordOrReservedState` , allows an input string to produce elements that are either plain words or reserved words. A new terminal class, `ReservedWord` , recognizes the reserved tokens that the tokenizer produces.