9.5 A Tokenizer Class

Building Parsers with Java
By Steven John Metsker

Table of Contents
Chapter 9.  Advanced Tokenizing


The Tokenizer class in sjm.parse.tokens uses a set of states to recognize different types of tokens. Each state is a subclass of TokenizerState , a class in the same package. A Tokenizer object reads a character of an input string and uses this character to decide which state to use to find the next token. The design of Tokenizer in sjm.parse.tokens is as follows :

  1. Read a character and use it to look up which TokenizerState object to use.

  2. Send the TokenizerState object the initial character, and ask the TokenizerState to return a Token . The TokenizerState reads as many characters as it needs to produce a Token .

  3. Repeat until there are no more characters.

Figure 9.3 shows a state diagram of the classes in sjm.parse.tokens .

Figure 9.3. Tokenizer state transitions. A Tokenizer object changes from its start state into a TokenizerState object that returns a token.


The tokenizer state classes follow the state pattern [Gamma et al.], providing different implementations of the nextToken() method depending on the tokenizer's state. Tokenizer states generally consume text, produce a token, and return the token. For example, if a string to tokenize is "123 blastoff" , the tokenizer sees the " 1 " and transfers control to a NumberState object. The NumberState object consumes all three characters of the number and then returns a token that represents the number 123.

Some states cannot produce a token themselves but rather have the role of ignoring input. All objects of class WhitespaceState , SlashStarState , and SlashSlashState discard some sequence of characters and then ask the tokenizer to return the next token. One job that the states share is the ability to produce the next token in support of the Tokenizer class's nextToken() method. Figure 9.4 shows the Tokenizer class.

Figure 9.4. The Tokenizer class. A Tokenizer returns a series of tokens using various TokenizerState objects to build different types of tokens.


The Tokenizer class creates a default set of states, makes them accessible, and plugs them in to its lookup table, which it calls characterState . You can create new states and plug them in to the lookup table, giving you complete control over how the tokenizer works.


Building Parsers with Java
Building Parsers With Javaв„ў
ISBN: 0201719622
EAN: 2147483647
Year: 2000
Pages: 169

Similar book on Amazon

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net