9.3 Tokenizers in Standard Java

	Building Parsers with Java By Steven John Metsker
	Table of Contents

	Chapter 9. Advanced Tokenizing

Content

The standard Java libraries include two tokenizers: StringTokenizer in java.util and StreamTokenizer in java.io .

The StringTokenizer class does not parse numbers , and it allows little customization. This tokenizer is suitable only for simple tokenization, and this book does not discuss it further.

The StreamTokenizer class is more customizable than StringTokenizer but lacks some desirable features. In particular, StreamTokenizer in Java 1.1.7 does not provide

A Token class to encapsulate token results
Customization of how to recognize numbers
The ability to define new token types
Differentiation of allowable characters for the start of a word from allowable characters within a word
Handling of multicharacter symbols such as " <= "

For example, the following program uses StreamTokenizer from java.io to tokenize the "blast-off" line from the preceding section:

 package sjm.examples.tokens;  import java.io.*; /**  * Show a <code>StreamTokenizer</code> object at work  */ public class ShowTokenizer2 { public static void main(String args[]) throws IOException {     String s =     "\"It's 123 blast-off!\", she said, // watch out!\n" +     "and <= 3 'ticks' later /* wince */ , it's blast-off!";     System.out.println(s);     System.out.println();     StreamTokenizer t =         new StreamTokenizer(new StringReader(s));     t.ordinaryChar('/');     t.slashSlashComments(true);     t.slashStarComments(true);     boolean done = false;     while (!done) {         t.nextToken();         switch (t.ttype) {             case StreamTokenizer.TT_EOF :                 done = true;                 break;             case StreamTokenizer.TT_WORD :             case '\"' :             case '\'' :                 System.out.println("(" + t.sval + ")");                 break;             case StreamTokenizer.TT_NUMBER :                 System.out.println("(" + t.nval + ")");                 break;             default :                 System.out.println(                     "(" + (char) t.ttype + ")");                 break;         }     } } }

To initialize a StreamTokenizer object to ignore Java-style comments, this code uses these statements:

 StreamTokenizer t =      new StreamTokenizer(new StringReader(s)); t.ordinaryChar('/'); t.slashSlashComments(true); t.slashStarComments(true);

The default behavior of StreamTokenizer is to ignore all characters on a line after an initial slash, so this code makes a slash an "ordinary" character.

The code in main() also shows how to handle quoted strings and symbols with StreamTokenizer . When a StreamTokenizer object finds a quoted string, it places the quote symbol in its ttype attribute and places the value of the string in its sval attribute. Two case statements in the example lead to the same output behavior for words and for quoted strings.

By default, the switch statement in the example shows as a symbol any token that is not a word, quoted string, or number. In this case, the StreamTokenizer object stores the symbol characters in its ttype attribute. The sample code casts ttype to a char in this case, so symbols print as characters rather than as numeric values. Running ShowTokenizer2 prints the following:

 "It's 123 blast-off!", she said, // watch out!  and <= 3 'ticks' later /* wince */ , it's blast-off! (It's 123 blast-off!) (,) (she) (said) (,) (and) (<) (=) (3.0) (ticks) (later) (,) (it) (s blast-off!)

This is similar to the earlier output except that StreamTokenizer does not include the quote character as part of the quoted string. The output also shows that StreamTokenizer divides the <= symbol into two tokens. This makes it more difficult to write a grammar because the grammar must comprehend that some comparisons use one symbol and other comparisons use two symbols. This also means that your language will allow whitespace to appear inside a two-character symbol. One solution to this problem is to wrap StreamTokenizer with a class that looks ahead and combines the two symbols " < " and " = " into one.

Another problem with this example is that the tokenizer mistakes the apostrophe in "it's" for the beginning of a quoted string. There is no corresponding mate for this character, so the tokenizer consumes the rest of the input, returning it as a quoted string. You could ask the tokenizer to treat the single quote as a word character, but then the tokenizer would return 'ticks' as a word with the quotes embedded in it. Worse, if the input included 'wee ticks' , this approach would tokenize 'wee as one word and ticks' as the next . The problem is that you want an apostrophe to occur inside a word as part of that word, and you want a single quote after some whitespace to mean the beginning of a quoted string. To achieve this, you need to separate the event of entering a tokenizer state from the mechanics of how that state builds a token. This is a primary motivation for writing a new Tokenizer class.

This example is longer than the preceding example, primarily because of the printing logic that handles the different token types. This illustrates the advantage of introducing a separate Token class that knows how to display itself.

Top