9.1 The Role of a Tokenizer | Building Parsers With Javaв„ў

	Building Parsers with Java By Steven John Metsker
	Table of Contents

	Chapter 9. Advanced Tokenizing

Content

Most languages are easier to describe as patterns of tokens than as patterns of characters . A token represents a logical piece of a string. For example, a typical tokenizer would divide the string "1.23 <= 12.3" into three tokens: the number 1.23 , a less-than -or-equal symbol, and the number 12.3 . A token is a receptacle; it relies on a tokenizer to decide precisely how to divide a string into tokens. In addition to building up numbers from the characters of a string, a tokenizer provides other services that divide a string into tokens. A tokenizer typically does the following:

Parse numbers.
Build up "words" from letters and potentially other characters.
Treat characters such as " < " as one-character symbols.
Allow multicharacter symbols, such as " <= " and " =:= ".
Treat whitespace as a token separator.
Respect quoted strings, "like this" .
Strip out comments.

For example, here is a short program that exercises most of these features in a tokenizer. This program uses the Tokenizer class from sjm.parse.tokens , which we later compare to the tokenizers in the standard Java libraries.

 package sjm.examples.tokens;  import java.io.*; import sjm.parse.tokens.*; /**  * Show a default <code>Tokenizer</code> object at work.  */ public class ShowTokenizer { public static void main(String args[]) throws IOException {     String s =     "\"It's 123 blast-off!\", she said, // watch out!\n" +     "and <= 3 'ticks' later /* wince */ , it's blast-off!";     System.out.println(s);     System.out.println();     Tokenizer t = new Tokenizer(s);     while (true) {         Token tok = t.nextToken();         if (tok.equals(Token.EOF)) {             break;         }         System.out.println("(" + tok + ")");     } } }

Running this class prints the following:

 "It's 123 blast-off!", she said, // watch out!  and <= 3 "ticks" later /* wince */ , it's blast-off! ("It's 123 blast-off!") (,) (she) (said) (,) (and) (<=) (3.0) ('ticks') (later) (,) (it's) (blast-off) (!)

The tokenizer respects the quoted strings, ignores Java-style comments, and otherwise gathers characters into roughly the same chunks as would a human reader, separating words, numbers, and punctuation.

Top