9.1 The Role of a Tokenizer

Building Parsers with Java
By Steven John Metsker

Table of Contents
Chapter 9.  Advanced Tokenizing


Most languages are easier to describe as patterns of tokens than as patterns of characters . A token represents a logical piece of a string. For example, a typical tokenizer would divide the string "1.23 <= 12.3" into three tokens: the number 1.23 , a less-than -or-equal symbol, and the number 12.3 . A token is a receptacle; it relies on a tokenizer to decide precisely how to divide a string into tokens. In addition to building up numbers from the characters of a string, a tokenizer provides other services that divide a string into tokens. A tokenizer typically does the following:

  • Parse numbers.

  • Build up "words" from letters and potentially other characters.

  • Treat characters such as " < " as one-character symbols.

  • Allow multicharacter symbols, such as " <= " and " =:= ".

  • Treat whitespace as a token separator.

  • Respect quoted strings, "like this" .

  • Strip out comments.

For example, here is a short program that exercises most of these features in a tokenizer. This program uses the Tokenizer class from sjm.parse.tokens , which we later compare to the tokenizers in the standard Java libraries.

 package sjm.examples.tokens;  import java.io.*; import sjm.parse.tokens.*; /**  * Show a default <code>Tokenizer</code> object at work.  */ public class ShowTokenizer { public static void main(String args[]) throws IOException {     String s =     "\"It's 123 blast-off!\", she said, // watch out!\n" +     "and <= 3 'ticks' later /* wince */ , it's blast-off!";     System.out.println(s);     System.out.println();     Tokenizer t = new Tokenizer(s);     while (true) {         Token tok = t.nextToken();         if (tok.equals(Token.EOF)) {             break;         }         System.out.println("(" + tok + ")");     } } } 

Running this class prints the following:

 "It's 123 blast-off!", she said, // watch out!  and <= 3 "ticks" later /* wince */ , it's blast-off! ("It's 123 blast-off!") (,) (she) (said) (,) (and) (<=) (3.0) ('ticks') (later) (,) (it's) (blast-off) (!) 

The tokenizer respects the quoted strings, ignores Java-style comments, and otherwise gathers characters into roughly the same chunks as would a human reader, separating words, numbers, and punctuation.


Building Parsers with Java
Building Parsers With Javaв„ў
ISBN: 0201719622
EAN: 2147483647
Year: 2000
Pages: 169

Similar book on Amazon

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net