Most languages are easier to describe as patterns of tokens than as patterns of characters . A token represents a logical piece of a string. For example, a typical tokenizer would divide the string "1.23 <= 12.3" into three tokens: the number 1.23 , a less-than -or-equal symbol, and the number 12.3 . A token is a receptacle; it relies on a tokenizer to decide precisely how to divide a string into tokens. In addition to building up numbers from the characters of a string, a tokenizer provides other services that divide a string into tokens. A tokenizer typically does the following:

  • Parse numbers.

  • Build up "words" from letters and potentially other characters.

  • Treat characters such as " < " as one-character symbols.

  • Allow multicharacter symbols, such as " <= " and " =:= ".

  • Treat whitespace as a token separator.

  • Respect quoted strings, "like this" .

  • Strip out comments.

For example, here is a short program that exercises most of these features in a tokenizer. This program uses the Tokenizer class from sjm.parse.tokens , which we later compare to the tokenizers in the standard Java libraries.

 package sjm.examples.tokens;  import java.io.*; import sjm.parse.tokens.*; /**  * Show a default <code>Tokenizer</code> object at work.  */ public class ShowTokenizer { public static void main(String args[]) throws IOException {     String s =     "\"It's 123 blast-off!\", she said, // watch out!\n" +     "and <= 3 'ticks' later /* wince */ , it's blast-off!";     System.out.println(s);     System.out.println();     Tokenizer t = new Tokenizer(s);     while (true) {         Token tok = t.nextToken();         if (tok.equals(Token.EOF)) {             break;         }         System.out.println("(" + tok + ")");     } } } 

Running this class prints the following:

 "It's 123 blast-off!", she said, // watch out!  and <= 3 "ticks" later /* wince */ , it's blast-off! ("It's 123 blast-off!") (,) (she) (said) (,) (and) (<=) (3.0) ('ticks') (later) (,) (it's) (blast-off) (!) 

The tokenizer respects the quoted strings, ignores Java-style comments, and otherwise gathers characters into roughly the same chunks as would a human reader, separating words, numbers, and punctuation.


