Most languages are easier to describe as patterns of tokens than as patterns of characters . A token represents a logical piece of a string. For example, a typical tokenizer would divide the string "1.23 <= 12.3" into three tokens: the number 1.23 , a less-than -or-equal symbol, and the number 12.3 . A token is a receptacle; it relies on a tokenizer to decide precisely how to divide a string into tokens. In addition to building up numbers from the characters of a string, a tokenizer provides other services that divide a string into tokens. A tokenizer typically does the following: -
Parse numbers. -
Build up "words" from letters and potentially other characters. -
Treat characters such as " < " as one-character symbols. -
Allow multicharacter symbols, such as " <= " and " =:= ". -
Treat whitespace as a token separator. -
Respect quoted strings, "like this" . -
Strip out comments. For example, here is a short program that exercises most of these features in a tokenizer. This program uses the Tokenizer class from sjm.parse.tokens , which we later compare to the tokenizers in the standard Java libraries. package sjm.examples.tokens; import java.io.*; import sjm.parse.tokens.*; /** * Show a default <code>Tokenizer</code> object at work. */ public class ShowTokenizer { public static void main(String args[]) throws IOException { String s = "\"It's 123 blast-off!\", she said, // watch out!\n" + "and <= 3 'ticks' later /* wince */ , it's blast-off!"; System.out.println(s); System.out.println(); Tokenizer t = new Tokenizer(s); while (true) { Token tok = t.nextToken(); if (tok.equals(Token.EOF)) { break; } System.out.println("(" + tok + ")"); } } } Running this class prints the following: "It's 123 blast-off!", she said, // watch out! and <= 3 "ticks" later /* wince */ , it's blast-off! ("It's 123 blast-off!") (,) (she) (said) (,) (and) (<=) (3.0) ('ticks') (later) (,) (it's) (blast-off) (!) The tokenizer respects the quoted strings, ignores Java-style comments, and otherwise gathers characters into roughly the same chunks as would a human reader, separating words, numbers, and punctuation. |