An early design decision is whether you want to treat your language as a pattern of characters or as a pattern of tokens. Most commonly, you will not want to use a tokenizer for languages that let a user specify patterns of characters to match against. Chapter 8, "Parsing Regular Expressions," gives an example of parsing without using a tokenizer. Tokens are composed of characters, so every language that is a pattern of tokens is also a pattern of characters. Theoretically, then, tokenizers are never necessary. However, it is usually practical to tokenize text and to specify a grammar for a language in terms of token terminals. Consider a robot control language that allows this command: move robot 7.1 meters from base If you do not plan to tokenize, your parser must recognize every character, including the whitespace between words. You also must ensure that you properly gather characters into words, and you must build the number value yourself. All of this is work that a tokenizer will happily perform for you. Chapter 9, "Advanced Tokenizing," discusses how to customize a tokenizer. When you are learning to design new languages, you may want to limit your languages to those that can benefit from the default behavior of class Tokenizer in package sjm.parse.tokens . |