5.4 A Tokenizing Problem

Building Parsers with Java
By Steven  John  Metsker

Table of Contents
Chapter  5.   Parsing Data Languages


The coffee grammar accepts coffee names , roasts, and countries as Word terminals. This creates a problem if any of these "words" contains a blank. For example, a coffee name might be "Toasty Rita," from Costa Rica. By default, the class Tokenizer in sjm.parse.tokens treats a blank as the end of a word. When tokenizing the text

 Toasty Rita, Italian, Costa Rica, 9.95 

a default tokenizer would return Toasty as a Word , followed by Rita as a Word . After the first word, the grammar will be looking for a comma and not another word, and a parser generated from the grammar will fail to match the input text.

One solution is to ask the tokenizer to allow blanks to occur inside words. The following code snippet creates such a tokenizer:

 Tokenizer t = new Tokenizer();  t.wordState().setWordChars(' ', ' ', true); 

This code says that a blank is a legal part of a word. The CoffeeParser class that appears later in this chapter provides a tokenizer() method that returns a tokenizer that accepts blanks inside words. The ShowCoffee class, which also appears later in this chapter, uses this tokenizer when parsing an input file of coffee types. Chapter 9, "Advanced Tokenizing," covers tokenizing in detail.


Building Parsers with Java
Building Parsers With Javaв„ў
ISBN: 0201719622
EAN: 2147483647
Year: 2000
Pages: 169

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net