9.9 Customizing a Tokenizer


 
Building Parsers with Java
By Steven John Metsker

Table of Contents
Chapter 9.  Advanced Tokenizing

    Content

You can customize a tokenizer in three ways: by customizing one of the tokenizer's states, by changing which state the tokenizer enters given an initial character, or by adding an entirely new state.

9.9.1 Customizing a State

The preceding section shows how the CoffeeParser class creates a special tokenizer that allows spaces to appear in words. The tokenizer() method of this class retrieves a WordState object from a tokenizer t and updates it:

 t.wordState().setWordChars(' ', ' ', true); 

9.9.2 Changing Which State the Tokenizer Enters

The example in Section 9.7.1 changes the state the tokenizer enters on seeing a " # " to a quote state. It uses this line:

 t.setCharacterState('#', '#', t.quoteState()); 

9.9.3 Adding a State

As an example of creating and using a new tokenizer state, consider how to change the default tokenizer to recognize scientific notation. For example, in your language you might want to allow exponential notation for numbers . That is, given a line of text such as:

 "2.998e8 meters per second" 

your language might recognize the number as 2,998,000,000, or 2.998 times 10 to the eighth power. To achieve this, you must change how you tokenize numbers. Figure 9.15 shows a ScientificNumberState class. This class subclasses from NumberState and reuses some of its superclass's logic.

Figure 9.15. The ScientificNumberState . The ScientificNumber class overrides some of the methods in NumberState to allow for exponential notation.

graphics/09fig15.gif

To use the scientific notation tokenizer with a parser, follow these steps:

  1. Create the tokenizer.

  2. Feed the tokenizer a string to tokenize.

  3. Create a token assembly from the tokenizer.

  4. Ask the parser to parse this assembly.

Here's an example:

 package sjm.examples.tokens;  import sjm.parse.*; import sjm.parse.tokens.*; import sjm.examples.arithmetic.*; /**  * This class shows how to use a tokenizer that accepts  * scientific notation with an arithmetic parser.  */ public class ShowScientific { public static void main(String[] args) throws Exception {     Tokenizer t = new Tokenizer();     ScientificNumberState sns = new ScientificNumberState();     t.setCharacterState('0', '9', sns);     t.setCharacterState('.', '.', sns);     t.setCharacterState('-', '-', sns);     t.setString("1e2 + 1e1 + 1e0 + 1e-1 + 1e-2 + 1e-3");     Parser p = ArithmeticParser.start();     Assembly a = p.bestMatch(new TokenAssembly(t));     System.out.println(a.pop()); } } 

This example evaluates

 "1e2 + 1e1 + 1e0 + 1e-1 + 1e-2 + 1e-3" 

Running the class prints the correct answer:

 111.111 

   
Top


Building Parsers with Java
Building Parsers With Javaв„ў
ISBN: 0201719622
EAN: 2147483647
Year: 2000
Pages: 169

Similar book on Amazon

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net