2.5 Terminal Parsers

	Building Parsers with Java By Steven John Metsker
	Table of Contents

	Chapter 2. The Elements of a Parser

Content

The examples in this book wrap strings either as assemblies of characters or assemblies of tokens. Each terminal must be geared toward one of these two types of assemblies. Each time a terminal asks an assembly for its nextElement() , the terminal must anticipate receiving either a character or a token; the terminal decides whether the character or token it receives is a match. Figure 2.8 shows the hierarchy of terminals that work with tokens. The subclasses of Terminal in Figure 2.8 are members of the package sjm.parse.tokens .

Figure 2.8. Token terminals. The subclasses of `Terminal` shown here expect an assembly's elements to be complete tokens.

graphics/02fig08.gif

2.5.1 Using Terminals

For an example of how to use a terminal, consider the following program:

 package sjm.examples.introduction;  import sjm.parse.*; import sjm.parse.tokens.*; /**  * Show how to recognize terminals in a string.  */ public class ShowTerminal { public static void main(String[] args) {     String s = "steaming hot coffee";     Assembly a = new TokenAssembly(s);     Parser p = new Word();     while (true) {         a = p.bestMatch(a);         if (a == null) {             break;         }         System.out.println(a);     } } }

The parser p is a single Word object. The assembly a is a TokenAssembly from the string "steaming hot coffee" . The code asks the parser to return the parser's best match against this assembly. Because the parser matches any word, its best match will be a new assembly with the same string and an index moved forward by one word. The first line the program prints is

 [steaming]steaming^hot/coffee

This output demonstrates how an assembly represents itself as a string. The assembly first shows the contents of its stack, [steaming] . The default behavior of all Terminal objects is to push whatever they recognize onto the assembly's stack. You can prevent this pushing by sending a Terminal object a discard() message.

After the stack, the assembly shows its tokens, separated by slashes , and shows the location of the index.

The while loop of the program asks the parser to return its best match against the revised assembly. In this pass, the program prints

 [steaming, hot]steaming/hot^coffee

Now the stack contains two words, and the index has moved past two words. In the next pass, the program prints

 [steaming, hot, coffee]steaming/hot/coffee^

Now the stack has all three words, and the index is at the end. In the next pass, the while loop will again ask the parser for its best match. Because the index of the assembly is at the end, there is no match; the bestMatch() method returns null , and the logic breaks out of the loop.

2.5.2 Word Terminals

In the preceding example, the parser p recognizes the set of all words. To be more specific, the language that p recognizes is the set of all strings that class Tokenizer in package sjm.parse.tokens recognizes as words.

2.5.3 Num Terminals

As a language designer, you can decide what constitutes a number. The default value of Tokenizer will find that the string

 "12 12.34 .1234"

contains the numbers 12.0, 12.34, and 0.1234. The default tokenizer will not recognize exponential notation or anything beyond digits, a decimal point, and more digits. Here is a program that shows the default tokenization of numbers :

 package sjm.examples.introduction;  import sjm.parse.tokens.*; import sjm.utensil.*; /**  * Show what counts as a number.  */ public class ShowNums { public static void main(String[] args) {     String s = "12 12.34 .1234 1234e-2";     TokenAssembly a = new TokenAssembly(s);     while (a.hasMoreElements()) {         System.out.println(a.nextElement());     } } }

Running this class prints the following:

 12.0 12.34 0.1234 1234.0 e-2

Note that by default the tokenizer does not comprehend the exponential notation of the last number. Chapter 9, "Advanced Tokenizing," explains how to change the tokenization of a string to allow for exponential notation. In your own languages, you can use the default number recognition in Tokenizer and simply disallow exponential notation. Alternatively, you can customize a Tokenizer object to allow exponential, imaginary, and other types of notation for numbers.

2.5.4 Literals

A literal is a specific string. Consider the following declaration:

 "int iq = 177;"

In this declaration, the word "int" must be a specific, literal value, whereas the variable name that follows it can be any word. To create a Literal parser, specify in a constructor the string the parser needs to match, as in

 Literal intType = new Literal("int");

2.5.5 Caseless Literals

Sometimes you want to let the people using your language enter specific values without worrying about capitalization. For example, in a coffee markup language, you might establish a roast parameter that looks for strings such as

 <roast>French</roast>

In building the parser to recognize this parameter, you might include a literal value for the roast parameter using the following object:

 new Literal("roast")

It would be more flexible to allow your language user to type "Roast" or "ROAST" in addition to the all-lowercase "roast" . To achieve this, use a caseless literal in place of the normal literal using an object such as:

 new CaselessLiteral("roast")

2.5.6 Symbols

A symbol is generally a character that stands alone as its own token. For example, semicolons, equal signs, and parentheses are all characters that a typical tokenizer treats as symbols. In particular, both StreamTokenizer in java.io and Tokenizer in sjm.parse.tokens treat such characters as symbol tokens. The default instance of Tokenizer treats the following characters as symbols:

 ! # $ % & ( ) * + , : ; < = > ? @ ` [ \ ] ^ _ {  } ~

In addition, the default value of Tokenizer treats the following multicharacter sequences as symbols:

 !=    <=    >=    :-

These symbols commonly represent "not equal," "less than or equal," "greater than or equal," and "if." The "if " symbol is common in logic languages (see Chapter 13, "Logic Programming"). The Tokenizer class gives you complete control over which characters and multicharacter sequences a Tokenizer object returns as symbols.

2.5.7 Quoted Strings

You may want to allow users of your language to enter quoted strings in some contexts ”for example, when you want to allow a string value to contain a blank. The following program accepts a secret identity as a quoted string:

 package sjm.examples.introduction;  import sjm.parse.*; import sjm.parse.tokens.*; /**  * Show how to recognize a quoted string.  */ public class ShowQuotedString { public static void main(String[] args) {     Parser p = new QuotedString();     String id = "\"Clark Kent\"";     System.out.println(p.bestMatch(new TokenAssembly(id))); } }

This program creates and applies a parser that recognizes a quoted string. Running this program prints

 ["Clark Kent"]"Clark Kent"^

The output shows that the parser matches the entire string, moving the ^ index past the token and stacking the token. Note that "Clark Kent" is a single token even though it contains a blank. Also note that the token contains the quote symbols themselves .

Top

Figure 2.8. Token terminals. The subclasses of Terminal shown here expect an assembly's elements to be complete tokens.