11.4 New Parser Features

	Building Parsers with Java By Steven John Metsker
	Table of Contents

	Chapter 11. Extending the Parser Toolkit

Content

With the classes Sequence , Alternation , and Repetition , along with a hierarchy of Terminal classes, you can build an infinite variety of parsers. These classes form a complete toolkit, and so you could argue that they need never be extended. History has shown, however, that no toolkit is ever complete in the sense that it cannot profitably be extended. Toolkit developers should strive both for comprehensiveness and extensibility.

Software that is extensible has the following features:

Classes have short methods that perform a single service.
Methods and instance variables are generally protected rather than private , allowing subclasses more flexibility in how they extend a class.
Documentation is accurate and complete at the class level and at the design and architectural levels.

The software in the sjm packages that come with this book generally meets these criteria. An exception is that classes that begin with "Show" are not designed for reuse; they just implement examples. All the classes in sjm.parse , sjm.parse.tokens , and sjm.parse.chars are designed with the goal of being extensible. Although these classes provide a complete kit for composing an unbounded assortment of new parsers, you can easily add to or extend these tools.

There are many reasons that you might want to extend the sjm.parse toolkit. For example, you might want to create subclasses that execute more quickly, parsers that have associated visual components , or parsers that provide helpful information when they fail to match. In each of these examples, the new behaviors follow a different line of thought than merely matching input text. Although the toolkit is complete in meeting its original objectives, it is wide open to enhancements in meeting new objectives.

11.4.1 An Error-Handling Parser

This section gives an example of how to extend the basic toolset with a new type of parser that helps explain why a match failed. The context is that you have written a parser, and your user is attempting to enter a language element that your parser will match. For example, imagine that you have written a parser that recognizes lists according to this grammar:

 list       = '(' contents ')';  contents   = Empty  actualList; actualList = Word (',' Word)*;

Suppose your user enters

 (burgle, filch, pilfer, pinch,, purloin, steal, thieve)

Your parser fails to match this input because there is a double comma after pinch . The tools in sjm.parse note that a list parser cannot match the input, but they say nothing about where the match failed.

To create a parser that is more helpful when matches fail, you can create a subclass of Sequence that throws an exception at the moment of failure. For example, an object of this class could throw an exception upon seeing a second comma after pinch .

A subtlety here is that it is usually acceptable for a Sequence object to fail entirely, but not after the sequence begins. For example, consider the contents rule in the preceding grammar. This rule allows a list to be either empty or an actual list. When an input list is empty, the actualList sequence fails entirely, and that is what you want. On the other hand, it is usually not acceptable for a Sequence object to begin and not complete. For example, the list sequence begins matching input that starts with an open parenthesis. If the input does not contain a corresponding closing parenthesis, then the input is not a member of the grammar's language, and we can say the input is in error.

There are two sequences in the list grammar that, once started, must complete for the input to be valid. The first sequence is list itself, which must complete after seeing an opening parenthesis. The second place is the sequence ( ',' Word ) that appears in actualList . After seeing a comma for this sequence, the input must contain a following word. For these two sequences, it is useful to have a Track class that throws an exception if a sequence begins but does not complete. Figure 11.2 shows this class.

Figure 11.2. The `Track` class. A `Track` object is like a `Sequence` object, except that a track throws an exception if its first subparser matches without all of its subparsers also matching.

graphics/11fig02.gif

The class Track overrides the match() method it inherits from Sequence . Recall the pseudocode for Sequence.match() :

 out = in;  for each subparser p[i] {     out = p[i].match(out); } return out;

A Track object notes whether its first subparser matches the input state vector. If it does, all the remaining subparsers must match for the Track object to succeed. If any subparser after the first one fails to match, the Track object throws an exception. Here is the pseudocode for Track.match() :

 inTrack = false;  out = in; for each subparser p[i]{     out = p[i].match(out);     if (out isEmpty && inTrack)         throw exception;     inTrack = true; } return out;

The code for Track.match() follows almost directly from the design:

 public Vector match(Vector in) {      boolean inTrack = false;     Vector last = in;     Vector out = in;     Enumeration e = subparsers.elements();     while (e.hasMoreElements()) {         Parser p = (Parser) e.nextElement();         out = p.matchAndAssemble(last);         if (out.isEmpty()) {             if (inTrack) {                 throwTrackException(last, p);             }             return out;         }         inTrack = true;         last = out;     }     return out; }

If the match() method does not throw an exception, it must return a complete state. This includes a responsibility to apply a subparser's assemblers to the results of a match, and that is why this method calls matchAndAssemble() for each subparser.

If any subparser after the first one fails to match, the match() method throws an exception. The method keeps a copy of the last successful match state to help create an informative message if the track fails. Here is the code for throwTrackException() :

 protected void throwTrackException(      Vector previousState, Parser p) {     Assembly best = best(previousState);     String after = best.consumed(" ");     if (after.equals("")) {         after = "-nothing-";     }     String expected = p.toString();     Object next = best.peek();     String found =         (next == null) ? "-nothing-" : next.toString();     throw new TrackException(after, expected, found); }

This code prepares an error message that explains where the parser was, what it was looking for, and what it found instead. To explain where the parser was, this method picks the most advanced (the "best") assembly in the last successful matching set. For example, a Track object might be matching a list of synonyms for steal, and the input string might be "(pilfer, pinch,, purloin)" . In throwTrackException() , the best assembly would be "(pilfer, pinch , " . The parser this method receives is the parser that was trying to match the last set. In this case, the parser might be a Word parser. An error message can say that you "expected" a Word . To show what the parser found instead of what it sought, the throwTrackException() method peeks at the next element in the best assembly. In this example, the next element is an extra comma. The TrackException object uses its after , expected , and found strings to compose a message string such as

 After   : ( pilfer , pinch ,  Expected: Word Found   : ,

The TrackException class includes newline ( \n ) characters in its message, something that is unusual for an exception. Figure 11.3 shows the TrackException class.

Figure 11.3. The `TrackException` class. A `TrackException` object contains the reasons that a track started but did not complete.

graphics/11fig03.gif

The constructor for TrackException requires three input strings: after , expected , and found . The after string is an indication of which text was parsed. The expected string is an indication of the kind of text that was expected, such as a term . The found string is the textual element the thrower actually found. The get- methods return the values set in the constructor.

The TrackException class, a subclass of RuntimeException , allows methods that call throwTrackException() not to declare that they may throw this exception. This is important because Track.match() calls throwTrackException() , and thus it may throw a TrackException . If TrackException were not a RuntimeException , Track.match() would have a different signature than does the method in its superclass, Sequence.match() . Java does not allow this. In general, a method that overrides a superclass method cannot introduce a new exception unless the new exception is a subclass of RuntimeException .

11.4.2 Tracks in Action

The following sample class has a static method that returns a list parser. The list grammar is:

 list       = '(' contents ')';  contents   = Empty  actualList; actualList = Word (',' Word)*;

The sample code uses two Track objects: one to parse the sequence ( ',' Word ) in actualList , and a second one to parse the list sequence. The example applies the list parser to attempt to parse a mixture of valid and invalid inputs.

 package sjm.examples.track;  import sjm.parse.*; import sjm.parse.tokens.*; /**  * Show some examples of using a <code>Track</code>.  */ public class ShowTrack {  public static Parser list() {     Parser empty, commaWord, actualList, contents, list;     empty = new Empty();     commaWord = new Track()         .add(new Symbol(',').discard())         .add(new Word());     actualList = new Sequence()         .add(new Word())         .add(new Repetition(commaWord));     contents = new Alternation()         .add(empty)         .add(actualList);     list = new Track()         .add(new Symbol('(').discard())         .add(contents)         .add(new Symbol(')').discard());     return list; } public static void main(String args[]) {     Parser list = list();     String test[] = new String[] {         "()",         "(pilfer)",         "(pilfer, pinch)",         "(pilfer, pinch, purloin)",         "(pilfer, pinch,, purloin)",         "(",         "(pilfer",         "(pilfer, ",         "(, pinch, purloin)",         "pilfer, pinch"};     System.out.println("Using parser: " + list);     for (int i = 0; i < test.length; i++) {         System.out.println("---\ntesting: " + test[i]);         TokenAssembly a = new TokenAssembly(test[i]);         try {             Assembly out = list.completeMatch(a);             if (out == null) {                 System.out.println(                     "list.completeMatch() returns null");             }             else {                 Object s = list.completeMatch(a).getStack();                 System.out.println("Ok, stack is: " + s);             }         }catch (TrackException e) {             System.out.println(e.getMessage());         }     } } }

The main() method uses list to parse an assortment of input strings. If list can consume the entire input, the main() method prints "Ok..." . If parsing the input causes a Track object to throw a TrackException , main() catches it and prints the exception's message.

Running this class initially prints a string representation of list :

 Using parser: <(< empty <Word<,Word>*>>)>

Lists that are empty, or that contain one, two, or three words are all acceptable:

 ---  testing: () Ok, stack is: [] --- testing: (pilfer) Ok, stack is: [pilfer] --- testing: (pilfer, pinch) Ok, stack is: [pilfer, pinch] --- testing: (pilfer, pinch, purloin) Ok, stack is: [pilfer, pinch, purloin]

After a comma, list expects a Word and not another comma:

 ---  testing: (pilfer, pinch,, purloin) After   : ( pilfer , pinch , Expected: Word Found   : ,

If the input starts with an opening parenthesis but ends before the list track finds contents and a closing parenthesis, list throws an exception saying what it expected and what it found:

 ---  testing: ( After   : ( Expected: ) Found   : -nothing- --- testing: (pilfer After   : ( pilfer Expected: ) Found   : -nothing- --- testing: (pilfer, After   : ( pilfer , Expected: Word Found   : -nothing-

After an opening parenthesis, a list expects contents, which may be empty. If actual contents do not follow the parenthesis, the list must be empty, which is acceptable, but then list expects a closing parenthesis:

 ---  testing: (, pinch, purloin) After   : ( Expected: ) Found   : ,

Without an opening parenthesis, the match fails entirely, the list track never begins, and it does not throw an exception. The result is that list.completeMatch() returns null :

 ---  testing: pilfer, pinch list.completeMatch() returns null

The Track class helps a language user determine what is wrong with an input string a parser cannot recognize. As a language developer, you'll find that it is easy to overlook the importance of helping your user stalk input errors. In practice, new language users spend most of their energy trying to enter text that a parser will accept. The parsers for the larger languages in this book (Jaql, Logikus, and Sling) use Track objects to help users produce proper programs.

Top

11.4.1 An Error-Handling Parser

Figure 11.2. The Track class. A Track object is like a Sequence object, except that a track throws an exception if its first subparser matches without all of its subparsers also matching.

Figure 11.3. The TrackException class. A TrackException object contains the reasons that a track started but did not complete.

11.4.2 Tracks in Action

Figure 11.2. The `Track` class. A `Track` object is like a `Sequence` object, except that a track throws an exception if its first subparser matches without all of its subparsers also matching.

Figure 11.3. The `TrackException` class. A `TrackException` object contains the reasons that a track started but did not complete.