Recipe 10.4 Scanning a File with StreamTokenizer

Problem

You need to scan a file with more fine-grained resolution than the readLine( ) method of the BufferedReader class and its subclasses (discussed in Recipe 10.14).

Solution

Use a StreamTokenizer, readLine( ), and a StringTokenizer; regular expressions (Chapter 4); or one of several scanner generating tools, such as ANTLR or JavaCC. On JDK 1.5, use the Scanner class (see Recipe 10.5).

Discussion

While you could, in theory, read a file one character at a time and analyze each character, that is a pretty low-level approach. The read( ) method in the Reader class is defined to return int so that it can use the time-honored value -1 (defined as EOF in Unix <stdio.h> for years) to indicate that you have read to the end of the file:

void doFile(Reader is) {     int c;     while ((c=is.read( )) != -1) {         System.out.print((char)c);     } }

The cast to char is interesting. The program compiles fine without it, but does not print correctly because c is declared as int (which it must be, to be able to compare against the end-of-file value -1). For example, the integer value corresponding to capital A treated as an int prints as 65, while (char) prints the character A.

We discussed the StringTokenizer class extensively in Recipe 3.2. The combination of readLine( ) and StringTokenizer provides a simple means of scanning a file. Suppose you need to read a file in which each line consists of a name like user@host.domain, and you want to split the lines into users and host addresses. You could use this:

// ScanStringTok.java protected void process(LineNumberReader is) {         String s = null;         try {             while ((s = is.readLine( )) != null) {                 StringTokenizer st = new StringTokenizer(s, "@", true);                 String user = (String)st.nextElement( );                 st.nextElement( );                 String host = (String)st.nextElement( );                 System.out.println("User name: " + user +                     "; host part: " + host);                 // Presumably you would now do something                  // with the user and host parts...               }         } catch (NoSuchElementException ix) {             System.err.println("Line " + is.getLineNumber( ) +                 ": Invalid input " + s);         } catch (IOException e) {             System.err.println(e);         } }

The StreamTokenizer class in java.util provides slightly more capabilities for scanning a file. It reads characters and assembles them into words, or tokens. It returns these tokens to you along with a type code describing the kind of token it found. This typecode is one of four predefined types (StringTokenizer.TT_WORD, TT_NUMBER, TT_EOF, or TT_EOL for the end-of-line), or the char value of an ordinary character (such as 40 for the space character). Methods such as ordinaryCharacter( ) allow you to specify how to categorize characters, while others such as slashSlashComment( ) allow you to enable or disable features.

Example 10-3 shows a StreamTokenizer used to implement a simple immediate-mode stack-based calculator:

2 2 + = 4 22 7 / = 3.141592857

I read tokens as they arrive from the StreamTokenizer. Numbers are put on the stack. The four operators (+, -, *, and /) are immediately performed on the two elements at the top of the stack, and the result is put back on the top of the stack. The = operator causes the top element to be printed, but is left on the stack so that you can say:

4 5 * = 2 / = 20.0 10.0

Example 10-3. Simple calculator using StreamTokenizer

import java.io.*; import java.io.StreamTokenizer; import java.util.Stack; /**  * SimpleCalc -- simple calculator to show StringTokenizer  *  * @author    Ian Darwin, http://www.darwinsys.com/  * @version    $Id: ch10.xml,v 1.5 2004/05/04 20:12:12 ian Exp $  */ public class SimpleCalcStreamTok {     /** The StreamTokenizer Input */     protected  StreamTokenizer tf;     /** The Output File */     protected PrintWriter out = new PrintWriter(System.out, true);     /** The variable name (not used in this version) */     protected String variable;     /** The operand stack */     protected Stack s;     /* Driver - main program */     public static void main(String[] av) throws IOException {         if (av.length == 0)             new SimpleCalcStreamTok(                 new InputStreamReader(System.in)).doCalc( );         else              for (int i=0; i<av.length; i++)                 new SimpleCalcStreamTok(av[i]).doCalc( );     }     /** Construct by filename */     public SimpleCalcStreamTok(String fileName) throws IOException {         this(new FileReader(fileName));     }     /** Construct from an existing Reader */     public SimpleCalcStreamTok(Reader rdr) throws IOException {         tf = new StreamTokenizer(rdr);         // Control the input character set:         tf.slashSlashComments(true);    // treat "//" as comments         tf.ordinaryChar('-');        // used for subtraction         tf.ordinaryChar('/');    // used for division         s = new Stack( );     }     /** Construct from a Reader and a PrintWriter      */     public SimpleCalcStreamTok(Reader in, PrintWriter out) throws IOException {         this(in);         setOutput(out);     }          /**      * Change the output destination.      */     public void setOutput(PrintWriter out) {         this.out = out;     }     protected void doCalc( ) throws IOException {         int iType;         double tmp;         while ((iType = tf.nextToken( )) != StreamTokenizer.TT_EOF) {             switch(iType) {             case StreamTokenizer.TT_NUMBER: // Found a number, push value to stack                 push(tf.nval);                 break;             case StreamTokenizer.TT_WORD:                 // Found a variable, save its name. Not used here.                 variable = tf.sval;                 break;             case '+':                 // + operator is commutative.                 push(pop( ) + pop( ));                 break;             case '-':                 // - operator: order matters.                 tmp = pop( );                 push(pop( ) - tmp);                 break;             case '*':                 // Multiply is commutative                 push(pop( ) * pop( ));                 break;             case '/':                 // Handle division carefully: order matters!                 tmp = pop( );                 push(pop( ) / tmp);                 break;             case '=':                 out.println(peek( ));                 break;             default:                 out.println("What's this? iType = " + iType);             }         }     }     void push(double val) {         s.push(new Double(val));     }     double pop( ) {         return ((Double)s.pop( )).doubleValue( );     }     double peek( ) {         return ((Double)s.peek( )).doubleValue( );     }     void clearStack( ) {         s.removeAllElements( );     } }

While StreamTokenizer is useful, it knows only a limited number of tokens and has no way of specifying that the tokens must appear in a particular order. To do more advanced scanning, you need some special-purpose scanning tools. Such tools have been used for a long time in the Unix realm. The best-known examples are yacc and lex (discussed in the O'Reilly text lex & yacc). These tools let you specify the lexical structure of your input using regular expressions (see Chapter 4). For example, you might say that an email address consists of a series of alphanumerics, followed by an at sign (@), followed by a series of alphanumerics with periods embedded, as:

name:    [A-Za-z0-9]+@[A-Za-z0-0.]

The tool then writes code that recognizes the characters you have described. These tools also have a grammatical specification, which says, for example, that the keyword ADDRESS must appear, followed by a colon, followed by a "name" token, as previously defined.

Two widely used scanning tools for Java are ANTLR and JavaCC . Terence Parr is the author and maintainer of ANTLR, which can be download from http://www.antlr.org/. JavaCC is an open source project on java.net (https://javacc.dev.java.net/). These "compiler generators" can be used to write grammars for a wide variety of programs, from simple calculators such as the one earlier in this recipe through HTML and CORBA/IDL, up to full Java and C/C++ compilers. Examples of these are included with the downloads. Unfortunately, the learning curve for parsers in general precludes providing a simple and comprehensive example here. Please refer to the documentation and the numerous examples provided with the distributions.

Java offers simple line-at-a-time scanners using StringTokenizer, fancier token-based scanners using StreamTokenizer, and grammar-based scanners based on JavaCC and similar tools. In addition to these, JDK 1.5 provides an easier way to scan simple tokens (see Recipe 10.5).