Recipe 10.4 Scanning a File with StreamTokenizer


Problem

You need to scan a file with more fine-grained resolution than the readLine( ) method of the BufferedReader class and its subclasses (discussed in Recipe 10.14).

Solution

Use a StreamTokenizer, readLine( ), and a StringTokenizer; regular expressions (Chapter 4); or one of several scanner generating tools, such as ANTLR or JavaCC. On JDK 1.5, use the Scanner class (see Recipe 10.5).

Discussion

While you could, in theory, read a file one character at a time and analyze each character, that is a pretty low-level approach. The read( ) method in the Reader class is defined to return int so that it can use the time-honored value -1 (defined as EOF in Unix <stdio.h> for years) to indicate that you have read to the end of the file:

void doFile(Reader is) {     int c;     while ((c=is.read( )) != -1) {         System.out.print((char)c);     } }

The cast to char is interesting. The program compiles fine without it, but does not print correctly because c is declared as int (which it must be, to be able to compare against the end-of-file value -1). For example, the integer value corresponding to capital A treated as an int prints as 65, while (char) prints the character A.

We discussed the StringTokenizer class extensively in Recipe 3.2. The combination of readLine( ) and StringTokenizer provides a simple means of scanning a file. Suppose you need to read a file in which each line consists of a name like user@host.domain, and you want to split the lines into users and host addresses. You could use this:

// ScanStringTok.java protected void process(LineNumberReader is) {         String s = null;         try {             while ((s = is.readLine( )) != null) {                 StringTokenizer st = new StringTokenizer(s, "@", true);                 String user = (String)st.nextElement( );                 st.nextElement( );                 String host = (String)st.nextElement( );                 System.out.println("User name: " + user +                     "; host part: " + host);                 // Presumably you would now do something                  // with the user and host parts...               }         } catch (NoSuchElementException ix) {             System.err.println("Line " + is.getLineNumber( ) +                 ": Invalid input " + s);         } catch (IOException e) {             System.err.println(e);         } }

The StreamTokenizer class in java.util provides slightly more capabilities for scanning a file. It reads characters and assembles them into words, or tokens. It returns these tokens to you along with a type code describing the kind of token it found. This typecode is one of four predefined types (StringTokenizer.TT_WORD, TT_NUMBER, TT_EOF, or TT_EOL for the end-of-line), or the char value of an ordinary character (such as 40 for the space character). Methods such as ordinaryCharacter( ) allow you to specify how to categorize characters, while others such as slashSlashComment( ) allow you to enable or disable features.

Example 10-3 shows a StreamTokenizer used to implement a simple immediate-mode stack-based calculator:

2 2 + = 4 22 7 / = 3.141592857

I read tokens as they arrive from the StreamTokenizer. Numbers are put on the stack. The four operators (+, -, *, and /) are immediately performed on the two elements at the top of the stack, and the result is put back on the top of the stack. The = operator causes the top element to be printed, but is left on the stack so that you can say:

4 5 * = 2 / = 20.0 10.0

Example 10-3. Simple calculator using StreamTokenizer
import java.io.*; import java.io.StreamTokenizer; import java.util.Stack; /**  * SimpleCalc -- simple calculator to show StringTokenizer  *  * @author    Ian Darwin, http://www.darwinsys.com/  * @version    $Id: ch10.xml,v 1.5 2004/05/04 20:12:12 ian Exp $  */ public class SimpleCalcStreamTok {     /** The StreamTokenizer Input */     protected  StreamTokenizer tf;     /** The Output File */     protected PrintWriter out = new PrintWriter(System.out, true);     /** The variable name (not used in this version) */     protected String variable;     /** The operand stack */     protected Stack s;     /* Driver - main program */     public static void main(String[] av) throws IOException {         if (av.length == 0)             new SimpleCalcStreamTok(                 new InputStreamReader(System.in)).doCalc( );         else              for (int i=0; i<av.length; i++)                 new SimpleCalcStreamTok(av[i]).doCalc( );     }     /** Construct by filename */     public SimpleCalcStreamTok(String fileName) throws IOException {         this(new FileReader(fileName));     }     /** Construct from an existing Reader */     public SimpleCalcStreamTok(Reader rdr) throws IOException {         tf = new StreamTokenizer(rdr);         // Control the input character set:         tf.slashSlashComments(true);    // treat "//" as comments         tf.ordinaryChar('-');        // used for subtraction         tf.ordinaryChar('/');    // used for division         s = new Stack( );     }     /** Construct from a Reader and a PrintWriter      */     public SimpleCalcStreamTok(Reader in, PrintWriter out) throws IOException {         this(in);         setOutput(out);     }          /**      * Change the output destination.      */     public void setOutput(PrintWriter out) {         this.out = out;     }     protected void doCalc( ) throws IOException {         int iType;         double tmp;         while ((iType = tf.nextToken( )) != StreamTokenizer.TT_EOF) {             switch(iType) {             case StreamTokenizer.TT_NUMBER: // Found a number, push value to stack                 push(tf.nval);                 break;             case StreamTokenizer.TT_WORD:                 // Found a variable, save its name. Not used here.                 variable = tf.sval;                 break;             case '+':                 // + operator is commutative.                 push(pop( ) + pop( ));                 break;             case '-':                 // - operator: order matters.                 tmp = pop( );                 push(pop( ) - tmp);                 break;             case '*':                 // Multiply is commutative                 push(pop( ) * pop( ));                 break;             case '/':                 // Handle division carefully: order matters!                 tmp = pop( );                 push(pop( ) / tmp);                 break;             case '=':                 out.println(peek( ));                 break;             default:                 out.println("What's this? iType = " + iType);             }         }     }     void push(double val) {         s.push(new Double(val));     }     double pop( ) {         return ((Double)s.pop( )).doubleValue( );     }     double peek( ) {         return ((Double)s.peek( )).doubleValue( );     }     void clearStack( ) {         s.removeAllElements( );     } }

While StreamTokenizer is useful, it knows only a limited number of tokens and has no way of specifying that the tokens must appear in a particular order. To do more advanced scanning, you need some special-purpose scanning tools. Such tools have been used for a long time in the Unix realm. The best-known examples are yacc and lex (discussed in the O'Reilly text lex & yacc). These tools let you specify the lexical structure of your input using regular expressions (see Chapter 4). For example, you might say that an email address consists of a series of alphanumerics, followed by an at sign (@), followed by a series of alphanumerics with periods embedded, as:

name:    [A-Za-z0-9]+@[A-Za-z0-0.]

The tool then writes code that recognizes the characters you have described. These tools also have a grammatical specification, which says, for example, that the keyword ADDRESS must appear, followed by a colon, followed by a "name" token, as previously defined.

Two widely used scanning tools for Java are ANTLR and JavaCC . Terence Parr is the author and maintainer of ANTLR, which can be download from http://www.antlr.org/. JavaCC is an open source project on java.net (https://javacc.dev.java.net/). These "compiler generators" can be used to write grammars for a wide variety of programs, from simple calculators such as the one earlier in this recipe through HTML and CORBA/IDL, up to full Java and C/C++ compilers. Examples of these are included with the downloads. Unfortunately, the learning curve for parsers in general precludes providing a simple and comprehensive example here. Please refer to the documentation and the numerous examples provided with the distributions.

Java offers simple line-at-a-time scanners using StringTokenizer, fancier token-based scanners using StreamTokenizer, and grammar-based scanners based on JavaCC and similar tools. In addition to these, JDK 1.5 provides an easier way to scan simple tokens (see Recipe 10.5).



Java Cookbook
Java Cookbook, Second Edition
ISBN: 0596007019
EAN: 2147483647
Year: 2003
Pages: 409
Authors: Ian F Darwin

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net