ProblemYou need to scan a file with more fine-grained resolution than the readLine( ) method of the BufferedReader class and its subclasses (discussed in Recipe 10.14). SolutionUse a StreamTokenizer, readLine( ), and a StringTokenizer; regular expressions (Chapter 4); or one of several scanner generating tools, such as ANTLR or JavaCC. On JDK 1.5, use the Scanner class (see Recipe 10.5). DiscussionWhile you could, in theory, read a file one character at a time and analyze each character, that is a pretty low-level approach. The read( ) method in the Reader class is defined to return int so that it can use the time-honored value -1 (defined as EOF in Unix <stdio.h> for years) to indicate that you have read to the end of the file: void doFile(Reader is) { int c; while ((c=is.read( )) != -1) { System.out.print((char)c); } } The cast to char is interesting. The program compiles fine without it, but does not print correctly because c is declared as int (which it must be, to be able to compare against the end-of-file value -1). For example, the integer value corresponding to capital A treated as an int prints as 65, while (char) prints the character A. We discussed the StringTokenizer class extensively in Recipe 3.2. The combination of readLine( ) and StringTokenizer provides a simple means of scanning a file. Suppose you need to read a file in which each line consists of a name like user@host.domain, and you want to split the lines into users and host addresses. You could use this: // ScanStringTok.java protected void process(LineNumberReader is) { String s = null; try { while ((s = is.readLine( )) != null) { StringTokenizer st = new StringTokenizer(s, "@", true); String user = (String)st.nextElement( ); st.nextElement( ); String host = (String)st.nextElement( ); System.out.println("User name: " + user + "; host part: " + host); // Presumably you would now do something // with the user and host parts... } } catch (NoSuchElementException ix) { System.err.println("Line " + is.getLineNumber( ) + ": Invalid input " + s); } catch (IOException e) { System.err.println(e); } } The StreamTokenizer class in java.util provides slightly more capabilities for scanning a file. It reads characters and assembles them into words, or tokens. It returns these tokens to you along with a type code describing the kind of token it found. This typecode is one of four predefined types (StringTokenizer.TT_WORD, TT_NUMBER, TT_EOF, or TT_EOL for the end-of-line), or the char value of an ordinary character (such as 40 for the space character). Methods such as ordinaryCharacter( ) allow you to specify how to categorize characters, while others such as slashSlashComment( ) allow you to enable or disable features. Example 10-3 shows a StreamTokenizer used to implement a simple immediate-mode stack-based calculator: 2 2 + = 4 22 7 / = 3.141592857 I read tokens as they arrive from the StreamTokenizer. Numbers are put on the stack. The four operators (+, -, *, and /) are immediately performed on the two elements at the top of the stack, and the result is put back on the top of the stack. The = operator causes the top element to be printed, but is left on the stack so that you can say: 4 5 * = 2 / = 20.0 10.0 Example 10-3. Simple calculator using StreamTokenizerimport java.io.*; import java.io.StreamTokenizer; import java.util.Stack; /** * SimpleCalc -- simple calculator to show StringTokenizer * * @author Ian Darwin, http://www.darwinsys.com/ * @version $Id: ch10.xml,v 1.5 2004/05/04 20:12:12 ian Exp $ */ public class SimpleCalcStreamTok { /** The StreamTokenizer Input */ protected StreamTokenizer tf; /** The Output File */ protected PrintWriter out = new PrintWriter(System.out, true); /** The variable name (not used in this version) */ protected String variable; /** The operand stack */ protected Stack s; /* Driver - main program */ public static void main(String[] av) throws IOException { if (av.length == 0) new SimpleCalcStreamTok( new InputStreamReader(System.in)).doCalc( ); else for (int i=0; i<av.length; i++) new SimpleCalcStreamTok(av[i]).doCalc( ); } /** Construct by filename */ public SimpleCalcStreamTok(String fileName) throws IOException { this(new FileReader(fileName)); } /** Construct from an existing Reader */ public SimpleCalcStreamTok(Reader rdr) throws IOException { tf = new StreamTokenizer(rdr); // Control the input character set: tf.slashSlashComments(true); // treat "//" as comments tf.ordinaryChar('-'); // used for subtraction tf.ordinaryChar('/'); // used for division s = new Stack( ); } /** Construct from a Reader and a PrintWriter */ public SimpleCalcStreamTok(Reader in, PrintWriter out) throws IOException { this(in); setOutput(out); } /** * Change the output destination. */ public void setOutput(PrintWriter out) { this.out = out; } protected void doCalc( ) throws IOException { int iType; double tmp; while ((iType = tf.nextToken( )) != StreamTokenizer.TT_EOF) { switch(iType) { case StreamTokenizer.TT_NUMBER: // Found a number, push value to stack push(tf.nval); break; case StreamTokenizer.TT_WORD: // Found a variable, save its name. Not used here. variable = tf.sval; break; case '+': // + operator is commutative. push(pop( ) + pop( )); break; case '-': // - operator: order matters. tmp = pop( ); push(pop( ) - tmp); break; case '*': // Multiply is commutative push(pop( ) * pop( )); break; case '/': // Handle division carefully: order matters! tmp = pop( ); push(pop( ) / tmp); break; case '=': out.println(peek( )); break; default: out.println("What's this? iType = " + iType); } } } void push(double val) { s.push(new Double(val)); } double pop( ) { return ((Double)s.pop( )).doubleValue( ); } double peek( ) { return ((Double)s.peek( )).doubleValue( ); } void clearStack( ) { s.removeAllElements( ); } } While StreamTokenizer is useful, it knows only a limited number of tokens and has no way of specifying that the tokens must appear in a particular order. To do more advanced scanning, you need some special-purpose scanning tools. Such tools have been used for a long time in the Unix realm. The best-known examples are yacc and lex (discussed in the O'Reilly text lex & yacc). These tools let you specify the lexical structure of your input using regular expressions (see Chapter 4). For example, you might say that an email address consists of a series of alphanumerics, followed by an at sign (@), followed by a series of alphanumerics with periods embedded, as: name: [A-Za-z0-9]+@[A-Za-z0-0.] The tool then writes code that recognizes the characters you have described. These tools also have a grammatical specification, which says, for example, that the keyword ADDRESS must appear, followed by a colon, followed by a "name" token, as previously defined. Two widely used scanning tools for Java are ANTLR and JavaCC . Terence Parr is the author and maintainer of ANTLR, which can be download from http://www.antlr.org/. JavaCC is an open source project on java.net (https://javacc.dev.java.net/). These "compiler generators" can be used to write grammars for a wide variety of programs, from simple calculators such as the one earlier in this recipe through HTML and CORBA/IDL, up to full Java and C/C++ compilers. Examples of these are included with the downloads. Unfortunately, the learning curve for parsers in general precludes providing a simple and comprehensive example here. Please refer to the documentation and the numerous examples provided with the distributions. Java offers simple line-at-a-time scanners using StringTokenizer, fancier token-based scanners using StreamTokenizer, and grammar-based scanners based on JavaCC and similar tools. In addition to these, JDK 1.5 provides an easier way to scan simple tokens (see Recipe 10.5). |