Recipe 3.2 Taking Strings Apart with StringTokenizer


Problem

You need to take a string apart into words or tokens.

Solution

Construct a StringTokenizer around your string and call its methods hasMoreTokens( ) and nextToken( ) . Or, use Regular Expressions (see Chapter 4).

The StringTokenizer methods implement the Iterator design pattern (see Recipe 7.4):

// StrTokDemo.java StringTokenizer st = new StringTokenizer("Hello World of Java"); while (st.hasMoreTokens( ))     System.out.println("Token: " + st.nextToken( ));

StringTokenizer also implements the Enumeration interface directly (also in Recipe 7.4), but if you use the methods thereof you need to cast the results to String.

A StringTokenizer normally breaks the String into tokens at what we would think of as "word boundaries" in European languages. Sometimes you want to break at some other character. No problem. When you construct your StringTokenizer, in addition to passing in the string to be tokenized, pass in a second string that lists the " break characters." For example:

// StrTokDemo2.java StringTokenizer st = new StringTokenizer("Hello, World|of|Java", ", |"); while (st.hasMoreElements( ))     System.out.println("Token: " + st.nextElement( ));

But wait, there's more! What if you are reading lines like:

FirstName|LastName|Company|PhoneNumber

and your dear old Aunt Begonia hasn't been employed for the last 38 years? Her "Company" field will in all probability be blank.[3] If you look very closely at the previous code example, you'll see that it has two delimiters together (the comma and the space), but if you run it, there are no "extra" tokens. That is, the StringTokenizer normally discards adjacent consecutive delimiters. For cases like the phone list, where you need to preserve null fields, there is good news and bad news. The good news is you can do it: you simply add a second argument of true when constructing the StringTokenizer, meaning that you wish to see the delimiters as tokens. The bad news is that you now get to see the delimiters as tokens, so you have to do the arithmetic yourself. Want to see it? Run this program:

[3] Unless, perhaps, you're as slow at updating personal records as I am.

// StrTokDemo3.java StringTokenizer st =      new StringTokenizer("Hello, World|of|Java", ", |", true); while (st.hasMoreElements( ))     System.out.println("Token: " + st.nextElement( ));

and you get this output:

C:\javasrc>java  StrTokDemo3 Token: Hello Token: , Token: Token: World Token: | Token: of Token: | Token: Java

This isn't how you'd like StringTokenizer to behave, ideally, but it is serviceable enough most of the time. Example 3-1 processes and ignores consecutive tokens, returning the results as an array of Strings.

Example 3-1. StrTokDemo4.java (StringTokenizer)
import java.util.*; /** Show using a StringTokenizer including getting the delimiters back */ public class StrTokDemo4 {     public final static int MAXFIELDS = 5;     public final static String DELIM = "|";     /** Processes one String; returns it as an array of Strings */     public static String[] process(String line) {         String[] results = new String[MAXFIELDS];         // Unless you ask StringTokenizer to give you the tokens,         // it silently discards multiple null tokens.         StringTokenizer st = new StringTokenizer(line, DELIM,  true);         int i = 0;         // stuff each token into the current slot in the array         while (st.hasMoreTokens( )) {             String s = st.nextToken( );             if (s.equals(DELIM)) {                 if (i++>=MAXFIELDS)                     // This is messy: See StrTokDemo4b which uses                      // a Vector to allow any number of fields.                     throw new IllegalArgumentException("Input line " +                         line + " has too many fields");                 continue;             }             results[i] = s;         }         return results;     }     public static void printResults(String input, String[] outputs) {         System.out.println("Input: " + input);         for (int i=0; i<outputs.length; i++)             System.out.println("Output " + i + " was: " + outputs[i]);     }     public static void main(String[] a) {         printResults("A|B|C|D", process("A|B|C|D"));         printResults("A||C|D", process("A||C|D"));         printResults("A|||D|E", process("A|||D|E"));     } }

When you run this, you will see that A is always in Field 1, B (if present) is in Field 2, and so on. In other words, the null fields are being handled properly:

Input: A|B|C|D Output 0 was: A Output 1 was: B Output 2 was: C Output 3 was: D Output 4 was: null Input: A||C|D Output 0 was: A Output 1 was: null Output 2 was: C Output 3 was: D Output 4 was: null Input: A|||D|E Output 0 was: A Output 1 was: null Output 2 was: null Output 3 was: D Output 4 was: E

See Also

Now that Java includes Regular Expressions (as of JDK 1.4), many occurrences of StringTokenizer can be replaced with Regular Expressions (see Chapter 4) with considerably more flexibility. For example, to extract all the numbers from a String, you can use this code:

Matcher toke = Pattern.compile("\\d+").matcher(inputString);  while (toke.find( )) {         String courseString = toke.group(0);                int courseNumber = Integer.parseInt(courseString);         ...

This allows user input to be more flexible than you could easily handle with a StringTokenizer. Assuming that the numbers represent course numbers at some educational institution, the inputs "471,472,570" or "Courses 471 and 472, 570" or just "471 472 570" should all give the same results.



Java Cookbook
Java Cookbook, Second Edition
ISBN: 0596007019
EAN: 2147483647
Year: 2003
Pages: 409
Authors: Ian F Darwin

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net