Input | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

By far the hardest part of this or any similar problem is parsing the non-XML input data. Everything else pales by comparison. Unlike parsing XML, you generally cannot rely on a library to do the hard work for you. You have to do it yourself. And also unlike XML, there's little guarantee that the data is well- formed . More likely than not, you will encounter incorrectly formatted data.

In this case, because the records are separated into lines, I'll read each line, one at a time, using the readLine() method of java.io.BufferedReader . This method works well enough as long as the data is in a file, although it's potentially buggy when the data is served over a network socket.

Each line is dissected into its component fields inside the splitLine() method. Each record is stored in its own map. The keys for the map are read from a constant array, because the fields are always in the same position in each record.

Caution

For parsing the data out of each line, a lot of Java developers immediately reach for the java.util.StringTokenizer or java.io.StreamTokenizer classes. Don't. These classes are very strangely designed and rarely do what developers expect them to do. For example, if StreamTokenizer encounters a \n inside a string literal, it will convert it to a linefeed . This makes sense when parsing Java source code, but in most other environments \n is just another two characters with no special meaning. Java's tokenizer classes are designed for and suited to parsing Java source code. They are not suitable for reading tab- or comma-delimited data. If you want to design your program around a tokenization function, you should write one yourself that behaves appropriately for your data format.

Example 4.1 shows the input code. To use this, open an input stream to the file containing the budget data and pass that stream as an argument to the parse() method. You'll get back a List containing the parsed data. Each object in this list is a Map containing the data for one line item. Both keys and values for this map are strings. Because the keys are constant, they're stored in a final static array named keys . At various times I plan to use the keys as XML element names, XML attribute names, or SQL field names . Therefore, it's necessary to begin each of them with letters . Thus the keys for the fiscal year fields will be named FY1976, FY1977, FY1978, and so forth instead of just 1976, 1977, 1978, and so forth. This means we won't trivially be able to store the keys as ints. However, this turns out not to have been the case anyway because one of the year fields turns out to be the transitional quarter in 1976which does not represent a full year and does not have a numeric name .

Caution

In 1976 the government's fiscal year shifted forward one quarter. As a result, the 1977 fiscal year started in October, a quarter after the 1976 fiscal year ended. There was a transitional quarter from July through September that year; therefore, some of the data actually represents less than a whole year. Here, the special case is very much a result of the data itself. Thus the data can't be fixed but still requires extra code, making the examples less clean than they otherwise would be.

This sort of funky data (a year with only three months in it that can easily be confused with another year) is exactly the sort of thing you have to watch out for when processing legacy data. The real world does not always fit into neatly typed categories. There's almost always some outlier data that just doesn't fit the schema. All too often it's been forced into the existing system by some manager or data entry clerk in ways the original designers never intended. This happens all the time. You cannot assume the data actually adheres to its schema, either implicit or explicit.

The code to parse each line of input is hidden inside the private splitLine() method. This code is relatively complex. It iterates through the record looking for comma delimiter characters but has to ignore commas that appear inside quoted strings. Furthermore, it must recognize that the end of the string delimits the last token. Even so, this method is not very robust. It will throw an uncaught exception if any quotes are omitted, or if there are too few fields. It will not notice and report the error if a record contains too many fields.

Example 4.1 A Class That Parses Comma-Separated Values into a List of HashMaps

 import java.io.*; import java.util.*; public class BudgetData {   public static List parse(InputStream src) throws IOException {     // The document as published by the OMB is encoded in Latin-1     InputStreamReader isr = new InputStreamReader(src, "8859_1");     BufferedReader in = new BufferedReader(isr);     List records = new ArrayList();     String lineItem;     while ((lineItem = in.readLine()) != null) {       records.add(splitLine(lineItem));     }     return records;   }   // the field names in order   public final static String[] keys = {     "AgencyCode",     "AgencyName",     "BureauCode",     "BureauName",     "AccountCode",     "AccountName",     "TreasuryAgencyCode",     "SubfunctionCode",     "SubfunctionTitle",     "BEACategory",     "On-Off-BudgetIndicator",     "FY1976", "TransitionQuarter", "FY1977", "FY1978", "FY1979",     "FY1980", "FY1981", "FY1982", "FY1983", "FY1984", "FY1985",     "FY1986", "FY1987", "FY1988", "FY1989", "FY1990", "FY1991",     "FY1992", "FY1993", "FY1994", "FY1995", "FY1996", "FY1997",     "FY1998", "FY1999", "FY2000", "FY2001", "FY2002", "FY2003",     "FY2004", "FY2005", "FY2006"    };   private static Map splitLine(String record) {     record = record.trim();     int index = 0;     Map result = new HashMap();     for (int i = 0; i < keys.length; i++) {       //find the next comma       StringBuffer sb = new StringBuffer();       char c;       boolean inString = false;       while (true) {         c = record.charAt(index);         if (!inString && c == '"') inString = true;         else if (inString && c == '"') inString = false;         else if (!inString && c == ',') break;         else sb.append(c);         index++;         if (index == record.length()) break;       }       String s = sb.toString().trim();       result.put(keys[i], s);       index++;     }     return result;   } }