Recipe 3.13 Parsing Comma-Separated Data


Problem

You have a string or a file of lines containing comma-separated values (CSV) that you need to read. Many Windows-based spreadsheets and some databases use CSV to export data.

Solution

Use my CSV class or a regular expression (see Chapter 4).

Discussion

CSV is deceptive. It looks simple at first glance, but the values may be quoted or unquoted. If quoted, they may further contain escaped quotes. This far exceeds the capabilities of the StringTokenizer class (Recipe 3.2). Either considerable Java coding or the use of regular expressions is required. I'll show both ways.

First, a Java program. Assume for now that we have a class called CSV that has a no-argument constructor and a method called parse( ) that takes a string representing one line of the input file. The parse( ) method returns a list of fields. For flexibility, the fields are returned as a List, from which you can obtain an Iterator (see Recipe 7.4). I simply use the Iterator's hasNext( ) method to control the loop and its next( ) method to get the next object:

import java.util.*; /* Simple demo of CSV parser class.  */ public class CSVSimple {         public static void main(String[] args) {         CSV parser = new CSV( );         List  list = parser.parse(             "\"LU\",86.25,\"11/4/1998\",\"2:19PM\",+4.0625");         Iterator it = list.iterator( );         while (it.hasNext( )) {             System.out.println(it.next( ));         }     } }

After the quotes are escaped, the string being parsed is actually the following:

"LU",86.25,"11/4/1998","2:19PM",+4.0625

Running CSVSimple yields the following output:

> java CSVSimple LU 86.25 11/4/1998 2:19PM +4.0625 >

But what about the CSV class itself? The code in Example 3-10 started as a translation of a CSV program written in C++ by Brian W. Kernighan and Rob Pike that appeared in their book The Practice of Programming (Addison Wesley). Their version commingled the input processing with the parsing; my CSV class does only the parsing since the input could be coming from any of a variety of sources. And it has been substantially rewritten over time. The main work is done in parse( ), which delegates handling of individual fields to advquoted( ) in cases where the field begins with a quote; otherwise, to advplain( ).

Example 3-10. CSV.java
import java.util.*; import com.darwinsys.util.Debug; /** Parse comma-separated values (CSV), a common Windows file format.  * Sample input: "LU",86.25,"11/4/1998","2:19PM",+4.0625  * <p>  * Inner logic adapted from a C++ original that was  * Copyright (C) 1999 Lucent Technologies  * Excerpted from 'The Practice of Programming'  * by Brian W. Kernighan and Rob Pike.  * <p>  * Included by permission of the http://tpop.awl.com/ web site,   * which says:  * "You may use this code for any purpose, as long as you leave   * the copyright notice and book citation attached." I have done so.  * @author Brian W. Kernighan and Rob Pike (C++ original)  * @author Ian F. Darwin (translation into Java and removal of I/O)  * @author Ben Ballard (rewrote advQuoted to handle '""' and for readability)  */ public class CSV {         public static final char DEFAULT_SEP = ',';     /** Construct a CSV parser, with the default separator (','). */     public CSV( ) {         this(DEFAULT_SEP);     }     /** Construct a CSV parser with a given separator.       * @param sep The single char for the separator (not a list of      * separator characters)      */     public CSV(char sep) {         fieldSep = sep;     }     /** The fields in the current String */     protected List list = new ArrayList( );     /** the separator char for this parser */     protected char fieldSep;     /** parse: break the input String into fields      * @return java.util.Iterator containing each field       * from the original as a String, in order.      */     public List parse(String line)     {         StringBuffer sb = new StringBuffer( );         list.clear( );            // recycle to initial state         int i = 0;         if (line.length( ) == 0) {             list.add(line);             return list;         }         do {             sb.setLength(0);             if (i < line.length( ) && line.charAt(i) == '"')                 i = advQuoted(line, sb, ++i);    // skip quote             else                 i = advPlain(line, sb, i);             list.add(sb.toString( ));             Debug.println("csv", sb.toString( ));             i++;         } while (i < line.length( ));         return list;     }     /** advQuoted: quoted field; return index of next separator */     protected int advQuoted(String s, StringBuffer sb, int i)     {         int j;         int len= s.length( );         for (j=i; j<len; j++) {             if (s.charAt(j) == '"' && j+1 < len) {                 if (s.charAt(j+1) == '"') {                     j++; // skip escape char                 } else if (s.charAt(j+1) == fieldSep) { //next delimiter                     j++; // skip end quotes                     break;                 }             } else if (s.charAt(j) == '"' && j+1 == len) { // end quotes at end of line                 break; //done             }             sb.append(s.charAt(j));    // regular character.         }         return j;     }     /** advPlain: unquoted field; return index of next separator */     protected int advPlain(String s, StringBuffer sb, int i)     {         int j;         j = s.indexOf(fieldSep, i); // look for separator         Debug.println("csv", "i = " + i + ", j = " + j);         if (j == -1) {                   // none found             sb.append(s.substring(i));             return s.length( );         } else {             sb.append(s.substring(i, j));             return j;         }     } }

In the online source directory, you'll find CSVFile.java, which reads a text file and runs it through parse( ). You'll also find Kernighan and Pike's original C++ program.

We haven't discussed regular expressions yet (we will in Chapter 4). However, many readers are familiar with regexes in a general way, so the following example demonstrates the power of regexes, as well as providing code for you to reuse. Note that this program replaces all the code[4] in both CSV.java and CSVFile.java. The key to understanding regexes is that a little specification can match a lot of data.

[4] With the caveat that it doesn't handle different delimiters; this could be added using GetOpt and constructing the pattern around the delimiter.

import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.util.ArrayList; import java.util.List; import java.util.regex.Matcher; import java.util.regex.Pattern; /* Simple demo of CSV matching using Regular Expressions.  * Does NOT use the "CSV" class defined in the Java CookBook, but uses  * a regex pattern simplified from Chapter 7 of <em>Mastering Regular   * Expressions</em> (p. 205, first edn.)  * @version $Id: ch03.xml,v 1.3 2004/05/04 18:03:14 ian Exp $  */ public class CSVRE {                                             /** The rather involved pattern used to match CSV's consists of three      * alternations: the first matches aquoted field, the second unquoted,      * the third a null field.      */     public static final String CSV_PATTERN = "\"([^\"]+?)\",?|([^,]+),?|,";     private static Pattern csvRE;     public static void main(String[] argv) throws IOException {         System.out.println(CSV_PATTERN);         new CSVRE( ).process(new BufferedReader(new InputStreamReader(System.in)));     }     /** Construct a regex-based CSV parser. */     public CSVRE( ) {         csvRE = Pattern.compile(CSV_PATTERN);     }     /** Process one file. Delegates to parse( ) a line at a time */     public void process(BufferedReader in) throws IOException {         String line;         // For each line...         while ((line = in.readLine( )) != null) {             System.out.println("line = `" + line + "'");             List l = parse(line);             System.out.println("Found " + l.size( ) + " items.");             for (int i = 0; i < l.size( ); i++) {                 System.out.print(l.get(i) + ",");             }             System.out.println( );         }     }     /** Parse one line.      * @return List of Strings, minus their double quotes      */     public List parse(String line) {         List list = new ArrayList( );         Matcher m = csvRE.matcher(line);         // For each field         while (m.find( )) {             System.out.println(m.groupCount( ));             String match = m.group( );             if (match == null)                 break;             if (match.endsWith(",")) {// trim trailing ,                 match = match.substring(0, match.length( ) - 1);             }             if (match.startsWith("\"")) { // assume also ends with                 match = match.substring(1, match.length( ) - 1);             }             if (match.length( ) == 0)                 match = null;             list.add(match);         }         return list;     } }

It is sometimes "downright scary" how much mundane code you can eliminate with a single, well-formulated regular expression.



Java Cookbook
Java Cookbook, Second Edition
ISBN: 0596007019
EAN: 2147483647
Year: 2003
Pages: 409
Authors: Ian F Darwin

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net