Parsing Non-XML Documents | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

One of the neatest tricks that a SAX filter can perform is presenting something that isn't an XML document as if it were one. The client application doesn't need to know that what it's parsing is really a tab-delimited text file or a relational database if the filter hides all of those details and just presents a view of that source as a well- formed XML document.

There are several advantages to using this approach over actually converting the source. For one, it's a lot more dynamically adaptable to changing data. For another, it's easier to integrate into existing XML processing chains. A small advantage is that you don't need to worry about escaping illegal characters like < and & because these would normally be unescaped before the parser passed them back to the client application.

For example, JDBC makes it easy to write a SAX filter that puts an XML face on an SQL table. The parse() method reads from a JDBC ResultSet which it annotates with tags and attributes by firing the appropriate events. The entire ResultSet becomes a single root table element. Each record is presented in a record element, which contains one field element for each field. Each field element contains the value of the field as text. It also has an xsi:type attribute that identifies the type of the field, and a name attribute that identifies the name of the field. An alternate approach would use the SQL field names as the element names rather than the generic field with a name attribute. Example 8.13 demonstrates .

Example 8.13 Accessing Databases through SAX

 import org.xml.sax.*; import org.xml.sax.helpers.*; import java.sql.*; import java.io.*; public class DatabaseFilter extends XMLFilterImpl {   private Connection connection;   // The string passed to the constructor must be a JDBC URL that   // contains all necessary information for connecting to the   // database such as host, port, username, password, and   // database name. For example,   // jdbc:mysql://host:port]/dbname?user=username&password=pass   // The driver should have been loaded before this method is   // called   public DatabaseFilter(String jdbcURL) throws SQLException {     connection = DriverManager.getConnection(jdbcURL);   }   public void parse(InputSource in) throws SAXException,    IOException, UnsupportedOperationException {     throw new UnsupportedOperationException(      "Can't read a database from an InputStream or Reader"     );   }   public void parse(String selectQuery)    throws SAXException, IOException {     try {       Statement statement = connection.createStatement();       ResultSet data = statement.executeQuery(selectQuery);       ResultSetMetaData metadata = data.getMetaData();       int numFields = metadata.getColumnCount();       Attributes emptyAttributes = new AttributesImpl();       startElement("", "table", "table", emptyAttributes);       while (data.next()) {         startElement("", "record", "record", emptyAttributes);         for (int field = 1; field <= numFields; field++) {           AttributesImpl fieldAtts = new AttributesImpl();           int type = metadata.getColumnType(field);           String typeName = getSchemaType(type);           fieldAtts.addAttribute(            "http://www.w3.org/2001/XMLSchema-instance", "type",            "xsi:type", "NMTOKEN", typeName);           String name = metadata.getColumnName(field);           fieldAtts.addAttribute(            "", "name", "name", "NMTOKEN", name);           // Convert nulls to empty elements with xsi:nil="true"           Object value = data.getObject(field);           if (value == null) {// null value in database             fieldAtts.addAttribute(              "http://www.w3.org/2001/XMLSchema-instance", "nil",              "xsi:nil", "NMTOKEN", "true");             startElement("", "field", "field", fieldAtts);             endElement("", "field", "field");           }           else {// non-null value             startElement("", "field", "field", fieldAtts);             convertToXML(data, field, type);             endElement("", "field", "field");           }         }         endElement("", "record", "record");       }       endElement("", "table", "table");       statement.close();     }     catch (SQLException e) { // convert exception type       throw new SAXException(e);     }   }   // I want the XML document to store values in the standard W3C   // XML Schema Language forms. This requires certain conversions   // depending on the type of the data   private void convertToXML(ResultSet data, int field, int type)    throws SQLException, SAXException {     switch (type) {       case Types.BINARY:       case Types.VARBINARY:       case Types.LONGVARBINARY:         hexEncode(data.getBinaryStream(field));         break;       case Types.BLOB:         Blob blob = data.getBlob(field);         hexEncode(blob.getBinaryStream());         break;       case Types.CLOB:         Clob clob = data.getClob(field);         Reader r = clob.getCharacterStream();         char[] text = new char[1024];         int numRead;         try {           while ((numRead = r.read(text, 0, 1024)) != -1) {             escapeText(text, 0, numRead);             characters(text, 0, numRead);           }         }         catch (IOException e) {           throw new SAXException("Read from CLOB failed", e);         }         break;       case Types.ARRAY:         Array array = data.getArray(field);         writeArray(array);         break;       default: // All other types can be handled as strings         Object o = data.getObject(field);         if (o == null) return;         String s = o.toString();         char[] value = s.toCharArray();         escapeText(value, 0, value.length);         characters(value, 0, value.length);     }   }   private void hexEncode(InputStream in)    throws SQLException, SAXException {     try {       int octet;       while ((octet = in.read()) != -1) {         StringWriter out = new StringWriter(2);         if (octet < 16) out.write('0');         out.write(Integer.toHexString(octet));         char[] text = out.toString().toCharArray();         characters(text, 0, 2);       }     }     catch (IOException e) {       throw new SAXException(e);     }   }   // String types may contain C0 control characters that are   // not legal in XML. I convert these to the Unicode replacement   // character 0xFFFD   private void escapeText(char[] text, int start, int length) {     for (int i = start; i < length; i++) {         text[i] = escapeChar(text[i]);     }   }   private char escapeChar(char c) {     if (c >= 0x20) return c;     else if (c == '\n') return c;     else if (c == '\r') return c;     else if (c == '\t') return c;     return '\uFFFD';   }   private void writeArray(Array array)    throws SQLException, SAXException {     ResultSet data = array.getResultSet();     int type = array.getBaseType();     String typeName = getSchemaType(type);     while (data.next()) {       AttributesImpl fieldAtts = new AttributesImpl();       fieldAtts.addAttribute(         "http://www.w3.org/2001/XMLSchema-instance", "type",         "xsi:type", "NMTOKEN", typeName);       startElement("", "component", "component", fieldAtts);       convertToXML(data, 2, type);       endElement("", "component", "component");     }   }   public static String getSchemaType(int type) {     switch (type) {       case Types.ARRAY:         return "array";       case Types.BIGINT:        return "xsd:long";       case Types.BINARY:        return "xsd:hexBinary";       case Types.BIT:           return "xsd:boolean";       case Types.BLOB:          return "xsd:hexBinary";       case Types.CHAR:          return "xsd:string";       case Types.CLOB:          return "xsd:string";       case Types.DATE:          return "xsd:date";       case Types.DECIMAL:       return "xsd:decimal";       case Types.DOUBLE:        return "xsd:double";       case Types.FLOAT:         return "xsd:decimal";       case Types.INTEGER:       return "xsd:int";       case Types.JAVA_OBJECT:   return "xsd:string";       case Types.LONGVARBINARY: return "xsd:hexBinary";       case Types.LONGVARCHAR:   return "xsd:string";       case Types.NUMERIC:       return "xsd:decimal";       case Types.REAL:          return "xsd:float";       case Types.REF:           return "xsd:IDREF";       case Types.SMALLINT:      return "xsd:short";       case Types.STRUCT:        return "struct";       case Types.TIME:          return "xsd:time";       case Types.TIMESTAMP:     return "xsd:dateTime";       case Types.TINYINT:       return "xsd:byte";       case Types.VARBINARY:     return "xsd:hexBinary";                                 // most general type       default:                  return "xsd:string";     }   }   // Warn clients that this filter does not receive its events   // from another XML parser   public void setParent(XMLReader parent)    throws UnsupportedOperationException {     throw new UnsupportedOperationException(      "A DatabaseFilter reads from an underlying SQL database;"      + " not an underlying XML parser"     );   } }

The trickiest part of this design was not the XML output. It was figuring out how to pass in the database connection parameters and the SQL queries. Sun has defined a JDBC URL, but such a URL only indicates the database, username, and password to access. It does not go all the way down to the level of the SQL query, so it could not be used as a system ID. I chose instead to pass in the database connection parameters through constructors, and the SQL query as a system ID string passed to the parse() method. This is not the customary URI system ID, but because this will only be used in this program and not directly in XML documents, this doesn't cause any major problems. Nonetheless, I changed the name of the formal argument to the parse() method from systemID to SQLQuery to try to make this more obvious. This filter cannot parse InputSource objects because it's not really possible to read a database from a stream. Thus the parse() method that takes an InputSource as an argument throws an UnsupportedOperationException .

A large part of the logic in this filter involves converting JDBC results into text. Most of the JDBC/SQL types have natural string representations that conveniently match W3C XML Schema Language primitive types. You can retrieve these by calling toString() on the corresponding Java object returned by the JDBC getObject() method. But first the binary types (BINARY, VARBINARY, LONGVARBINARY, and BLOB) must be hex encoded. ^[3] CLOBs don't have to be hex encoded, but they aren't available as a single string and thus have to be read from a stream too. Finally, arrays and structs require more detailed treatment as a complex type rather than a simple type.

^[3] Base64 encoding would be more efficient, but I didn't want to introduce another class library just to do base64 encoding.

The sample driver program for this filter, shown in Example 8.14, is very similar to earlier driver programs, except that it requires the user to specify two arguments on the command line: the JDBC URL giving the connection parameters and the SQL query to execute. Because SQL queries normally contain white space, it must be enclosed in double quotes. A more serious example would reuse the same driver (or filter) for multiple queries.

Example 8.14 A Very Simple User Interface for Extracting XML Data from a Relational Database

 import org.xml.sax.*; import org.xml.sax.helpers.XMLReaderFactory; import java.io.*; import com.megginson.sax.DataWriter; import java.sql.SQLException; public class SQLDriver {   public static void main(String[] args) {     if (args.length < 2) {       System.out.println(        "Usage: java SQLDriver URL query driverClass");       return;     }     String url = args[0];     String query = args[1];     String driverClass = "org.gjt.mm.mysql.Driver"; // MySQL     if (args.length >= 3) driverClass = args[2];     try {       // Load JDBC driver       Class.forName(driverClass).newInstance();       // Technically, the newInstance() call isn't needed,       // but the MM.MySQL documentation suggests this to       // "work around some broken JVMs"       XMLFilter filter = new DatabaseFilter(url);       DataWriter out = new DataWriter();       out.forceNSDecl(        "http://www.w3.org/2001/XMLSchema-instance", "xsi");       out.forceNSDecl("http://www.w3.org/2001/XMLSchema", "xsd");       out.setIndentStep(2);       filter.setContentHandler(out);       filter.parse(query);       out.flush();     }     catch (InstantiationException e) {       System.out.println(driverClass        + " could not be instantiated");     }     catch (ClassNotFoundException e) {       System.out.println(driverClass + " could not be found");     }     catch (Exception e) {// SQL, SAX, and IO       e.printStackTrace();       System.out.println(e);     }   } }

Instead of XMLWriter , this driver uses David Megginson's other public-domain writer program, DataWriter , which is designed for record-oriented data of this sort and inserts extra white space to prettify the output. Following is some of the output from when I ran SQLDriver against one of my databases:

 %  java -Dorg.xml.sax.driver=gnu.xml.aelfred2.XmlReader  SQLDriver   'jdbc:mysql://luna.oit.unc.edu/NYWC?user=elharo&password=Password'   "SELECT LastName, FirstName FROM composers   WHERE LastName like 'A%'"  <table xmlns:xsd="http://www.w3.org/2001/XMLSchema"        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">   <record>     <field xsi:type="xsd:string" name="LastName">Anderson</field>     <field xsi:type="xsd:string" name="FirstName">Beth</field>   </record>   <record>     <field xsi:type="xsd:string" name="LastName">Austin</field>     <field xsi:type="xsd:string" name="FirstName">Dorothea</field>   </record>   <record>     <field xsi:type="xsd:string" name="LastName">Austin</field>     <field xsi:type="xsd:string" name="FirstName">Elizabeth</field>   </record>   <record>     <field xsi:type="xsd:string" name="LastName">Ayers</field>     <field xsi:type="xsd:string" name="FirstName">Lydia</field>   </record> </table>

Warning

Depending on the platform, inputting the above arguments can be a little tricky. The SQL query contains white space, and the JDBC URL contains characters that are important to the shell, such as & . You may need to use quote marks to prevent some of these arguments from being interpreted by the shell. Details vary from platform to platform. Of course, this is just a quick hack to demonstrate the filter. A real program would provide a GUI, making these points moot.