Java Implementation | Using XML with Legacy Business Applications

The Java implementation is composed of the following source files:

XMLToCSVSimple.java : the main routine for the Java application
CSVRowWriter.java : a class with write and formatRow methods
SAXErrorHandler.java : a class that handles exceptions while parsing the input XML document

The formatRow method is not very relevant to the main thrust of the book, so I won't discuss it further here other than to make one comment. The formatRow method uses only a comma as a field delimiter and always encloses fields with quotes. In Chapter 7 we'll develop a more flexible approach.

main in XMLToCSVBasic.java

Let's start with this bit of pseudocode:

 Set up DOM XML environment (dependent on implementation) Load input XML Document (dependent on implementation)

In JAXP, we build a DOM Document with the DocumentBuilder class. However, we can't just declare a DocumentBuilder; we must get one from a DocumentBuilderFactory. A real-world analogy would be a metal worker who wants to build a barbecue for his backyard. He needs tools to build it, and the ultimate source of those tools is a tool and die maker. So, we declare a tool and die maker (the DocumentBuilderFactory), who then makes us a tool (the DocumentBuilder) that we can use to build our barbecue (the Document). Here's how the Java code looks.

From XMLToCSVBasic.java

 //  Set up DOM XML environment DocumentBuilderFactory Factory =     DocumentBuilderFactory.newInstance(); //  Create the new document builder DocumentBuilder Builder = Factory.newDocumentBuilder(); //  Set the error handler Builder.setErrorHandler(new SAXErrorHandler()); //  Load input XML Document (dependent on implementation) //  Use our DocumentBuilder's parse method on the //  disk file passed on the command line. docInput = Builder.parse(new File(sInputXMLName));

The Builder.parse call is where we start making the barbecue. Also note the Builder.setErrorHandler call. We'll talk more about that soon.

The rest of the DOM- related code is pretty straightforward. Here's the Java code with the pseudocode embedded as comments.

 //  NodeList of Rows <- Call Document's //    getElementsByTagName for all elements named Row RowList = docInput.getElementsByTagName("Row"); //  DO until Rows NodeList.item[index] is null //    Call CSVRowWriter write method, passing //        NodeList.item[index] //    Increment index //  ENDDO while (RowList.item(iRows) != null) {   RowWriter.write(RowList.item(iRows));   iRows++; }

write in CSVRowWriter.java

As I said earlier, this is where most of the work gets done. Again, here's the DOM-relevant snippet of code with the pseudocode as comments. The passed Row Element is referred to as nRow. All the DOM work gets done in the DO loop.

From CSVRowWriter.java ”write

 //  Columns NodeList <- Get Row's childNodes attribute ColumnList = nRow.getChildNodes(); //  DO until Columns NodeList.item[index] is null while (ColumnList.item(iRowChildren) != null) {   // Skip the Row's Text nodes   if ( ColumnList.item(iRowChildren).getNodeType()     != Node.ELEMENT_NODE)   {     iRowChildren++;     continue;   }      // Get a shorthand name for this guy   nColumn = ColumnList.item(iRowChildren);   //  Column Name <- get NodeName attribute   sColumnName = nColumn.getNodeName();   //  Column Number <- Derive from Column Name   iColumnNumber = (new Integer(     sColumnName.substring(6))).intValue();   //  IF Column Number > Highest Column   //    Highest Column <- Column Number   //  ENDIF   if (iColumnNumber > iHighestColumn)     iHighestColumn = iColumnNumber;   //  Column Array [Column Number] <- get nodeValue of   //    item[index] firstChild Node   sColumnArray[iColumnNumber] =     nColumn.getFirstChild().getNodeValue();   //  Increment index   iRowChildren++; //  ENDDO }

WARNING

Watch out for unexpected Text Nodes with whitespace!

I need to point out one thing here. At the top of the loop we have these few lines:

 // Skip the Row's Text nodes if ( ColumnList.item(iRowChildren).getNodeType()   != Node.ELEMENT_NODE) {   iRowChildren++;   continue; }

This is because the Row's NodeList had an unexpected Node preceding each ColumnXX Element Node. We have to skip them since the only children we want to process are the ColumnXX Element Nodes. This behavior was peculiar to JAXP and Xerces. I did not observe it with MSXML.

Interestingly, I observed this behavior only when processing a file that was "pretty printed," that is, each tag started on a new line with indentation. If I processed a file with no whitespace between the tags I didn't observe it. Some debugging statements in the code verified that these were indeed Text Nodes with contents of a new line and tab characters . (Thank you, XMLSPY!)

Another interesting thing is that calling the Row Element's normalize method did not make the Text Nodes go away or consolidate them into a single Text Node. However, if you think about it, a normalize call should not affect this behavior since, technically speaking, these are not adjacent Text Nodes. They occur between the Column Nodes, not next to each other.

To eliminate these Text Nodes, the DocumentBuilderFactory class has a method called "setIgnoringElementContentWhitespace". (I love these verbose Java method names !) This method strips out the "ignorable" whitespace from the content of Elements that have only child Elements and no Text content. However, the parser only knows of such elements from a DTD or a schema, so you have to be validating the input for this method to have any effect. We haven't included validation yet in our code, so the setIgnoringElementContentWhitespace method wouldn't have helped us here. We will use this method later when we do validation in Chapter 5.

I suppose that the Xerces and MSXML developers could have long arguments about which parser behaves correctly, but it is of little interest to me. The bottom line is to know how your parser behaves and to code your programs accordingly . Files intended to be used for business application import/export or exchange with trading partners rarely have such "mixed content" where an Element can have both data and child Elements. However, beware of unexpected Text Nodes and be prepared to handle them. Don't always assume that a Node List has only Elements.

Error Handling

Both JAXP and the Xerces DOM implementation follow the basic Java architecture for exception handling. Consistent with that architecture, most of the code in the main method lies within a try block that is followed by various catch blocks. JAXP implements an interesting overall parsing architecture in that it allows you to use a different underlying XML parser than the default Xerces implementation. One way JAXP does this is by requiring such pluggable parsers to use the SAX classes related to exception handling. So, whether or not your parser is actually a SAX parser under the hood, you need a SAX exception handler if you want to handle parsing exceptions. The JAXP documentation (in Javadoc format) has this to say about declaring an exception handler for your DocumentBuilder:

If an application does not register an ErrorHandler, XML parsing errors will go unreported and bizarre behavior may result.

Given that, I tend to think it's a good idea to declare an exception handler. However, SAX parsing errors are just one kind of error, even if they tend to be the most involved to handle. Here are some other errors we want to catch:

SAX parsing exceptions : warnings, errors, fatal errors
DOM exceptions of all types : for example, attempting to reference nonexistent DOM nodes and using an out-of-range index into a NodeList
All other Java exceptions : most likely I/O exceptions

To catch SAX parsing exceptions we declare an error handler for our DocumentBuilder as follows :

 dbBuilder.setErrorHandler(new SAXErrorHandler());

The SAXErrorHandler class will be reused in all our Java utilities that read XML documents, so it is worth looking at in a bit of detail. It has four methods, the first three of which are required by the SAX ErrorHandler interface it implements.

warning : Under normal circumstances this will just print exception information to the standard system error stream and continue. However, it may itself also throw another SAXException if it encounters anything really bizarre.
error : This gets SAX-specific exception information and rethrows the exception.
fatalError : This is virtually the same as the error method. It gets SAX-specific exception information and rethrows the exception.
getSAXExceptionInfo : This utility formats interesting SAX exception information into a string.

Here's the code:

SAXErrorHandler.java

 // Standard libraries import java.io.*; // JAXP packages import javax.xml.parsers.*; import org.xml.sax.*; import org.xml.sax.helpers.*; public class SAXErrorHandler implements ErrorHandler {   //  The first three methods are required by the SAX   //  ErrorHandler interface. They deal with warnings, errors,   //  and fatalErrors, respectively.   public void warning(SAXParseException spe)     throws SAXException   {     System.err.println("\nWarning during parsing " +       getSAXExceptionInfo(spe));   }   public void error(SAXParseException spe)     throws SAXException   {     String sInfo = "\nError during parsing " +       getSAXExceptionInfo(spe);     throw new SAXException(sInfo);   }   public void fatalError(SAXParseException spe)     throws SAXException   {     String sInfo = "\nFatal error during parsing " +       getSAXExceptionInfo(spe);     throw new SAXException(sInfo);   }   //  This utility method gets SAX specific information about   //  the exception   private String getSAXExceptionInfo(SAXParseException spe)   {     String sInfo = "\n" +       "URL    = " + spe.getSystemId() + "\n" +       "Entity = " + spe.getPublicId() + "\n" +       "Line = " + spe.getLineNumber() + "\n" +       "Column = " + spe.getColumnNumber() + "\n" +       "Text = " + spe.getMessage() + "\n";     return sInfo;   } }

The use of URL is not as goofy as it looks on first take. We can report not only parsing errors in the instance document we're working with but also errors in parsing any associated schema files. So, the URL may actually give us some useful information.

Our general strategy is to continue if we encounter a warning and bail out if we encounter anything more severe. When we bail out, we exit back all the way out of the try block in the main routine and execute its catch block.

Keeping with the strategy of fairly simple error handling, the catch block for the main routine is pretty simple. We want it to handle DOM exceptions and anything else we might have tripped over such as I/O errors. The DOMException class extends java.lang.RuntimeException and doesn't have any special methods associated with it. So, for DOM exceptions and everything else throwable we just print the message and a stack trace to the standard system error stream, then set an exit status to an error condition.