Java Implementation | Using XML with Legacy Business Applications

In validation we again encounter one of the areas where it takes us many more lines of code to do something with Java and JAXP than it does with C++ and MSXML. This is partially because the JAXP and Xerces implementations give us many more configuration options for our parser. The other reason is that the current versions of the JAXP and Xerces implementations don't give us a handy way to validate a DOM Document in memory before we write it disk. We have to do a bit of Java magic and create our own method.

Let's start with input validation because we have to build on these techniques when doing output validation.

Input Validation in XMLToCSVBasic.java

JAXP performs validation in the DocumentBuilder's parse method. There is no special "validate" method as such. We tell JAXP that we want validation by specifying the type of DocumentBuilder the DocumentBuilderFactory should make. This is kind of like knowing that you want to build a barbecue out of titanium instead of stainless steel and calling the toolmaker to have him make tools to work with the former rather than the latter.

So, what options do we want to set on the DocumentBuilderFactory? Aside from the obvious one of telling it we want to validate, there are a few others. Here's the relevant code added to XMLToCSVBasic.java.

Validation Code in XMLToCSVBasic.java

 //  Set up DOM XML environment DocumentBuilderFactory Factory =   DocumentBuilderFactory.newInstance(); //  Set the factory to create a Document Builder that //    is: //  Namespace aware - necessary for schema validation Factory.setNamespaceAware(true); //  Ignores whitespace on Element only nodes Factory.setIgnoringElementContentWhitespace(true); //  Ignores comments Factory.setIgnoringComments(true); //  Set the schema language - these attributes are //  specific to Xerces2 Factory.setAttribute(JAXPConstants.JAXP_SCHEMA_LANGUAGE,           JAXPConstants.W3C_XML_SCHEMA);  //  Validating, if requested if (boValidate) {   Factory.setValidating(true); } //  Create the new document builder DocumentBuilder Builder = Factory.newDocumentBuilder();

We set the options on the Factory after creating a new instance. We first set the Factory to be aware of namespaces. To validate an instance document against a schema the DocumentBuilder must at least be able to handle the XMLSchema-instance namespace where the noNamespaceSchemaLocation Attribute lives. The next two properties don't directly affect validation, but we're interested in them anyway. As we mentioned in Chapter 2, Xerces returns "ignorable whitespace" as Text Nodes unless you specifically tell it not to. We're telling it not to. We're also telling it to ignore comments since we don't care about them. The setAttribute method on the DocumentBuilderFactory is the next method that directly deals with validation. JAXP provides this mechanism for the purpose of setting parameters that govern the behavior of the underlying parser. In this case we're telling Xerces to use the W3C XML Schema language as its schema language when doing validation. These constants are defined in the org.apache.xerces.jaxp.JAXP Constants interface, and I have chosen to use them from that source. However, literal values are also accepted. JAXP_SCHEMA_LANGUAGE has a value of:

http://java.sun.com/xml/jaxp/properties/schemaLanguage

W3C_XML_SCHEMA has a value of:

http://www.w3.org/2001/XMLSchema

If we don't set this attribute of the Factory, we get validation against a DTD.

The last option we set configures the Factory to build a validating parser via the setValidating method. Note that if we configured a validating parser and an instance document does not specify a schema through either the no Namespace Schema Location or the schemaLocation Attributes, the parser throws an exception.

So, even though there is a bit of configuration that we have to do, input validation in Java with JAXP and Xerces is not very involved. Validation errors are handled in the same fashion as general parsing errors from instance documents that aren't well formed . They throw SAX exceptions that we catch and report using the SAXExceptionHandler.

Output Validation in CSVToXMLBasic.java

Input validation was fairly simple once we performed the setup. On the other hand, validating output is not so easy with the current versions of JAXP and Xerces. This may have changed by the time you read this, so it will be worthwhile to check the JAXP and Xerces Javadocs to see what is available. However, right now we have only a few options.

We could do DOM Level 3 validation. The Xerces dom3 packages have some preliminary validation methods based on the DOM Level 3 drafts. However, the current Level 3 validation draft doesn't quite offer us exactly what we're looking for, which is a single method that validates a complete DOM Document. In addition, the Xerces dom3 packages come with warnings about instability. We don't want to use anything unstable, do we?

We could use the internal Xerces implementation classes and try to do validation the same way that Xerces does. There are such overwhelming disadvantages to this approach that I won't even discuss the benefits. Even if we could comfortably rely on these classes and methods staying stable between major releases, we are still tying ourselves more than I would like to one particular API implementation. The other obvious disadvantage is that we would need to figure out how to use these classes. Certainly doable, but not necessarily something I want to tackle within the scope of this book.

So, where does that leave us? The only mechanism that JAXP offers for validation is to validate while parsing. It may not be very elegant or efficient, but that is exactly what we'll do. We'll serialize our DOM Document to an in-memory I/O stream, then parse it from that stream to check for validation failures. Serializing the document an additional time (even though it is in memory rather than to disk) and parsing it is not exactly the most efficient approach. But, given our priorities and the tools we have to work with, it is the most appropriate approach. Remember, keeping things simple is more important to us than performance.

To implement this approach we create a DOMValidator class. The source code is in DOMValidator.java. This class has three private properties set up in the constructor.

ValOutput : a ByteArrayOutputStream to which we serialize our input DOM Document
ValSerializer : an XMLSerializer (such as was used in Chapter 3), configured to use ValOutput
Builder : a DocumentBuilder configured to validate against W3C schemas

The validation work is done in the validate method.

The validate Method from DOMValidator.java

 public void validate(Document docInput)     throws SAXException, java.io.IOException   {     //  Serialize the document to the byte array output stream     ValSerializer.serialize(docInput);     ValOutput.flush();     //  Set up the input byte stream for reading it back     ByteArrayInputStream ValInput =       new ByteArrayInputStream(ValOutput.toByteArray());     //  Validate the document while parsing it from the input     //  byte stream     Builder.parse(ValInput);   }

The code is really pretty simple. We serialize the DOM Document to the byte array output stream. Then we convert that to a byte array that passed as input via the constructor to a byte array input stream. We finally parse the serialized document by reading it from the byte array input stream. Because we are dealing with input and output streams we can throw an IOException, and because we are parsing we can throw a SAXException.

We call this method from our main method as follows .

 if (boValidate) {   Validator = new DOMValidator();   Validator.validate(docOutput); }

If JAXP and Xerces eventually come up with something that provides the same functionality, we can just replace these calls and throw away DOMValidator. Until then, though kind of a kludge , this approach does the job well enough.