Additional DOM Processing Considerations and Strategies | Using XML with Legacy Business Applications

We talked in Chapters 2 and 3 about some basic DOM programming techniques. We're starting to deal with more complex documents, so it is appropriate to discuss some more advanced topics and potholes on the road to creating robust DOM programs that are easy to maintain.

A primary consideration is whether we want to retrieve data from a DOM document by name or whether we want to traverse the document tree in some particular order and process things as we encounter them. To access data by name we can use a few basic methods. The ones listed below are sufficient for the points I want to make, although other methods are also available.

getElementsByTagName : retrieves an Element by its name, returning a NodeList; offered by the Document and Element interfaces
getAttribute : retrieves an Attribute value by name, returning the text value of the Attribute; offered by the Element interface
getNamedItem : retrieves a Node by name from a NamedNodeMap; offered by the NamedNodeMap interface

On the other hand, to traverse a document in a particular order, several attributes (and get methods) offered by the Node interface allow us to walk the Document tree in nearly any fashion we might want.

parentNode : the Node's parent Element, as a Node
childNodes : a NodeList of all child nodes
firstChild : the first child Node of the Node; same as childNodes.item[0]
lastChild : the last child Node of the Node; same as childNodes.item[childNodes.length - 1]
nextSibling : the next sibling of the selected Node; corresponds to parentNode.childNodes[index of this Node + 1]
previousSibling : the next sibling of the selected Node; corresponds to parentNode.childNodes[index of this Node - 1]

In addition, several methods return a NodeList, and we can select individual Nodes from the list using its item method.

The hitch here is that we have to be a bit careful when mixing these two approaches because when we have retrieved an Element using one of these Node attributes, even though we might know it is an Element, according to the DOM it is a Node, which is the Element interface's base class. To get access to the Element interface's methods and attributes we have to cast the retrieved Node as an Element or copy it to an Element (the exact technique depends on the programming language). For example, in Visual C++ using MSXML we can do the following:

 IXMLDOMDocument2Ptr spMyDoc; IXMLDOMElementPtr spMyElement; IXMLDOMNodeListPtr spNodeList; ... spNodeList = spMyDoc->getElementsByTagName("TheName"); spMyElement = spNodeList->item(0); MyValue = spMyElement->getAttribute("MyAttribute");

In Java using Xerces we could do the equivalent operations:

 Document MyDocument; Element MyElement; NodeList NodeList; NodeList = Document.getElementsByTagName("TheName"); MyElement = (Element) NodeList.item[0]; MyValue = MyElement.getAttribute("MyAttribute");

There are two things to bear in mind here. The first is that casting a base class object to a derived class object or otherwise creating a derived class object from a base class object can be risky programming practice. If the base class object isn't really a member of the target derived class, you can throw runtime exceptions. You want to be absolutely sure about what you're doing. One way to avoid exceptions is to check the Node type before doing a cast or setting a derived class item equal to a base class item.

Casting Nodes back to Elements lets you use the getAttribute and getElementsByTagName methods of the Element interface, which can be very convenient . Aside from the runtime exceptions, you need to be aware of the types of the various interface attributes and the types returned by various methods. Compilers are usually very helpful (or annoying, depending on your perspective) in this regard, so you're probably not going to miss doing a cast when one is required.

However, if you want to completely avoid the potential problems, there are other approaches for retrieving items by their names . I don't use them very much in this book because they require a lot more code. However, here are the outlines for two basic options, depending on whether you want to get a child Element or an Attribute.

Element : Walk the list of the Element's child Nodes using either a child Nodes NodeList or the firstChild, followed by calls to nextSibling. Then check the nodeName to see if you have the Element you are seeking.
Attribute : Use the Node's "attribute" attribute, which is a Named Node Map of the Element's Attributes. You can then use the getNamedItem method in the NamedNodeMap interface, passing the Attribute name. This returns the Attribute Node, and you can then call that Node's getNodeValue method to return the text of the Attribute.

You can easily see that getting an Element requires a lot more code and probably CPU cycles than the getElementsByTagName method. Getting an Attribute doesn't necessarily visibly invoke a lot more processing in your code (though we can't be absolutely sure what the DOM API is doing for us), but it certainly is a lot more verbose than the Element's getAttribute interface.

The bottom line here is that I recommend using whatever navigation or selection method seems appropriate to what you want to accomplish. If it involves casting a Node back to an Element, just be careful about it.

There is one other thing to be aware of when traversing part of a document tree using the Node attributes. Comments, Processing Instructions, and whitespace Text Nodes can all be in a childNodes NodeList or a nextSibling (unless you have created the NodeList with the getElementsByTagName method). If you want to process only Elements you will need to check for the Node type. You'll find several places in the Java and C++ code where I skip over non-Element Nodes.

Let's look now at how I apply these considerations to the instance documents processed and created by our utilities and at how we define and process the file description documents.

As I said above, these are general purpose utilities that don't inherently know anything about the XML documents that correspond to our legacy formats. They know about the legacy formats only in abstract terms. For example, the CSV to XML utility doesn't know anything about an invoice file in CSV format or the individual XML format invoices it produces. It knows only about the abstract, general characteristics of CSV files. At runtime it gets a lot of information about the target XML format from the CSV Source file description document. But since it needs to be general purpose and process other types of business documents, we need to code for the common characteristics. So, as I said earlier, we just use Elements to represent the application data. This means that except for Elements that convey structure information, such as record and group description Elements, we treat all Elements with equal semantic significance (or insignificance). In practical terms this means that we don't have any reason to retrieve or process Elements by name when converting from XML to a legacy format. We instead just walk the document tree and process Elements according to their characteristics as described in the file description documents.

We have a somewhat different situation when processing our file description documents. When initializing the converters we want to retrieve specific information about the legacy formats. For example, we need to know the record terminators. If we want to include a schema reference in the target XML document, we need to retrieve the schema URL from the file description document. Finally, we need to retrieve the Grammar Element. In these cases we retrieve Elements and Attributes by name. However, when processing the grammar we take a somewhat different approach. In almost all cases we walk through the document's subtree containing the grammar representation on a Node-by-Node basis. This is because the way we navigate the grammar must match the way we process the source and target instance documents and files. We process legacy formats sequentially, which corresponds to walking the grammar subtree sequentially (a preorder traversal) except when we backtrack and repeat a group or record. The same logic applies to our source instance documents. We process them sequentially (as a preorder traversal of the DOM document tree), so we do the same thing with the grammar.

I want to make one final, general point about DOM programming before moving on. Several of the DOM methods we call can return a null pointer if the requested object doesn't exist. If we then try to call a method of the returned object and the reference is null, we throw an exception. For example, if a document doesn't have a RecordTerminator Element, the getElementsByTagName method would return a null pointer instead of a NodeList. We would throw an exception if we tried to access item[0] of the NodeList. In many scenarios safe programming would dictate that we check for such null pointers before making such method calls. However, because our file description document is validated against a schema, by the time we get to bits of code like this we know how the DOM Document in memory is going to look. So, we can dispense with most of our checks for null pointers. This is another argument in favor of performing schema validation if you can.

To summarize, I think the main thing to get from this discussion and previous points is that there is no one-size-fits-all solution. There is no approach that will be best for all types of documents and processing situations. In addition, even for the same circumstances, different people are going to have different opinions about the "best" solution because they value differently the underlying requirements such as efficiency or maintainability. I hope that walking through some of these decisions I've made and the tradeoffs I've considered will help you deal with similar problems.