Other Approaches and APIs | Using XML with Legacy Business Applications

I've used the DOM for all the XML processing in the Java and C++ code for this book. If you've followed me up to this point I think you can probably understand why that was an appropriate choice. However, it may not be the most appropriate choice for all circumstances, given your own particular situation and requirements. I said in Chapter 1 that there were other approaches. It's time to talk about them a bit more.

Simple API for XML

SAX isn't a "real" standard in that it hasn't been blessed by any standards body, but the SAX implementations probably follow SAX no more or less than people's DOM implementations conform to the W3C Recommendation. Again, SAX is based on an event-driven model that uses callbacks for each type of XML construct it encounters in an input stream. Unlike the DOM, SAX just reads a data stream in serially and triggers events. It doesn't build a tree (or anything, for that matter) in memory. Thus it can be quite appropriate if you have to process very large XML documents that would consume considerable memory if processed using the DOM.

Several API libraries for Java and C++ support SAX, so it is widely available. However, a major downside with SAX, as I said in Chapter 1, is that there isn't a standard way to create documents using SAX. The Apache Xerces distribution has SAX-based classes for creating XML output. So does MSXML. However, they're different.

Still, the choice of SAX over DOM is subject to several nonfunctional considerations, not the least of which is the programming paradigm. Are you more comfortable with event-based programming, or do you prefer working with trees? Take your pick.

Generated Class Bindings

An increasing number of products and freely available tools will generate Java or C++ classes for you if given a schema for input. Here are a few of those products.

The Castor project, described as "an open source data binding framework for Java" and "a mapping framework between Java objects, XML documents, SQL & OQL databases and LDAP directories." For details, visit www.exolab.org.
The Java Architecture for XML Binding (JAXB), which "provides an API and tools that automate the mapping between XML documents and Java objects." JAXB is part of the Java Web Services Developer Pack. More information and free downloads are available at http://java.sun.com/xml/jaxb.
XMLSPY, Version 5, Enterprise Edition, a proprietary offering in this space. According to the vendor, this product "includes a built-in Code generator which can generate program code bindings of XML Schema components in Java, C++, or Microsoft C#." You can find more information at www.xmlspy.com.

The promise of tools like this is that if you give them a schema, they'll generate all the code necessary to let you access an XML document just like you would any other C++ or Java object. There is no complicated DOM, SAX, or other lower-level XML-specific code to write. This solution may be superior to DOM programming for many situations and is probably worth your consideration. However, despite all its benefits we need to keep in mind some of the potential drawbacks.

Cost is one drawback. For tools like Castor and JAXB, cost isn't much of a consideration (though using open source software itself might be an issue for some organizations). For a tool like the Enterprise Edition of XMLSPY, cost could be a consideration.
There isn't a standard for generating Java or C++ classes from schemas, but if you think about it a standard probably isn't applicable .
You don't want to modify the generated code. If the schema is changed you'll have to feed it to the tool again and generate new code, wiping out your changes.
Tools may favor some styles of schema design over others. I've not yet worked with any of these tools. However, intuitively and from discussions with colleagues who have, it seems that the specific code that gets generated may be dependent on the particular approach to schema design. Different schema features may yield different depictions in code. The tools may steer you toward certain schema design styles in order to optimize the generated code. The schema styles may or may not be desirable when considered from perspectives other than code generation. For example, a schema design style that yields very nice Java classes may not necessarily be the most understandable from data analysis or reusability perspectives. In addition, when you are working with schemas that others have created, the generated code may not be the most desirable from a programming language perspective. However, you may have no way to modify what gets generated.
Most of the tools I'm aware of make it very easy to use XML documents as input, but they may or may not do much for you when you need to serialize a document to disk. Pay careful attention to the features.
Determine which APIs the tool uses in the generated code. If it uses a standard API like Xerces or MSXML in the generated code you're probably pretty safe. If instead it calls proprietary APIs, do thorough testing. It may be too much of a black box.
If you need to perform schema validation, check how well the tool complies with the W3C XML Schema Recommendation.
The generated code may or may not be very efficient.
In the worst case, the generated code may not be bug free with all inputs.

I've been around long enough to remember some early code generation products and to remember that they never caught on despite the promised benefits. Do a thorough assessment. The tools may make processing small, simple documents very easy. However, for larger, more complex documents with many optional Elements and Attributes, most of your program logic may deal more with content than with the particular APIs. A code generator may or may not save you significant effort over DOM programming.

I'm sure that 40 years ago similar concerns were raised by old assembly language programmers warning about the drawbacks of third-generation languages like COBOL and FORTRAN. So, call me an old (or new) fuddy-duddy if you like. The best advice I can offer is to do a thorough evaluation, including testing with a wide variety of inputs, before you commit to a particular tool. Despite the potential drawbacks, I do need to say that these tools get one thing right. They start with the data model.

Options for Procedural Languages

It's not that you don't have any options with strictly procedural languages like C, COBOL, and FORTRAN. It's just that your options are somewhat limited and nonstandard. I'll discuss three basic options here.

The first option involves linking routines in an XML supporting object-oriented language, probably C++, with your application. Digital Equipment (later Compaq and now part of Hewlett Packard) as early as 1978 supported a calling and linking standard that allowed modules written in any language on the VMS operating system to call modules in any other language. Even today not all operating systems and development environments offer this support or offer it as transparently . However, many do. If you happen to be fortunate enough to have this option, it's probably the easiest route to adding XML support to your application. Develop all your XML handling modules in C++, design so that you're sure you can pass the data back and forth, and you have the job done.

The second option is specialized API libraries, software packages, or compilers that provide XML APIs directly to these procedural languages. Several open source and proprietary alternatives are available. Here are a few examples.

LT XML for C, open source : www.ltg.ed.ac.uk/software/xml/
Libxml in C, XML C library for the Gnome project, open source : http://xmlsoft.org
expat , James Clark's XML Parser Toolkit in C, open source : www.jclark.com/xml/expat.html, with version 2.0 work being done at http://expat. sourceforge .net
XML4cobol for COBOL, proprietary : www.xml4cobol.com
Fujitsu NetCOBOL compiler, proprietary : www.netcobol.com

APIs like these can certainly provide native XML support to an existing application. However, there are several issues to consider. As I said earlier, they generally don't provide native support for standard APIs like SAX and the DOM. Another issue is whether or not they support schema validation. Some older tools may not even support XML namespaces. Find out! Other issues to consider are the same as for any other development tool. The cost of the tool, its quality, its support, and the stability of the vendor (or the breadth and depth of the open source community) are usually of the highest importance.

The third option to consider is reassessing whether or not you really need native XML support. If your application is coded entirely in C, COBOL, or FORTRAN, you probably still do a lot of processing in batch mode. Unless you have particular requirements for real-time behavior (perhaps something similar to a CICS ^[*] transaction processing monitor), consider auxiliary stand-alone conversion utilities like those developed in this book.

^[*] Customer Information Control System (CICS) is IBM's venerable mainframe-based transaction processing control system.