Enhancements and Alternatives | Using XML with Legacy Business Applications

So, what else would we like to do with the CSV to XML Converter? Some of the enhancements discussed in the previous chapter also apply to this utility. Here are a few others.

Validation of the Output Document

With both implementations we are confident of creating well- formed XML. However, wouldn't it be nice in some cases to be able to validate the instance document against a schema, too? This might be particularly important if you were going to send the document directly to a consuming application, especially if that application belongs not to you but to another organization. Chapter 5 will show how to do this.

Advanced Functionality

As was the case with our XML to CSV Converter in Chapter 2, the CSV to XML Converter provides only some very basic functionality. There are several other things we might want such a converter to do.

Reading several input XML documents to create a single physical output CSV file. (In the conventional way of doing things, an XML document is stored on disk as one physical file. However, most batch-oriented import facilities take several logical documents in as one physical file.)
Using Element names that are more semantically meaningful for specific applications (e.g., LastName instead of Column01).
Performing data type conversions such as conversion from the W3C XML Schema ISO 8601 date formats to MM/DD/YY, DD/MM/YY, or other date formats.
Supporting characters other than a comma for the column delimiter .
Supporting characters other than a quotation mark for the text delimiter.

These form the start of the requirements list for the more capable utility we'll build in Chapter 7.

Some Observations

In reviewing the pseudocode and the implementations, it should be apparent that creating an XML instance document using the DOM is fairly simple. It requires a series of create operations on the appropriate Node type, followed by an appendChild method call to attach the Node to the appropriate parent. Other types of XML applications might require more complex processing, but a fairly serial conversion of one document format to another requires only these simple operations. There are some minor complexities involved in setting up the XML environment and dealing with the different save operations. However, even with error handling these still take less than 50 or so lines of code. The main complexity comes in keeping track of the Node to which we want to append the Node we're creating. If we take EDI as a rough indication, the majority of business documents shouldn't have trees more than about six nodes deep. With reasonably designed parsing and processing algorithms, dealing even with these types of documents shouldn't be very hard.

So, in addition to converting an XML format to a non-XML format, we've also now looked at the basic mechanics of converting from a non-XML format to an XML format. It should be clear that the DOM APIs are extremely useful and save an awful lot of coding. If you consider the previously discussed complexities of reliably parsing a simple CSV file, the idea of writing code to parse an instance document, parse one or more schemas, validate the schemas, and validate the instance document against the schemas should give you the heebie-jeebies! Use the APIs. Don't try to write your own code to read or create raw XML.

In addition to the obvious things we've explored so far, the APIs do one other thing that hasn't been obvious from our sample data. They also handle predefined entities for us. These entities represent special syntax characters declared in the basic XML 1.0 recommendation. Consider the following input row, mangled to include some of these special characters (note that I left out a quotation mark because my CSV parser can't handle it!).

 "<Smith","Sue>","Highway & 118",,"Ter'lingua","TX&","79852", ,,"desertrat@aol.com"

Running it through the Java utility produces the following output.

 <Row>   <Column01>&lt;Smith</Column01>   <Column02>Sue&gt;</Column02>   <Column03>Highway &amp; 118</Column03>   <Column05>Ter&apos;lingua</Column05>   <Column06>TX&amp;</Column06>   <Column07>79852</Column07>   <Column10>desertrat@aol.com</Column10> </Row>

Running it through the C++ utility produces the output below.

 <Row>   <Column01>&lt;Smith</Column01>   <Column02>Sue&gt;</Column02>   <Column03>Highway &amp; 118</Column03>   <Column05>Ter'lingua</Column05>   <Column06>TX&amp;</Column06>   <Column07>79852</Column07>   <Column10>desertrat@aol.com</Column10> </Row>

Since the <, >, &, ", and ' characters all have special meaning in XML syntax, they are converted to entities that don't give parsers problems. The reverse process is performed when we read an input XML instance document and convert it to another format; the real characters are restored.

As another implementation difference, note that Xerces converts the apostrophe to an entity, while MSXML doesn't.

Some of the other complexities you might encounter in parsing XML include whitespace handling, support for different Unicode character sets (including double byte characters), entities other than the predefined entities, and embedded comments. And these situations have only to do with parsing; validation is an entirely different endeavor. I try very hard in this book not to give unqualified recommendations because it is very rare that one size fits all. However, this is one case where I'll make an exception. Don't try to write your own code to parse XML if there's any way you can avoid it. The standard APIs will save you more time and work in the long run than you can imagine.

This fairly short chapter shows how creating XML using the DOM is actually pretty easy. Having laid the foundation with these basic techniques, we can proceed to address utilities that deal with more complex legacy file formats and situations. Much of what we will be concerned with in later chapters has more to do with these legacy formats than XML. However, these chapters will be of interest to XML-oriented developers who may not have previously dealt with these types of requirements. We will also look at some of the key XML technologies that support our overall architecture. We start in Chapter 4 with the W3C XML Schema language.