So, having built the basic functionality and seen how it works, what more can we do?
Validation of the Input Document
The most obvious enhancement we can make to this program is to validate the input document before we try to process it. As I previously pointed out, we can run into unexpected output, no output, or abnormal termination if the input document doesn't conform to the expected structure. There are two ways we can handle this.
The first way is to add additional code to the program in several places to deal with unexpected conditions. For example, in the main method we would check to see that there was at least one Row Node in the NodeList. We would want to do the same thing when checking for ColumnXX children of the Row Nodes in the CSVRowWriter write method. In addition, we would want to verify that the Row Node's children were Element Nodes and that their names started with Column before we tried to process them as columns . This is a lot of code to add, and you can easily imagine the code you would need to add in more complex applications.
Fortunately, there's an easy way out. The second alternative is to validate the document against a DTD or schema before we process it, letting the DOM API do the heavy lifting . In both the Java and C++ implementations we have to add only a line or two of code. After having given you this teaser, I'll wait until Chapter 5 to actually show you the code. There are, however, other considerations having to do with whether or not we should validate against a schema. I'll discuss these in Chapter 12.
You should now ask, "So what do I validate against?" Ah, there's the question! In Chapter 4 I'll introduce the W3C XML Schema Recommendation and in Chapter 5 the specific schema that can be used for this program and the one developed in the next chapter.
Using a ColumnNumber Attribute
When presenting the chosen document design, I discussed some alternative approaches for naming the column Elements. Let's look now at how the code would change if we used a generic Element name (Column) and an Attribute indicating the column number (that is, approach 2, instead of my chosen ColumnXX approach 3).
In the CSVRowWriter write method we replace the call to get the column's NodeName and parse the ColumnXX name to get the number with a single call to the Element Node's getAttribute method. This is shown in the pseudocode snippet below.
Column Number <- call Column Element's getAttribute to get value of ColumnNumber Attribute
Intuitively, this seems as if it would be a simpler approach since we would save one line of pseudocode and a few lines in Java and C++. From a programmer's perspective your intuition would be right. However, as noted earlier there are perspectives to consider other than just that of the person doing the coding. We need to consider the impact on the end users, and this approach makes their jobs harder. When given a choice, I'll always make things easier for the end user rather than the programmer.
A Recursive Algorithm
Now let's talk about a more radical alternative approach to designing this utility. We have outlined in this program the basic strategy for walking through the tree that represents a DOM Document. For each selected Node we get a list of the Node's children. Then for each of the children we get a list of children and process them however we want. Keeping with my "clarity over cleverness " orientation, I've designed specialized routines that deal specifically with the expected structure of the input XML document. However, it isn't too hard to see how this approach can be generalized into a recursive algorithm that walks the complete tree and takes actions that are appropriate to each type or name of Node encountered . Listed below is the pseudocode for a recursive approach that processes Document, Row, Column, and Text Nodes.
Logic for the Main Routine: Recursive Implementation
Parse input arguments from command line IF help option is specified display help message and exit ENDIF Set up DOM XML environment (dependent on implementation) Load input XML Document (dependent on implementation) Open output file Initialize CSVRowWriter object Call processNode, passing Document object and null pointers for Column Array and Column Text Close input and output files
The main routine is quite similar to the original main routine. We see the differences primarily in the processNode method.
Logic for the CSVRowWriter.processNode Routine: Recursive Implementation
NodeList <- Get passed Node's childNodes attributes DO CASE of Node's nodeName Document: // Process the Document's Row children DO for each of the Node's children Call processNode, passing child Node and null pointers for Column Array and Column Text ENDDO Row: // Process the Row's ColumnXX children Initialize Column Array DO for each of the Node's children Call processNode, passing child Node and pointer to Column Array, null pointer to Column Text ENDDO Output Buffer <- Call formatRow, passing Column Array Write Output Buffer ColumnXX: // Process the Column's #Text children Column Number <- Derive from Column Name DO for each of the Node's children Call processNode, passing child Node and null pointer to Column Array, pointer to Column Text ENDDO Column Array[Column Number] <- Column Text #Text: // Get the text Column Text <- get nodeValue Default: No operation ENDDO
If you understand how recursive programming works, this routine should be easy enough to figure out. However, we need to ask this question: Just because we can solve this problem recursively, should we? If all we wanted to do was to walk the document tree and print out Node names and values, a recursive approach might be very appropriate. However, in this case we have a different kind of problem to solve. We do different types of processing for each type of Node we encounter. The different types of processing require that we pass arguments to the routine that are used only in specific cases and not in others. Overall, we need to ask whether or not this approach contributes to our goals of simplicity, understandability, and maintainability. You may run into some kinds of XML processing problems where a recursive algorithm contributes to those goals. In this case I don't think it does.
The one advantage to this type of recursive approach is that we have a Default case to handle unexpected Nodes. However, if we validate the input XML document, this advantage is moot. Even though I don't think that a recursive algorithm has any advantages for this particular utility, I did want to point it out to you in case you might find it applicable to a problem you need to solve. In Chapters 8 and 9 I present two cases where a recursive approach does make sense.
As useful as this XML to CSV utility is, it's probably well suited only for a limited range of applications. A different XML to CSV converter would be helpful for dealing with the following real-life situations.
These form the start of the requirements list for the more capable utility we'll build in Chapter 7.