Having discussed the rationale for the file organization and element naming decisions in Chapter 2, we focus here on the basics of creating and writing out an XML document using the DOM.
Here's the pseudocode for the main routine. In the previous chapter I created a CSVRowWriter class to handle walking through the Column Elements of a single Row Element and writing them to the output stream. This routine uses a CSVRowReader to parse an input CSV row and transform it into a Row and child ColumnXX elements.
Logic for the Main Routine
Parse input arguments from command line IF help option is specified display help message and exit ENDIF Open input file Set up DOM XML environment (dependent on implementation) Create the Output XML Document (dependent on implementation) Root Element <- Call Document's createElement method, with tagName of SimpleCSV Document <- Call Document's appendChild to append Root Element to Document Initialize CSVRowReader object using the Output Document, and the Root Element Read first record from file DO until all records are read Increment Row Count Call CSVRowReader parse method, passing the record read from the input file Call CSVRowReader write method Read next record from input file ENDDO Save Output XML Document (dependent on implementation) Close input file
The main part of the work is done by the CSVRowReader class. However, before we move on let's look a bit more at a few lines relating to the DOM.
Create the Output XML Document (dependent on implementation)
This is another case where the DOM recommendation (at least through Level 2 of the DOM) is silent, so we can expect there will be variations among the different DOM implementations. The one thing that both the JAXP and MSXML implementations have in common is that they both create a pointer to the Document Node. The Document Node, basically, has no content whatsoever. It just gives us a hook from which we can start hanging things.
Root Element <- Call Document's createElement method, with tagName of SimpleCSV Document <- Call Document's appendChild to append Root Element to Document
The first thing we need to do is to create the root Element of the document, calling the Document interface's createElement method. This takes a single argument of the name of the Element to be created. In our example output above, this is the SimpleCSV Element that has all of the Row Elements as children. Note for future reference that any time we want to create any type of DOM Node (except for Attributes), we must call the appropriate create method from the Document interface (the only interface that has create methods ). Another important thing to note is that until we tie a new Node to something, it is just floating off by itself in space. We must use the appendChild method of the appropriate parent Node to hang it on the tree. In this case, we want the root Element to be a child of the Document Node, so we call the Document interface's appendChild method.
In the DO loop we read a record, parse it, and then write it out to the DOM tree. We'll look at these methods shortly, but the thing to note here is that we are separating the reading, parsing, and output functions. This is consistent with good programming practice: Each function has a single logical function, that is, has cohesion. It is also consistent with the design of most compilers. In the big picture, what the utilities in this book do is very similar to what compilers do.
Save Output XML Document (dependent on implementation)
Again, we'll see implementation differences. Both Xerces and MSXML implement the semantics of the DOM Level 3 save operation. In this case, we'll see that the Microsoft implementation looks in the actual code closer to the standard than Xerces.
The CSVRowReader parse and write methods do most of the work in this utility. In creating the CSVRowReader class, I use the following class attributes in addition to the parse and write methods:
The Output Document and root Element are initialized in the class constructor, where the Column Array is also set up.
In both of our implementations the parse method deals strictly with standard Java or C++ things and has nothing to do with XML per se. However, the logic of the parse method is of interest since getting the parsing algorithm correct is very important when dealing with CSV files (as well as when dealing with EDI in Chapter 9). Let's look at the pseudocode for the parse routine, then focus on the DO loop where the actual parsing occurs.
Logic for the CSVRowReader parse Method
Inputs - InputRecord Local - ColumnNumber, CurrentPosition, QuoteOpen, ColumnActive ColumnNumber <- 1 CurrentPosition <- 0 ColumnActive, QuoteOpen <- false DO until CurrentPosition = Length of InputRecord DO CASE of InputRecord[CurrentPosition] Quote: IF QuoteOpen is true QuoteOpen <- false ELSE QuoteOpen <- true ENDIF BREAK Comma: IF QuoteOpen is false Increment ColumnNumber ColumnActive <- false BREAK ENDIF NOTE: We fall through to Other for commas within Quoted strings Other: IF ColumnActive is false ColumnActive <- true ENDIF ColumnArray[ColumnNumber] <- ColumnArray[ColumnNumber] + InputRecord[CurrentPosition] ENDDO CASE Increment CurrentPosition ENDDO
In the DO loop we are performing character-by-character parsing. I have set this up to handle column text that may or may not be delimited by quotation marks. We can't do simple string token operations looking for commas because a comma might appear within a quoted string of text. That's the main reason we use quotation marks in the first place. In "compiler speak," we are doing simple left-to-right parsing with no lookahead .
There are many ways to implement such parsers. To start out simple in this book I'm using a DO CASE strategy. The general strategy allows us to recognize quotation marks as delimiters for text and not move them to the output. A comma indicates the end of a column unless the comma is preceded by a single quotation mark, which indicates the comma is part of a text string. The first character in a column flags that the column is started. We then save that and all other characters , including the commas that have been "escaped" by preceding quotation marks, to the entry in the column array. Note that this algorithm can't deal with quotation marks that appear in column text, that is, that aren't used as delimiters. I didn't see a significant requirement to deal with that circumstance. Handling it would add more complexity to the algorithm than it was worth.
As noted, this is a fairly simple parsing algorithm appropriate for fairly simple input. EDI, on the other hand, while it has many similarities to CSV formats, has a much more complex grammar that is more amenable to more sophisticated parsing techniques. Parsing non-XML input will get a lot more interesting later in the book!
The write method does the XML DOM work.
Logic for the CSVRowReader write Method
Inputs - None Local - ColumnNumber, Column Name, Row Element, Column Element, Text Node Row Element <- Call Document's createElement method with "Row" as the tagName Root Element <- Call root Element's appendChild to add Row Element as child DO for all ColumnNumber from 1 to MaxColumns IF ColumnArray[ColumnNumber] has content Column Name <- "Column" and ColumnNumber Column Element <- Call Document's createElement method with ColumnName as the tagName Row Node <- Call Row Node's appendChild to add ColumnElement as child Text Node <- Call Document's createTextNode method with text from ColumnArray[ColumnNumber] Column Element <- Call Column Element's appendChild to add Text Node as child ENDIF Increment ColumnNumber ENDDO
The first thing we do in this routine, similar to the main routine, is to create a Node from which we can hang everything else. We create the Row element and append it to the root Element of the Document Node. Note that in the main method we created the root Element and appended it to the Document Node itself. We then loop through the column array. For every column that has text we build the ColumnXX Element name, create and append the Column Element to the Row Element, then create and append a Text Node to the Column Element. The createTextNode method takes as an argument the text to be inserted into the Node.