Flat File to XML: Detail Design | Using XML with Legacy Business Applications

Chapter 7, in dealing with CSV files, gave a good introduction to how the architecture is implemented. There are only a few significant differences between processing flat files and processing CSV files. So, while I'll present the pseudocode that shows the logic of the Java and C++ implementations for flat file conversion, the discussion will focus on the aspects that are different from CSV processing.

Main Program

As discussed in Chapter 6, in order to develop reusable modules that can be linked together into programs and not just cobbled together in scripts, we put the main processing logic in a routine that is called from a shell main program. The shell main program functions, described next , are essentially identical to the CSV main routine.

Logic for the Shell Main Routine for Flat to XML

 Arguments:   Input File Name   Output Directory   File Description XML Document Options:   Validate output   Help Validate and process command line arguments IF help option specified   Display help message   Exit ENDIF Create new FlatSourceConverter object, passing:     Validation option     Output Directory     File Description XML Document Call FlatSourceConverter processFile method, passing     the input file name Display completion message

FlatSourceConverter Class (Extends SourceConverter)

Overview

The FlatSourceConverter is the main driver for the actual conversion. It inherits all the attributes and methods of its base classes, the SourceConverter and Converter classes (see Chapter 6).

Attributes:

FlatRecordReader Object
Array of Strings Partner Array
Integer Partner Break Field Offset
Integer Partner Break Field Length
String Saved Partner ID

Methods:

Constructor
processFile
processGroup (added to SourceConverter base class)
testPartnerBreak

Methods

Constructor

The constructor method for our FlatSourceConverter object sets up that object as well as the FlatRecordReader object.

Logic for the FlatSourceConverter Constructor Method

 Arguments:   Boolean Validation option   String Output Directory   String File Description Document Name Call base class constructor Initialize Partner Array Call loadFileDescriptionDocument from passed File Description     Document Name Schema Location URL <- Call File Description Document's     getElementsByTagName on "SchemaLocationURL", then     getAttribute on "value" IF Schema Location URL is null and Validation is true   Throw Exception ENDIF Partner Break Field Offset <- 0 Partner Break Field Length <- 0 NodeList Temp <- call File Description Document's     getElementsByTagName for "PartnerBreak" IF (Temp length = 1)   Partner Break Element <- Temp NodeList item(0)   Partner Break Offset <- call Partner Break Element's getAttribute       for "Offset", and convert to integer   Partner Break Length <- call Partner Break Element's       getAttribute for "Length", and convert to integer ENDIF Initialize Saved Partner ID Create FlatRecordReader object, passing:     File Description Document

processFile

The main processing is driven by the FlatSourceConverter's processFile method. This method converts one input flat file into one or more output XML documents based on the input parameters.

Logic for the FlatSourceConverter processFile Method

 Arguments:   String Input File Name Returns:   Status or throws exception Output Directory Path <- Base Directory + directory separator Initialize Output Document to null Initialize Sequence Number Header Record Tag <- Get Grammar Element's "TagValue" Attribute Open input file Call FlatRecordReader's setInputStream method Record Length <- Call FlatRecordReader's readRecord method Record Tag <- Call FlatRecordReader's getRecordType method IF (Record Tag != Header Record Tag)   Return error or throw exception ENDIF DO while Record Length => 0   Partner Break <- Call testPartnerBreak   IF Partner Break = true     Output Directory Path <- Base Directory +         Partner ID + directory separator     Lookup Partner in Partner Array     IF Partner is not in Array       Create output directory from Output Directory Path       Partner Array <- Add Partner ID     ENDIF   ENDIF   Create new Output Document   Increment Sequence Number, and pad with leading zeroes       to three digits   Output File Path <- Output directory path + Root Element       Name + Sequence Number + ".xml"   Call FlatRecordReader's setOutputDocument method for new       Output Document   Record Length <- Call processGroup to process the document,       passing the Root Element and the Grammar Element   IF (Record Length > 0)     Header Tag <- Call FlatRecordReader's getRecordType method     IF (Header Tag != Grammar Header Record Tag)       Return error or throw exception     ENDIF   ENDIF   Call saveDocument ENDDO Close input file Display completion message with number of documents processed

There are several similarities between this processFile method and the CSVSourceConverter's processFile method. We do very similar processing for partner lookup, directory and file management, and saving documents. However, instead of processing records individually, we process a document as a whole using the processGroup method.

processGroup (Base Class SourceConverter Method)

We noted in the discussion of flat file grammars that the recursive definition for the group production lends itself to processing by a recursive algorithm. I also noted near the end of Chapter 2 that while we can often process XML using recursive algorithms, doing so doesn't always offer any advantages over nonrecursive approaches. However, for our purposes in this utility a recursive approach, implemented with the processGroup method, is quite appropriate and very powerful. It processes the first record in a group, then all the other records. If one of the records starts another logical group , the processGroup method calls itself.

Since both flat files and EDI files have the same logical group structures (at least when processing their XML representations in our architecture) we can use the same method for processing them. So, although we introduce the processGroup method in this chapter, we're actually adding it to the SourceConverter base class.

Logic for the FlatSourceConverter processGroup Method

 Arguments:   DOM Node Parent   DOM Element Group Grammar Returns   RecordLength of last input record read, or EOF Group Element Name <- Group Grammar getAttributeValue     For "ElementName" Group Element <- createElement using Group Element Name Parent Element <- appendNode Group Element Grammar Element Name <- Record Grammar getNodeName IF Grammar Element Name = "Grammar"        and Schema Location URL is not NULL        Create namespace Attribute for SchemaInstance and append            to Group Element        Create noNamespaceSchemaLocation Attribute and append            to Group Element ENDIF Child Node <- Get Grammar Element's firstChild DO while Child Node is not an Element Node   Child Node <- nextSibling ENDDO Record Grammar Element <- Child Node Record Element Name <- Record Grammar Element's     getAttributeValue For "ElementName" //  Process the first record in the group Call RecordReader's parseRecord Call RecordReader's toXML Call RecordReader's writeRecord // This advance in the grammar makes sure that we don't repeat //  the starting record of the group Record Grammar Element <- Get next Record Element from     Group Grammar Grammar Record Tag <- Record Grammar getAttribute for     "TagValue" Record Length <- call RecordReader's readRecord DO until end of file   Record Tag <- call RecordReader's getRecordTag   DO until Grammar Record Tag = Record Tag     Record Grammar Element <- Get next Record Element from         Group Grammar     IF Record Grammar Element is NULL       return Record Length //  This record is not part of the group     ENDIF   ENDDO   Grammar Element Name <- Record Grammar getNodeName   IF Grammar Element Name = "GroupDescription"     Do recursive call of processGroup   ELSE     Record Element Name <- Record Grammar getAttributeValue for         "ElementName"     Call RecordReader's parseRecord     Call RecordReader's toXML     Call RecordReader's writeRecord     Record Length <- call RecordReader's readRecord   ENDIF ENDDO Return Record Length

The logic is mostly straightforward, but it may help to review the recursive and termination cases. The first time processGroup is executed it is called from processFile after we have read the header record for a logical document. We pass processGroup the Document as the parent Node to which to append all the record Elements we create in processGroup. We also pass the complete Grammar Element from the file description document as the grammar for the group. After we have processed the header record, we advance to the next Element in the grammar. We then read the next record from the file. We advance Elements in the grammar until we match the record identifier of the record that we read from the file. If it is a normal record (that is, it doesn't start another group), we process it. However, if the matching grammar Element indicates that it starts a group (indicated by an Element name of "GroupDescription" instead of "RecordDescription"), we execute the recursive call. In this circumstance we pass the Document's root Element as the parent Element and the group's grammar Element (the GroupDescription Element) as the grammar. We thus start a new instance of processGroup and proceed as before.

We reach the termination case when we have read a record that is not part of the grammar of the current group. This case is recognized when we advance through all the grammar Elements that are children of the current group grammar and don't find a TagValue Attribute that matches the record's identifier. If we're processing a lower-level group in the logical document hierarchy (and a higher execution point on the stack), we exit processGroup and return to the previous iteration of processGroup. If the record is part of the group being processed by that iteration, we resume processing. However, if it isn't part of that group, we again exit. This holds true if the record we have read starts another iteration of the same type of record group. This continues until we finally exit back to processFile. If the current record is a header record, we save the current Document to disk and begin a new iteration of the while loop. However, if the record is not a header record, we have encountered a record that is either not defined for this type of document or is not where we have said it would be in the grammar. In that case we force an abnormal termination.

testPartnerBreak

This testPartnerBreak method serves the same purpose as the one in the CSVSourceConverter class, but the processing is a bit different. In addition to a few other minor differences, we trim trailing whitespace through the call to the Flat RecordReader's getFieldValue method.

Logic for the FlatSourceConverter testPartnerBreak Method

 Arguments:   None Returns:   Boolean - true if new partner and false if not IF Partner Break Length is zero   return false ENDIF PartnerID <- Call FlatRecordReader's getFieldValue method     passing the Partner Break Offset and Partner Break Length IF PartnerID = Saved Partner ID   return false ENDIF Saved Partner ID <- Partner ID Return true

FlatRecordReader Class (Extends RecordReader)

Overview

The FlatRecordReader introduces only a few new attributes but has some new and overridden methods. It inherits several attributes and methods from its RecordReader and RecordHandler base classes (see Chapter 6).

Attributes:

Integer Fixed Record Length
Integer Record ID Field Offset
Integer Record ID Field Length

Methods:

Constructor
getFieldValue
getRecordType
parseRecord
readRecord
readRecordFixedLength

Methods

Constructor

In the FlatRecordReader constructor method, we primarily retrieve values from the file description document that the reader needs for processing. The logic should look familiar by now.

Logic for the FlatRecordReader Constructor Method

 Arguments:   DOM Document File Description Document Call RecordReader base class constructor, passing File     Description Document Record Format Element <- Get RecordFormat Element from File     Description Document Child Element <- Get first childNode from Record Format Element,     advancing over non-Element Nodes Fixed Record Length <- 0 IF Child Element NodeName = "Fixed"   Fixed Record Length <- Call Child Element's getAttribute for       "Length" ELSE   Record Terminator <- Call Child Element's getAttribute for       "RecordTerminator"   Call setTerminator to set the Record Terminators ENDIF Tag Info Element <- Get "RecordTagInfo" Element from File     Description Document Record ID Field Offset <- Call Tag Info's getAttribute for     "Offset" Record ID Field Length <- Call Tag Info's getAttribute for     "Length"

getFieldValue

This method gets the complete value of a record's field with a single call. It is declared and implemented only in the FlatRecordReader class and isn't used in any of the other classes derived from the RecordReader class. The reason for this is that the other legacy formats (CSV and EDI) parse input records on a character-by-character basis. They save a field's characters to a DataCell object as they parse the record. In contrast, with the flat file format we parse records on a field-by-field basis, extracting fields from the record buffer based on offsets and lengths. This method is called from the FlatRecordReader's parseRecord method to perform that extraction. In addition, this method is designed to be used for extracting the record identifier, and until we have it we don't know how to parse the full record and load the DataCell Array.

The method is also called from the FlatRecordReader's testPartnerBreak method and the FlatRecordReader's getRecordType method. We call it for these special cases and not as a general purpose method for extracting field values from the input record buffer since in Java and C++ that operation takes only a couple lines of code. In addition to extracting the value, getFieldValue trims trailing whitespace and returns an error if the field is empty.

Logic for the FlatRecordReader getFieldValue Method

 Arguments:   Integer Field Offset   Integer Field Length Returns:   Field value; throws exception or returns status IF Field Offset > Record Buffer Length   Return error ENDIF IF Field Offset + Field Length > Record Buffer Length   Field Length <- Record Buffer Length - Field Offset ENDIF Field Value <- Extract from Record Buffer according to     Passed Field Offset and adjusted Field Length Field Value <- Trim trailing whitespace (<= ASCII space     character) IF length of Field Value = 0   Return error ENDIF Return Field Value

getRecordType

This method extracts the value of the Record ID field from the input record buffer, trimming trailing whitespace. In the current implementation the field value is returned in its raw form, interpreted as alphanumeric string data.

Logic for the FlatRecordReader getRecordType Method

 Arguments:   None Returns:   Record ID tag value; throws exception or returns status Record Tag Value <- call getFieldValue, passing Record ID Field     Offset and Record ID Field Value Return Record Tag Value or status

parseRecord

This method is not as interesting as the one in the CSVRecordReader because the grammar is not as complex. Essentially, we walk the FieldDescription Elements of the RecordDescription Element, extract the field contents according to their defined offsets and lengths, and create new DataCell objects.

Logic for the FlatRecordReader parseRecord method

 Arguments:   DOM Element Record Grammar Returns:   Error status or throw exception Field Grammar NodeList <- call Record Grammar's     getElementsByTagName for "FieldDescription" DO until end of Field Grammar Node List   Field Grammar Element <- Next item in Field Grammar NodeList   Field Number <- call Field Grammar Element's getAttribute on       "FieldNumber"   Field Offset <- call Field Grammar Element's getAttribute on       "Offset"   Field Length <- call Field Grammar Element's getAttribute on       "Length"   IF Field Offset + 1 > Record Buffer Length     Return success //  End of record   ENDIF   IF Field Offset + Field Length > Record Buffer Length     Field Length <- Record Buffer Length - Field Offset   ENDIF   Field Length <- Record Buffer Length - Field Offset   Field Value <- Extract from Record Buffer according to       Field Offset and adjusted Field Length   New Cell <- call RecordHandler's createDataCell method, passing       Field Number and Field Grammar Element   Call New Cell's putFieldValue method, passing Field Value ENDDO

readRecord

This convenience method provides a way to protect calling routines from the variations in how we physically process records for flat files. It performs either a fixed or variable length read, depending on the physical characteristics gleaned from the file description document.

Logic for the FlatRecordReader readRecord Method

 Arguments:   None Returns:   Record Length or EOF IF Fixed Record Length > 0   Return call to readRecordFixedLength ENDIF Return call to base RecordReader's readRecordVariableLength

readRecordFixedLength

This method calls the language-specific routines for reading a fixed length record from a flat file.

Logic for the FlatRecordReader readRecordFixedLength Method

 Arguments:   DOM Element Record Grammar Returns:   Error status or throw exception Clear Record Buffer Record Buffer <- Call language-specific routine to read a fixed     number of bytes from the input record stream IF not EOF   Return Fixed Record Length ENDIF Return EOF