Chapter 7, in dealing with CSV files, gave a good introduction to how the architecture is implemented. There are only a few significant differences between processing flat files and processing CSV files. So, while I'll present the pseudocode that shows the logic of the Java and C++ implementations for flat file conversion, the discussion will focus on the aspects that are different from CSV processing.
As discussed in Chapter 6, in order to develop reusable modules that can be linked together into programs and not just cobbled together in scripts, we put the main processing logic in a routine that is called from a shell main program. The shell main program functions, described next , are essentially identical to the CSV main routine.
Logic for the Shell Main Routine for Flat to XML
Arguments: Input File Name Output Directory File Description XML Document Options: Validate output Help Validate and process command line arguments IF help option specified Display help message Exit ENDIF Create new FlatSourceConverter object, passing: Validation option Output Directory File Description XML Document Call FlatSourceConverter processFile method, passing the input file name Display completion message
FlatSourceConverter Class (Extends SourceConverter)
The FlatSourceConverter is the main driver for the actual conversion. It inherits all the attributes and methods of its base classes, the SourceConverter and Converter classes (see Chapter 6).
The constructor method for our FlatSourceConverter object sets up that object as well as the FlatRecordReader object.
Logic for the FlatSourceConverter Constructor Method
Arguments: Boolean Validation option String Output Directory String File Description Document Name Call base class constructor Initialize Partner Array Call loadFileDescriptionDocument from passed File Description Document Name Schema Location URL <- Call File Description Document's getElementsByTagName on "SchemaLocationURL", then getAttribute on "value" IF Schema Location URL is null and Validation is true Throw Exception ENDIF Partner Break Field Offset <- 0 Partner Break Field Length <- 0 NodeList Temp <- call File Description Document's getElementsByTagName for "PartnerBreak" IF (Temp length = 1) Partner Break Element <- Temp NodeList item(0) Partner Break Offset <- call Partner Break Element's getAttribute for "Offset", and convert to integer Partner Break Length <- call Partner Break Element's getAttribute for "Length", and convert to integer ENDIF Initialize Saved Partner ID Create FlatRecordReader object, passing: File Description Document
The main processing is driven by the FlatSourceConverter's processFile method. This method converts one input flat file into one or more output XML documents based on the input parameters.
Logic for the FlatSourceConverter processFile Method
Arguments: String Input File Name Returns: Status or throws exception Output Directory Path <- Base Directory + directory separator Initialize Output Document to null Initialize Sequence Number Header Record Tag <- Get Grammar Element's "TagValue" Attribute Open input file Call FlatRecordReader's setInputStream method Record Length <- Call FlatRecordReader's readRecord method Record Tag <- Call FlatRecordReader's getRecordType method IF (Record Tag != Header Record Tag) Return error or throw exception ENDIF DO while Record Length => 0 Partner Break <- Call testPartnerBreak IF Partner Break = true Output Directory Path <- Base Directory + Partner ID + directory separator Lookup Partner in Partner Array IF Partner is not in Array Create output directory from Output Directory Path Partner Array <- Add Partner ID ENDIF ENDIF Create new Output Document Increment Sequence Number, and pad with leading zeroes to three digits Output File Path <- Output directory path + Root Element Name + Sequence Number + ".xml" Call FlatRecordReader's setOutputDocument method for new Output Document Record Length <- Call processGroup to process the document, passing the Root Element and the Grammar Element IF (Record Length > 0) Header Tag <- Call FlatRecordReader's getRecordType method IF (Header Tag != Grammar Header Record Tag) Return error or throw exception ENDIF ENDIF Call saveDocument ENDDO Close input file Display completion message with number of documents processed
There are several similarities between this processFile method and the CSVSourceConverter's processFile method. We do very similar processing for partner lookup, directory and file management, and saving documents. However, instead of processing records individually, we process a document as a whole using the processGroup method.
processGroup (Base Class SourceConverter Method)
We noted in the discussion of flat file grammars that the recursive definition for the group production lends itself to processing by a recursive algorithm. I also noted near the end of Chapter 2 that while we can often process XML using recursive algorithms, doing so doesn't always offer any advantages over nonrecursive approaches. However, for our purposes in this utility a recursive approach, implemented with the processGroup method, is quite appropriate and very powerful. It processes the first record in a group, then all the other records. If one of the records starts another logical group , the processGroup method calls itself.
Since both flat files and EDI files have the same logical group structures (at least when processing their XML representations in our architecture) we can use the same method for processing them. So, although we introduce the processGroup method in this chapter, we're actually adding it to the SourceConverter base class.
Logic for the FlatSourceConverter processGroup Method
Arguments: DOM Node Parent DOM Element Group Grammar Returns RecordLength of last input record read, or EOF Group Element Name <- Group Grammar getAttributeValue For "ElementName" Group Element <- createElement using Group Element Name Parent Element <- appendNode Group Element Grammar Element Name <- Record Grammar getNodeName IF Grammar Element Name = "Grammar" and Schema Location URL is not NULL Create namespace Attribute for SchemaInstance and append to Group Element Create noNamespaceSchemaLocation Attribute and append to Group Element ENDIF Child Node <- Get Grammar Element's firstChild DO while Child Node is not an Element Node Child Node <- nextSibling ENDDO Record Grammar Element <- Child Node Record Element Name <- Record Grammar Element's getAttributeValue For "ElementName" // Process the first record in the group Call RecordReader's parseRecord Call RecordReader's toXML Call RecordReader's writeRecord // This advance in the grammar makes sure that we don't repeat // the starting record of the group Record Grammar Element <- Get next Record Element from Group Grammar Grammar Record Tag <- Record Grammar getAttribute for "TagValue" Record Length <- call RecordReader's readRecord DO until end of file Record Tag <- call RecordReader's getRecordTag DO until Grammar Record Tag = Record Tag Record Grammar Element <- Get next Record Element from Group Grammar IF Record Grammar Element is NULL return Record Length // This record is not part of the group ENDIF ENDDO Grammar Element Name <- Record Grammar getNodeName IF Grammar Element Name = "GroupDescription" Do recursive call of processGroup ELSE Record Element Name <- Record Grammar getAttributeValue for "ElementName" Call RecordReader's parseRecord Call RecordReader's toXML Call RecordReader's writeRecord Record Length <- call RecordReader's readRecord ENDIF ENDDO Return Record Length
The logic is mostly straightforward, but it may help to review the recursive and termination cases. The first time processGroup is executed it is called from processFile after we have read the header record for a logical document. We pass processGroup the Document as the parent Node to which to append all the record Elements we create in processGroup. We also pass the complete Grammar Element from the file description document as the grammar for the group. After we have processed the header record, we advance to the next Element in the grammar. We then read the next record from the file. We advance Elements in the grammar until we match the record identifier of the record that we read from the file. If it is a normal record (that is, it doesn't start another group), we process it. However, if the matching grammar Element indicates that it starts a group (indicated by an Element name of "GroupDescription" instead of "RecordDescription"), we execute the recursive call. In this circumstance we pass the Document's root Element as the parent Element and the group's grammar Element (the GroupDescription Element) as the grammar. We thus start a new instance of processGroup and proceed as before.
We reach the termination case when we have read a record that is not part of the grammar of the current group. This case is recognized when we advance through all the grammar Elements that are children of the current group grammar and don't find a TagValue Attribute that matches the record's identifier. If we're processing a lower-level group in the logical document hierarchy (and a higher execution point on the stack), we exit processGroup and return to the previous iteration of processGroup. If the record is part of the group being processed by that iteration, we resume processing. However, if it isn't part of that group, we again exit. This holds true if the record we have read starts another iteration of the same type of record group. This continues until we finally exit back to processFile. If the current record is a header record, we save the current Document to disk and begin a new iteration of the while loop. However, if the record is not a header record, we have encountered a record that is either not defined for this type of document or is not where we have said it would be in the grammar. In that case we force an abnormal termination.
This testPartnerBreak method serves the same purpose as the one in the CSVSourceConverter class, but the processing is a bit different. In addition to a few other minor differences, we trim trailing whitespace through the call to the Flat RecordReader's getFieldValue method.
Logic for the FlatSourceConverter testPartnerBreak Method
Arguments: None Returns: Boolean - true if new partner and false if not IF Partner Break Length is zero return false ENDIF PartnerID <- Call FlatRecordReader's getFieldValue method passing the Partner Break Offset and Partner Break Length IF PartnerID = Saved Partner ID return false ENDIF Saved Partner ID <- Partner ID Return true
FlatRecordReader Class (Extends RecordReader)
The FlatRecordReader introduces only a few new attributes but has some new and overridden methods. It inherits several attributes and methods from its RecordReader and RecordHandler base classes (see Chapter 6).
In the FlatRecordReader constructor method, we primarily retrieve values from the file description document that the reader needs for processing. The logic should look familiar by now.
Logic for the FlatRecordReader Constructor Method
Arguments: DOM Document File Description Document Call RecordReader base class constructor, passing File Description Document Record Format Element <- Get RecordFormat Element from File Description Document Child Element <- Get first childNode from Record Format Element, advancing over non-Element Nodes Fixed Record Length <- 0 IF Child Element NodeName = "Fixed" Fixed Record Length <- Call Child Element's getAttribute for "Length" ELSE Record Terminator <- Call Child Element's getAttribute for "RecordTerminator" Call setTerminator to set the Record Terminators ENDIF Tag Info Element <- Get "RecordTagInfo" Element from File Description Document Record ID Field Offset <- Call Tag Info's getAttribute for "Offset" Record ID Field Length <- Call Tag Info's getAttribute for "Length"
This method gets the complete value of a record's field with a single call. It is declared and implemented only in the FlatRecordReader class and isn't used in any of the other classes derived from the RecordReader class. The reason for this is that the other legacy formats (CSV and EDI) parse input records on a character-by-character basis. They save a field's characters to a DataCell object as they parse the record. In contrast, with the flat file format we parse records on a field-by-field basis, extracting fields from the record buffer based on offsets and lengths. This method is called from the FlatRecordReader's parseRecord method to perform that extraction. In addition, this method is designed to be used for extracting the record identifier, and until we have it we don't know how to parse the full record and load the DataCell Array.
The method is also called from the FlatRecordReader's testPartnerBreak method and the FlatRecordReader's getRecordType method. We call it for these special cases and not as a general purpose method for extracting field values from the input record buffer since in Java and C++ that operation takes only a couple lines of code. In addition to extracting the value, getFieldValue trims trailing whitespace and returns an error if the field is empty.
Logic for the FlatRecordReader getFieldValue Method
Arguments: Integer Field Offset Integer Field Length Returns: Field value; throws exception or returns status IF Field Offset > Record Buffer Length Return error ENDIF IF Field Offset + Field Length > Record Buffer Length Field Length <- Record Buffer Length - Field Offset ENDIF Field Value <- Extract from Record Buffer according to Passed Field Offset and adjusted Field Length Field Value <- Trim trailing whitespace (<= ASCII space character) IF length of Field Value = 0 Return error ENDIF Return Field Value
This method extracts the value of the Record ID field from the input record buffer, trimming trailing whitespace. In the current implementation the field value is returned in its raw form, interpreted as alphanumeric string data.
Logic for the FlatRecordReader getRecordType Method
Arguments: None Returns: Record ID tag value; throws exception or returns status Record Tag Value <- call getFieldValue, passing Record ID Field Offset and Record ID Field Value Return Record Tag Value or status
This method is not as interesting as the one in the CSVRecordReader because the grammar is not as complex. Essentially, we walk the FieldDescription Elements of the RecordDescription Element, extract the field contents according to their defined offsets and lengths, and create new DataCell objects.
Logic for the FlatRecordReader parseRecord method
Arguments: DOM Element Record Grammar Returns: Error status or throw exception Field Grammar NodeList <- call Record Grammar's getElementsByTagName for "FieldDescription" DO until end of Field Grammar Node List Field Grammar Element <- Next item in Field Grammar NodeList Field Number <- call Field Grammar Element's getAttribute on "FieldNumber" Field Offset <- call Field Grammar Element's getAttribute on "Offset" Field Length <- call Field Grammar Element's getAttribute on "Length" IF Field Offset + 1 > Record Buffer Length Return success // End of record ENDIF IF Field Offset + Field Length > Record Buffer Length Field Length <- Record Buffer Length - Field Offset ENDIF Field Length <- Record Buffer Length - Field Offset Field Value <- Extract from Record Buffer according to Field Offset and adjusted Field Length New Cell <- call RecordHandler's createDataCell method, passing Field Number and Field Grammar Element Call New Cell's putFieldValue method, passing Field Value ENDDO
This convenience method provides a way to protect calling routines from the variations in how we physically process records for flat files. It performs either a fixed or variable length read, depending on the physical characteristics gleaned from the file description document.
Logic for the FlatRecordReader readRecord Method
Arguments: None Returns: Record Length or EOF IF Fixed Record Length > 0 Return call to readRecordFixedLength ENDIF Return call to base RecordReader's readRecordVariableLength
This method calls the language-specific routines for reading a fixed length record from a flat file.
Logic for the FlatRecordReader readRecordFixedLength Method
Arguments: DOM Element Record Grammar Returns: Error status or throw exception Clear Record Buffer Record Buffer <- Call language-specific routine to read a fixed number of bytes from the input record stream IF not EOF Return Fixed Record Length ENDIF Return EOF