Describing the File Formats | Using XML with Legacy Business Applications

Describing the CSV file format in Chapter 7 was fairly simple due to the restrictions we placed on it. The most significant of these was that every row has the same logical format. We're going to allow more variation in our flat file formats. Applications that use flat files to import or export data typically support several different logical record formats and group these records into repeating units. We'll need to specify more information about the grammar of our flat files than we did with CSV files. In addition to data types and other characteristics of fields, we'll need to specify the details of all the record types as well as how the records are grouped together.

As with the CSV format, the flat file format's file description document has three major sections, each represented by an Element that is an immediate child of the root Element.

PhysicalCharacteristics : the flat file characteristics

Figure 8.2 Sample Output Flat File (PurchaseOrders.Dat)

[View full width]

 10        20        30        40        50        60        70        80  90        100       110       120       130  HDRBQ003               AZ999345            2002110120021115 SHPYazoo Grocers - NE Distribution Center  12 Industrial Parkway, NW  Portland            ME 04101 LINHCVAN                       1200000002590000000000 DSCInstant Hot Cocoa Mix - Vanilla flavor LINHCMIN                       2400000002530000000000 DSCInstant Hot Cocoa Mix - Mint flavor HDRBQ003               AW999346            2002110120021115 SHPYazoo Grocers - SE Distribution Center  Dock 37                       3975 Hwy 75  Atoka               OK 74525 LINHCVAN                       3600000002590000000000 DSCInstant Hot Cocoa Mix - Vanilla flavor LINHCMIN                       7200000002530000000000 DSCInstant Hot Cocoa Mix - Mint flavor HDRAY001               2002-0967           2002110920021114 SHPCorner Drug and Sundries                14 Main Street  Wichita             KS 67201 LINHCVAN                       2400000002590000000000 DSCInstant Hot Cocoa Mix - Vanilla flavor

XMLOutputCharacteristics : XML output characteristics, required only when converting to XML
Grammar : flat file grammar

Flat File Physical Characteristics

The PhysicalCharacteristics Element describes the file's physical characteristics. This Element is required for both the source and target conversion utilities.

Table 8.3 shows the child Elements of the PhysicalCharacteristics Element. All are required unless otherwise noted.

XML Output Characteristics

Characteristics governing the output XML documents are described in the XMLOutputCharacteristics Element. This Element is used only when converting from flat files to XML.

Table 8.4 shows the child Elements of the XMLOutputCharacteristics Element. All are required unless otherwise noted.

Flat File Grammar

The grammar of a flat file is described in the Grammar Element. Although the XML representation of groups of records in flat files may be fairly intuitive, a few diagrams might help make it clearer.

Figure 8.3 shows a typical stream of records in a flat file, using our cocoa invoice as an example. For brevity only the record tags appear in the figure.

Figure 8.3. Record Stream in the Invoice File

If we look only at the records we can't for certain deduce much about the logical structure of a document. We would probably suspect that the HDR record started a new document and that perhaps the LIN and DSC records were a repeating group. However, we don't know for certain just by looking at the document; we must verify our suspicions by consulting the file specification or the application designer. For our purposes, we use Table 8.1 as our specification. This allows us to interpret the stream as shown in Figure 8.4.

Figure 8.4. Record Stream in the Invoice File, with Groups Added

graphics/08fig04.gif

Figure 8.4, in essence, shows what is known as a syntax tree . Figure 8.5 converts the brackets into nodes in the tree. I show siblings at the same level in the diagram to make relationships more obvious.

Figure 8.5. Syntax Tree for the Invoice File

graphics/08fig05.gif

The logical structure in Figure 8.5 now finally starts to look like something we might see in XML. All we have to do to make the transformation complete is to change the text from record identifiers and descriptions to XML Element names (Figure 8.6).

Figure 8.6. Invoice Document in XML

graphics/08fig06.gif

Table 8.3. Child Elements of the PhysicalCharacteristics Element

Child Element	Child Element	Attribute	Schema Data Type	Description	Allowable Values, Restrictions, or Comments
RecordFormat				Specifies the physical format of the record	Only one of Fixed or Variable is allowed.
	Fixed	Length	positiveInteger	Specifies the physical record length	Maximum value reflects restriction on record length as noted in restrictions list in text.
	Variable	RecordTerminator	union of U, W, and hexBinary	Designates a UNIX-style line feed, Windows-style carriage return and line feed pair, or a hexadecimal value	U, W, or a two-character hexadecimal number from 00 through FF representinga single byte.
TagInfo				Specifies the location of the record identifier within the record	The tag contents will be interpreted as an alpha-numeric string, with leading and trailing white-space removed. Must be the same offset and length for every record type.
		Offset	nonNegativeInteger	Specifies the offset from zero in bytes for the first position of the tag	Maximum value reflects restriction on record length as noted in restrictions list in text.
		Length	positiveInteger	Specifies the length of the tag in bytes	Maximum value reflects restriction on field length as noted in restrictions list in text.

Table 8.4. Child Elements of the XMLOutputCharacteristics Element

Child Element	Attribute	Schema Data Type	Description Description	Allowable Values, Restrictions, or Comments
SchemaLocationURL	value	anyURI	URL of the schema file for the output document. Will be written as the value of the root Element's noNamespaceSchemaLocation Attribute.	Optional. If not specified the noNamespaceSchemaLocation Attribute will not be written. An error will occur if output validation is requested and this Element is not present.
PartnerBreak			Information about a field that dictates a different trading partner when its content changes (for example, a customer number in the first field of the invoice).	Optional. Field contents are interpreted as an alphanumeric string and must be valid as a directory name for the operating system. If not specified, all output documents will be created in the output directory instead of creating a separate subdirectory for each trading partner.
	Offset	nonNegativeInteger	Offset from zero in bytes for the first position of the field.	Maximum value reflects restriction on record length as noted in the restrictions list in the text.
	Length	positiveInteger	Length of the field in bytes.	Maximum value reflects restriction on field length as noted in the restrictions list in the text.

Now the transformation is complete. However, one other diagram may be helpful in fully understanding the file description documents and how the utilities use them. The logical structure of the grammar of our invoice file exactly matches the structure of the XML representation of the invoice document (Figure 8.7). The Element names in the file description document are shown in boldface type, while the invoice Elements they specify are shown in italics. Note that we define each Element in the invoice document only once and don't repeat the GroupDescription for each occurrence of the LineItemGroup Element.

Figure 8.7. Grammar Description of the Invoice Document

graphics/08fig07.gif

For a more detailed discussion of the analysis of flat file grammars, refer to the High-Level Design Considerations section. Table 8.5 shows the details of the Grammar Element and its child Nodes. All are required unless noted. The indentation in the Element column shows the approximate hierarchical relationships. The Allowable Child Elements column lists the specific details of the hierarchy.

Table 8.6 shows the data types supported for the flat file format. To those we developed for the CSV file format in Chapter 7 we add a new numeric and a new date data type.

For all types, a runtime error occurs if Truncatable is false and the length of the XML Element contents exceeds the field length.

I should make a note here about truncating versus rounding fractional digits. In these utilities I always truncate and never round. I've had enough bad experiences with floating point arithmetic that I'm taking the easy way out and just truncating. If you need to round fractional digits, you can use an XSLT transformation or whatever means you use to put the data into the proper XML source format. Or, if you want to modify the source code, you can take an approach similar to the one I discuss in the Enhancements and Alternatives section at the end of the chapter.

Table 8.5. Flat File Grammar Characteristics in the Grammar Element

Element	Allowable Child Elements	Attribute	Schema Language Data Type	Description	Allowable Values, Restrictions, or Comments
Grammar	RecordDescription, GroupDescription			Describes the grammar of both the flat file and the corre-sponding XML representation.	The first child Element of the Grammar Element must be a RecordDescription Element. It may be followed by any combination of RecordDescription or GroupDescription Elements.
		ElementName	NMTOKEN	Specifies the name of the document's root Element.	When creating XML documents, the specified name is assigned to the document's root Element. When creating a flat file, the input XML document's root Element must match this name. Maximum length reflects restriction on length of Element names.
		TagValue	token	The value of the Header record's record identifier field.	Maximum length reflects restriction on field length. Do not include trailing spaces if the tag length is less than the length specified in the TagInfo Element.
GroupDescription	RecordDescription, GroupDescription			Describes the grammar of a group of records.	Any combination of RecordDescription or GroupDescription Elements can follow the first RecordDescription Element.
		ElementName	NMTOKEN	Specifies the name of the Element representing the group.	Maximum length reflects restriction on length of Element names.
		TagValue	token	The value of the record identifier field described for the first record in the group.	Maximum value reflects restriction on field length. Do not include trailing spaces if the tag length is less than the length specified in the TagInfo Element.
RecordDescription	FieldDescription			Describes the grammar of an individual record and the corresponding XML, a RecordDescription is required representation	A RecordDescription Element is required for each unique record type in the file.If a record type may appear at different for each position.
		ElementName	NMTOKEN	Specifies the name of the Element representing a row.	Maximum length reflects restriction on length of Element names.
		TagValue	token	The value of the record identifier field described by the TagInfo Element above.	Maximum length reflects restriction on field length. Do not include trailing spaces if the tag length is less than the length specified in the TagInfo Element.
FieldDescription	None			Describes the characteristics of a field in the flat file and the corresponding XML representation.	One FieldDescription Element is required for each field in the flat file record. If a range of characters within the record is not covered by a field description, they will be ignored for flat file source conversions and space filled for flat file target conversions.
		ElementName	NMTOKEN	Specifies the name of the Element representing the field.	Maximum length reflects restriction on length of Element names.
		FieldNumber	positiveInteger	Specifies the number of the field, starting at one.	Maximum value reflects restriction on the number of fields per record.
		DataType	token	Specifies the data type of the field in the flat file.	The supported data types developed in this chapter are shown in Table 8.6. The Grammar data type code values are used.
		Offset	nonNegative-Integer	Specifies the offset from zero in bytes for the first position of the field.	Maximum value reflects restriction on record length.
		Length	positiveInteger	Specifies the length of the field in bytes.	Maximum value reflects restriction on field length.
		Truncatable	boolean	Indicates whether or not truncation is permitted. See comments regarding truncation in Table 8.6.	Optional, defaults to false.
		FillCharacter	union of single character string and hex-Binary	When converting to flat files as the target, the field will be padded with this character if the source XML Element content is missing or shorter than the field length.	Optional, defaults to an ASCII space character. A single literal character or a two-character hexadecimal number from 00 through FF representing a single byte may be specified.

Table 8.6. Flat File Data Types

Flat File Data Type	Grammar Data Type Code	Schema Data Type	Actions with Flat File as Source	Actions with Flat File as Target	Actions with Flat File as Target if Truncatable Is True
Alphanumeric	AN	string	Leading and trailing white-space (any character with an integer value less than or equal to a space character) is trimmed . All other white-space within the string is preserved.	If the source is shorter than the field length, the field is left-justified and filled to the right with the fill character.	The string is right-truncated to the field length.
Real number	R	decimal	Leading zeroes and leading plus signs are removed. All whitespace is trimmed.	The number is right-justified within the field. Leading characters are set according to the fill character. If the fill character is a zero, the minus sign if present is placed in the left-most position. For all other fill characters the minus sign immediately precedes the most significant digit.	Fractional digits to the right of the decimal point are truncated until the Element contents are equal to the field length. An error occurs if digits to the left of the decimal exceed the field length.
Implied decimal number	Nx, where x represents the number of implied decimal places	decimal	Leading zeroes and leading plus signs are removed. All whitespace is trimmed.	The number is right-justified within the field. If the number source decimal number exceeds x, the number is right-truncated to x fractional digits. Zeroes are added as fractional digits if the source number has fewer than x fractional digits. Leading characters are set according to the fill character. If the fill character is a zero, the sign character is placed in the left-most position. For all other fill characters the sign character immediately precedes the most significant digit.	Ignored, not truncatable.
Date in YYYYMMDD format	DYYYYMMDD	date	N/A	The date is left-justified within the field and filled with the specified fill character if the field is longer than 8 characters.	Ignored, not truncatable.
Date in MM/DD/YYYY format	DMMsDDsYYYY	date	Month and day may be either one or two digits each.	The date is left-justified within the field and filled with the specified fill character if the field is longer than 10 characters.	Ignored, not truncatable.

Example File Description Documents

Here are the file description documents for the flat file invoice and purchase order examples.

Sample InvoiceFlatSourceDescription.xml

 <?xml version="1.0" encoding="UTF-8"?> <FlatSourceFileDescription     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"     xsi:noNamespaceSchemaLocation="FlatSourceFileDescription.xsd">   <PhysicalCharacteristics>     <RecordFormat>       <Variable RecordTerminator="W"/>     </RecordFormat>     <TagInfo Offset="0" Length="3"/>   </PhysicalCharacteristics>   <XMLOutputCharacteristics>     <SchemaLocationURL value="FlatInvoice.xsd"/>     <PartnerBreak Offset="3" Length="20"/>   </XMLOutputCharacteristics>   <Grammar ElementName="FlatInvoice" TagValue="HDR">     <RecordDescription ElementName="Header" TagValue="HDR">       <FieldDescription FieldNumber="1"           ElementName="RecordID" DataType="AN"           Offset="0" Length="3"/>       <FieldDescription FieldNumber="2"           ElementName="CustomerNumber" DataType="AN"           Offset="3" Length="20"/>       <FieldDescription FieldNumber="3"           ElementName="InvoiceNumber" DataType="AN"           Offset="23" Length="20"/>       <FieldDescription FieldNumber="4"           ElementName="InvoiceDate" DataType="DYYYYMMDD"           Offset="43" Length="8"/>       <FieldDescription FieldNumber="5"           ElementName="PONumber" DataType="AN"           Offset="51" Length="20"/>       <FieldDescription FieldNumber="6"           ElementName="DueDate" DataType="DYYYYMMDD"           Offset="71" Length="8"/>     </RecordDescription>        <RecordDescription ElementName="ShipTo" TagValue="SHP">          <FieldDescription FieldNumber="1"              ElementName="RecordID" DataType="AN"              Offset="0" Length="3"/>          <FieldDescription FieldNumber="2"              ElementName="ShipToName" DataType="AN"              Offset="3" Length="40"/>          <FieldDescription FieldNumber="3"              ElementName="ShipToStreet1" DataType="AN"              Offset="43" Length="30"/>          <FieldDescription FieldNumber="4"              ElementName="ShipToStreet2" DataType="AN"              Offset="73" Length="30"/>          <FieldDescription FieldNumber="5"              ElementName="ShipToCity" DataType="AN"              Offset="103" Length="20"/>          <FieldDescription FieldNumber="6"             ElementName="ShipToStateOrProvince"             DataType="AN" Offset="123" Length="3"/>          <FieldDescription FieldNumber="7"              ElementName="ShipToPostalCode" DataType="AN"              Offset="126" Length="10"/>          <FieldDescription FieldNumber="8"              ElementName="ShipToCountry"              DataType="AN" Offset="136" Length="3"/>        </RecordDescription>        <GroupDescription ElementName="LineItemGroup"            TagValue="LIN">          <RecordDescription ElementName="LineItem" TagValue="LIN">            <FieldDescription FieldNumber="1"                ElementName="RecordID" DataType="AN"                Offset="0" Length="3"/>            <FieldDescription FieldNumber="2"                ElementName="ItemID" DataType="AN"                Offset="3" Length="20"/>            <FieldDescription FieldNumber="3"                ElementName="ItemQuantity" DataType="R"                Offset="23" Length="10"/>            <FieldDescription FieldNumber="4"                ElementName="UnitPrice" DataType="N2"                Offset="33" Length="10"/>            <FieldDescription FieldNumber="5"                ElementName="ExtendedPrice" DataType="N2"                Offset="43" Length="10"/>          </RecordDescription>          <RecordDescription ElementName="ItemDescription"              TagValue="DSC">            <FieldDescription FieldNumber="1"                ElementName="RecordID" DataType="AN"                Offset="0" Length="3"/>            <FieldDescription FieldNumber="2"                ElementName="Description" DataType="AN"                Offset="3" Length="80"/>          </RecordDescription>        </GroupDescription>        <RecordDescription ElementName="Summary" TagValue="SUM">          <FieldDescription FieldNumber="1"              ElementName="RecordID" DataType="AN"              Offset="0" Length="3"/>          <FieldDescription FieldNumber="2"              ElementName="TotalAmount" DataType="N2"              Offset="3" Length="10"/>          <FieldDescription FieldNumber="3"              ElementName="NumberOfLines" DataType="N0"              Offset="13" Length="10"/>        </RecordDescription>      </Grammar>    </FlatSourceFileDescription>

Sample PurchaseOrderFlatTargetDescription.xml

 <?xml version="1.0" encoding="UTF-8"?> <FlatTargetFileDescription     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"     xsi:noNamespaceSchemaLocation=         "FlatTargetFileDescription.xsd">   <PhysicalCharacteristics>     <RecordFormat>       <Variable RecordTerminator="W"/>     </RecordFormat>     <TagInfo Offset="0" Length="3"/>   </PhysicalCharacteristics>   <Grammar ElementName="PurchaseOrder" TagValue="HDR">     <RecordDescription TagValue="HDR" ElementName="POHeader">       <FieldDescription FieldNumber="1"           ElementName="RecordID" DataType="AN"           Offset="0" Length="3"/>       <FieldDescription FieldNumber="2"           ElementName="CustomerNumber" DataType="AN"           Offset="3" Length="23"/>       <FieldDescription FieldNumber="3"           ElementName="PONumber" DataType="AN"           Offset="23" Length="20"/>       <FieldDescription FieldNumber="4"           ElementName="PODate" DataType="DYYYYMMDD"           Offset="43" Length="8"/>       <FieldDescription FieldNumber="5"           ElementName="RequestedDeliveryDate" DataType="DYYYYMMDD"           Offset="51" Length="8"/>     </RecordDescription>     <RecordDescription TagValue="SHP" ElementName="ShipTo">       <FieldDescription FieldNumber="1"           ElementName="RecordID" DataType="AN"           Offset="0" Length="3"/>       <FieldDescription FieldNumber="2"           ElementName="ShipToName" DataType="AN"           Offset="3" Length="40" Truncatable="true"/>       <FieldDescription FieldNumber="3"           ElementName="ShipToStreet1" DataType="AN"           Offset="43" Length="30" Truncatable="true"/>       <FieldDescription FieldNumber="4"           ElementName="ShipToStreet2" DataType="AN"           Offset="73" Length="30" Truncatable="true"/>       <FieldDescription FieldNumber="5"           ElementName="ShipToCity" DataType="AN"           Offset="103" Length="20"/>       <FieldDescription FieldNumber="6"           ElementName="ShipToStateOrProvince" DataType="AN"           Offset="123" Length="3"/>       <FieldDescription FieldNumber="7"           ElementName="ShipToPostalCode" DataType="AN"           Offset="126" Length="10"/>       <FieldDescription FieldNumber="8"           ElementName="ShipToCountry" DataType="AN"           Offset="136" Length="3"/>     </RecordDescription>     <GroupDescription ElementName="LineItem" TagValue="LIN">       <RecordDescription TagValue="LIN" ElementName="Item">         <FieldDescription FieldNumber="1"             ElementName="RecordID" DataType="AN"             Offset="0" Length="3"/>         <FieldDescription FieldNumber="2"             ElementName="ItemID" DataType="AN"             Offset="3" Length="20"/>         <FieldDescription FieldNumber="3"             ElementName="OrderedQty" DataType="R"             Offset="23" Length="10" FillCharacter=" "/>         <FieldDescription FieldNumber="4"             ElementName="UnitPrice" DataType="N2"             Offset="33" Length="10" FillCharacter="0"/>         <FieldDescription FieldNumber="5"             ElementName="ExtendedAmount" DataType="N2"             Offset="43" Length="10" FillCharacter="0"/>       </RecordDescription>       <RecordDescription TagValue="DSC"           ElementName="ItemDescription">         <FieldDescription FieldNumber="1"             ElementName="RecordID" DataType="AN"             Offset="0" Length="3"/>         <FieldDescription FieldNumber="2"             ElementName="Description" DataType="AN"             Offset="3" Length="80" Truncatable="true"/>       </RecordDescription>     </GroupDescription>   </Grammar> </FlatTargetFileDescription>