High-Level Design Considerations | Using XML with Legacy Business Applications

Our approach to grammar analysis is now going to be a bit more formal than that presented in Chapters 2 and 3. I'll start the design discussion in this and the next two chapters by examining the grammar of the legacy file format. I'll present a logical analysis, followed by the schema for the file description document. In the previous section I presented the basic content of that document, but here I'll discuss its relation to the grammar. The grammar analysis will also lay a foundation for the parsing algorithms presented later in this chapter.

Grammar Analysis and Description

The grammar of our legacy file formats can be broken down into two separate grammars: (1) the grammar of records and groups of records within the file, and (2) the grammar of the fields within a record. In the case of CSV files, we have imposed the restriction that each row has the same format. The CSV file grammar can be expressed rather simply with the following BNF production.

CSV File Grammar

 CSVFile ::= row+

The plus sign (+) indicates one or more. So, this simply says that a CSV file contains one or more rows. For CSV files, the row grammar is the interesting part. Remember, here we don't want to describe the grammar of a specific CSV file; instead we want to abstract the essential features of CSV files and develop a grammar that describes the whole class of CSV files. So, here's the CSV row grammar. Note that we follow through with the complete grammar by using the row nonterminal symbol from the file grammar.

CSV Row Grammar

 row ::= column (column_delimiter column?)*  (column_delimiter column?)+ column ::= column_characters_A+             text_delimiter column_characters_B+ text_delimiter column_characters_A ::= All allowed characters except                         column_delimiter column_characters_B ::= All allowed characters except                         text_delimiter

Again, the plus sign (+) indicates one or more occurrences and the vertical bar or pipe () indicates an exclusive OR choice. The asterisk (*) indicates zero or more occurrences, the question mark (?) indicates optionality, and the parentheses are used to establish groupings, the same way they are used in mathematical equations. I've taken a bit of a shortcut in the last two productions by falling back to text rather than terminal and nonterminal symbols, but I think it is clearer than trying to enumerate the full set of allowed characters.

Bear in mind that any number of BNF productions can describe this grammar. Let's not get confused about whether or not this is the most elegant way to express the grammar; let's just focus on the one I present.

So, what does this grammar tell us? The first production tells us that a row can have:

A single column with nothing after it
A single column followed by any combination of empty and filled columns
An empty first column followed by any combination of empty and filled columns

This production tells us that all the following rows are legal, assuming we use a comma as the column delimiter .

Mary,had,a,little,lamb
Mary,had
Mary,,a,,lamb
Mary,,a,,,
,had,a,little,lamb
,had,a,,
,,,

Note that the grammar allows a row to end with empty columns or to be completely empty. Many applications won't produce such rows. However, since we're aiming to accommodate the widest possible class of CSV files I saw no reason to impose this restriction. This approach also makes the parsing algorithm a bit easier. But even though we'll be able to parse rows ending with empty columns, we're not going to create such rows. As discussed in Chapter 9, the grammar of EDI records (segments) doesn't allow empty fields at the end.

The last three productions basically tell us that a column either may have any character other than the selected column delimiter or, if it is delimited by the text delimiter in the first and last positions , may include the column delimiter. Again, this allows a much wider range of variations than are usually permitted by any particular application. Many delimit only alphanumeric columns that contain the column delimiter (usually a comma), but some delimit all columns regardless of type. Our approach allows us to accommodate nearly all cases. As we'll shortly see, this also keeps the parsing algorithm from getting too complex since we don't need to be concerned with the data type of the column as we are parsing. Note, however, than when we convert from XML to CSV, we use the DelimitText Element of the column's grammar to determine whether or not we should delimit the column content with the text delimiter character.

File Description Document Schemas

In designing the schemas for the invoice and purchase order I was fairly safe in letting a tool do most of the work. However, for the file description documents I start from the ground up. Much of the rationale was discussed in Chapter 6, so I'll point out here only things that I didn't mention there. We'll also discuss other issues related to schema design in Chapter 12.

There are four schemas presented in this subsection.

CSVSourceFileDescription.xsd : This is the schema for file description documents that describe conversions in which the source format is a CSV file.
CSVTargetFileDescription.xsd : This schema is for file description documents that describe conversions in which the target format is a CSV file.
CSVCommonFileDescription.xsd : This type library schema is used by the two previous schemas.
BBCommonFileDescription.xsd : This schema specifies types used in all the Babel Blaster conversions. It primarily defines the type for the supported Babel Blaster data types (corresponding to the DataCell derived classes) and the enumeration of the codes for that type.

So, here are the schemas, with a few more comments interspersed where appropriate. If you recall the lessons of Chapter 4 and review the comments in Chapter 6 about the union data type, you should be able to read these schemas fairly well.

CSV Source File Description Schema (CSVSourceFileDescription.xsd)

 <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"     elementFormDefault="unqualified"     attributeFormDefault="unqualified">   <xs:include schemaLocation="CSVCommonFileDescription.xsd"/>   <xs:element name="CSVSourceFileDescription">     <xs:annotation>       <xs:documentation>         This schema specifies the format of File Description         Documents when converting from CSV files as source to XML         documents as targets       </xs:documentation>     </xs:annotation>     <xs:complexType mixed="false">       <xs:sequence>         <xs:element name="PhysicalCharacteristics"             type="CSVPhysicalCharacteristicsType"/>         <xs:element name="XMLOutputCharacteristics"             type="CSVXMLOutputCharacteristicsType"/>         <xs:element name="Grammar"             type="CSVGrammarType"/>       </xs:sequence>     </xs:complexType>   </xs:element> </xs:schema>

CSV Target File Description Schema (CSVTargetFileDescription.xsd)

 <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"     elementFormDefault="unqualified"     attributeFormDefault="unqualified">   <xs:include schemaLocation="CSVCommonFileDescription.xsd"/>   <xs:element name="CSVTargetFileDescription">     <xs:annotation>       <xs:documentation>         This schema specifies the format of File Description         Documents when converting from XML documents as source to         CSV files as targets       </xs:documentation>     </xs:annotation>     <xs:complexType mixed="false">       <xs:sequence>         <xs:element name="PhysicalCharacteristics"             type="CSVPhysicalCharacteristicsType"/>         <xs:element name="Grammar"             type="CSVGrammarType"/>       </xs:sequence>     </xs:complexType>   </xs:element> </xs:schema>

You'll notice that these two schemas are nearly identical. Aside from the difference in the root document Element name, the source file schema has a required XMLOutputCharacteristics Element that does not appear in the target file schema.

CSV File Description Common Schema (CSVCommonFileDescription.xsd)

 <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"     elementFormDefault="unqualified"     attributeFormDefault="unqualified">   <xs:include schemaLocation="BBCommonFileDescription.xsd"/>   <xs:complexType name="CSVPhysicalCharacteristicsType"       mixed="false">     <xs:annotation>       <xs:documentation>           Describes the CSV physical record organization       </xs:documentation>     </xs:annotation>     <xs:sequence>       <xs:element name="RecordTerminator">         <xs:complexType>           <xs:complexContent>             <xs:extension base="EmptyType">               <xs:attribute name="value"                   type="RecordTerminatorValueType"                   use="required"/>             </xs:extension>           </xs:complexContent>         </xs:complexType>       </xs:element>       <xs:element name="ColumnDelimiter" type="DelimiterType"/>       <xs:element name="TextDelimiter" type="DelimiterType"/>     </xs:sequence>   </xs:complexType>   <xs:complexType name="CSVXMLOutputCharacteristicsType"       mixed="false">     <xs:annotation>       <xs:documentation>         Describes characteristics of the output XML document       </xs:documentation>     </xs:annotation>     <xs:sequence>       <xs:element name="DocumentBreakColumn">         <xs:complexType mixed="false">           <xs:complexContent mixed="false">             <xs:extension base="EmptyType">               <xs:attribute name="value" type="BreakColumnType"                   use="required"/>             </xs:extension>           </xs:complexContent>         </xs:complexType>       </xs:element>       <xs:element name="PartnerBreakColumn">         <xs:complexType mixed="false">           <xs:complexContent mixed="false">             <xs:extension base="EmptyType">               <xs:attribute name="value" type="BreakColumnType"                   use="required"/>             </xs:extension>           </xs:complexContent>         </xs:complexType>       </xs:element>       <xs:element name="SchemaLocationURL" minOccurs="0">         <xs:complexType>           <xs:complexContent>             <xs:extension base="EmptyType">               <xs:attribute name="value" type="anyURI127"                   use="required"/>             </xs:extension>           </xs:complexContent>         </xs:complexType>       </xs:element>     </xs:sequence>   </xs:complexType>   <xs:complexType name="CSVGrammarType">     <xs:annotation>       <xs:documentation>         Describes the grammar of the CSV file       </xs:documentation>     </xs:annotation>     <xs:sequence>       <xs:element name="RowDescription">         <xs:annotation>           <xs:documentation>             Describes a row in the CSV file. Currently, all rows             must have the same format so we only allow a single             one of these Elements.           </xs:documentation>         </xs:annotation>         <xs:complexType>           <xs:sequence>             <xs:element name="ColumnDescription" maxOccurs="100">               <xs:annotation>                 <xs:documentation>                   Describes a column in the row. The current                   design limits us to one hundred columns per                   row.                 </xs:documentation>               </xs:annotation>               <xs:complexType>                 <xs:complexContent>                   <xs:extension base="FieldGrammarType">                     <xs:attribute name="DelimitText"                         type="xs:boolean" use="optional"                         default="false"/>                   </xs:extension>                 </xs:complexContent>               </xs:complexType>             </xs:element>           </xs:sequence>           <xs:attribute name="ElementName" type="NMToken127"               use="required"/>         </xs:complexType>       </xs:element>     </xs:sequence>     <xs:attribute name="ElementName" type="NMToken127"         use="required"/>   </xs:complexType>   <xs:simpleType name="BreakColumnType">     <xs:annotation>       <xs:documentation>         Enforces restrictions on column number for partner and         document break       </xs:documentation>     </xs:annotation>     <xs:restriction base="xs:nonNegativeInteger">       <xs:maxExclusive value="100"/>     </xs:restriction>   </xs:simpleType> </xs:schema>

Babel Blaster File Description Common Schema (BBCommonFileDescription.xsd)

 <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"     elementFormDefault="unqualified"     attributeFormDefault="unqualified">   <xs:annotation>     <xs:documentation>       This schema specifies types common to all Babel Blaster       conversion utilities.     </xs:documentation>   </xs:annotation>   <xs:annotation>     <xs:documentation>       These complex types define reused types with Attributes       only and no Element children.     </xs:documentation>   </xs:annotation>   <xs:complexType name="EmptyType">     <xs:annotation>       <xs:documentation>         Base Type for empty types       </xs:documentation>     </xs:annotation>   </xs:complexType>   <xs:complexType name="DelimiterType">     <xs:annotation>       <xs:documentation>         Base type for defining delimiters       </xs:documentation>     </xs:annotation>     <xs:complexContent>       <xs:extension base="EmptyType">         <xs:attribute name="value" type="DelimiterValueType"             use="required"/>       </xs:extension>     </xs:complexContent>   </xs:complexType>   <xs:complexType name="FieldGrammarType" mixed="false">     <xs:annotation>       <xs:documentation>         Base type for defining field grammars       </xs:documentation>     </xs:annotation>     <xs:complexContent mixed="false">       <xs:extension base="EmptyType">         <xs:attribute name="FieldNumber" type="FieldNumberType"             use="required"/>         <xs:attribute name="ElementName" type="NMToken127"             use="required"/>         <xs:attribute name="DataType" type="BBDataType"             use="required"/>       </xs:extension>     </xs:complexContent>   </xs:complexType>   <xs:annotation>     <xs:documentation>       These simple types define unions of other simple types.     </xs:documentation>   </xs:annotation>   <xs:simpleType name="DelimiterValueType">     <xs:annotation>       <xs:documentation>         Type for column and text delimiters. A union of a         single character (as a token of length one) or a two-         byte hex value.       </xs:documentation>     </xs:annotation>     <xs:union memberTypes="Token1 HexBinary1"/>   </xs:simpleType>   <xs:simpleType name="RecordTerminatorValueType">     <xs:annotation>       <xs:documentation>         Type for value attribute of RecordTerminator - union of         U,W, and 2 byte Hex.       </xs:documentation>     </xs:annotation>     <xs:union memberTypes="OSTerminatorType HexBinary1"/>   </xs:simpleType>   <xs:annotation>     <xs:documentation>       These simple types specify restrictions on built-in schema       data types.     </xs:documentation>   </xs:annotation>   <xs:simpleType name="OSTerminatorType">     <xs:annotation>       <xs:documentation>         Enumerations for OS terminator values       </xs:documentation>     </xs:annotation>     <xs:restriction base="xs:token">       <xs:enumeration value="U"/>       <xs:enumeration value="W"/>     </xs:restriction>   </xs:simpleType>   <xs:simpleType name="HexBinary1">     <xs:annotation>       <xs:documentation>         Type for a single-byte hex number       </xs:documentation>     </xs:annotation>     <xs:restriction base="xs:hexBinary">       <xs:length value="1"/>     </xs:restriction>   </xs:simpleType>   <xs:simpleType name="Token1">     <xs:annotation>       <xs:documentation>         Token with length of 1       </xs:documentation>     </xs:annotation>     <xs:restriction base="xs:token">       <xs:length value="1"/>     </xs:restriction>   </xs:simpleType>   <xs:simpleType name="BBDataType">     <xs:annotation>       <xs:documentation>         These are the supported native Babel Blaster data types         for CSV files. Add an enumeration element when adding a         new data type.       </xs:documentation>     </xs:annotation>     <xs:restriction base="xs:token">       <xs:enumeration value="AN"/>       <xs:enumeration value="R"/>       <xs:enumeration value="DMMsDDsYYYY"/>     </xs:restriction>   </xs:simpleType>   <xs:simpleType name="NMToken127">     <xs:annotation>       <xs:documentation>         Data type for Element names. Restricted to 127 characters         since C++ char arrays are 128.       </xs:documentation>     </xs:annotation>     <xs:restriction base="xs:NMTOKEN">       <xs:maxLength value="127"/>     </xs:restriction>   </xs:simpleType>   <xs:simpleType name="FieldNumberType">     <xs:annotation>       <xs:documentation>         Enforces restriction on maximum number of fields       </xs:documentation>     </xs:annotation>     <xs:restriction base="xs:positiveInteger">       <xs:maxExclusive value="100"/>     </xs:restriction>   </xs:simpleType>   <xs:simpleType name="anyURI127">     <xs:annotation>       <xs:documentation>         Enforces restriction on maximum length of URI       </xs:documentation>     </xs:annotation>     <xs:restriction base="xs:anyURI">       <xs:maxLength value="127"/>     </xs:restriction>   </xs:simpleType> </xs:schema>

The schema representation of the grammar is actually very simple. The schema is somewhat of an abstraction of the grammar in that we can ignore physical characteristics such as column and text delimiters. It basically reduces us to saying that the grammar, as expressed in the Grammar Element, is composed of a single row description in the RowDescription Element. A row is composed of one or more columns as described in the ColumnDescription Element. The schema, in effect, reduces the description of the grammar to the following productions.

CSV File Grammar in Schema

 CSVFileGrammar ::= rowGrammar rowGrammar ::= columnGrammar+

Again, we'll see the more complex version used when we discuss the parsing algorithm.