Grammar Analysis and Description
While our discussion of CSV grammars focused strictly on the grammar of a row, our consideration of flat file grammars will focus on the logical structure of the file. Flat file record formats are usually not complex (at least in the way we will deal with them).
Before we look at the BNF, let's look at a few examples. Our sample invoice and purchase order files have the following record organization (except that the purchase order doesn't have a summary record). To focus on the structure, I use just the record IDs, and indentation shows the groupings.
HDR SHP LIN DSC SUM
The following example uses somewhat arbitrary record tags.
A010 B020 C030 D040 E050 F060 G070 H080 I090 J100
The thing that flat file formats usually have in common, although there are a few exceptions, is that each record in the hierarchy has a unique identifier. Individual record formats are usually not used in more than one location in the logical hierarchy of the file. With this characteristic it would be quite easy to just use the record identifier as an index into the record grammar. However, doing this would impose for all cases the restriction that each record format and location in the hierarchy be unique. This might make the utility unusable for those few odd designs that do repeat the same record at different places in the hierarchy. In addition, if possible, we would like to generalize our approach to the problem so that it might apply to other formats using groups, such as EDI. (EDI does not observe this unique record ID restriction.)
If we abstract the essential features of flat file organization, we see that the following BNF adequately describes the abstract essence of all our cases.
Flat File Grammar
FlatFile ::= document+ document ::= group group ::= record (record group)*
What this says, basically, is that a flat file contains one or more logical documents. Each logical document is a group of records. A group has a start record and contains zero or more records or other groups. The group production is a fairly simple recursive definition, but it is the key to describing flat file hierarchies. It has a great deal of descriptive power. All the examples we have covered so far can be described by this simple grammar. As we'll see in Chapter 9, most EDI formats also can be described by this grammar. The most interesting result of this analysis is that, as a recursive definition, it suggests that we might use a recursive algorithm for processing. We'll see that in later sections on detail design.
We complete the BNF by describing the grammar of a flat file record.
Flat File Record Grammar
record ::= (field* recordID field+) (field+ recordID field*) recordID ::= field, with an enumerated set of unique values and the same offset and length for all records in the file field ::= a string of bytes with a specified offset and length
These three productions basically say that a record contains two or more fields, a field is described by its offset and length, and one of the fields must contain an identifier for the record. The record identifier field must have the same offset and length in all the file's record types. Again, for the final productions strict BNF fails us and we resort to natural language. But I think you get the idea.
Of course, the grammar of any particular flat file type is more complicated than these abstractions. However, in order to build general purpose converters we must find an abstract grammar that is general enough to handle most, if not all, of our cases. I think you'll find that this grammar is sufficiently general.
File Description Document Schemas
As with CSV formats there are four schemas involved in describing flat files.
To prevent further cluttering of this chapter with XML code listings, I'll just discuss the schemas in general terms and list only some particularly interesting snippets. Again, the schemas are available on the Web.
The schemas for the source and target file description documents are again basic shells . The common type library for flat files is a bit more interesting. It uses a few schema language features that I noted aren't normally used in schemas for common business documents. Two of these are shown in the xs:complexType Element that defines the FlatGrammarType type.
FlatGrammarType complexType in FlatCommonFileDescription.xsd
<xs:complexType name="FlatGrammarType"> <xs:annotation> <xs:documentation> Describes the grammar of the flat file </xs:documentation> </xs:annotation> <xs:sequence> <xs:element name="RecordDescription" type="FlatRecordType"/> <xs:choice maxOccurs="unbounded"> <xs:element name="RecordDescription" type="FlatRecordType"/> <xs:element name="GroupDescription" type="FlatGroupType"/> </xs:choice> </xs:sequence> <xs:attributeGroup ref="ElementAndTagType"/> </xs:complexType>
The first schema language feature to note is the xs:choice Element, representing a choice content model. We use it because it maps very well to the BNF production for our group grammar:
group ::= record (record group)*
Again, this production says that a group consists of a record followed by zero or more records or groups. The "record or group" concept is logically a choice. In BNF we use the vertical bar symbol () to represent it, and it maps directly to the choice content model in schema language. While the choice content model is used very seldom in business document schemas, its use here is entirely appropriate.
The second feature to note is the use of the xs:attributeGroup Element. We use the ElementName and TagValue Attributes on three different Elements in the schema, so it makes sense just to code them as a reusable Attribute group. The definition is shown below.
ElementAndTagType attributeGroup in FlatCommonFileDescription.xsd
<xs:attributeGroup name="ElementAndTagType"> <xs:annotation> <xs:documentation> ElementName and TagValue </xs:documentation> </xs:annotation> <xs:attribute name="ElementName" type="xs:NMTOKEN" use="required"/> <xs:attribute name="TagValue" type="TagValueType" use="required"/> </xs:attributeGroup>