High-Level Design Considerations | Using XML with Legacy Business Applications

Grammar Analysis and Description

I have mentioned that EDI grammar is the most complicated grammar we'll deal with in this book because we must be concerned with both the overall file structure and the grammar of an individual record (or EDI segment). In this section I will finally prove it to you.

It takes some fairly complicated processing to handle EDI grammar. I prefer to tackle this task only once by developing a generalized approach that can be used for most classes of EDI grammars. For that reason we'll consider grammar features that aren't necessarily part of version 004010 of X12 but may be features of later versions or other syntaxes such as UN/EDIFACT's ISO 9735 syntax. We're not going to concern ourselves with interchange and group structure in this analysis since they are fairly fixed for each syntax and sometimes require special processing. However, where applicable we'll apply the segment grammar analysis to our processing of control segments.

As usual, we'll start with the overall structure of a logical document. In X12 this is a transaction set and in UN/EDIFACT a message. In looking at the overall structure, as is the case with segment grammar, we are concerned not with the definition of a document standard, per se, but with the grammar of an instance of the document as it appears in a data stream.

EDI Document Grammar

 document ::= (segment  segment_group)* group ::= segment (segment  segment_group)*

Note that the first production differs from its counterpart in the flat file analysis. In the flat file analysis we defined a document as a group. Header and trailer control segments are generally considered part of the published document standards, and if we included them, the group production would be applicable. However, we have omitted them, and we have a different production for this document because the first segment in a document following the header control segment is not necessarily mandatory. In X12 transaction sets it is customary for a mandatory segment such as the BEG or BIG in our examples to be defined immediately following the ST segment. However, this is not part of X12.6 and the X12 Design Guidelines specifically state that transaction sets are not required to have a unique beginning segment. I've not thoroughly reviewed ISO 9735, the UN/EDIFACT design rules and guidelines, or the message definitions. However, knowing that the X12 syntax has this caveat is sufficient grounds for expressing the first production as I have.

If I had included the header control segment in the grammar then, like the flat file grammar, the first production could equate a document with a group. There are tradeoffs with both approaches. Had header and trailer control segments been included the processDocument and processFile methods could be developed very simply. Like those in the flat file classes they could primarily just call processGroup. However, while making things easier for the developer this approach makes things harder in a few ways for the user. The user would have to include the header and trailer control segments in the Grammar Element of all file description documents. When converting from XML to X12 the user would have to provide the XML representations of them also. This would require code in every component that produces such XML documents, such as XSLT style sheets. When given a choice between making things harder for users and making things harder for developers, I generally favor users. Hence, we use the first production as written above.

The definition of a segment group in most EDI syntaxes is fairly specific, and the generalized grammar in the second production should cover all of them. It is worth noting that X12.6 specifically states that the first segment of a segment group shall not appear in any other position within the group. The production accommodates this restriction. On the other hand, UN/EDIFACT is not so restrictive . There is at least one message, BAPLIE, in which the starting segment of a segment group may appear later within the group. Parsing ambiguity is prevented by making mandatory at least one segment that precedes the second usage and one segment that succeeds it. Our grammar is general enough to also accommodate this model.

We complete the BNF by describing the grammar of an EDI segment. These productions accommodate both repeating data elements and the release character used in UN/EDIFACT's ISO 9735 syntax. Note that the terminology in the productions doesn't follow X12 exactly in all cases since we're trying to describe an abstract grammar that also applies to other EDI syntaxes. However, the concepts are the same.

EDI Segment Grammar

 segment ::= segment_identifier             (element_separator data_element?)*             element_separator data_element segment_terminator segment_identifier ::= char char char? data_element ::= simple_element  composite_structure                   repeating_element composite_structure ::=     (component_element? component_separator)* component_element component_element ::= simple_element repeating_element ::=     ((simple_element? repetition_separator)* simple_element )      ((composite_structure? repetition_separator)*         composite_structure) simple_element ::=  ((release_char special_char)  char )+ delimiters ::= segment_terminator  element_separator                 component_separator  repetition_separator                 release_char segment_terminator ::= special_char element_separator ::= special_char component_separator ::= special_char repetition_separator ::= special_char release_char ::= special_char special_char ::= a character from the approved character set that     doesn't appear in the data stream (except for the     release_char) and that is not used for any of the other     delimiters char ::= a character from the approved character set, excluding     delimiters unless preceded by the release_char

Though this is not nearly as complicated as even the basic XML 1.0 grammar, it is still a somewhat complicated grammar, if only for the number of productions. Let's see what it actually says.

The segment production says that a segment beings with a segment identifier and ends with an element separator, a data element, and a segment terminator. In between the segment tag and the final element separator there may be zero or more data elements or empty data elements, each preceded by an element separator. This production specifically says that trailing element separators are not allowed.
The segment identifier production says that it is a two- or three-character string.
The data element production amplifies how the data element is used in the segment production, saying that it can be a simple data element, a composite data structure, or a repeating data element.
The composite structure production says that a composite data structure must end with a simple data element. The final simple data element may be preceded by zero or more simple data elements or empty data elements, each followed by a component data element separator.
The repeating element production says that it may be composed of either simple data elements or composite data structures, but not both. It must end with a data element that may be preceded by zero or more data elements or empty data elements, each followed by the repetition separator.
The remaining productions clarify simple data elements and delimiters.

The grammar of control segments is a subset of this grammar. Repeating data elements are not allowed. X12 control segments do not contain composite data structures, and the release character is not supported in X12. I have also heard it asserted that the release character is not allowed in UN/EDIFACT service segments. I've been unable to confirm this assertion in either version 2 or 4 of ISO 9735. However, even if the release character is permitted in a UN/EDIFACT service segment, I've never seen one actually used. I think it highly unlikely that people would use a character needed for releasing in any of the identifiers used in these segments. For these reasons we can often use somewhat simpler approaches (such as string token routines) for processing control segments.

So, if you weren't convinced before that EDI grammars can be fairly complicated, I hope you are by now! I frequently encounter people who think that XSLT is an appropriate tool for parsing raw EDI. I still on occasion field questions from people who want to write their own programs to process raw EDI rather than buying a translator or using something like the utilities developed in this chapter. A lot of pain, suffering, and avoidable expense would be saved if such people reviewed this or a similar analysis before contemplating such dubious endeavors.

This analysis prepares the ground for the parsing algorithm we'll develop for the EDIRecordReader's parseRecord method. It will also help make clearer some of the processing performed in the EDIRecordWriter's writeRecord method.

File Description Document Schemas

As with our other formats, there are four schemas involved in describing X12 interchanges and their xml representation.

X12SourceFileDescription.xsd : This schema is for file description documents that describe conversions in which the source is an X12 interchange.
X12TargetFileDescription.xsd : This schema is for file description documents that describe conversions in which the target format is an X12 interchange.
X12CommonFileDescription.xsd : This type library schema is used by the two previous schemas.
BBCommonFileDescription.xsd : Again, this is our common type library.

The schemas introduced in this chapter are very similar to those developed in Chapters 7 and 8. The main difference in our X12 schemas is that instead of using the FieldGrammarType from our common schema, we define an ElementGrammarType. The main reason we do this is that our X12 utilities don't support all the other data types we have defined. We create a new X12DataType simple type for the DataType Attribute instead of using the BBDataType.

Note that the SegmentType complex type, unlike the flat file's FlatRecordType and the CSV's RowDescription Element, allows two different types of child Elements in its sequence. We can have a mix of SimpleElementDescription Elements and CompositeStructureDescription Elements. We indicate this with a choice content model within the sequence.