Representing the Legacy Non-XML Grammars in XML

We'll take a detailed look in the next three chapters at each of the languages we'll use to describe the grammars of our legacy file formats. However, let's look at the overall approach since it will help clarify some of the classes and methods discussed later in this chapter. There are four major areas we want to address here.

The XML instance documents that correspond to our legacy file formats
The file description documents that describe the grammar and other characteristics of our legacy file formats
The schemas for our file description documents
The schemas that are optionally used to validate our instance documents

Instance Document Design

As discussed in Chapter 4, the place to start is how we want the instance document to look. The bullet items below outline the major decisions involved. In the next subsection we'll make the same decisions in regard to the file description documents that describe our legacy non-XML formats.

Elements and Attributes : I use Elements only for application data, mainly because the general purpose utilities we're discussing are designed to handle a wide variety of logical documents. The legacy formats in and of themselves offer little basis for deciding what should be an Attribute instead of an Element. For example, when dealing with a generic CSV row all the columns have the same significance. Building in the capability to designate some data fields as Attributes of others that are Elements adds complexity both for the end user and for the developer. The rationale is discussed in the section on DOM processing considerations (later in this chapter) in a bit more detail.
Naming conventions : The utilities are designed to enable users to assign Element names of their choice. I recommend simple, common business terms set in upper camel case. You are free to use whatever approach makes sense to you. You will probably only see and use these names when setting up the conversion utilities and in the XSLT stylesheets that convert to and from your chosen XML format.
Specific names versus qualifier/value pairs : This is somewhat of a moot point since our legacy formats already reflect the decision. Your XML document is going to look like your flat file or EDI document.
Structure : The structure will match the logical hierarchy of the legacy format, not necessarily its physical hierarchy. For example, our CSV format has only one record type, so the hierarchy is flat. Flat files and EDI formats can have groups nested within groups, yielding hierarchies that are sometimes several levels deep.
Namespaces : We're not using them; we're going to use the unnamed default target namespace. Named namespaces, versus the default unnamed namespace, are useful in many circumstances. For example, they can disambiguate Element and Attribute names that would otherwise be duplicates by associating them with different namespaces. They can also make it more explicit that certain names come from other domains. However, in the case of our legacy file converters, the XML instance documents that we read and write have relevance only to our utilities. They most often will serve as intermediate formats for XSLT transformations and won't be used directly by any other applications. Named namespaces, particularly if we require Elements to have namespace prefixes, add no clarity and add complexity to our code. We're following the principle of KISS, remember?

File Description Document Design

General Considerations

As I alluded to previously, in addition to specifying the grammars of our legacy non-XML formats, the file description documents also describe physical characteristics of the input files. For example, we specify record terminators and certain characteristics of the output XML documents such as schema URLs. Again, let's look first at the general choices we need to make about the design of these documents. In the previous section we made decisions about how our instance documents should look. Here we consider how the file description documents should look. We allowed a bit more flexibility in the instance document design than we will here largely due to the variation in legacy file formats. Also, we don't want to impose restrictions where they aren't absolutely required. However, in the case of the file description documents, our processing code is relatively more tightly linked with their document design style. We'll impose a bit more order to make our coding easier.

Naming conventions : I have chosen relatively intuitive, descriptive names for the Elements and Attributes, with no formal analysis behind them. I have also chosen to use upper camel case.
Specific names versus qualifier/value pairs : This is not much of a consideration for the file description documents. To the extent that it applies, I use specific names.
Structure : The structure follows a logical breakdown of the file and grammar characteristics. Where we have a flat grammar, as in CSV files, the XML representation is pretty flat with only three levels. More complex grammars such as EDI can have several levels.
Namespaces : Again, for most of the considerations noted in the previous section, we're not using them.

The other major choice that needs to be made about the general format of the file description documents is how we use Elements and Attributes. This one deserves a bit more discussion than just a bullet point. Reviewing my thought processes and what I ended up doing may help you with similar decisions in the future.

My initial inclination was to use only Elements, matching the appearance and processing approach used for the instance documents that correspond to our legacy formats. However, when dealing with the grammars, particularly when considering the grammars of flat files and EDI documents, I wanted to use some Attributes. I felt the grammars would be easier to work with if the nonterminal symbols, such as groups, records, and fields, were represented as Elements. There were a few properties of groups and records that I also needed to represent, but I didn't want to depict those as Elements on a peer level with the Elements representing the grammar symbols. For example, I didn't want a flat file's record identifier tag to be a sibling Element to those representing the fields in the record's grammar. That led to creating a few selected Attributes for such things. Then, as I began to design the code, it became apparent that certain field characteristics, such as column number and data type, might be more easily accessed as Attributes. This was leading me to an inconsistent design in which some parts of the document used only Elements and other parts, in particular the grammar, used only Attributes. In the end I decided to go completely the other way and just use Elements for structure and Attributes to depict all values. Basically, I decided to adopt the approach used in W3C Schema language. I felt consistency was important within the document, that it would make processing the document easier. More importantly, I felt it would be easier for end users of the utilities if I followed a consistent approach for all parts of the document.

Major Sections and Elements

The grammars of our legacy formats are described using the following basic items. The exact Element names may vary depending on the legacy grammar being described.

Grammar Element : In the file description document, this is the root Element of the subtree that describes the grammar. Depending on the legacy file grammar, it may have RecordDescription or GroupDescription Elements as children.
GroupDescription Element : This Element describes a set of records (and perhaps other groups). Its first child Element is always a RecordDescription. It may have one or more other RecordDescription or GroupDescription Elements as children. The GroupDescription Element is not used in our CSV file grammar since we support only one record type and therefore have no groups.
RecordDescription Element : This Element describes the structure and overall characteristics of a record. Its child Elements are FieldDescription Elements.
FieldDescription Element : This Element has a number of Attributes that describe characteristics of the field in the legacy format. It always specifies a name , a data type, and a number. Depending on the legacy format it may also specify information such as length, minimum or maximum length, offset, and fill character.

EDI grammars add another layer, but we'll talk about that in Chapter 9.

Schemas for File Description Documents

General Schema Design Considerations

I develop schemas for each of the file description documents and validate the documents against the schemas. Using schema validation enforces some predictability about the documents and makes writing the code to process them a bit easier.

As discussed in Chapter 4, several issues must be considered when designing schemas. We'll discuss schema design for application files in the next subsection and in more detail in Chapter 12, but here is a summary of some of the major design decisions regarding the schemas for the file description documents.

Naming conventions : We've discussed most of them already, but I do want to note that I adhere to a convention of appending "Type" to named types.
Structuring : One schema specifies the file description document when the legacy format is the source of the conversion; another is used when the legacy format is the target. Due to the fact that the two schemas have very similar data, I declare most of the named types in the common schema for that legacy format. The common schema is then included in the source and target schemas. In addition, a project-wide (or Babel Blaster “common) schema specifies types used in all the conversions. Figure 6.1 shows the organization.

Figure 6.1. Structuring of File Description Document Schemas
Global Elements versus named types and local Elements : I use named types and local Elements since, due to my programming background, this approach seems to come more naturally. It is very similar to declaring classes, extending them into subclasses, and creating objects of a class.
Named versus anonymous types : If a type is reused, it is named. This holds true for all the types in the include schema. The main exception to this overall approach is in the source and target schemas. The root Element declaration in each has an in-line anonymous type declaration in which its child Elements are declared as a sequence.
Namespaces : As discussed in Chapter 4, we could develop an approach that uses named namespaces in schemas but still avoids using them in instance documents. However, our overall schema architecture is simple enough that segmenting it into two or more namespaces won't add much clarity or other value. To keep things simple, except for the things we use from the schema language namespace, we're not going to use named namespaces in our schemas. We will stay with the approach of using the unnamed default target namespace.

Common Schema for File Description Documents

The common schema, BBCommonFileDescription.xsd, defines several types that are reused in the other schemas. In doing so, it uses a few schema features that deserve comment.

Since all the data values are conveyed as Attributes of empty Elements, some special things needed to be done. The Primer of the XML Schema Recommendation gives one example of such an empty Element. It defines an anonymous complexType with complexContent but defines no child Elements. It defines only Attributes. This approach works well, except that it derives the complexType by restriction from the anyType data type. This means that in an instance document you may add other Attributes. (It seems that the schema for the W3C XML Schema language takes this approach.) However, I wanted validation to be a bit tighter than that, so I created an EmptyType complexType with no child Elements or Attributes and used it as the restriction base for all types used for Elements.

Setting up the DelimiterType for Attributes declaring delimiter values was interesting since there were several choices. The basic requirement for most delimiters, such as column delimiters in a CSV file, was that the user has the ability to declare them as literal characters or as hexadecimal values. The latter is important for EDI files and for future support of EBCDIC-encoded data. My first approach was to declare separate Elements for the character and hex values, but this seemed a bit cumbersome for both the user and for my coding.

A second approach suggested that I might just declare the DelimiterType as token with length of one and use the syntax (&#x Hexadecimal number ;) for entering characters to enter the hex values. There are two problems with this approach. The first is that the hex value represented not the actual hexadecimal value in the data stream but the hexadecimal value of the character's "code point" in the ISO/IEC 10646 standard. The concept here is that a number of different characters in different character sets, each having different code points, might all be encoded in a data stream with the same bit pattern. I suspect that most of the potential users of the utilities might be much more familiar with actual encoded hex values than with code points, so this approach didn't seem appropriate. The second problem is that, although I didn't fully investigate this, it appeared to me from the code point ranges listed that it might not be possible to use this syntax to specify control characters encoded with hex values below x20. These control characters are sometimes used as delimiters in EDI syntaxes.

This led me to the third approach, which I adopted: to declare the DelimiterType as a union of a token with a length of one and a hexBinary with a length of one. The latter restriction hung me up briefly since the length is not exactly intuitive. The length facet on hexBinary applies to the length of the binary octet that the hexBinary number represents, not the string length of the hexBinary number itself. So, to get a single byte, such as the line feed character, we use a length of one rather than the length of its hexadecimal representation (x0A). It's worth noting, too, that to use delimiters that have special meaning in XML the corresponding predefined entities must be used. For example, use " instead of a literal quotation mark. Defining the RecordTerminatorType was a similar process, except it is a union of a one-byte hexBinary and a token with enumerations of W and U, representing Windows-style and UNIX-style physical record terminators.

NOTE Adding New Data Types

When modifying the utilities to add a new data type, add the enumeration for the type's coded value to the BBDataType type in this schema.

Schemas for Source and Target Documents

How We'll Use Schemas

Schemas for the source and target XML representations of our legacy formats play an important but optional role in this approach. We decided that we'll use our own XML-based languages for describing the grammars of our non-XML formats. In addition, we decided that our utilities, in and of themselves, are not going to perform full validation against business constraints when converting but will perform only the validation necessary for converting our non-XML formats to and from XML. So, what role do schemas play in the architecture?

Schemas play an important role by providing the primary means by which business constraints are enforced. We will validate our XML formats against schemas that define the XML representations of our non-XML formats. Rather than writing a lot of our own code to validate business constraints, we're going to take advantage of the schema validation offered by the XML APIs.

For example, we won't check the raw X12 850 Purchase Order for compliance with the X12 standard (or implementation guideline). We will instead support schema validation of the XML instance document produced from the X12 850 by our EDI to XML utility. Our common strategy in all of these conversion utilities will be to create a one-for-one correspondence (or, in stricter terms, one-to-one and "on to," or isomorphic, correspondence) between the XML and the non-XML formats. So, if the data satisfies the appropriate business constraints in the XML representation, as defined in a schema, changing the syntax to a non-XML representation should make no difference in satisfying the constraints. The same is true in reverse. Using XML as the format in which our data content is validated further cements XML's central place in our architecture.

Creating the Schemas

Although I recommend that users at least be able to read schemas, I want to avoid any requirement that they be fluent in schema language. The design approach I present here keeps that consideration paramount. KISS comes first. I propose the users start with what is basically a simple approach to designing schemas. I recommend an initial approach of using a tool such as XMLSPY or TurboXML, feeding it an instance document, and having it generate a schema. For most users this will beat hand-coding a schema from scratch every time. Users must review the generated schema and will almost certainly need to clean up and correct a few things such as the assigned data types, minOccurs, and maxOccurs. (This is discussed in more detail in Chapter 7.) However, this approach is much easier than trying to create a schema completely by hand. If such a tool isn't available, users can take the examples from the next three chapters and hack them up in a text editor or XML Notepad to suit their needs. However, I highly recommend a full-featured XML tool.

If a user is fluent with schema language, he or she could do something as ambitious as developing type libraries for the segments and data elements in the full UN/EDIFACT standard. I'll discuss some considerations and approaches in more detail in Chapter 12. But if he or she just wants to load a document into a tool like XMLSPY and click on Create Schema, that's okay too. The important thing to note is that users have considerable freedom in developing schemas that meet their needs.