Analyzing the Legacy Non-XML Grammars | Using XML with Legacy Business Applications

As I noted when developing the CSV to XML utility, as complicated as XML-based grammars can be, the grammars of other formats aren't necessarily dumb and simple. We can get into trouble and ask for processing problems if we don't analyze grammars correctly and process them with the appropriate algorithms. While I'm not going to go heavily into language analysis and parsing algorithms, I am going to borrow a few techniques from compiler construction and the formal study of languages. If you've studied computer science you'll probably already be familiar with most of these concepts. However, if you haven't or if it has been a few years , this discussion will help build a foundation.

Let's start by reviewing and clarifying some basic terminology. The main terms we're concerned about are "language" and "grammar." We'll use these terms as they are conventionally used in the formal study of languages in computer science. In this sense language is defined as "a set of strings involving symbols from some alphabet" [Martin 1991]. This very broad definition allows us to consider not only English as a language but also a set of seemingly random strings of ones and zeroes as a language. A grammar is a set of rules that describes a language. For example, various rules describe proper English grammar, such as "A complete sentence must have a noun and a verb." The languages we are concerned with have similar grammar rules. If a string of symbols conforms to the rules of the grammar, it is said to be "in the language."

With these definitions in hand, let's make sure that we understand the scope of the problem. We'll start by using XML itself as an example. XML is by itself a fairly simple language. Its grammar is defined in the XML 1.0 Recommendation. This standard specifies Unicode as the language's alphabet, and it covers such things as how to use special characters (such as angle brackets and ampersands) to specify tags and entities. XML's main use comes from its extensibility, that is, the ability to use it to define other markup languages. In essence, a new markup language is not defined by, for example, ANSI ASC X12, the Open Applications Group , RosettaNet, or Commerce One, but by every individual document specification produced by such groups. Each DTD or schema defines a specific language. In the strictest terms, OASIS's Universal Business Language is not a language, but its Order document is. XML 1.0 defines the general characteristics of these languages, and as such it defines a class of languages. The XML 1.0 DTD or the W3C Schema language is used to describe each of these specific languages in very precise terms.

We face a similar situation in developing utilities that handle specific non-XML formats. Just as languages based on XML 1.0 correspond to a class of grammars, each of the formats we want to support corresponds to a separate class of grammars. For example, the X12 EDI syntax, that is, its basic grammar, is defined in the X12.6 standard. However, each X12 Transaction Set standard, such as the 850 Purchase Order, through its grammar defines a specific language based on the X12.6 grammar. Our challenge in developing generalized utilities is that we don't want solely to support specific grammars such as a UN/EDIFACT ORDER; instead we want to support classes of grammars such as UN/EDIFACT (based on ISO 9735).

In supporting classes of grammars we need to consider the overall characteristics of the grammars. Where the class of grammar is defined by a standard, as is the case with ANSI X12 and UN/EDIFACT EDI, most of the work is done for us already. However, where the class of grammar is not defined formally , as is the case with CSV or flat files, we'll have to abstract and generalize the essential features in order to come up with a definition of the grammar. In addition, even the EDI grammars will need some "tweaking" to be easily processed in our XML world.

All of the grammars with which we are concerned fall into a special class of grammars called "context-free grammars." Various notations can be used to describe such grammars, but the one most commonly used is Backus-Naur form ( BNF ). There are various flavors of BNF. X12.6 defines its own and uses it to describe the X12.6 grammar. The W3C also has something called Extended Backus-Naur Form ( EBNF ). EBNF is defined in the XML 1.0 Recommendation. Since we're working in an XML world we'll use EBNF as the base for our notation. I say "base" because in describing classes of grammars rather than specific grammars even EBNF has some limitations.

In BNF an individual grammar rule is known as a production . A set of productions makes up the complete grammar. An individual production is similar to a mathematical equation, with a single symbol on the left-hand side and an expression on the right. An expression is composed of one or more symbols combined in various ways, such as concatenation or choice. Symbols are further classified as being terminal (composed of one or more characters in the alphabet that aren't further reduced into expressions) or nonterminal (defined only in terms of the expressions to which they are equated).

As an example, let's consider a language composed of all strings of three zeroes or ones, in any combination of zeroes and ones, in any order. We can define this language using the following grammar expressed in EBNF notation.

The Language of Three Zeroes or Ones

 Sentence ::= zero_or_one zero_or_one zero_or_one zero_or_one ::= "0"  "1"

Sentence is the starting nonterminal symbol in the grammar. zero_or_one is the other nonterminal. The first production shows a concatenation of three of these symbols. "0" and "1" are terminal symbols in the grammar, shown in quotes as literals from the alphabet. The second production shows a choice of one or the other of these two terminals. The strings 000, 010, and 101 are in the language.

Getting a bit more specific, most of the grammars we are dealing with are not only context-free grammars but also members of a special class of grammars called "LR(0) grammars." This is a term from compiler construction that describes a set of languages that can be parsed from left to right with zero characters of lookahead . It can also be derived from terminal symbols to nonterminal symbols in a bottom-up fashion starting at the right side of a syntax tree. It is beyond the scope of this book to really explain what all of that means. The important thing to understand from this concept is that for most of these file formats we can fully identify a character from an input file, that is, understand its meaning and place in the grammar, simply by considering what we have already read from the file. We don't need to read ahead beyond the character to understand what we should do with it.

Understanding the types of grammars we are using is also very important because it gives us guidance about parsing algorithms that might be appropriate for processing the grammar. We can look to the study of grammars and parsing algorithms in compiler construction for some ideas about how to best process these non-XML files as input. (I should note that some subsets of our legacy grammars also fall into an even narrower class of grammars called "regular expressions." I won't delve too much further into this area, though, because BNF with its orientation toward context-free grammars has become the de facto standard syntax for expressing language grammars. However, the fact that some are regular expressions is significant because this points us to yet other parsing algorithms, as we'll see in Chapters 7 and 9 on CSV and EDI, respectively.)

We'll start our grammar analysis in the next chapter by analyzing the grammar of CSV files. This will be fairly simple since we're restricting the types of CSV files we're supporting to those in which every row has the same format, that is, the type of data in each column is the same for every row, similar to a spreadsheet with column headings. In this case we're concerned primarily with the grammar of a row. In dealing with flat files we'll pay more attention to the grammar of the file structure since we need to understand the organization and sequence of the various types of records in the file. Finally, in dealing with EDI we'll be concerned with both the grammar of the file (or EDI document) and the grammar of a record (or EDI segment) in the file.

A final point, directly related to my previous comments about tweaking EDI grammars, is how we're going to use our grammars. Our utilities are primarily concerned with converting legacy non-XML formats to and from XML. As such, whether or not the legacy files strictly comply with their full grammars is not a consideration for us. We only need to understand and specify enough of the grammar to enable us to convert the format to and from XML. For example, in developing the utilities we're not going to be concerned with whether or not all of the required segments and elements are present in an X12 EDI transaction set. We just want to be able to understand enough of its structure to accurately convert it to and from XML. Doing full grammar checking would add a huge amount of complexity to our code, and that would be in conflict with several of our requirements.

So, you ask, if we're not going to do full compliance checking in the utilities, how can we be sure that the non-XML formats comply with our business constraints? How can we be sure we're sending or receiving valid EDI? Saying we're not going to build validation of the non-XML formats in the utilities doesn't mean we're not going to support full (or almost full) grammar and content validation. We'll do it with schemas. We'll talk about how in just a minute, but let's finish talking about the non-XML grammars.