Describing the Legacy Non-XML Grammars | Using XML with Legacy Business Applications

We've decided that we need to approach grammar analysis of our non-XML formats in a more rigorous fashion and that we're going to use EBNF as one of our primary tools. However, the machine still needs to know about the grammars. How are we going to inform the machine about the grammars? We could use EBNF or something similar such as Prolog, but that would involve building the appropriate interpreter into our utilities. Scratch that idea! So what other choices do we have? We'll need to devise a representation for the grammar, and my intuition tells me that the representation may in part be dependent on the format we use to inform the machine. So, what choices do we have about format?

We could encode the grammar somehow in a database. This is a reasonable idea in many ways. Databases can be secure, reliable ways to store data. Databases also give us a way to enforce some constraints on the data. However, adding database support to our utilities comes with quite a bit of baggage. To ensure portability with various database engines, we would probably want to use Open DataBase Connectivity ( ODBC ) or Java Database Connectivity ( JDBC ). Database support gives us another technology to deal with and adds complexity. We would also have to give the end user a mechanism to load the grammar into the database, that is, a user interface. Again, this adds complexity. Finally, using a database to encode the grammar imposes a runtime requirement on the user to have an appropriate database engine installed. The disadvantages outweigh the advantages.

We could encode the grammar in flat files or CSV files. Although not quite as secure or reliable as databases, these formats would certainly be adequate. For reading these formats we could use techniques already developed for reading flat and CSV files. Flat files can be created with a text editor, and CSV files can be exported from a spreadsheet. So, we have at least a basic user interface. Also, the users won't need a database engine at runtime. Despite these advantages there are a few disadvantages. Creating flat files with a text editor is very vulnerable to user data entry error. We can put code into the utilities to detect many of these types of errors, but again this adds complexity to the code. Also, some types of errors may be very hard to detect. Getting a field name or position off by one character in the grammar file can lead to errors that may not be detected by the utilities and may only be detected by looking for incorrect output. That is probably not acceptable. Finally, although grammar descriptions could be created with a text editor or a spreadsheet, these are not exactly the friendliest methods from a user perspective. So, flat files or CSV files aren't a very good choice.

That leaves us with XML, which is a pretty good choice since we're working in an XML world. Right? We have two choices in XML representation: (1) we could create our own XML-based languages for describing our grammars, or (2) we could use W3C XML Schema language. Let's look at the latter option first.

I find the idea of using schema language for describing non-XML grammars very enticing. The user would only have to develop a single description to cover both the XML and non-XML formats of a business document. Tools such as XMLSPY and TurboXML could be used to develop these schemas. In the utility that converts XML to the non-XML format we could use default Attributes to provide the information necessary for conversion. For example, we can get the information about file organization and record sequence, that is, the file's grammar, from sequence definitions in complex types. In addition, a schema could define DataFormat, Offset, and Length Attributes with default values on a Field Element representing a field in a flat file record. The utility could then access this information from the DOM tree and create the output accordingly . There might be additional information we could not "shoehorn" into the standard schema language features, but there is a mechanism for specifying this information. Most schema language Elements allow users to add their own Attributes provided they are from a namespace other than the schema language namespace.

However, there are some distinct disadvantages to this schema-based approach. To understand the grammar of non-XML input files we would have to interpret the schema. Schema documents can be very complex, and we would really like to avoid writing all the code required to interpret them. Tools can help us understand and use schemas, but such tools aren't without their drawbacks. MSXML offers something called the Schema Object Model ( SOM ). It provides a way to parse and give programs access to schemas in much the same way that the DOM enables processing of instance documents. In the Java world, the open source Castor framework developed by exolab.org also offers an object model with very similar functionality [exolab 2002]. However, the SOM is not a standard. Unlike the DOM APIs offered by JAXP and MSXML, Castor's and MSXML's SOM APIs are not logically equivalent implementations of the same model. There are some significant differences in the interfaces, methods, and properties offered by these two implementations. We would not be able to develop one approach that would work for both of our implementation environments as we did with the DOM.

Finally, for practical considerations we would have to set some constraints on how the schemas are written. The main reason for this is that schema language, as we noted in Chapter 4, is incredibly flexible. There are a near infinite number of schemas that could be written to specify the same instance document. Handling all the different approaches to schema design would make our schema processing code very complex. It would be simpler to code and maintain if we restricted schemas to a certain style. For example, we might want to have all types declared anonymously, in-line, rather then being declared as named types, perhaps in an imported schema. We would then need to make sure that these constraints were observed . One approach for doing this would be to write a schema for schemas that enforced these constraints, then validate our schemas against this schema at runtime. It would basically specify what the W3C's DTD or schema for schemas do but would be a much more restrictive model. However, this approach might be just as complex as writing the required validation code in the utility. In addition, writing schemas to support description of non-XML legacy grammars may make it harder in some ways to write schemas for validation, that is, enforcement of business constraints.

Finally, the schema-based approach would require all end users to become reasonably proficient in writing schema language. I feel comfortable encouraging people to become proficient in reading schemas, but I don't necessarily feel so comfortable when it comes to writing schemas. Also, end users would have to write schemas not only in schema language but also in a specific style, with customized extensions required by our conversion utilities.

So, as enticing as I find the approach of using schema language, I must regretfully give up on it, at least for this phase of the project. That leaves us with using XML instance documents.

Our choice is to develop XML-based languages to describe the grammars of each of our non-XML formats. This approach has many advantages. We can make the languages fairly simple. We can write schemas to validate that the instance documents conform to the languages and adequately describe the grammars. We can use our DOM techniques to read the grammar describing instance documents. These instance documents can be created using tools as simple as Microsoft's XML Notepad or as sophisticated as XMLSPY. They could even be created with a text editor, although I don't recommend it. The main disadvantage of this approach is that if end users want to do full validation against business constraints, they also have to develop a schema for the XML representation of the grammar. That means they would have to develop two metadata documents instead of one. However, they are free to develop that schema in whatever fashion they desire , and they don't need to develop it if they don't care about validating business constraints.

So, we'll develop XML-based languages to describe our non-XML grammars. We'll call the documents that specify these languages our "file description documents" since we'll be specifying a few details beyond just the grammars. The approach will become clearer when we look at the first case, CSV files, in the next chapter. Another advantage of this approach will become clear when we look at the grammar of flat files and EDI messages. If we construct our grammars properly, we can use the DOM as our internal data structure for storing the grammar rather than having to create a different mechanism. This extremely powerful feature will become more evident when we look at the pseudocode. It saves us from having to develop our own data structure for describing the grammar.