Parsing EDIFACT

   

We already know how to transform documents from one format into the other. Recall from Chapter 5 that the three steps are as follows :

  1. Read the input document.

  2. Convert between the two structures, which can involve grouping or splitting EDIFACT segments into one or several XML elements. It also requires transforming XML codes (such as ISBN ) into their EDIFACT equivalents ( IB ).

  3. Write the resulting document.

Again, the XSLT processor takes care of the conversion step, provided we feed it an XML document. Although the processor cannot read EDIFACT, it can write the XML document.

As the previous chapter clearly demonstrated, much can be gained from moving as much of the transformation as possible in XSLT. Some of the advantages include the following:

  • Writing and maintaining style sheets is faster and easier than writing the equivalent Java code.

  • Style sheets are declarative so it is easy to discuss them with non-developers.

  • The XSLT processor is optimized for transformation, and as it improves , so does our application.

The only issue is that an XSLT processor chokes at EDIFACT. So, we need to roll up our sleeves and write our own parser for EDIFACT. The parser will turn the EDIFACT document into the XML-ization of EDIFACT we used in Chapter 5.

In other words, this chapter is the mirror of Chapter 5!

Warning

As in Chapter 5, we will limit ourselves to a reasonable subset of the EDIFACT syntax (technically a subset of the EDIFACT syntax version 3). You might need to extend the parser to recognize the most advanced (but seldom used) options of EDIFACT. The goal remains to illustrate a useful technique (importing non-XML documents into XML), not to compete with commercial products.


Architecture of the Parser

The typical parser is composed of two modules: a tokenizer (also called lexer ) and the parser itself. The tokenizer breaks the input file into its constituents. In particular, it separates special characters ( + , : , ' , and ? ) from regular text.

The parser receives the pre-digested input, as tokens, from the tokenizer and assembles them in a higher-level construct.

Figure 6.1 illustrates how it works. The tokenizer breaks the segment into its constituents ”text and special characters. Each of these becomes a different token. The parser then reads through the tokens and the higher-level constructs, such as data elements and segments.

Figure 6.1. The tokenizer interacts with the parser through tokens.

graphics/06fig01.gif

The separation between tokenizer and parser results in more manageable code. For example, one option in the EDIFACT syntax replaces the + , : , ' , and ? characters with other characters. The parser introduced here does not support this option (if only because it is seldom used), but you could easily add it by changing the tokenizer. Note that such changes are limited to the tokenizer; they do not impact the parser itself.

Classes in the Parser

Figure 6.2 illustrates a UML class diagram of the parser. The various classes are as follows:

  • EdifactTokenizer ”Breaks the input stream into tokens

  • EdifactParser ”Interacts with the tokenizer to decode the stream

  • EdifactStructure ”Is a helper class for the parser

  • UnexpectedTokenException ”Signals a parsing error

  • Extensions ”Implements extensions to XSL

  • Edifact2XML ”Is the application's main program

Figure 6.2. The architecture of the application.

graphics/06fig02.gif

Note

For some languages, a so-called compiler-compiler can simplify the coding. We will review this option in the section Additional Resources at the end of this chapter.


   


Applied XML Solutions
Applied XML Solutions
ISBN: 0672320541
EAN: 2147483647
Year: 1999
Pages: 142

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net