We already know how to transform documents from one format into the other. Recall from Chapter 5 that the three steps are as follows :
Again, the XSLT processor takes care of the conversion step, provided we feed it an XML document. Although the processor cannot read EDIFACT, it can write the XML document. As the previous chapter clearly demonstrated, much can be gained from moving as much of the transformation as possible in XSLT. Some of the advantages include the following:
The only issue is that an XSLT processor chokes at EDIFACT. So, we need to roll up our sleeves and write our own parser for EDIFACT. The parser will turn the EDIFACT document into the XML-ization of EDIFACT we used in Chapter 5. In other words, this chapter is the mirror of Chapter 5! Warning As in Chapter 5, we will limit ourselves to a reasonable subset of the EDIFACT syntax (technically a subset of the EDIFACT syntax version 3). You might need to extend the parser to recognize the most advanced (but seldom used) options of EDIFACT. The goal remains to illustrate a useful technique (importing non-XML documents into XML), not to compete with commercial products. Architecture of the ParserThe typical parser is composed of two modules: a tokenizer (also called lexer ) and the parser itself. The tokenizer breaks the input file into its constituents. In particular, it separates special characters ( + , : , ' , and ? ) from regular text. The parser receives the pre-digested input, as tokens, from the tokenizer and assembles them in a higher-level construct. Figure 6.1 illustrates how it works. The tokenizer breaks the segment into its constituents ”text and special characters. Each of these becomes a different token. The parser then reads through the tokens and the higher-level constructs, such as data elements and segments. Figure 6.1. The tokenizer interacts with the parser through tokens.
The separation between tokenizer and parser results in more manageable code. For example, one option in the EDIFACT syntax replaces the + , : , ' , and ? characters with other characters. The parser introduced here does not support this option (if only because it is seldom used), but you could easily add it by changing the tokenizer. Note that such changes are limited to the tokenizer; they do not impact the parser itself. Classes in the ParserFigure 6.2 illustrates a UML class diagram of the parser. The various classes are as follows:
Figure 6.2. The architecture of the application.
Note For some languages, a so-called compiler-compiler can simplify the coding. We will review this option in the section Additional Resources at the end of this chapter. |