The EDIFACT Parser

This parser recognizes a useful subset of the EDIFACT syntax. It translates the input in the intermediate XML format introduced in Chapter 5. You will recall that segment

 PRI+AAA:24.99::SRP'

becomes the following, in our XML-ized EDIFACT format:

 <Segment tag="PRI">      <Composite>         <Simple>AAA</Simple>         <Simple>24.99</Simple>         <Simple/>         <Simple>SRP</Simple>      </Composite>   </Segment>

In this chapter, we will use the same purchase order introduced in Chapter 5. It is reproduced in Listing 6.1.

Listing 6.1 `orders.edi`

 UNH+1+ORDERS:D:96A:UN'BGM+220+AGL153+9+AB'DTM+137:20000310:102'DTM+61:20000410:  102'NAD+BY+++PLAYFIELD BOOKS+34 FOUNTAIN SQUARE  PLAZA+CINCINNATI+OH+45202+US'NAD+SE+++QUE+ 201 WEST 103RD  STREET+INDIANAPOLIS+IN+46290+US'LIN+1'PIA+5+0789722429:IB'QTY+21:5'PRI+AAA:24.99::  SRP'LIN+2'PIA+5+0789724308:IB'QTY+21:10'PRI+AAA:42.50::SRP'UNS+S'CNT+3:2'UNT+17+1'

Writing the Tokenizer

Let's start with the tokenizer. Fortunately, the EDIFACT syntax has only four special characters :

+ , : , and ' ”Separators (between fields in a segment, between fields in a composite data element, and between segments, respectively)
? ”The escape character

The rest of the message is made up of the fields themselves . We could try to differentiate tags (three letters only), codes, and regular text, but it is easier to tokenize everything as text and sort it out in the parser itself.

Listing 6.2 is EdifactTokenizer.java . It recognizes the separator in the input stream, resolves escape characters, and returns text fields as strings.

Listing 6.2 `EdifactTokenizer.java`

 package com.psol.xsledi; import java.io.*; public class EdifactTokenizer {    public static final int TK_EOF = 0,                            TK_APOSTROPHE = 1,                            TK_PLUS = 2,                            TK_COLON = 3,                            TK_DATA = 4;    // allocate a 10K buffer    // for larger messages recompile with a larger buffer    protected static final int bufferSize = 1024 * 10;    protected char[] buffer = new char[bufferSize],                     token = new char[bufferSize / 100];    protected int bPos,                  bLen,                  tPos;    protected void putc()    {       bPos--;    } protected int getc()    {       if(bPos < bLen)          return buffer[bPos++];       else          return -1;    }    public int nextToken()    {       tPos = 0;       int c = getc();       switch(c)       {          case -1:             return TK_EOF;          case '+':             return TK_PLUS;          case '\ '':             return TK_APOSTROPHE;          case ':':             return TK_COLON;          case '?':             c = getc();             if(c == -1)                return TK_EOF;          default:             token[tPos++] = (char)c;       }       for(;;)       {          c = getc();          switch(c)          {             case -1:                return TK_EOF;             case '+':             case '\ '':             case ':':                putc();                return TK_DATA;             case '?':                c = getc();                if(c == -1)                   return TK_EOF;             default:                token[tPos++] = (char)c;                break;          }       }    }    public String getCurrentToken()    {       return new String(token,0,tPos);    }    public static String toString(int token)    {       switch(token)       {          case TK_EOF:             return "end of file";          case TK_APOSTROPHE:             return "\ '";          case TK_PLUS:             return "+";          case TK_COLON:             return ":";          case TK_DATA:             return "data";          default:             throw new IllegalArgumentException();       }    }    public void tokenize(InputStream in)       throws IOException    {       Reader reader = new InputStreamReader(in,"ISO-8859-1");       bLen = reader.read(buffer);       if(bLen == buffer.length)          throw new IOException("buffer is too small");       if(bLen == -1)          throw new EOFException();       bPos = 0;    } }

Let's take a closer look at Listing 6.2. The tokenizer declares constants for the various tokens. TK_EOF stands for the end of file, whereas TK_DATA signifies a textual field. The other constants are for separators. You will notice that no constant exists for the escape character because the tokenizer resolves it transparently :

 public static final int TK_EOF = 0,                           TK_APOSTROPHE = 1,                           TK_PLUS = 2,                           TK_COLON = 3,                           TK_DATA = 4;

The tokenizer also allocates various buffers and defines two methods ( getc() and putc() ) to read from the buffer or replace the last read character in the buffer. We will see how useful it is when reading data fields:

 // allocate a 10K buffer   // for larger messages recompile with a larger buffer   protected static final int bufferSize = 1024 * 10;   protected char[] buffer = new char[bufferSize],                    token = new char[bufferSize / 100];   protected int bPos,                 bLen,                 tPos;   protected void putc()   {      bPos--;   }   protected int getc()   {      if(bPos < bLen)         return buffer[bPos++];      else         return -1;   }

Warning

Note that the tokenizer assumes messages are smaller than 10KB. For larger messages, you need to either allocate a large buffer or rewrite putc() and getc() to support more efficient buffering.

The heart of the tokenizer is the nextToken() method . It reads from the buffer and recognizes data and separators. The parser will repetitively call nextToken() until it reaches the end of file.

nextToken() starts by testing whether the current character is a separator. If it is, it returns the appropriate token constant.

However, if the current character is not a separator, it must be part of a field, so nextToken() loops until it reaches a separator. During the loop, it fills the token array. Then, when it hits a separator, nextToken() replaces the separator in the input (through putc() ) so that it will be available for the next call to nextToken() :

 public int nextToken()   {      tPos = 0;      int c = getc();      switch(c)      {         case -1:            return TK_EOF;         case '+':            return TK_PLUS;         case '\ '':            return TK_APOSTROPHE;         case ':':            return TK_COLON;         case '?':            c = getc();            if(c == -1)               return TK_EOF;         default:           token[tPos++] = (char)c;      }      for(;;)      {         c = getc();         switch(c)         {            case -1:               return TK_EOF;            case '+':            case '\ '':            case ':':               putc();               return TK_DATA;            case '?':               c = getc();               if(c == -1)                  // could return TK_DATA but a single question                  // mark is a syntax error                  return TK_EOF;            default:               token[tPos++] = (char)c;               break;         }      }   }

Notice that nextToken() takes special care to resolve escape characters: When it hits a question mark, it immediately reads the next character and discards the question mark.

Writing the Parser

The parser is not complicated either. It assembles the various tokens into higher-level elements: simple data elements, composite data elements, and segments.

The various elements in the EDIFACT syntax were introduced in Chapter 5, in the section "Meet EDIFACT."

The only potential pitfall stems from composite and simple data elements. Consider the following segment:

 BGM+220+AGL153+9+AC'

Is the first field ( 220 ) a simple data element or a composite data element? You can't tell from the segment, can you? You must turn to the EDIFACT definition of the segment. (It is a composite data element.)

However, the parser must differentiate simple and composite data elements. The correct XML-ized document for the BGM segment is as follows :

 <Segment tag="BGM">      <Composite>         <Simple>220</Simple>      </Composite>      <Simple>AGL153</Simple>      <Simple>AB</Simple>   </Segment>

The following is incorrect:

 <Segment tag="BGM">      <Simple>220</Simple>      <Simple>AGL153</Simple>      <Simple>AB</Simple>   </Segment>

In other words, we need to provide the parser with segment definitions. One solution is to use Listing 6.3, which is essentially an XML document listing segments used in order. For each segment, it describes its content using the following code: A simple field is represented with an S and a composite field is represented with a C .

Listing 6.3 `edifactstructure.xml`

 <?xml version="1.0"?> <Structure>    <Segment tag="BGM" content="CSSSS"/>    <Segment tag="CNT" content="C"/>    <Segment tag="DTM" content="C"/>    <Segment tag="LIN" content="SSCCSS"/>    <Segment tag="NAD" content="SCCCCSSSS"/>    <Segment tag="PIA" content="SCCCCC"/>    <Segment tag="PRI" content="CS"/>    <Segment tag="QTY" content="C"/>    <Segment tag="RFF" content="C"/>    <Segment tag="UNH" content="SCSC"/>    <Segment tag="UNS" content="S"/>    <Segment tag="UNT" content="SS"/> </Structure>

Note

Listing 6.3 does not break the composite data element into a simple data element. The parser does not know whether a given composite data element should be three or five simple data elements.

This is not a problem, though, because we are not too concerned about the EDIFACT message. In fact, we don't even care whether a composite has the right number of simple data elements as long as we can transform it in valid XML. Validating the order in XML using a validating parser is easier than trying to validate the original EDIFACT document.

Listing 6.4 is EdifactStructure , a helper class that reads Listing 6.3. The EDIFACT parser calls setSegment() to select a segment (for example, BGM ) and repetitively calls nextContentType() to retrieve the type of each field in the segment. For BGM , successive calls to nextContentType() return C , S , S , S , and S , as specified in Listing 6.3.

Listing 6.4 `EdifactStructure.java`

 package com.psol.xsledi; import java.util.*; import org.xml.sax.*; class EdifactStructure    extends HandlerBase {    protected Dictionary dictionary = new Hashtable();    protected String content = null;    protected int cPos;    public void startElement(String name,AttributeList atts)       throws SAXException    {       if(name.equals("Segment"))       {          String tag = atts.getValue("tag"),                 content = atts.getValue("content");          if(null != tag && null != content)             dictionary.put(tag,content);          else             throw new SAXException("Missing attribute in Segment");       }    }    public void setSegment(String segment)    {       content = (String)dictionary.get(segment);       cPos = 0;       if(null == content)          throw new NullPointerException("unknown: " + segment);    }    public char nextContentType()    {       if(cPos < content.length())          return content.charAt(cPos++);       else          return content.charAt(content.length() - 1);    } }

The parser itself is demonstrated in Listing 6.5. It uses the EdifactTokenizer and EdifactStructure classes .

Listing 6.5 `EdifactParser.java`

 package com.psol.xsledi; import java.io.*; import java.util.*; public class EdifactParser {    protected EdifactStructure structure;    protected Writer writer = null;    protected EdifactTokenizer tokenizer =       new EdifactTokenizer();    public EdifactParser(EdifactStructure structure)    {       this.structure = structure;    }    public void setWriter(Writer writer)    {       this.writer = writer;    }    protected void match(int token)       throws UnexpectedTokenException    {       int t = tokenizer.nextToken();       if(t != token)          throw new UnexpectedTokenException(token,t);    }    public void parse(String filename)       throws UnexpectedTokenException, IOException    {       tokenizer.tokenize(new FileInputStream(filename));       writer.write("<?xml version=\ '1.0\ '?>");       writer.write("<Message>");       while(nextSegment() != EdifactTokenizer.TK_EOF)          ;       writer.write("</Message>");       writer.flush();    }    protected int nextSegment()       throws UnexpectedTokenException, IOException    {       int token = tokenizer.nextToken();       if(token == EdifactTokenizer.TK_EOF)          return EdifactTokenizer.TK_EOF;       else if(token != EdifactTokenizer.TK_DATA)          throw new UnexpectedTokenException(token,                                   EdifactTokenizer.TK_DATA);       String tag = tokenizer.getCurrentToken();       writer.write("<Segment tag=\ '");       writeEscape(tag);       writer.write("\ '>");       match(EdifactTokenizer.TK_PLUS);       structure.setSegment(tag);       while(token != EdifactTokenizer.TK_EOF &&             token != EdifactTokenizer.TK_APOSTROPHE)       {          switch(structure.nextContentType())          {             case 'C':                token = nextComposite();                break;             case 'S':                token = nextSimple();                break;             default:                throw new IllegalStateException();          }       }       writer.write("</Segment>");       return token;    }    protected int nextComposite()       throws UnexpectedTokenException, IOException    {       writer.write("<Composite>");       int token = nextSimple();       while(token == EdifactTokenizer.TK_COLON)           token = nextSimple();       writer.write("</Composite>");       return token;    }    protected int nextSimple()       throws UnexpectedTokenException, IOException    {       int token = tokenizer.nextToken();       switch(token)       {          case EdifactTokenizer.TK_DATA:             writer.write("<Simple>");             writeEscape(tokenizer.getCurrentToken());             writer.write("</Simple>");             int t = tokenizer.nextToken();             return t;          case EdifactTokenizer.TK_PLUS:          case EdifactTokenizer.TK_COLON:          case EdifactTokenizer.TK_APOSTROPHE:             writer.write("<Simple/>");             return token;          default:             throw new UnexpectedTokenException(token);       }    }    protected void writeEscape(String data)       throws IOException    {       // assumes a Unicode encoding since       // it does not escape non-ASCII characters for(int i = 0;i < data.length();i++)       {          char c = data.charAt(i);          switch(c)          {             case '<':                writer.write("&lt;");                break;             case '&':                writer.write("&amp;");                break;             case '\ '':                writer.write("&apos;");                break;             default:                writer.write(c);          }       }    } }

The parse() method is the starting point. It uses the tokenizer to read the file. Next, it creates the root of the XML document ( Message ) and iterates over all the segment, by repetitively calling nextSegment() , until it finds the TK_EOF token:

 public void parse(String filename)      throws UnexpectedTokenException, IOException   {      tokenizer.tokenize(new FileInputStream(filename));      writer.write("<?xml version=\ '1.0\ '?>");      writer.write("<Message>");      while(nextSegment() != EdifactTokenizer.TK_EOF)         ;      writer.write("</Message>");      writer.flush();   }

After reading the tag and writing the segment element, nextSegment() iterates over the various fields in the segment, calling nextComposite() or nextSimple() according to the segment structure made available by EdifactStructure :

 protected int nextSegment()      throws UnexpectedTokenException, IOException   {      int token = tokenizer.nextToken();      if(token == EdifactTokenizer.TK_EOF)         return EdifactTokenizer.TK_EOF;      else if(token != EdifactTokenizer.TK_DATA)         throw new UnexpectedTokenException(token,                                  EdifactTokenizer.TK_DATA);      String tag = tokenizer.getCurrentToken();      writer.write("<Segment tag=\ '");      writeEscape(tag);      writer.write("\ '>");      match(EdifactTokenizer.TK_PLUS);      structure.setSegment(tag);      while(token != EdifactTokenizer.TK_EOF &&            token != EdifactTokenizer.TK_APOSTROPHE)      {         switch(structure.nextContentType())         {            case 'C':               token = nextComposite();               break;            case 'S':               token = nextSimple();               break;            default:               throw new IllegalStateException();         }      }      writer.write("</Segment>");      return token;   }

nextComposite() and nextSimple() are even simpler. They read as much data as possible until they reach a separator:

 protected int nextComposite()      throws UnexpectedTokenException, IOException   {      writer.write("<Composite>");      int token = nextSimple();      while(token == EdifactTokenizer.TK_COLON)         token = nextSimple();      writer.write("</Composite>");      return token;   }

Listing 6.6 is UnexpectedTokenException , which the parser uses to report errors.

Listing 6.6 `UnexpectedTokenException.java`

 package com.psol.xsledi; public class UnexpectedTokenException    extends Exception {    public UnexpectedTokenException(int foundToken)    {       super("unexpected " +             EdifactTokenizer.toString(foundToken) +             " token found");    }    public UnexpectedTokenException(int expectedToken,                                    int foundToken)    {       super("unexpected " +             EdifactTokenizer.toString(foundToken) +             " token found, was expecting " +             EdifactTokenizer.toString(expectedToken));    } }

Listing 6.1 orders.edi

Writing the Tokenizer

Listing 6.2 EdifactTokenizer.java

Writing the Parser

Listing 6.3 edifactstructure.xml

Listing 6.4 EdifactStructure.java