Unfortunately, it is not possible to convert between two formats unless you are familiar with the input and output formats. I assume you are familiar with XML, but you might not be familiar with EDIFACT. This section is a crash course in EDIFACT. If you are in a hurry and want to jump straight into the code, I suggest you read at least the section "EDIFACT Segments."
EDIFACT, or UN/EDIFACT as it is formally known, is short for Electronic Data Interchange For Administration, Commerce, and Transport. It is a comprehensive e-commerce solution developed under the auspices of the United Nations (hence, the UN part of the name ).
e-Commerce is a commonly used term that has several meanings. When people think of e-commerce, though, they usually think of http://Amazon.com or other online shops . Other popular and older forms of e-commerce do exist, however.
Online shops cater primarily to the business-to-consumer (B2C) side of e-commerce. The other side is business-to-business (B2B) e-commerce, or the buying, selling, and other commercial transactions that take place between businesses.
Business-to-business commerce isn't as well known as the consumer-oriented side. This is mainly because it is less visible and more abstract: We shop in various stores (online and offline) every day but few of us really care from where the stores are buying their goods.
This is business-to-business commerce: stores (businesses) buying goods from their suppliers (other businesses). What might surprise you is that it accounts for a very large volume because, behind the supplier is another supplier, and another, and another.
Let's look at an example. Say you have bought Applied XML Solutions at a bookstore. The bookstore bought the same book from a distributor, who bought it from Sams. Sams, in turn , had the book manufactured by a printer. To manufacture the book, the printer bought paper and ink. You get the idea.
So, for a single consumer-oriented transaction (you buying the book), several business-to-business transactions must occur. These transactions have a multiplying effect, which means that business-to-business commerce ”and consequently, business-to-business e-commerce ”is destined to outnumber consumer activities by a wide margin.
Electronic Data Interchange
One of the oldest forms of e-commerce is Electronic Data Interchange (EDI) . EDI is concerned solely with business-to-business e-commerce. The idea behind EDI is very simple: To conduct business, companies have to exchange an enormous amount of paperwork. Let's replace the paperwork with electronic files.
For example, if my company decides to buy goods from yours, we'll issue a purchase order. We also expect the goods to come with an invoice. To pay the invoice, we might cut a check.
Do we write these documents with a pen and paper? This is unlikely , because like most companies we use some sort of accounting software (by accounting software, I mean anything from QuickBooks to SAP) that tracks orders, invoices, and payments.
Go through your incoming mail and you'll find that most documents were printed by a computer (incidentally, you'll understand why Intuit makes so much money selling checkbooks). Follow the paper trail and you'll find the same documents are being routed to your own accounting software!
So, the process is to print commercial documents, send them by postal mail, and key them in at the receiving end. The paperwork and all the manual processing it requires is just a small annoyance for small corporations such as mine, but it's a major expense for larger organizations.
More than 20 years ago, some companies realized they could simplify things by building a more direct link between the two accounting softwares . Instead of spitting out a paper purchase order, my computer produces a file. I then can email you the file and you can feed it straight into your accounting package. No paper or postal mail is required, and it's better than regular email because the commercial documents are automatically imported.
Some of the benefits of EDI include the following:
How big is EDI? According to Forrester Research, business-to-business e-commerce was valued at $671 billion in 1998. So, why don't we hear more about it? One of the reasons might be that most transactions take place on private networks, not the Internet. In fact, Internet transactions represented only $92 billion.
Most transactions taking place on private networks are not based on XML. Instead, they use the EDI-specific formats, such as UN/EDIFACT and ANSI X12.
However, it would be a mistake to discount XML in that space. The same study expects business-to-business e-commerce to grow to $1.3 trillion (that's trillion, not billion) within three years. And guess where most of the growth will take place? On the Internet, of course. Now guess which format will dominate on the Internet. If you chose XML, you're right again.
To summarize, business-to-business e-commerce is very important. It is several times larger than consumer-oriented activities and will remain so.
Currently, most of these transactions take place on private networks, using special formats. However, they are expected to migrate to the Internet and XML within the next three years. This is why it's important to build a bridge between the EDI formats and XML.
The Inner Working of EDIFACT
The two dominant EDI formats are ANSI X12 and UN/EDIFACT . X12 was developed by ANSI and is used predominantly in the U.S. EDIFACT, on the other hand, enjoys a worldwide audience. Other popular formats include Odette (used in the automotive industry, including IAEG in the U.S.), Tradacoms (which is UK-based), and Swift (used in international banking).
Although they differ in details, the various EDI formats are based on the same principles.
The underlying idea is to develop electronic versions of most commercial documents. The list of documents is too long to detail here. But, some examples include an electronic purchase order, electronic invoice, and electronic catalog. An electronic custom declaration (when importing or exporting goods), electronic financial transactions (to replace checks), and electronic tax and other tax- related forms have also been developed.
Finally, some industries have even developed documents specific to their needs ”for example, electronic versions of insurance contracts, reinsurance claims, statistics forms, and more.
With EDIFACT, the electronic documents are called UNSMs , which is short for United Nations Standard Messages.
Because the messages are developed by international (EDIFACT) or national (X12, Tradacoms, and so on) bodies, they tend to be rather large. Imagine an invoice that satisfies the legal requirements of every country, every industry, and every company in the world! Large and unmanageable? You bet.
Therefore, users must simplify these documents before using them. For example, American companies must collect the sales tax, and European companies must collect the VAT (Value-Added Tax). The worldwide invoice does both, though. So, an American company would need to simplify it to include only sales tax, whereas a European company would limit it to VAT.
Incidentally, this is one of the major criticisms of the EDI formats: Because they are all-encompassing, they are very complex. Furthermore, to bring them down to something manageable, users must spend significant effort in simplifying the messages (a process known, in EDI circles, as creating subsets ) .
A side effect is that this creates incompatibilities, which cause most of the benefits of standardized formats to be lost. In the example, one company simplified to remove VAT, and the other to remove sales tax. Now, what happens when the U.S. corporation sends a purchase order to the European one?
This problem has led a growing number of companies to look for alternatives to the EDI formats, and XML appears to be a very attractive alternative because of the following:
The last point is worth reviewing. As I said, the international documents are so complex that companies must simplify them. Yet, when you study EDIFACT, you find that it has not been designed to be simplified. No support exists in the standard for simplifying orders.
On the other hand, XML has namespaces, which are a mechanism to organize large documents into smaller, more manageable subsets. Look at how XSL is divided in XSLT and XSLFO for a good example on how namespaces help simplify large standards. The standard is literally divided into two parts that can be used independently or combined at will. XML could bring that sort of benefit to EDI. For example, sales tax and VAT elements could be developed independently from the purchase order and then combined at will.
What do EDI messages look like? Listing 5.1 is an EDIFACT purchase order in which the bookstore, Playfield Books, is ordering books from Que.
For simplicity, the purchase order in Listing 5.1 is minimalist. It has all the required information but little extra.
Listing 5.1 orders.edi
UNH+1+ORDERS:D:96A:UN'BGM+220+AGL153+9+AB'DTM+137:20000310:102'DTM+61:20000410: 102'NAD+BY+++PLAYFIELD BOOKS+34 FOUNTAIN SQUARE PLAZA+CINCINNATI+OH+45202+US'NAD+SE+++QUE+ 201 WEST 103RD STREET+INDIANAPOLIS+IN+46290+US'LIN+1'PIA+5+0789722429:IB'QTY+21:5'PRI+AAA:24.99:: SRP'LIN+2'PIA+5+0789724308:IB'QTY+21:10'PRI+AAA:42.50::SRP'UNS+S'CNT+3:2'UNT+17+1'
We will look at this purchase order in more detail in the next section. This section serves as a crash course in EDIFACT syntax. Don't worry if you don't remember everything, this is not "Applied EDIFACT Solutions." However, to convert between any format and XML, you need to know the fundamentals of the non-XML format.
The building block for the EDIFACT message is the segment. Segments start with a tag followed by a set of data. They end with the ' character . The tag identifies the segment. For example, the following is a price segment, recognizable by its PRI tag:
Within a segment, the fields are delineated by the + or : character . The fields have no tags but are identified by their position. For example, in the PRI segment, the price is always the second field, which is $42.50.
The first and fourth fields are coded fields, which means their value is a code or an alphanumeric identifier for a value. For example, in the first field, the code AAA means net price. Codes are similar to enumerated parameter values in XML and are used for the same purposes.
If you are curious , SRP in the fourth field means suggested retail price. The meaning of the codes is specified by the EDIFACT standard.
You'll notice that the third field is empty, which means it has no value. However, because fields are identified by their position, the empty third field cannot be omitted. Indeed, I cannot write
or SRP would be in the third field instead of the fourth field. The third field has a different meaning (it is reserved for the price type) from the fourth field.
When EDIFACT was originally conceived, bandwidth was more expensive than it is today. Therefore, a lot of effort was directed toward achieving the smallest file possible.
If you are curious, compare Listing 5.1 with Listing 5.2, which is the same purchase order in XML. EDIFACT is clearly the winner in terms of size .
What about the + and : characters ? Fields in a segment can be either simple fields or composite fields and are separated by + characters . A composite field is a list of simple fields separated by : characters .
Therefore, the PRI segment
contains one composite field, which is made up of four simple fields ( AAA , 42.50 , empty, and SRP ). Compare this with the PIA segment (product identifier):
PIA starts with a simple data element ( 5 ) followed by a composite data element ( 0789722429:IB ) for the ISBN number. The composite data element has two simple data elements ( 0789722429 and IB ).
Note that ISBN stands for International Standard Book Number . It is a worldwide identifier for books. The ISBN appears on the back of the book with the bar code, and each book has a unique one. For example, this book has been assigned ISBN 0-7897-2430-8.
Because each book has a different ISBN, using only the ISBN suffices when ordering books. In fact, less risk of confusion is involved when ordering books by ISBN than by the title or author's name. It's easy to confuse two books with the same title, but it is impossible to confuse two books'ISBNs.
It's not always obvious why some elements become simple data elements while others end up as composite data elements. You should refer to the EDIFACT documentation to decide which is which.
In theory, when two or more simple data elements are often used together, they have been grouped in a composite data element.
You are now familiar with the basics of EDIFACT . However, we should consider the following two important rules that we have not yet encountered :
I explained that empty fields must be present so as not to impact the field positions . Thanks to the so-called compression mechanism, you can remove empty fields when no risk of confusion is involved. For example, in the PIA segment, the ISBN can repeat up to five times, so it could look similar to the following:
But, because the four empty composite data elements are also the last elements in the segment, no risk of confusion exists, so you must write the following segment:
The same rule applies at the end of composite data elements. The definition for the BGM (beginning of message) segment states that the first composite data element has four fields. However, if it looks like this
the compression rule states that if the last three fields of the composite data elements are empty, they need not appear in the segment. Therefore, we must write
The last syntactical rule is concerned with escape characters. Because + , : , and ' have a special role in segments, they cannot appear in data. This is similar to the < and & characters in XML, which cannot appear in data, either.
EDIFACT's solution is to escape these characters with the ? character ; therefore, we would not write
NAD+BY+++PLAYFIELD BELGIUM+43 RUE DE L'OUVRAGE+NAMUR++5000+BE'
because the ' would be confused with the end of the segment. Instead, we'd write the following:
NAD+BY+++PLAYFIELD BELGIUM+43 RUE DE L?'OUVRAGE+NAMUR++5000+BE'
The Message in Details
A message is a list of segments . The meaning of the segments, their positions in the message, the acceptable code for coded data elements, and more are specified in the EDIFACT standard. To decode a message, you must look up its definition in the EDIFACT standard.
The standard is conveniently available online at http://www.unece.org ; follow the links for UN/CEFACT and then UN/EDIFACT. You can search by message and drill down to the list of segments. From the segments, you then can zoom to the data elements and code lists . See Figures 5.1 and 5.2 for examples.
To save you this rather tedious task, here are the secrets of Listing 5.1, segment by segment.
The UNH segment marks the document as an EDIFACT document and identifies the type of document, which in this case is an order ( ORDERS ).
For completeness, note that EDIFACT groups messages in interchanges. The beginning and end of interchanges are indicated through more segments. For simplicity, interchanges are not discussed in this chapter.
Figure 5.1. Looking up the list of segments on the UN/ECE Web site.
Figure 5.2. Zooming in on one segment in the invoice shows the fields.
The 1 in the first field is a message identifier; D , 96A , and UN in the last fields identify a specific revision of the ORDERS message.
BGM stands for beginning of message. The code 220 confirms that the document is indeed an order. Next is the purchase order number, AGL153 .
The 9 in the next field is a code that says this message is the original purchase order (other codes exist for duplicates). The last field, AB , means we want the recipient to acknowledge reception .
The DTM segment in the previous line is the date (actually it's the Date and Time, hence the trailing M). The code 137 says this is the purchase order date. The actual date is next. The final code, 102 , means that the date is in ISO format, 10 March 2000, in this case.
When EDIFACT was originally conceived , other date formats were commonly used (including the dreadful two-digit years such as 99). Lately, it seems everybody uses the ISO date format, so 102 is becoming some sort of a constant for dates.
The next segment is another date. This one has the code 61 in the first field, meaning it is the last date for delivery. If the seller cannot deliver within a month (by 10 April 2000), he can forget the order.
NAD+BY+++PLAYFIELD BOOKS+34 FOUNTAIN SQUARE PLAZA+ CINCINNATI+OH+45202+US'
The next segment is an NAD , meaning name and address. The first field is a code ( BY ) to indicate this is the buyer's address. After two unused fields, we find, in order of appearance, the name of the buyer ( PLAYFIELD BOOKS ), the street address ( 34 FOUTAIN SQUARE PLAZA ), the city ( CINCINNATI ), the state ( OH ), the ZIP code or postal code ( 45202 ), and, finally, the country ( US ).
NAD+SE+++QUE+201 WEST 103RD STREET+INDIANAPOLIS+IN+46290+US'
A second NAD contains the seller's address (code SE ).
Next is the first line of the order, identified by a LIN segment and the line number ( 1 in this case).
The PIA segment that follows contains the product identifier. The code 5 , in the first field, specifies that the product identifier is related to an order line. The identifier itself follows as ISBN 0-7897-2242-9. The last code ( IB ) identifies the code as an ISBN.
Why do we need a code 5 to specify that the product identifier applies to a line order? Isn't it obvious by reading the order message that this must be a line order?
Yes and no. One of the issues with EDIFACT is that it uses a very flat data structure. Essentially, a message is a list of segments . With large messages, the placement of segments doesn't always indicate what's what. Code such as this 5 identify relationships between segments ("this is not any product identifier; it's the product identifier for the current product line"). These special codes are known as qualifiers.
To be complete, note that EDIFACT has a notion of groups of segments , in which a group is a set of related segments. However, groups have no special syntax, so they are not easy to recognize in a message!
The QTY segment indicates we are buying five books (in the last field). The 21 code (it's a qualifier again) states that the quantity applies to a line order.
This is the product price in a PRI segment . The first field is a code ( AAA ) meaning net price. It is followed by the price itself ( 24.99 ), an empty field, and a code ( SRP for suggested retail price).
The second line is for the order of 10 books (ISBN 0-7897-2430-8) at a suggested retail price of $42.50.
The next segment, UNS , means that the following segments are a summary of the message.
In this case, the summary consists of only one CNT (count) segment . The code 3 in the first field indicates it counts the number of order lines, which is 2 in this case.
The last segment is a UNT with two fields. The first field counts the total number of segments in the message ( 17 ). The second field, on the other hand, repeats the message identifier from the UNH segment ( 1 in this case).
The Message Structure
To summarize, the EDIFACT message follows the classic structure of an invoice: It starts with the name and address of the parties, the date of the invoice, and a reference number. Next are order lines. Each line contains a product identifier (ISBN), the quantity, and the price.
As has already been noted, this structure is not immediately apparent in EDIFACT because it is a rather flat list of segments. In contrast, XML elements nest so the structure is immediately apparent.