7.5 Data Representations for Information Exchange | Modernizing Legacy Systems: Software Technologies, Engineering Processes, and Business Practices

The creation of standard data formats that can be used to share information across system and platform boundaries is an important innovation in data management. In this section, we describe two data representations for information exchange: electronic data interchange (EDI) and the eXtensible Markup Language (XML).

EDI

EDI is the computer-to-computer exchange of business data in standard formats between trading partners . EDI was first developed for the shipping and transportation industry more than 25 years ago to reduce paperwork burdens. Traditionally, individual trading partners implement EDI-shared definitions of document formats, including purchase orders, invoices, and shipping orders. The trading partners interface EDI to their existing systems via translation software rather than at the application level. EDI has been accepted by many industries, including health care, financial services, and government procurement. Such standards as ANSI ASC (Accredited Standards Committee) X12 and UN/EDIFACT (United Nations/Electronic Data Interchange for Administration, Commerce and Transport) enable the adoption of EDI by describing an agreed-on message format. Even though these standards reduce confusion about message formats, challenges in mapping message content persist.

EDI enjoys wide industry acceptance: According to the International Data Corporation, more than three times as many B2B electronic transactions occurred via EDI in 2001 than over the Internet. Nonetheless, EDI is often criticized for its syntactic rigidity and its implementation costs. The rise of the document-centric Web and the notion of e-business have resulted in new avenues for exchanging data. Many EDI adopters have begun exploring new technologies to enhance systems relying on EDI technology. The leading technology is XML.

XML

XML is a markup language developed by the World Wide Web Consortium (W3C). XML is used to structure data to reduce ambiguity between applications sharing information. Structured data includes the content and the role the content plays.

A data object is an XML document if it is well formed, as defined in the XML specification. Being well formed requires that the elements are delimited by start tags and end tags and are nested properly. XML offers a universal syntax for describing and structuring data independent from application logic and is being used to define languages for specific industries and applications.

It is important to note that XML is not a programming language but rather a markup language similar to HTML. Its design, however, was influenced by principles of good programming language design, including extensibility ”allowing the introduction of new tags without breaking the existing document structure ”platform independence, and support for internationalization, as it is based on Unicode. In fact, XML and HTML share a common ancestry, as they are both descendants of the Standardized General Markup Language (SGML), a 1986 ISO standard for structuring data commonly used in large technical documentation projects. Since its inception in 1996, XML has refined and focused SGML concepts into a simplified subset that is appropriate for use on the Web.

As XML has matured, several related standards have emerged. XSL (eXtensible Style Language) is the advanced language for expressing style sheets. XSL is based on XSLT (eXtensible Style Language Transformation), a transformation language used for rearranging, adding, and deleting tags and attributes.

DTD

Although an XML document is the data itself, the means to describe and validate the structure of the data is left to a document type definition (DTD). A DTD expresses constraints on XML documents by defining the allowable elements within a document and their content, order, and attributes. An XML document is valid if it has an associated DTD and complies with its constraints. In addition to validation, a DTD can define entities, define notations, and provide default values for attributes. The DTD enables heterogeneous applications to share data. Numerous specialized DTDs are used in specific industries and applications and have become standards.

XML Schemas

Schemas address several limitations of the DTD by providing a richer semantic encoding and by providing advances in describing document object models. Schemas are written in XML-instance-document syntax, using tags, elements, and attributes. Schemas can assign data types, such as integer and date, to elements and validate documents, based on not only the element structure but also the contents of the elements. DTDs lack an effective means to extend types and combine types from multiple names spaces, both of which are addressed by schemas.

XML Parsers

Because it is structured data in the form of plaintext with tags to delimit the data, XML can be manipulated simply. Programmatically, writing XML can be as simple as sending characters to a file output stream. On the other hand, reading XML is best accomplished via XML parsers, available as libraries from numerous vendors . Parsing simply is the process of reading an XML document and reporting its content to a client application while checking that the document is well formed.

The two common techniques for parsing XML are using the simple application programming interface for XML (Simple API for XML, or SAX) and using the document object model (DOM). SAX defines the API to read an XML file in sequence, line by line. SAX is based on two interfaces: the XML Reader interface, which represents the parser, and the Content Handler interface, which is implemented to receive data from the parser. Callbacks of the event-oriented architecture of SAX are used to notify the parsing implementation when element names and data are encountered . This technique is useful when processing large XML files and streaming data. The DOM is a standard set of function calls for manipulating XML and HTML files from a programming language in which the manipulation is not sequential but rather tree based. The DOM is most useful for programs needing to manipulate large portions of small documents.

Standards

XML 1.0

Extensible Markup Language (XML) v1.0, a recommendation of the World Wide Web Consortium (W3C) is in its second edition. ^[11]

^[11] For the full text of the specification, go to http://www.sei.cmu.edu/cbs/mls/links.html#w3xml2000.

XML Schema

There are three W3C recommendations for the XML Schema: XML Schema Part 0: Primer; XML Schema Part 1: Structures; and XML Schema Part 2: Datatypes. The Primer, a non-normative document providing a readable description of the XML Schema facilities, is useful for quickly understanding how to create schemas using the XML Schema language. XML Schema: Structures specifies the XML Schema definition language, which offers facilities for describing the structure and constraining the contents of XML v1.0. XML Schema Part 2: Datatypes defines facilities for defining datatypes to be used in XML Schemas, as well as other XML specifications. ^[12]

^[12] All three recommendations, along with additional information, can be found at http://www.sei.cmu.edu/cbs/mls/links.html#w3-xml-schema.

SAX

The Simple API for XML, originally a Java-only API, was the first widely adopted API for XML in Java and is a de facto standard. The current version is SAX v2.0, and there are versions for several programming language environments other than Java. ^[13]

^[13] For more information, see http://www.sei.cmu.edu/cbs/mls/links.html#saxproject.

Products

Apache Xerces

The Xerces Java Parser supports the XML v1.0 recommendation and contains advanced parser functionality, such as support for the W3C's XML Schema recommendation v1.0, DOM Level 2 v1.0, and SAX v2.0, in addition to supporting the industry-standard DOM Level 1 and SAX v1.0 APIs. ^[14]

^[14] More information can be found at http://www.sei.cmu.edu/cbs/mls/links.html#apache.

IBM XML4J

IBM's XML Parser for Java (XML4J) is a validating XML parser written in 100% Pure Java. XML4J incorporates support for the W3C XML Schema Recommendation v1.0, SAX v1.0 and SAX v2.0, DOM Level 1, DOM Level 2, some features of DOM Level 3 Core Working Draft, and JAXP v1.1 support. IBM is a major contributor to Apache's Xerces-J code base. Version 1.4.2 of Xerces-J forms the basis for XML4J v3.2.1. ^[15]

^[15] For more information, see http://www.sei.cmu.edu/cbs/mls/links.html#alphaworks.

SUN JAXP

The Java API for XML Processing (JAXP) supports processing of XML documents using the DOM, SAX, and XSLT. JAXP enables applications to parse and transform XML documents independently of a particular XML processing implementation. Developers can swap between XML processors, such as high-performance versus memory-conservative parsers, without changing the application code. The JAXP reference implementation v1.1.3 includes a high-quality parser supporting both SAX and DOM and a transformation engine supporting XSLT. ^[16]

^[16] More information can be found at http://www.sei.cmu.edu/cbs/mls/links.html#sun.