Chapter 15. XML as a Data Format | XML in a Nutshell, 2nd Edition

CONTENTS

15.1 Why Use XML for Data?
15.2 Developing Data-Oriented XML Formats
15.3 Sharing Your XML format

Despite its document roots, the most common applications of XML today involve the storage and transmission of information for use by different software applications and systems. New technologies and frameworks (such as Web Services) depend heavily on XML content to communicate and negotiate between dissimilar applications.

The appropriate techniques used to design, build, and maintain a data-centric XML application vary greatly, depending on the required functionality and intended audience. This chapter discusses the different concerns, techniques, and technologies that should be considered when designing a new data-centric XML application.

15.1 Why Use XML for Data?

Before XML, individual programmers had to determine how data would be formatted whenever they needed to store or transmit program data. In most cases, the data was never intended for use outside the original program, so programmers would store it in the most convenient format they could devise. A few de facto file formats evolved over the years (RTF, CSV, and the ubiquitous Windows .ini file format), but the data written by one program could usually be read only by that same program. In fact, it was often possible for only that specific version of the same program to read the data.

The rapid proliferation of XML and free XML tools throughout the programming community has given developers an obvious choice when the time comes to select a data-storage or transmission format for their application. For all but the most trivial applications, the benefits of using XML to store and retrieve data far outweigh the additional overhead of including an XML parser in your application. The unique strengths of using XML as a software data format include:

Simple syntax: Easy to generate and parse.
Support for nesting: Tags easily allow programs to represent structures with nested elements.
Easy to debug: Human-readable data format is easy to explore and create with a basic text editor.
Language and platform independent: XML and Unicode guarantee that your datafile will be portable across virtually every popular computer architecture and language combination in use today.

Building on these basic strengths, XML can make possible new types of applications that would have been previously impossible (or very costly) to implement.

There are a few technologies that seek to achieve similar cross-program compatibility but use binary formats. Abstract Syntax Notation One (ASN.1) is probably the most prominent of these. ISO and ITU-T are developing standards for working with XML and ASN.1 in various combinations; more information on these developments is available from http://asn1.elibel.tm.fr/en/xml/.

15.1.1 Mixed Environments

Modern enterprise applications often involve software running on different computer systems under various operating systems. Choosing a communication protocol involves finding the lowest common denominator available on each system. With the large number of XML parsers that can be freely integrated with your application, XML is becoming a popular format for enterprise data sharing.

Imagine a typical enterprise application that needs to display data from a mainframe to users connected to a corporate web site. In this case, XML acts as the "glue" to connect a web server with a legacy application on a mainframe. The simple XML interface application accepts requests from the web server, calls the legacy application, and converts the result to XML. Using a technology like XSLT, the web server can then transform the XML into a number of acceptable web formats. By adopting XML as the common language of your enterprise, it becomes easier to reuse existing data in new ways.

Even on smaller systems, XML can be useful for sharing information between applications written in different languages or running in different environments. If a Perl program and a Java program need to communicate, generating and processing XML can be simpler than the alternatives. The XML can also serve as a record to their communications or provide a gateway to other systems that need to join the conversation.

15.1.2 Communications

Building flexible communications protocols that link disparate systems has always been a difficult area in computing. With the proliferation of computer networking and the Internet, building distributed systems has become even more important.

While XML itself is only a data format, not a protocol, XML's flexibility and cross-platform usability has inspired some new developments on the protocol front. XML messaging started even before the XML specification was finished, and various forms of XML messaging have continued to evolve.

One of the earliest approaches, and still a common one, was transmitting XML over HTTP POST requests. The sender would assemble an XML document and send it much like HTML form data, and the recipient would process the XML and send back a response, also in XML. Some developers create custom vocabularies for these transactions, while others have moved to standardized vocabularies such as XML-RPC and SOAP.

XML-RPC is a very simple protocol, which uses XML messages traveling on HTTP to represent client-server remote procedure calls (RPC). The XML messages identify methods, parameters, and the results of calling the methods. The XML documents use simple but effective set of data types (including arrays and structs) to pass information between computers. For more information on XML-RPC, see http://www.xmlrpc.com/.

SOAP offers much more flexibility than XML-RPC, but is much more complex as well. SOAP (formerly the Simple Object Access Protocol, but now an acronym without meaning) uses XML to encapsulate information being sent between programs. SOAP is no longer bound to an HTTP transport, but HTTP is commonly used. It offers both an RPC approach and a document-oriented approach and uses XML Schema data types (with some of its own extensions for things like arrays) to identify type information. SOAP is often grouped with Web Services Description Language (WSDL) and Universal Description, Discovery, and Integration (UDDI) in discussions of "Web Services." For information on SOAP and Web Services, see http://www.w3.org/2002/ws/.

Some developers are promoting the use of HTTP-based alternatives to SOAP and XML-RPC, under the banner of Representational State Transfer (REST). For more information on this architectural approach and the perspective it offers, see http://internet.conveyor.com/RESTwiki/moin.cgi.

The Blocks Extensible Exchange Protocol (BEEP) takes a very different approach from SOAP and XML-RPC. Rather than building documents that travel over existing protocols, BEEP uses XML to build protocols on TCP sockets. BEEP supports HTTP-style message-and-reply, as well as more complex synchronous and asychronous modes of communication. SOAP messages can be transmitted over BEEP, and so can a wide variety of other XML and binary information. More information on BEEP is available at http://www.beepcore.org.

15.1.3 Object Serialization

Like the issue of communications, the question of where and how to store the state of persistent objects has been answered in various ways over the years. With the popular adoption of object-oriented languages, such as C++ and Java, the language and runtime environment frequently handle object-serialization mechanics. Unfortunately, most of these technologies predate XML.

Most existing serialization methods are highly language- and architecture-specific. The serialized object is most often stored in a binary format that is not human readable. These files break easily if corrupted, and maintaining compatibility as the object's structure changes frequently requires custom work on the part of the programmer.

The features that make XML popular as a communications protocol also make it popular as a format for serializing object contents. Viewing the object's contents, making manual modifications, and even repairing damaged files is easy. XML's flexible nature allows the file format to expand ad infinitum while maintaining backward compatibility with older file versions. XML's labeled hierarchies are a clean fit for nested object structures, and conversions from objects to XML and back can be reasonably transparent. (Mapping arbitrary XML to object structures is a much harder problem.)

A number of tools serialize objects written in various environments as XML documents and can recreate the objects from the XML. Java 1.4, for example, adds an "API for Long-Term Persistence" to its java.beans package, giving developers an alternative to its existing (and still supported) opaque binary serialization format. The XML vocabulary looks a lot like Java and is clearly designed for use within a Java framework, though other environments may import and export the serialization. For more information on this API and the XML it produces, see http://java.sun.com/j2se/1.4/docs/guide/beans/changes14.html#ltp.Microsoft's .NET framework includes similar capabilities but uses an XML Schema-based approach.

15.1.4 Data Storage/Retrieval

The line between an XML file and a database can be blurred. Though XML documents are too verbose and searching is too inefficient for high-performance large-scale database applications, they may be used as a simple, self-contained data store for small sets of information.

XML can play a role in the communications between databases and other software, providing usable chunks of information in a form more easily reused than a typical query response. On the client side, XML data files can be used to offload some nontransactional data-search and -retrieval applications from busy web servers down to the desktop web browser. On the server side, XML can be used as an alternate delivery mechanism for query results.

XML is also finding use as a supplement to information stored in relational databases, and more and more relational databases include native support for XML both as a data-retrieval format and a data type. Native XML databases, which store XML documents and provide querying and retrieval tools, are also becoming more widely available. For more information on the wide variety of XML and data-management tools available, see http://www.rpbourret.com/xml/XMLDatabaseProds.htm.

15.2 Developing Data-Oriented XML Formats

Despite the mature status of most of XML's core technologies, XML application development is only now being recognized as a distinct discipline. Many architects and XML developers are attempting to turn existing design methodologies (like UML) and design patterns to the problem of constructing markup languages, but a widely accepted design process for creating XML applications still does not exist.

The term "XML application" is often used in XML contexts to describe an XML vocabulary for a particular domain rather than the software used to process it. This may seem a little strange to developers used to creating software applications, but it makes sense if you think about integrating a software application with an XML application, for instance.

XML applications can range in scope from a proprietary vocabulary used to store a single computer program's configuration settings to an industry-wide standard for storing consumer loan applications. Although the specifics and sometimes the sequence will vary, the basic steps involved in creating a new XML application are as follows:

Determine the requirements of the application.
Look for existing applications that might meet those requirements.
Choose a validation model.
Decide on a namespace structure.
Plan for expansion.
Consider the impact of the design on application developers.
Determine how old and new versions of the application will coexist.

The following sections explore each of these steps in greater depth.

15.2.1 Basic Application Requirements

The first step in designing a new XML application is like the first step in many design methodologies. Before the application can be designed, it is important to determine exactly what needs the application will fulfill. Some basic questions must be answered before proceeding.

15.2.1.1 Where and how will new documents be created?

Documents that will be created automatically by a software application or database server will be structured differently than those that need to be created by humans using an XML editor. While software wouldn't have a problem generating 100 elements with intricate attributes and cross-references, a human being probably would.

If you already have an application or a legacy format to which you're adding XML, you may already have data structures you need to map to the XML. Depending on the other requirements for the application, you may be able to base your XML format on the existing structures. If you're starting from scratch or need to share the information with other programs that don't share those structures, you probably need to look at the data itself and build the application creating the XML around the information.

15.2.1.2 How complex will the document be?

Obviously the complexity of the data that will be modeled by the XML document has some impact on how the application will be designed. A document containing a few, simple element types is much easier to describe than one that contains dozens of different elements with complex data type requirements. The complexity of an application will affect what type of validation should be used and how documents will be created and processed.

15.2.1.3 How will documents be consumed?

If the XML documents using this vocabulary will only pass between similar programs, it may make sense to model the XML documents directly on the internal structures of the programs without much concern for how easy or difficult that makes using the document for other programs or for humans. If there's a substantial chance that this information needs to be reused by other applications, read by humans (for debugging purposes or for direct access to information), or will be stored for unknown future use, it probably makes sense to ensure that the document is easy to read and process even if that makes creating the document a slightly more difficult task.

15.2.1.4 How widely will the resulting documents be distributed?

Generally, the audience of a new XML application is known in advance. Some documents are created and read by the same application without ever leaving a single system. Other documents will be used to transmit important business information between the IT systems of different organizations. Some documents are created for publication on the Web to be viewed by hundreds or even thousands of people around the world. XML formats that will be shared widely typically need comprehensive documentation made readily available to potential users. Formal validation models may also be more important for documents that are shared outside of a small community of trusted participants.

15.2.1.5 Will others need to incorporate this document structure into their own applications?

Some XML applications are never intended for use and are only useful when incorporated into other XML applications. Others are useful standards on their own but are also suitable for inclusion in other applications. A few different methods that might be used to incorporate markup from one application into another:

Simple inclusion: Markup from one application is included within a container element of another application. Embedding XHTML content in another document is a common example of this.
Mixed element inclusion: Markup from one application is mixed inline with content from another application. This can complicate validation and makes the including application sensitive to changes in the included application. The Global Document Annotation (GDA) Initiative application provides an example of this type of application (http://www.oasis-open.org/cover/gda.html).
Mixed attribute inclusion: Some XML applications are comprised of attributes that may be attached to elements from the host application. XML Linking (XLink) is a prime example of this type of application, defining only attributes that may be used in other vocabularies.

Answering these questions will provide a basic set of requirements to keep in mind when deciding whether to build a new application, acquire an existing application, or some combination of the two.

15.2.2 Investigating Available Options

Before committing to designing and implementing a new XML application, it is a good idea to take a few minutes to search the Internet for prior art. Since the first version of the XML recommendation was released in 1998, thousands of new XML applications have been developed and released around the world. Although the quality and completeness of these applications vary greatly, it is often more efficient to start with an existing DTD or schema (however imperfect) rather than starting from scratch. In some cases supporting software is already available, potentially saving software development work as well.

15.2.2.1 XML vocabulary development

It is also possible that the work your application needs to do may fit into an existing generic framework, such as XML-RPC or SOAP. If this is the case, you may or may not need to create your own XML vocabulary. XML-RPC only uses its own vocabulary, while different styles of SOAP may reduce the amount of work your vocabulary needs to perform.

There are several XML application registries available on the Internet, and a good "metadirectory" of DTD and schema directories can be found on O'Reilly's XML site, http://www.xml.com. These repositories list applications for various disciplines and topics with varying licensing requirements. The XML Cover Pages, at http://xml.coverpages.org, also provide information about a wide variety of XML-related vocabularies, software, and projects. The search for existing applications may also find potential collaborators, potentially helpful if the XML format is intended for use across multiple organizations.

15.2.3 Planning for Growth

Some applications may not need to evolve over time (a vocabulary describing basic DNA strands, for instance), but some thought should be given as to how users of the application would be able to extend it to meet their own needs. In DTD-based applications, this is done by providing parameter entity "hooks" into the document type definition, which could either be referenced or redefined by an instance document. Take the simple DTD shown in Example 15-1.

Example 15-1. extensible.dtd

<!ENTITY % varContent "(EMPTY)"> <!ELEMENT variable %varContent;>

This fragment is not a very interesting application by itself, but since it provides the capability for extension, the document author can make it more useful by providing an alternative entity declaration for the content of the variable element, as shown in Example 15-2.

Example 15-2. Document extending extensible.dtd

<?xml version="1.0"?> <!DOCTYPE variable SYSTEM "extensible.dtd" [ <!ENTITY % varContent "(#PCDATA)"> ]> <variable>Useful content.</variable>

The XML schema language provides more comprehensive and controlled support for extending markup using the extension, include, redefine, and import elements. These two mechanisms can be used in conjunction to create very powerful, customizable application frameworks.

15.2.4 Choosing a Validation Method

The first major implementation decision of designing a new XML application is what type of validation (if any) will be performed on instance documents. In many cases, prototyping a set of instance documents is the best way to determine what level of validation must be performed.

If your application is simply saving some internal program state between invocations (such as window positions or menu configurations within a GUI application), going to the trouble of building a schema and validating documents may not be necessary. Since these configuration documents will always be written and read by the same program, the structure is fixed by the program logic itself. The only conceivable purpose for validating a document like this would be to detect file corruption, which would be likely to generate a well-formedness error in any case.

An example of an application that would require some level of validation is where XML documents are exchanged between different related systems that are not maintained by the same development organization. In this case, a DTD or schema can serve as a definitive blueprint to ensure that all systems are sending and receiving information in the expected formats.

The most rigorous type of validation is required when developing a new XML standard that will be implemented independently by many different vendors without any explicit control or restrictions. For example, the XHTML 1.1 standard is enforced by a very explict and well-documented DTD that is hosted by the W3C. This well-known public DTD allows tool and application vendors to ensure that their systems will interoperate as long as instance documents conform to the standard.

After determining the level of validation for a particular application, it must be decided what validation language will be used. The DTD mechanism of XML 1.0 is still the most widely supported standard, although it lacks the expressive power that is required by sophisticated data-oriented applications. The W3C XML schema recommendation provides very rich type and content model expression, but brings with it a commensurate level of complexity.

Developers can also provide both DTDs and XML schemas, or even combine them with other vocabularies for describing XML structures, notably RELAX NG (http://www.oasis-open.org/committees/relax-ng/ ) and Schematron (http://www.ascc.net/xml/resource/schematron/schematron.html).RDDL, described in Chapter 14, provides a set of tools for supporting and explaining such combinations for formats that use namespaces.

15.2.5 Namespace Support

Virtually every XML application that will be shared with the public should include at least a basic level of namespace support. Even if there are no current plans to release a particular document application to the outside world, it is much simpler to implement namespaces from the ground up than it is to retrofit an existing application with a namespace.

Namespaces affect everything from how the document is validated to how it is transformed (using a stylesheet language such as XSLT). Here are a few namespace issues to consider before selecting a URI and starting work.

15.2.5.1 Will instance documents need to be validated using a DTD?

If so, some planning of how namespace prefixes will be assigned and incorporated into the DTD is necessary. DTDs are not namespace aware, so strategic use of parameter entities can make modification of prefixes much simpler down the road.

15.2.5.2 Will markup from this application need to be embedded in other applications?

If so, some thought needs to be given to potential tag-name collisions. The safest approach is to force every element from your application to be explicitly qualified with a namespace. This can be done within an XML schema by setting the elementFormDefault and attributeFormDefault attributes of the schema element to qualified.

15.2.5.3 Are there legacy documents to support?

If an application will be used to validate existing XML documents, some thought should be given to the effort involved in migrating them. In most cases, simply adding a default namespace declaration will be sufficient. If the new application includes markup from different namespaces, however, some thought must be given to how to update old documents.

15.2.6 Maintaining Compatibility

Maintaining backward compatibility with existing documents is a primary concern for XML applications that are widely used by diverse audiences. The difficulties faced by standards organizations when dealing with the task of updating a popular application (such as HTML) are formidable. While most applications may not become as widespread as HTML, some thought should be given in advance as to how new versions of a schema or DTD will interact with existing documents.

One possible approach to maintaining backward compatibility is to create a new, distinct namespace that will be used to mark new element declarations or perhaps to change the namespace of the entire document to reflect a substantially changed version. Another possible strategy is only to extend existing applications without removing prior functionality. The most important thing is to ensure that each instance document for an application has some readily identifiable marker that associates it with a particular version of a DTD or schema. The good news is that the highly transformable nature of XML makes it very easy to migrate old documents to new document formats.

Removing functionality is possible, but frequently difficult, once a format is widely used. Deprecating functionality marking it as a likely target for removal a version or several before it is actually removed is one approach. While deprecated features often linger in implementations long after they've been targeted for removals, they change the expectations of developers building new applications and make it possible, if slow, to remove functionality.

15.3 Sharing Your XML format

Creating a data format is often only the first step in making it useful. If an XML vocabulary is only used for a particular process inside a software application, there may not be much reason to publish information about how it works, except for future developers who may work on that application. If, on the other hand, the data format is intended for widespread use by people or organizations who may not normally interact with each other beyond the exchange of messages, then it probably makes sense to provide much more support for the format.

There are a variety of different kinds of information about a data format that are frequently worth sharing:

Human-readable documentation, perhaps even in a variety of languages
Schemas and DTDs formally defining the structures and content
Stylesheets and transformations for presenting the data or converting it from one format to another
Code for processing the data, perhaps even in a variety of languages or environments

The first two approaches human-readable documentation and schemas are typically the foundations. Formal definitions and rough understandings of what goes where often work for formats that are used by individual programmers or small groups, but sharing formats widely often requires further explanation. Stylesheets and code are additional options that may simplify adoption for developers.

The appropriate level of publicity for an XML vocabulary can vary widely, from no publicity at all to publishing a RDDL document or a support site, to registering the format in one of the XML application registries, or to creating a working group at some kind of standards body or consortium.

CONTENTS