16.1 Why Use XML for Data? | XML in a Nutshell, Third Edition

Before XML, individual programmers had to invent a new data format every time they needed to save a file or send a message. In most cases, the data was never intended for use outside the original program, so programmers would store it in the most convenient format they could devise , which was often very tightly coupled to the program's internal data structures. Indeed, the earliest versions of Microsoft Word wrote at least part of their files by dumping memory straight to disk, and then opened those files by reading the data back into memory. This made understanding the data format and loading it into any other program extremely difficult. A few de facto file formats evolved over the years (RTF, CSV, ASN.1, and the ubiquitous Windows .ini file format), but in too many cases, the data written by one program could usually be read only by that same program. In fact, it was often possible for only that specific version of the same program to read the data.

In recent years, however, XML has begun to solve this problem and make data a lot more portable. The rapid proliferation of free XML tools throughout the programming community has made XML the obvious choice when the time comes to select a data-storage or transmission format for their application. For all but the most trivial applications, the benefits of using XML to store and retrieve data far outweigh the additional overhead of including an XML parser in your application. The unique strengths of using XML as a software data format include:

Simple syntax: Easy to generate and parse.
Support for nesting: Nested elements allow programs to represent complex structures easily.
Easy to debug: Human-readable data format is easy to explore and create with a basic text editor.
Language- and platform-independent: XML and Unicode guarantee that your data files will be portable across virtually every popular computer architecture and language combination in use today.

Building on these basic strengths, XML makes possible new types of applications that would have been previously impossible (or very costly) to implement.

16.1.1 Mixed Environments

Modern enterprise applications often involve software running on different computer platforms with a variety of operating systems. Choosing a communication protocol involves finding the lowest common denominator available on each system. Thanks to the enormous number of XML parsers that can be freely integrated with applications in a wide variety of environments, XML has become a popular format for data sharing.

Imagine an application server that needs to display data from a mainframe to users connected to a corporate web site. In this case, XML acts as the "glue" to connect the web server with a legacy application on a mainframe. The web server can send an XML request to the application server. The application server converts the request to what the legacy server expects and calls the legacy application. In the reverse direction, the application server converts the legacy server's response to XML before passing it back to the web server. Using a technology like XSLT, the web server can then transform the XML into a number of acceptable web formats for distribution to clients . By adopting XML as the common language of your enterprise, it becomes easier to reuse existing data in new ways.

Even on smaller systems, XML can be useful for sharing information between applications written in different languages or running in different environments. If a Perl program and a Java program need to communicate, generating and processing XML can be simpler than creating a custom format for the conversation. The XML documents exchanged can also serve as a record of the communications. Most importantly, the XML format provides a gateway to additional systems or programs that need to join the conversation. Each new system only needs to understand how to read and write the common XML format, rather than understanding every different format used by other participants .

16.1.2 Communications Protocols

Building flexible communications protocols that link disparate systems has always been a difficult area in computing. With the proliferation of computer networking and the Internet, building distributed systems has become even more important.

While XML itself is only a data format, not a protocol, XML's flexibility and platform agnosticism has inspired some new developments on the protocol front. XML messaging started even before the XML specification was finished and has continued to evolve since then.

16.1.2.1 XML as a part of the Web: REST

One of the earliest approaches, and still one of the best, is transmitting XML over HTTP. The server assembles an XML document and sends it to a client just like it sends an HTML file or a GIF image. For example, suppose a developer is building a service that takes a U.S. Zip Code and returns current weather information such as temperature and barometric pressure. The browser or other client application can encode the Zip Code as a query, producing a URL like http://example.com/weatherNow.cgi?zip=95472 .

It then sends a normal HTTP GET request to the server example.com requesting a representation of the resource /weatherNow.cgi?zip=95472. The server constructs an XML document representing the current weather for the Zip Code 95472, which might look something like Example 16-1.

Example 16-1. An XML document containing the weather in Sebastopol

 <?xml version="1.0" encoding="UTF-8"?> <weatherNow xmlns="http://example.com/weatherNow/" >   <temperature>57</temperature>   <pressure>29.97</pressure>   <pressureChange>rising</pressureChange> </weatherNow>

This simple web-based approach has been gathering supporters under the banner of Representational State Transfer (REST, http://rest.blueoxen.net/cgi-bin/wiki.pl). In the REST model, XML exchanges are treated in a very web-like way, using HTTP methods (GET, PUT, POST, DELETE) as verbs, XML documents as messages, and URIs to identify the services. REST doesn't have all the APIs, $200 an hour consultants , and six-figure middleware products that more complex web service-based approaches like SOAP support. But that's because it really doesn't need them. REST is simple, straightforward, and gets the job done with minimal effort.

16.1.2.2 XML for procedure calls over HTTP: XML-RPC

Other developers have chosen to use XML with more traditional programming approaches, like remote procedure calls (RPC). XML-RPC (http://www.xmlrpc.com) is a very simple protocol that encodes the method name and arguments as an XML document and transmits it using HTTP POST. The remote server responds with another XML document encoding the method's return value or an error message. The XML-RPC vocabulary defines elements representing six primitive data types (plus arrays and structs) common in pre-object-oriented languages.

If our hypothetical weather service was implemented using XML-RPC, a client request might look like Example 16-2.

Example 16-2. An XML-RPC request for the weather in Sebastopol

 POST /weatherNow HTTP 1.0 User-Agent: myXMLRPCClient/1.0 Host: example.com Content-Type: application/xml Content-Length: 170     <?xml version="1.0"?> <methodCall>   <methodName>weatherNow</methodName>   <params>     <param>       <value><string>95472</string></value>     </param>   </params> </methodCall>

Note that this example includes both an HTTP header and the XML document payload.

The XML is designed to represent a method call of the form weatherNow("95472") . XML-RPC supports a variety of parameter types, but in this case, the method only requires one parameter, the Zip Code. Parameter order matters as it does in programming languages, although it's also possible (with a struct parameter) to send name-value pairs to the method. The reply from a service providing the weatherNow method might look like Example 16-3.

Example 16-3. An XML-RPC response containing the weather in Sebastopol

 HTTP/1.0 200 OK Date: Sat, 06 Oct 2001 23:20:04 GMT Server: Apache.1.3.31 (Unix) Connection: close Content-Type: application/xml Content-Length: 519     <?xml version="1.0"?> <methodResponse>   <params>     <param>       <value>         <struct>           <member>              <name>temperature</name>              <value><int>57</int></value>           </member>           <member>              <name>pressure</name>              <value><double>29.96</double></value>           </member>           <member>              <name>pressureChange</name>              <value><boolean>1</boolean></value>           </member>         </struct>       </value>     </param>   </params> </methodResponse>

This response provides the temperature, pressure, and pressure change as a struct , a set of name-value pairs. The values are of different typesan int for the temperature, a double for the pressure, and a boolean to indicate whether the pressure is rising or falling. Responses are only allowed to include one param element (despite its enclosing params element), so a struct or an array will be necessary if a method needs to return more than a single value.

XML-RPC is limited by its strict adherence to the procedure call metaphor and its non-extensible vocabulary, but the simplicity of that approach has meant that a lot of different implementations are available for a wide array of environments. Developers using XML-RPC will rarely, if ever, see the actual XML underlying their procedure calls.

16.1.2.3 XML envelopes and messages: SOAP

SOAP offers much more flexibility than XML-RPC, but it is much more complex as well. SOAP (formerly the Simple Object Access Protocol, but now an acronym without meaning) uses XML to encapsulate information being sent between programs. Like XML-RPC, SOAP started out using HTTP POST requests , and this is still the most common way to use SOAP, although other transport protocols are allowed.

This discussion focuses on SOAP 1.1. A later specification, SOAP 1.2, is now a W3C recommendation, but SOAP 1.1 still dominates in common use. You may also want to explore the WS-I Basic Profile at http://ws-i.org/Profiles/BasicProfile-1.0-2004-04-16.html, built on SOAP 1.1, for suggestions for maximizing SOAP interoperability.

SOAP provides three features that differentiate it from plain XML messaging. The first is a structure for messages containing a SOAP-ENV:Envelope , an optional SOAP-ENV:Header for metadata, and a SOAP-ENV:Body . The second, now largely deprecated, is "SOAP Encoding" (or "Section 5 Encoding") for RPC messages. It provides structure much like the XML-RPC format, though it leaves open the choice of element names . The last feature is an explicit vocabulary for error messages, which are called faults. A simple SOAP request for the Zip Code weather server might look like Example 16-4.

Example 16-4. A SOAP request for the weather in Sebastopol

 <?xml version="1.0" ?> <SOAP-ENV:Envelope     xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"    xmlns:xsi="http://http://www.w3.org/2001/XMLSchema-instance"    xmlns:xsd="http://http://www.w3.org/2001/XMLSchema">  <SOAP-ENV:Body xmlns="http://example.com/weatherNow/">     <weatherForZip xsi:type="xsd:string">95472</weatherForZip>  </SOAP-ENV:Body> </SOAP-ENV:Envelope>

The response to this request could be encoded as in Example 16-5.

Example 16-5. A SOAP response containing the weather in Sebastopol

 <?xml version="1.0" ?> <SOAP-ENV:Envelope     xmlns:env="http://www.w3.org/2003/05/soap-envelope">  <SOAP-ENV:Body xmlns="http://example.com/weatherNow/">    <weatherStatus>     <temperature xsi:type="xsd:int">57</temperature>     <pressure xsi:type="xsd:double">29.97</pressure>     <pressureChange xsi:type="xsd:boolean">1</pressureChange>    </weatherStatus>  </SOAP-ENV:Body> </SOAP-ENV:Envelope>

Most of what's gained in SOAP beyond ordinary XML is a wrapper structure that lets developers add their own details to the messages. The SOAP-ENV:Header element, which can appear as the first child element of SOAP-ENV:Envelope , may be used to add extra information to a request, appearing before the body. Headers are used for a variety of tasks , from routing messages to the proper recipient to ensuring that a recipient understands a particular request before attempting to process the message.

When used in an HTTP environment, the request would typically be sent as a POST request from the client, generating the response from the server. SOAP can be used over a variety of other protocols, provided that all the senders and receivers understand both the protocol being used and as much of the SOAP messages as they need to process the request.

SOAP is built on XML and uses XML technologies like XML Schema, but in practice, very few developers actually work with the XML directly. Toolkits analyze existing objects or accept XML Schemas describing formats and then generate the markup automatically.

The Web Services Description Language (WSDL) can somewhat automate this process. A WSDL document is itself an XML document that describes a SOAP service. In many cases, it is easier to focus on the WSDL document and related XML Schemas for a service than to work with the SOAP messages themselves . A third vocabularyUniversal Description, Discovery, and Integration (UDDI) helps programs locate WSDL-described SOAP web services, although UDDI hasn't achieved such a broad adoption as SOAP and WSDL.

A number of organizations are working on Web Service-based technologies, including the W3C (http://www.w3.org/2002/ws/ ), OASIS (http://oasis-open.org), and the Web Services Interoperability Organization (http://ws-i.org/ ). The field is developing rapidly , with vendors offering a wide variety of sometimes conflicting proposals.

16.1.2.4 Other options: BEEP and XMPP

Two other protocols, both from the Internet Engineering Task Force (IETF), may also be worth considering. The Blocks Extensible Exchange Protocol, or BEEP (http://www.beepcore.org), solves a different problem than SOAP, XML-RPC, and REST. Rather than building documents that travel over existing protocols, BEEP uses XML as a foundation for protocols built on TCP sockets. BEEP supports HTTP-style message-and-reply, as well as more complex synchronous and asychronous modes of communication. SOAP messages can be transmitted over BEEP, and so can a wide variety of other XML and binary information.

The IETF is also home to the Extensible Messaging and Presence Protocol (XMPP), the protocol used by the Jabber instant messaging software. Jabber (http://jabber.org) has grown from its chat roots to a toolkit frequently used by developers to allow computers, rather than people, to talk to each other.

16.1.3 Object Serialization

Like the issue of communications, the question of where and how to store the state of persistent objects has been answered in various ways over the years. In many popular object-oriented languages, such as C++ and Java, the runtime environment frequently handles object-serialization mechanics. Unfortunately, most of these technologies predate XML.

Most existing serialization methods are highly language- and architecture-specific. The serialized object is most often stored in a binary format that is not human readable. These files break easily if corrupted, and maintaining compatibility as the object's structure changes frequently requires custom work on the part of the programmer.

The features that make XML popular as a communications protocol also make it popular as a format for serializing objects. Viewing the object's contents, making manual modifications, and even repairing damaged files is easy. XML's flexible nature allows the file format to expand ad infinitum while maintaining backward compatibility with older file versions. XML's labeled hierarchies are a clean fit for nested object structures, and conversions from objects to XML and back can be reasonably transparent. Mapping arbitrary XML to object structures is a harder problem, but hardly an insurmountable one.

A number of tools serialize objects written in various environments as XML documents and can recreate the objects from the XML. Java 1.4, for example, adds an API for Long-Term Persistence (http://java.sun.com/j2se/1.4/docs/guide/beans/changes14.html#ltp) to the java.beans package, giving developers an alternative to its existing (and still supported) opaque binary serialization format. Example 16-6 shows a simple applet persisted as XML.

Example 16-6. A Java frame serialized in XML

 <?xml version="1.0" encoding="UTF-8"?>  <java version="1.4.2_03" class="java.beans.XMLDecoder">   <object class="SwingCubScout">    <void property="contentPane">     <void method="add">      <object class="javax.swing.JLabel">       <void property="background">        <object class="java.awt.Color">         <int>255</int>         <int>255</int>         <int>0</int>         <int>255</int>        </object>       </void>       <void property="foreground">        <object class="java.awt.Color">         <int>0</int>         <int>0</int>         <int>255</int>         <int>255</int>        </object>       </void>       <void property="font">        <object class="java.awt.Font">         <string>Sans</string>         <int>1</int>         <int>24</int>        </object>       </void>       <void property="text">        <string>Cub Scouts!</string>       </void>      </object>     </void>    </void>    <void property="name">     <string>panel0</string>    </void>   </object>  </java>

This XML vocabulary looks a lot like Java and is clearly designed for use within a Java framework, although other environments may import and export the serialization. Microsoft's .NET Framework includes similar capabilities but uses an XML Schema-based approach. There is an incredible number of options for this kind of serialization process, available from many different vendors. Some depend on XML Schema, while others have their own models or work directly from existing object structures.

16.1.4 File Formats

Many single-user desktop applications open and save files. Games store the current state of the game. Word processors store text. Spreadsheets store numbers . Personal finance programs store monetary transactions. What unites these applications is that the data is read and written only at well-defined times, generally when the user selects Save or Open from the File menu. The formats designed for such storage are rarely a simple dump of the objects in-memory. What's sensible for storage on disk is rarely what makes sense for in-memory manipulation. Instead, special code is written to load and save a custom format that represents the information to be saved.

Most such file formats should be based on XML. It is much easier to invent, define, and use an XML format for such files than to devise some custom binary format. The first advantage is simply the wide availability of tools to parse and write XML. Unlike a custom format, basing your own format on XML means you don't have to test and debug parsers and serializers. Just use one of the well- tested , well-established, and debugged standard tools like Xerces or MSXML. You write less code, which translates into fewer bugs and faster time to market.

A second advantage to choosing XML for the format is that the files will be more accessible to other tools and developers. They too can use standard parsers to read the files. It may not be immediately obvious to such third parties what all the elements and attributes mean, but it's a lot easier for them to reverse engineer XML than some undocumented, proprietary binary file format. If you include a schema or DTD for the format, then it's even easier for third parties to understand the format and write their own programs that can work with it. XML formats lead much more interoperable software and expand the universe of tools that can work with your formats. They make interoperability of independent software much easier to achieve.

The developers of OpenOffice.org have created a format that combines several different standards for interoperability. They use ZIP files as containers for XML and graphics files, making it easy to share compound documents as compact files.

16.1.5 Databases

XML can play a role in the communications between databases and other software, providing information in an easily reusable form. On the client side, XML data files can be used to offload some nontransactional data search and retrieval applications from busy web servers down to the desktop web browser. On the server side, XML can be used as an alternate delivery mechanism for query results.

XML is also finding use as a supplement to information stored in relational databases, and more and more relational databases include native support for XML, both as a data-retrieval format and a data type. Native XML databases, which store XML documents and provide querying and retrieval tools, are also becoming more widely available. These tools provide a more structured way of storing XML information than collecting documents in a filesystem.

For more information on the wide variety of XML and data-management tools available and ways to use XML with databases, see http://www.rpbourret.com/xml/XMLDatabaseProds.htm.

16.1.6 RDF

In certain cases, especially where the data contained in the documents is metadata describing other documents, you may want to look at the Resource Description Framework (RDF). RDF can be written in an XML syntax, but its data model is built around more generic graphs instead of XML's strictly hierarchical trees. When you process an XML document using XML tools, you get a treea collection of nested containers holding information. When you process an RDF document using RDF toolseven if the RDF is encoded as XMLyou get a collection of "triples." In English, a triple takes the rather stilted form " Subject has a Property whose value is Object ." For example, "W. Scott Means has an email address whose value is smeans@ewm.biz." However, to make the identification of subjects, properties, and objects less ambiguous, these are all named with URIs, so we'd actually write "http://www.oreillynet.com/cs/catalog/view/au/751?x-t=book.view&CMP=IL7015 has the property http://www.w3.org/2000/10/swap/pim/contact#mailbox, whose value is mailto:smeans@ewm.biz ." In XML, this would be written as shown in Example 16-7.

Example 16-7. An RDF statement encoded in XML

 <?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"              xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#">       <contact:Person rdf:about= "http://www.oreillynet.com/cs/catalog/view/au/751?x-t=book.view&amp;CMP=IL7015">     <contact:mailbox rdf:resource="mailto:smeans@ewm.biz"/>     </rdf:RDF>

The advantage to this rather opaque approach is that, like XML itself, RDF is much easier for computers to process than natural language. In particular, as long as all statements are written in this restricted Subject-Property-Object triple form, computers can reason about statements and infer new truths based on existing triples. For instance, by knowing that the book XML in a Nutshell has an author property with the value W. Scott Means, and that W. Scott Means has an email property with the value smeans@ewm.biz, an RDF inferencing engine can deduce that the email address of an author of this book is smeans@ewm.biz. When many such triples are available from many different sources with standardized URIs, RDF software should demonstrate knowledge (if not exactly intelligence) that is greater than the sum of its parts . At least that's the theory. Honestly, we're a little skeptical. RDF's approach does put an additional layer of abstraction between the serialization of the data and the internal structure of the data, and that layer is useful if you have data that is heavily self-referential or doesn't neatly fit into a nested container structure.

It's possible to create formats that can be processed either as XML or as RDF, giving consumers of the document flexibility about how they would prefer to process it. RSS 1.0 is such a format (although it does seem to be the least successful of several RSS variants). For a look at what is involved in mixing RDF into an XML environment, see http://www.xml.com/pub/a/2002/10/30/rdf-friendly.html.