XML as a Message Format | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

One of the major uses of XML is the exchange of data between heterogeneous systems. Given almost any collection of data, it's straightforward to design some XML markup that fits it. Because XML is natively supported on essentially any platform of interest, you can send data encoded in such an XML application from point A to point B without worrying about whether point A and point B agree on how many bytes there are in a float, whether ints are big-endian or little-endian, whether strings are null delimited or use an initial length byte, or any of the myriad of other issues that arise when moving data between systems. As long as both ends of the connection agree on the XML application used, they can exchange information regardless of what software produced the data. One side can use Perl and the other Java. One can use Windows and the other Unix. One can run on a mainframe and the other on a Mac. The document can be passed over HTTP, e-mail, NFS, BEEP, Jabber, or sneakernet. Everything except the XML document itself can be ignored.

The details of the XML markup used depend heavily on the information being exchanged. If you're exchanging financial data, you might use the Open Financial Exchange (OFX) [http://www.ofx.net/ofx/] If you're exchanging genetic codes, you might use the Gene Expression Markup Language (GEML) [http://www.rosettabio.com/products/conductor/geml/] If you're exchanging news articles in a syndication service, you might use NewsML [http://www. xmlnews .org/NewsML/]. And if no standard XML application exists that fits your needs, you'll probably invent your own. But whatever XML application you choose, certain features will crop up again and again that can benefit from standardization. These include the envelope used to pass the data and the representations of basic data types, such as integer and date.

Envelopes

An envelope may not be needed if (a) only two systems are involved, (b) they talk only to each other, and (c) they always send the same type of message. It's enough for one system to send the other the message in the agreed-upon XML format. However, when there are many dozens, hundreds, or even thousands of different systems exchanging many different kinds of messages in many different ways, it's useful to have some standards that are independent of the message content. This offers up some hope that when a message in an unrecognized format is received, it can still be processed in a reasonable fashion. For example, a system might receive a message ordering 1,000 "Frodo Lives" buttons but not know how to handle that order. However, it may be able to read enough information from the envelope to route the request to the program that does know how to process the order.

In XML-RPC, the envelope is essentially all the markup, and the data inside the envelope is all the text content. SOAP and RSS are a little more complex. For SOAP, the envelope is an XML document, and the data is too. In some ways RSS, especially RSS 1.0, is the most complex of all because it's based on the relatively complex RDF syntax. RDF mixes the envelope and the data together so that you can't point to any one element in the document and say, "That's the envelope," or "That element is the data." Instead, pieces of both the envelope and the data are intermingled throughout the complete document. In all three cases, however, it's straightforward to extract the data from the envelope for further processing.

Data Representation

Another area ripe for standardization is the proper representation of low-level data such as dates and numbers . Nobody really cares how many bytes there are in an int, as long as there are enough to hold all of the values they want to hold. Nobody really cares whether dates are written Day-Month-Year or Month-Day-Year, as long as it's easy to tell which is which. It doesn't really matter how this information is passed, as long as there's one standard way of doing it that everyone can agree on and process without excessive hassle.

In XML all data of any type must be passed as text, but the proper textual representation of simple data types such as integer and date is trickier than most developers initially assume. For example, integers can be uncomplicatedly represented in the form 42, -76, +34562, 0, and so forth. The normal base-10 representation with optional plus or minus signs is fully adequate for most needs. However, consider the number 28562476535, the dollar value of Bill Gates' Microsoft stock holdings alone as of July 24, 2002. This is a perfectly good integer, albeit a large one. However, it's so large that trying to use it in many applications will lead to a crash or some other form of error.

Floating-point numbers are even worse . Two different computers can look at an unambiguous string such as 65431987467.324345192 and interpret it as two different numbers. Dates cause problems even for humans . Is 07/04/01 the Fourth of July, 2001? the Fourth of July, 1901? the seventh of April, 2001? Some other date? These are all very real issues that cause real problems in systems today.

XML itself doesn't standardize the text representation of data, but the W3C XML Schema Language does. In particular, schemas define the 44 simple data types shown in Table 2.1. By assigning these data types to particular elements, you can clearly state what a particular string means in a syntax everyone can understand. And if these data types aren't enough, the W3C XML Schema Language also lets you define new types that are combinations or restrictions of these basic types.

Table 2.1. Primitive Data Types Defined in the W3C XML Schema Language

Data Type	Meaning
`xsd:string`	The schema equivalent of `#PCDATA` , any string of Unicode characters that may appear in an XML document.
`xsd:boolean`	True, false; 1, 0.
`xsd:decimal`	A decimal number, such as 44.145629 or -0.32, with an arbitrary size and precision; similar to the `java.math.BigDecimal` class.
`xsd:float`	The four-byte IEEE-754 floating point number that best approximates the specified decimal string; equivalent to Java's `float` type.
`xsd:double`	The eight-byte IEEE-754 floating point number that best approximates the specified decimal string; equivalent to Java's `double` type.
`xsd:integer`	An integer of arbitrary size; similar to the `java.math.BigInteger` class.
`xsd:positiveInteger`	An integer strictly greater than zero.
`xsd:nonPositiveInteger`	An integer less than or equal to zero.
`xsd:negativeInteger`	An integer strictly less than zero.
`xsd:nonNegativeInteger`	An integer greater than or equal to zero.
`xsd:long`	An integer between -9223372036854775808 and +9223372036854775807 inclusive; equivalent to Java's `long` primitive data type.
`xsd:int`	An integer between -2147483648 and 2147483647 inclusive; equivalent to Java's `int` primitive data type.
`xsd:short`	An integer between -32768 and 32767 inclusive; equivalent to Java's `short` primitive data type.
`xsd:byte`	An integer between -128 and 127 inclusive; equivalent to Java's `byte` primitive data type.
`xsd:unsignedLong`	An integer between 0 and 18446744073709551615.
`xsd:unsignedInt`	An integer between 0 and 4294967295.
`xsd:unsignedShort`	An integer between 0 and 65535.
`xsd:unsignedByte`	An integer between 0 and 255.
`xsd:duration`	A length of time given in the ISO 8601 extended format: `P` `nYn Mn DTn Hn Mn S` . The number of seconds can be a decimal or an integer. All other values must be nonnegative integers. For example, `P1Y2M3DT4H5M6.7S` represents 1 year, 2 months, 3 days, 4 hours, 5 minutes, and 6.7 seconds.
`xsd:dateTime`	A particular moment of time on a particular day up to an arbitrary fraction of a second in the ISO 8601 format: `CCYY-MM-DD Thh:mm:ss` . This can have a `Z` suffix to indicate Coordinated Universal Time (UTC) or an offset from UTC. For example, Neil Armstrong set foot on the moon at `1969-07-20T21:28:00-06:00` by the clock in Houston mission control, alternately represented as `1969-07-21T02:28:00Z` .
`xsd:time`	A certain time of day on no particular day in the ISO 8601 format: `hh:mm:ss.sss` . A time zone specified as an offset from UTC is optional. For example, on most days I wake up about `07:00:00.000-05:00` and go to bed about `23:30:00.000-05:00` .
`xsd:date`	A particular date in history given in ISO 8601 format: `YYYYMMDD` , for example, `20010706` or `19690920` .
`xsd:gYearMonth`	A certain month in a certain year, for example, `2001` - `12` or `1999-03` .
`xsd:gYear`	A year in the Gregorian calendar ranging from 0001 to 2001, to 9999, 10000, 10001, and beyond. Earlier dates can be represented as -0001, -0002, -0003, and so forth back to the big bang. There is no year zero, however.
`xsd:gMonthDay`	A specific day of a specific month in no particular year, in the form -- `02` - `28` . For example, Christmas falls on -- `12` - `25` .
`xsd:gDay`	A particular day of no particular month, in the form ---01, ---02, ---03, through ---31.
`xsd:gMonth`	A particular month in no particular year, in the form --01--, --02--, --03--, through --12--.
`xsd:hexBinary`	Hexadecimal encoded binary data; each byte of the data is replaced by the two hexadecimal digits that represent its unsigned value.
`xsd:base64Binary`	Base-64 encoded binary data.
`xsd:anyURI`	An absolute or relative URL or a URN.
`xsd:QName`	An optionally prefixed XML name such as `SOAP-ENV:Body` or `Body` . Unprefixed names must be in the default namespace.
`xsd:NOTATION`	The name of a notation declared in the current schema.
`xsd:normalizedString`	A string in which carriage returns (\r), linefeeds (\n) and tab (\t) characters should be treated the same as spaces.
`xsd:token`	A string in which all runs of white space should be treated the same as a single space.
`xsd:language`	An RFC 1766 [http://www.ietf.org/rfc/rfc1766.txt] language identifier such as en, fr-CA, or i-klingon.
`xsd:NMTOKEN`	An XML name token.
`xsd:NMTOKENS`	A white-space -separated list of XML name tokens.
`xsd:Name`	An XML name.
`xsd:NCName`	An XML name that does not contain any colons; that is, an unprefixed name.
`xsd:ID`	An NCName that is unique among other things of ID type in the same document.
`xsd:IDREF`	An NCName used as an ID somewhere in the document.
`xsd:IDREFS`	A white-space-separated list of IDREFs.
`xsd:ENTITY`	An NCName that has been declared as an unparsed entity in the document's DTD.
`xsd:ENTITIES`	A white-space-separated list of ENTITY names.

Even without using schema validation or the full schema apparatus, you can use these data types in your own documents. Simply attach an xsi:type attribute to any element identifying the type of that element's content. The xsi prefix is mapped to the http://www.w3.org/2001/XMLSchema-instance namespace URI. Example 2.1 is an XML document that uses these data types to label different parts of an order document. Notice that some things that naively might be assumed to be numeric types are in fact strings.

Example 2.1 An XML Document That Labels Elements with Schema Simple Types

 <?xml version="1.0" encoding="ISO-8859-1"?> <Order xmlns:xsd="http://www.w3.org/2001/XMLSchema"        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">   <Customer id="c32" xsi:type="xsd:string">Chez Fred</Customer>   <Product>     <Name xsi:type="xsd:string">Birdsong Clock</Name>     <SKU xsi:type="xsd:string">244</SKU>     <Quantity xsi:type="xsd:positiveInteger">12</Quantity>     <Price currency="USD"  xsi:type="xsd:decimal">21.95</Price>     <ShipTo>       <Street xsi:type="xsd:string">135 Airline Highway</Street>       <City xsi:type="xsd:string">Narragansett</City>       <State xsi:type="xsd:NMTOKEN">RI</State>       <Zip xsi:type="xsd:string">02882</Zip>     </ShipTo>   </Product>   <Product>     <Name xsi:type="xsd:string">Brass Ship's Bell</Name>     <SKU xsi:type="xsd:string">258</SKU>     <Quantity xsi:type="xsd:positiveInteger">1</Quantity>     <Price currency="USD" xsi:type="xsd:decimal">144.95</Price>     <Discount xsi:type="xsd:decimal">.10</Discount>     <ShipTo>       <GiftRecipient xsi:type="xsd:string">         Samuel Johnson       </GiftRecipient>      <Street xsi:type="xsd:string">271 Old Homestead Way</Street>       <City xsi:type="xsd:string">Woonsocket</City>       <State xsi:type="xsd:NMTOKEN">RI</State>       <Zip xsi:type="xsd:string">02895</Zip>     </ShipTo>     <GiftMessage xsi:type="xsd:string">       Happy Father's Day to a great Dad!       Love,       Sam and Beatrice     </GiftMessage>   </Product>   <Subtotal currency='USD' xsi:type="xsd:decimal">     393.85   </Subtotal>   <Tax rate="7.0"        currency='USD' xsi:type="xsd:decimal">28.20</Tax>   <Shipping method="USPS" currency='USD'             xsi:type="xsd:decimal">8.95</Shipping>   <Total currency='USD' xsi:type="xsd:decimal">431.00</Total> </Order>

As well as using a schema for explicit labeling, a document can use a schema to indicate the type. However, right now the APIs for such things aren't finished, so it's best to explicitly label elements when the types are important.

XML-RPC uses only the int , boolean , decimal , dateTime , and base64 types as well as a string type that's restricted to ASCII. Furthermore, it does not allow the NaN, Inf, and -Inf values for double. It does not use xsi:type attributes, relying instead on predefined semantics for particular elements. SOAP allows all 44 types and does use xsi:type attributes to label elements.