In the first chapter we set central roles in our overall architecture for XML as the "common denominator" format and logical conversions via XSLT (as opposed to the syntax conversions that the conversion utilities perform). However, it's time we expanded on those concepts and got a bit more specific. Not only do we want XML to be the common format but also wherever possible we want the data content to be compliant with native schema language data types. This means, for example, that a date in our common XML format is expressed, compliant with ISO 8601 and the schema date data type, as 2002-09-06 instead of 06-Sep-2002 or 9/6/02. A dollar amount that might be expressed in a COBOL extract file or in X12 EDI as a number with two implied decimal places is represented with an explicit decimal in XML, compliant with the built-in decimal data type.
We may not always be able to automatically use the XML data types as our common format. For example, if your application has a field for a local time and another for a time zone, there's no easy way to automatically convert those to the ISO 8601 format of the Coordinated Universal Time (GMT or Zulu) followed by an offset. However, we can save a lot of code in XSLT stylesheets if we convert data types where it is appropriate.
Why do we care about data representation at this level of detail? The main reason is that these built-in schema language data types are the data representations that will most likely be used natively by business applications using XML, whether you or your trading partners are employing those applications. Importantly, as we see standards emerge for common business documents such as purchase orders and invoices, these standards are predominantly using these built-in types instead of creating their own. If our utilities can put the data into these representations for us, then we save having to continually write the same code into our XSLT stylesheets.
The term "canonical XML" has often been given to XML used in this fashion. However, there are other definitions as well. See the sidebar for more details. Given the confusion about just what canonical XML is and the way some people seem to be misusing the term , I'm not going to use it except when I talk about it in the W3C sense (which will be rarely in this book).
What Is Canonical XML?
You know what XML is by now. But what does it mean for XML to be canonical? You have to crack a big dictionary to find an appropriate alternate definition of canonical. We're not looking for something having to do with church canon law, although there is an authoritative flavor about the term. The last alternate definition in my old Webster's  defines canonical as "relating to various of the simplest and most significant forms or schemata to which general equations, statements, or expressions may be reduced without loss of generality." It also compares this to "normal form," which we should be familiar with from relational database design. My dictionary goes on to define "normal form" as, from logic, "a canonical or standard fundamental form of a statement to which others can be reduced." So, using this generic definition we get the idea that canonical XML might involve being in a common, general format to which all other data formats can be converted. Since we're using XML as the common format to and from which all data is converted, our use fits with this general definition. Many people seem to be using the term this way.
However, the W3C also has a Recommendation on Canonical XML [W3C 2000] that assigns a much more specialized meaning to the term. (There are other related Recommendations such as one on Exclusive XML Canonicalization, which deals with aspects of digital signatures in XML documents. But the one I'm discussing here is the first and primary Recommendation.) This Recommendation deals largely with determining whether or not two XML documents are logically equivalent by defining a canonical form to which both can be converted. For example, two instance documents may be in compliance with XML 1.0 and convey the same data, but they may be slightly different. For example, one might use CDATA to convey an ampersand and the other might use a predefined entity for the same purpose. In addition, for the same Element, the Attributes in one document might be in a different order than the Attributes of the same Element in the other document. "Canonical" as defined in this Recommendation tends to agree more strictly with the definition of "canonical" that is used in the study of logic.
I don't know about you, but in most matters XML I'll defer to the W3C.