Canonical XML

Although infosets are a good idea, they are only abstract formulations of the information in an XML document. So without reducing an XML document to its infoset, how can you actually approach the goal of being able to actually compare XML documents byte by byte?

It turns out that there is a way: You can use canonical XML. Canonical XML is a companion standard to XML, and you can read all about it at www.w3.org/TR/xml-c14n. Essentially, canonical XML is a strict XML syntax; documents in canonical XML can be compared directly. The information included in the canonical XML version of a document is the same as would appear in its infoset.

As you can imagine, two XML documents that actually contain the same information can be arranged differently. They can differ in terms of their structure, attribute ordering, and even character encoding. That means that it's very hard to compare such documents. However, when you place these documents in canonical XML format, they can be compared on a byte-by-byte level. In the canonical XML syntax, logically equivalent documents are identical byte for byte.

The canonical XML syntax is very strict; for example, canonical XML uses UTF-8 character encoding only, carriage -return linefeed pairs are replaced with linefeeds, tabs in CDATA sections are replaced by spaces, all entity references must be expanded, and much more, as specified in www.w3.org/TR/xml-c14n. Because canonical XML is intended to be byte-by-byte correct, the upshot is that if you need a document in canonical form, you should use software to convert your XML documents to that form.

One such package that will convert valid XML documents to canonical form comes with the XML for Java software that you can get from IBM's AlphaWorks (the Web site is http://www.alphaworks.ibm.com/tech/xml4j). XML for Java comes with a Java program named DOMWriter that can convert documents to canonical XML form. To use this program, you need to make sure that your document is valid, which means giving it a DTD or schema to be checked against. I'll add a DTD to the example ch02_01.xml we've seen in this chapter (we'll see how to create DTDs in the next chapter):

Listing ch02_04.xml
 <?xml version = "1.0" standalone="yes"?>  <!DOCTYPE DOCUMENT [   <!ELEMENT DOCUMENT (CUSTOMER)*>   <!ELEMENT CUSTOMER (NAME,DATE,ORDERS)>   <!ELEMENT NAME (LAST_NAME,FIRST_NAME)>   <!ELEMENT LAST_NAME (#PCDATA)>   <!ELEMENT FIRST_NAME (#PCDATA)>   <!ELEMENT DATE (#PCDATA)>   <!ELEMENT ORDERS (ITEM)*>   <!ELEMENT ITEM (PRODUCT,NUMBER,PRICE)>   <!ELEMENT PRODUCT (#PCDATA)>   <!ELEMENT NUMBER (#PCDATA)>   <!ELEMENT PRICE (#PCDATA)>   ]>  <DOCUMENT>     <CUSTOMER>         <NAME>             <LAST_NAME>Smith</LAST_NAME>             <FIRST_NAME>Sam</FIRST_NAME>         </NAME>         <DATE>October 15, 2003</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Tomatoes</PRODUCT>                 <NUMBER>8</NUMBER>                 <PRICE>.25</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Oranges</PRODUCT>                 <NUMBER>24</NUMBER>                 <PRICE>.98</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER>     <CUSTOMER>         <NAME>             <LAST_NAME>Jones</LAST_NAME>             <FIRST_NAME>Polly</FIRST_NAME>         </NAME>         <DATE>October 20, 2003</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Bread</PRODUCT>                 <NUMBER>12</NUMBER>                 <PRICE>.95</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Apples</PRODUCT>                 <NUMBER>6</NUMBER>                 <PRICE>.50</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER>     <CUSTOMER>         <NAME>             <LAST_NAME>Weber</LAST_NAME>             <FIRST_NAME>Bill</FIRST_NAME>         </NAME>         <DATE>October 25, 2003</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Asparagus</PRODUCT>                 <NUMBER>12</NUMBER>                 <PRICE>.95</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Lettuce</PRODUCT>                 <NUMBER>6</NUMBER>                 <PRICE>.50</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER> </DOCUMENT> 

Now you can use the DOMWriter program with the special -c switch to convert this document to canonical form (the > canonical.xml part at the end sends the output of the program to a file named canonical.xml):

 %java dom.DOMWriter -c ch02_01.xml > canonical.xml 

Here's the result (note that DOMWriter has preserved all the whitespace in the document, and the &#10; entity references stand for the UTF-8 code for a linefeedyou can also give codes in hexadecimal if you include an x before the number, like this, for a linefeed: &#xA ):

 <DOCUMENT>&#10;    <CUSTOMER>&#10;        <NAME>&#10;  <LAST_NAME>Smith</LAST_NAME>&#10; <FIRST_NAME>Sam</FIRST_NAME>&#10;        </NAME>&#10; <DATE>October 15, 2003</DATE>&#10;        <ORDERS>&#10; <ITEM>&#10;                <PRODUCT>Tomatoes</PRODUCT>&#10; <NUMBER>8</NUMBER>&#10; <PRICE>.25</PRICE>&#10;             </ITEM>&#10; <ITEM>&#10;                <PRODUCT>Oranges</PRODUCT>&#10; <NUMBER>24</NUMBER>&#10; <PRICE>.98</PRICE>&#10;             </ITEM>&#10; </ORDERS>&#10;    </CUSTOMER>&#10;     <CUSTOMER>&#10; <NAME>&#10;            <LAST_NAME>Jones</LAST_NAME>&#10; <FIRST_NAME>Polly</FIRST_NAME>&#10;        </NAME>&#10; <DATE>October 20, 2003</DATE>&#10;        <ORDERS>&#10; <ITEM>&#10;                <PRODUCT>Bread</PRODUCT>&#10; <NUMBER>12</NUMBER>&#10; <PRICE>.95</PRICE>&#10;            </ITEM>&#10; <ITEM>&#10;                <PRODUCT>Apples</PRODUCT>&#10; <NUMBER>6</NUMBER>&#10; <PRICE>.50</PRICE>&#10;            </ITEM>&#10; </ORDERS>&#10;    </CUSTOMER>&#10;    <CUSTOMER>&#10; <NAME>&#10;            <LAST_NAME>Weber</LAST_NAME>&#10; <FIRST_NAME>Bill</FIRST_NAME>&#10;        </NAME>&#10; <DATE>October 25, 2003</DATE>&#10;        <ORDERS>&#10; <ITEM>&#10;                <PRODUCT>Asparagus</PRODUCT>&#10; <NUMBER>12</NUMBER>&#10; <PRICE>.95</PRICE>&#10;            </ITEM>&#10; <ITEM>&#10;                <PRODUCT>Lettuce</PRODUCT>&#10; <NUMBER>6</NUMBER>&#10; <PRICE>.50</PRICE>&#10;            </ITEM>&#10; </ORDERS>&#10;    </CUSTOMER>&#10;</DOCUMENT> 

In their canonical form, documents can be compared directly, and any differences will be readily apparent.

This example is also useful because it shows exactly what a DTD looks like and provides us with the perfect starting point for the next chapter, which is where we're going to start writing DTDs ourselves and thus create valid XML documents.



Real World XML
Real World XML (2nd Edition)
ISBN: 0735712867
EAN: 2147483647
Year: 2005
Pages: 440
Authors: Steve Holzner

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net