CDATA Sections

CDATA sections are probably the most frequently abused drugs in the XML pharmacy. The normal reason for this abuse is to embed non-well- formed HTML inside an XML document. For example, the description element in a catalog entry might contain an entire web page for a product.

 <Vehicle>   <price>30000</price>   <inStock>4</inStock>   <color>black</color>   <description><![CDATA[     <html>       <title>The G2 SUV</title>       <body>         <img src=g2suv.jpg height=100 width=100>         The G2 SUV is one of our best-selling models.         <p>         It's built on a truck base for all the stability of a         pickup driving down a bumpy country road.         <p>         It gets an astonishing eight miles to the liter.         <p>         <hr>         <a href=G3SUV.html>Next Car</a>       </body>     </HTML>   ]]></description> </Vehicle> 

Given this structure, it's temptingly easy to write code that extracts the contents of the description element and writes the raw text into a file or onto a network socket that expects to receive HTML.

Even worse is the case where the CDATA section is not the exclusive contents of an element but is instead one of several children, so that it becomes almost a pseudo-element. For example, imagine that the above catalog entry did not contain a separate description element child, just a CDATA section holding HTML.

 <Vehicle>   <price>30000</price>   <inStock>4</inStock>   <color>black</color>   <![CDATA[     <html>       <title>The G2 SUV</title>       <body>         <img src=g2suv.jpg height=100 width=100>         The G2 SUV is one of our best-selling models.         <p>         It's built on a truck base for all the stability of a         pickup driving down a bumpy country road.         <p>         It gets an astonishing eight miles to the liter.         <p>         <hr>         <a href=G3SUV.html>Next Car</a>       </body>     </HTML>   ]]> </Vehicle> 

This kind of structure causes major problems for all sorts of XML tools. It severely limits the validation that can be performed with a DTD or a schema. It is extremely difficult to transform properly with XSLT. DOM parsers may or may not separate out the CDATA sections from the surrounding text, and SAX parsers might not even notice the CDATA sections.

The solution in both cases is simple: Make the HTML well-formed and treat it as an html element rather than raw text:

 <Vehicle>   <price>30000</price>   <inStock>4</inStock>   <color>black</color>   <html>     <title>The G2 SUV</title>     <body>       <img src="g2suv.jpg" height="100" width="100" />       <p>         The G2 SUV is one of our best-selling models.       </p>       <p>         It's built on a truck base for all the stability of a         pickup driving down a bumpy country road.       </p>       <p>         It gets an astonishing eight miles to the liter.       </p>       <hr />       <a href="G3SUV.html">Next Car</a>     </body>   </html> </Vehicle> 

If you want to get the text from the HTML, you'll have to serialize the root html element, just like you'd serialize any other XML element. In DOM3 you can use the DOMWriter class.

The general rule for CDATA sections is that nothing should change if the CDATA section is replaced by its content text with all < and & characters suitably escaped. CDATA sections are meant as a convenience for human authors, especially those writing books about markup like the one you're reading right now. They are not meant to replace elements for indicating the structure and semantics of content to hide malformed markup inside an XML document.



Effective XML. 50 Specific Ways to Improve Your XML
Effective XML: 50 Specific Ways to Improve Your XML
ISBN: 0321150406
EAN: 2147483647
Year: 2002
Pages: 144

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net