XML 101

The history of XML is an interesting read, but for our purposes it is not relevant. In brief, XML was developed to solve many of the problems that people were starting to have with the Hypertext Markup Language (HTML).

Specifically, the World Wide Web Consortium (W3C) developed XML so that one could maintain context with a document.

Note

You can learn more about the W3C at www.w3.org/.

Tim Berners-Lee originally created HTML as a way to accurately describe scientific abstracts and documents. Even with this humble origin, it has proven to be a fairly robust language. After 10 years and billions of web pages, HTML has evolved from its original intent and specifications. Its element names (the "tags" in an HTML document) now bear little relationship to their current meaning. For example, an HTML <div> (division) element is essentially a container for other markup; the name has lost its significance, so it can represent practically anything.

Web pages marked up in HTML lack structure, context, and semantics. Without structure, there's no effective way to develop web pages because a program requires objects with known behaviors and characteristics to work properly. In other words, without structure, you don't have much context for any piece of information within a web page. You can attempt to "screen-scrape" the contents, but this requires a lot of information about the page's visual format and relies on similar HTML structures not existing elsewhere in the document. Often this involves writing more or less a custom parsing program that contains rules for parsing that specific version of HTML. As any developer who has written programs to work with flat files knows, this can be a tedious and frustrating experience, especially if the target output format (which might not be a visual medium, like HTML's intended output format) changes often.

XML borrows the tag structure and container/contained relationship of formal Standard Generalized Markup Language (SGML), which is the grandfather of both XML and HTML. But XML, unlike HTML, lets you design your own tag elements, attributes, and rules. By keeping the underlying language relatively simple, this approach nicely mimics the way that most "natural" data constructs work. Consider, for example, a shipto block from a purchase order written as XML:

 <shipto>   <company>Department13</company>   <contact>    <title>VP</title>    <name>Robi Sen</name>   </contact>   <street>122 UnterWasser</street>   <city>Fort Collins</city>   <state_province>Colorado</state_province>   <zipcode>80526</zipcode>   <country>USA</country>  </shipto>

The same shipto structure could use a different <country> element and still be perfectly valid:

 <shipto>    <company>Department13</company>    <contact>     <title>VP</title>     <name>Robi Sen</name>    </contact>    <street>122 UnterWasser</street>    <city>Fort Collins</city>    <state_province>Colorado</state_province>    <zipcode>80526</zipcode>    <country>UK</country>  </shipto>

XML is generally used in one of two ways: as a method of marking up documents (its original SGML heritage), which is document-centric; or as data description or exchange, which is data-centric. As a ColdFusion application developer, you are already familiar with relational databases, so one way you can think about XML is from a database or data-centric viewpoint. For instance, to represent the preceding XML structures in a typical relational database, you might use three distinct tables: one for customer information, one for customer addresses, and perhaps another for regional codes. You might also need additional keys: primary ones for the two customer tables and a foreign key for the regional address codes. By making the structure/ schema as close to third normal form as possible, you can handle the numerous exception conditions that often plague such databases exceptions that occur largely because most data doesn't conveniently fit into a strict table and key relational model.

Note

XML does not follow a relational model like most databases; so although much of how XML works is analogous to database principles, there are points where the parallels break down. Currently, many vendors of object-oriented databases are finding interest in their products in that XML can be much better modeled as objects than relation tables.

Metadata

One of the most important things about XML is its notion of metadata, which is a concept you will find in any data-modeling system or theory. Metadata is information about your system's data. For example, if you created a database table called Customer, you might have several different fields and data types (see Table 19.1).

Table 19.1. The Customer Table
Field Name	Datax
`CustomerID`	AutoNumber
`CustomerFirstName`	Text
`CustomerLastName`	Text
`CustomerAddress`	Text
`CustomerCity`	Text
`CustomerStateID`	Number
`CustomerPhone`	Text
`CustomerEmail`	Text
`CustomerPassword`	Text
`CustomerNotification`	Text
This is the same table that is in the datasource for this book.

In the case of this table, the metadata is the column names, data types, and restrictions you put on the data, such as integer range, string length, binary, date field, and so on. Metadata, as you already know, makes relational databases useful in that once you have created the schema for your database, you have a formal context and relationships for all the data you store in that database. Metadata is also what makes XML useful; however, XML accomplishes this in a couple of ways. At its simplest (and least informative), an XML document could consist of a single set of <exampledocument > tags, with the content being a superset of ASCII text:

 <?xml version="1.0"?>  <hellokitty>  This XML document contains almost no real structure.  The XML parser only knows that it's an XMLdocument.  It also says Hello Kitty!  </hellokitty>

It is important to note that although this XML document is ridiculously simple, it is a well-formed XML document (unlike the other examples so far, which have been XML document fragments).

Well-Formedness

As we pointed out, the preceding document was well-formed, but what does that mean? To be well-formed, an XML document must conform to three specific rules:

The document starts with an XML declaration: <?xml version="1.0"?>.
There is a root element in which all others are contained: the <hellokitty> and </hellokitty> tags from the previous code for example,.
All elements must be properly nested. No overlapping is permitted.

As you can see, creating a well-formed XML document is pretty straightforward. Later in this chapter we will go into more depth about well-formedness, but for now let's look at a more complex version of a XML document.

For now, imagine we have created an XML file from a database recordset containing purchase order information. (If you have used CFFILE to create text files, you already know most of what you need to create an XML file in ColdFusion.) Listing 19.1 presents an example of a more complex XML document.

Listing 19.1 Pricelist.xml

 <?xml version="1.0" encoding="UTF-8"?>  <!DOCTYPE price-list SYSTEM=" \pricelist.dtd">        <price-group>              <name>ColdFusion MX</name>              <price-element>                <product-code>CFMX2002</product-code>              <description>ColdFusion MX blah blah blah blah</description>                    <license type="Server">                          <quantity>String</quantity>                    </license>               <list-price>1000</list-price>              </price-element>         </price-group>        <price-group>              <name>Kojac</name>              <price-element>              <product-code>CFS20023</product-code>              <description>New New New</description>                    <license type="Server">                          <quantity>String</quantity>                    </license>               <list-price>500</list-price>              </price-element>        </price-group>        <price-group>              <name>Flash MX</name>              <price-element>              <product-code>MACR2002F</product-code>              <description>Macromedia Flash MX provides everything you need to create and  deploy rich Web content and powerful applications. Whether you are designing motion  graphics or building data-driven applications, Flash MX has the tools you need to produce  great results and deliver the best user experiences across multiple platforms and devices. </description>                    <license type="Server">                          <quantity>String</quantity>                    </license>               <list-price>500</list-price>              </price-element>        </price-group>        <price-group>              <name>Jrun</name>              <price-element>              <product-code>MACR2002JR</product-code>              <description>JRun Server Professional Edition is a high-performance J2EE  application server for deploying JSP and Servlet applications. </description>                    <license type="Server">                          <quantity>String</quantity>                    </license>               <list-price>1000</list-price>              </price-element>        </price-group>  </price-list>

Just by glancing at this document, you can probably see that it's a price list and contains information on a product's product code, description, license type, list price, and what price grouping it falls into. Using our previous database analogies, you can more or less see the XML Schema (when we talk about schema in this sense, we mean like a database schema; later we will talk about an XML initiative called XML Schema that is replacing the use of DTDs, both of which are used to store further metadata for the XML document) as well as the literal values or records in the schema something that would not be apparent from looking at a database recordset or table view. For example, Figure 19.1 shows the output of a query against a database. As you can see, all that's easily discernable is the raw data and the column names, to some extent.

Figure 19.1. The output of a database query.

graphics/19fig01.gif

One of the great things about XML is that not only is it human readable (like HTML), it is also machine readable. This means that a computer using an XML parser can "understand" what a shipto tag means in the context of a document or application without having to do any special programming.

Let's break down the Pricelist.xml document's structure and see how it works. First, note this part of the first line:

 <?xml version="1.0"?>

The first line declares that the document is an XML document (version 1.0 specifically).

The next thing we notice is the opening tag, or root element, for our document: <price_list>. If you quickly scan the document, you see that it is closed with a </price_list> tag. As we noted earlier, all XML tags need to have a starting tag and a closing tag. These tags are actually called elements, and every XML document can have only one root element.

For example, you could have this code:

 <helloXML>      <hello>          <world />          <kitty />      </hello>  </helloXML>

But not this code:

 <helloXML>      <hello>          <world />          <kitty />      </hello>  </helloXML>  <helloXML>      <hello>          <world />          <kitty />      </hello>  </helloXML>

You can't have the second instance because <helloXML> is the root element and cannot be repeated in the document. One way to think of this is that the root element is much like a table name. If it's not unique within the document, the parser does not really have a starting point with which to work. All of the other elements (called subelements) can be repeated as many times as you want; the only restrictions are that all elements must start with a left-angle bracket (<) and end with a right-angle bracket (>), and all beginning tags must be closed with an ending tag unless they are empty tags. Empty tags take the form <myemptytag />, such as <img src="/books/2/197/1/html/2/example.jpg" /> or <br /> (the XHTML equivalents of the HTML <img src="/books/2/197/1/html/2/example.jpg"> and <br> tags, respectively).

So, to extend our analogy of databases to XML, the root element is much like a table name, and the subelements are like column names. In databases, though, we can assign all sorts of conditions, attributes, and logic to a column. XML also lets you assign various attributes to elements; however, our analogy starts to get a little thin here. Let's take a look at the pricelist.xml document again, specifically the price-group element:

 <price-group>        <name>Flash MX</name>        <price-element>        <product-code>MACR2002F</product-code>        <description>Macromedia Flash MX provides everything you need to create and deploy  rich Web content and powerful applications. Whether you are designing motion graphics or  building data-driven applications, Flash MX has the tools you need to produce great  results and deliver the best user experiences across multiple platforms and devices.</ description>              <license type="Server">                    <quantity>String</quantity>              </license>         <list-price>500</list-price>        </price-element>  </price-group>

In this block, we see a so-called empty tag that is not empty; it has name, product-code, license type, quantity, description, and list-price attributes. To keeping pushing our database analogy, this element might look like Table 19.2 if we put it in tabular form.

Table 19.2. Products
Field Name	Data Type
`ProductID`	Number
`VendorID`	Number
`ProductDate`	Date/time
`CategoryID`	Number
`ProductName`	Text
`ProductDescription`	Memo
`ProductPricePerUnit`	Text
`ProductSpecifications`	Text
`ProductSearchTerms`	Text
`ProductDemoURL`	Text
`ProductDocumentationURL`	Text
`ProductEvaluationFile`	Text
`ProductCommercialFile`	Text
`ProductLicenseID`	Number

But remember that, in a database table, we can also assign table-level logic such as whether an attribute is required, needs to be of a certain length or specific data type, or can be repeated. (In some relation database servers, you can also assign complex database-level and application-level logic.) In our XML document, there is little contextual information that tells us this kind of information. However, look at the second line from Listing 19.1:

 <!DOCTYPE price-list SYSTEM "\pricelist.dtd">

This line is a declaration that tells the parser that there is a DTD associated with the XML document. So, what's a DTD? For our analogy and this example, the DTD is where we store all the metadata about data-level logic and other specifics you would normally define in a database. Why can't we define that information in the XML document directly? In a database, for instance, we can define it right in a table. Well, have you ever had to change your database schema and then change all those rules? It can be a major hassle, and often after you have changed your scheme you will need to "scrub" your database's data to conform to the new data model. All in all, it takes time and specialized knowledge that is often product specific; essentially, it's why our friend the database administrator makes so much money. Instead, XML allows a document's creators to separate the document's logical structure and data (in this case, think table structure as well as records of data) from the metadata (the rules and metadata) and from the eventual target output (how the XML is eventually displayed, rendered, and produced for consumption).

Metadata

Table 19.1. The Customer Table

Well-Formedness

Listing 19.1 Pricelist.xml

Figure 19.1. The output of a database query.

Table 19.2. Products