XML versus HTML: An Example | The Guru[ap]s Guide to SQL Server[tm] Stored Procedures, XML, and HTML

for RuBoard

I've mentioned that you can create your own tags in XML. This is such a powerful, vital part of XML that it bears further discussion. If you're used to working in HTML, this concept is probably very foreign to you because HTML does not allow you to define your own tags. Although various browser vendors have extended HTML with their own custom tags, the bottom line is that at some point you're stuck. You have to use the tags provided to you by your browser. You cannot make your own.

So how do you define a new tag in XML? The simplest answer is: You don't have to. You just use it. You can control which tags are valid in an XML document using DTD documents and XML schemas (we'll talk about each of these later), but the bottom line is that you simply use a tag to define it in XML. There is no typedef or similar construct.

To compare and contrast how HTML and XML represent data, let's look at the same data represented using each language. Here's some sample HTML that displays a recipe (Listing 12-1):

Listing 12-1 A basic HTML document.

 <!-- The original html recipe --> <HTML> <HEAD> <TITLE>Henderson's Hotter-than-Hell Haba?ero Sauce</TITLE> </HEAD> <BODY> <H3>Henderson's Hotter-than-Hell Habaero Sauce</H3> Homegrown from stuff in my garden (you don't want to know exactly what). <H4>Ingredients</H4> <TABLE BORDER="1"> <TR BGCOLOR="#308030"><TH>Qty</TH><TH>Units</TH><TH>Item</TH></TR> <TR><TD>6</TD><TD>each</TD><TD>Habaero peppers</TD></TR> <TR><TD>12</TD><TD>each</TD><TD>Cowhorn peppers</TD></TR> <TR><TD>12</TD><TD>each</TD><TD>Jalapeo peppers</TD></TR> <TR><TD></TD><TD>dash</TD><TD>Tequila (optional)</TD></TR> </TABLE> <P> <H4>Instructions</H4> <OL> <LI>Chop up peppers, removing their stems, then grind to a liquid.</LI> <!-- and so forth --> </BODY> </HTML>

If you read through the HTML in Listing 12-1, you'll no doubt notice that the recipe is stored in an HTML table. Figure 12-1 shows what it looks like in a browser.

Figure 12-1. A simple HTML page containing some data.

graphics/12fig01.gif

There are several positive aspects of how HTML represents this data:

It's readable. If you look hard enough, you can tell what data the HTML contains.
It can be displayed by any browser, even nongraphical ones.
A cascading style sheet could be used to further control the formatting.

However, there's a really big negative aspect that outweighs the others insofar as data markup goes: There's nothing in the code to indicate the meaning of any of its elements. The data contained in the document has no context. A program could scan the document and pick out the items in the table, but it wouldn't know what they were. And although you could hard code assumptions about the data (column 1 is Qty, column 2 is Units, and so on), if the format of the page were changed, your app would break.

The problem is further exacerbated by attempting to extract the data and store it in a database. Because the semantic information about the data was stripped out when it was translated into HTML, we have to resupply this info in order to store it meaningfully in a database. In other words, we have to translate the data back out of HTML because HTML is not a suitable storage medium for semantic information.

Now let's take a look at the same data represented as XML. You'll notice that the markup has nothing to do with displaying the datait is all about describing content. Here's the code (Listing 12-2):

Listing 12-2 The recipe data stored as XML.

 <?xml version="1.0" ?> <Recipe>        <Name>Henderson&apos;s Hotter-than-Hell Habaero Sauce</Name>        <Description> Homegrown from stuff in my garden (you don&apos;t want to know exactly what).</Description>        <Ingredients>              <Ingredient>                    <Qty unit="each">6</Qty>                    <Item>Habanero peppers</Item>              </Ingredient>              <Ingredient>                    <Qty unit="each">12</Qty>                    <Item>Cowhorn peppers</Item>              </Ingredient>              <Ingredient>                    <Qty unit="each">12</Qty>                    <Item>Jalapeno peppers</Item>              </Ingredient>              <Ingredient>                    <Qty unit="dash" />                    <Item optional="1">Tequila</Item>              </Ingredient>        </Ingredients>        <Instructions>              <Step> Chop up peppers, removing their stems, then grind to a liquid.</Step>              <!-- and so forth... -->        </Instructions> </Recipe>

See the difference? The tags in this data relate to recipes, not formatting. The file remains readable, so it retains the simplicity of the HTML format, but the data now has context. A program that parses this file will know exactly what a Jalepeno isit's an Item in an Ingredient in a Recipe.

And, regarding ease of use, I think you'll find that XML is actually more human readable, not less, than HTML. It accomplishes the goal of being at least as simple to use as HTML, yet it's orders of magnitude more powerful. It explains the information in a recipe in terms of recipes, not in terms of how to display recipes. We leave the display formatting for later and for tools better suited to it.

Notational Nuances

It's important to get some of the nomenclature straight before we get too far into our discussion of XML. Let's reexamine part of our XML document:

 <  Item   optional  ="1">Tequila</Item>

In this code:

Item is the tag name. As in HTML, tags mark the start of an element in XML. Elements are a key piece of the XML puzzle. XML documents consist mostly of elements and attributes.
optional is an attribute name. An attribute is a field that further describes an element. We could have called it something besides optional. The name we've come up with is entirely of our own choosing. Notice that the other elements in the document do not have this attribute.
"1" is the value of the optional attribute, and the portion from optional through "1" comprises the attribute.
/Item is the end tag of the Item element.
The portion from Item through /Item is the Item element.

XML tags do not always contain text. They can be empty or contain just attributes. For example, have a look at this excerpt:

 <  Qty  unit="dash" />

Here, Qty is the element name, and unit is its only attribute. The forward slash at the end of the text indicates that the element itself is empty and therefore does not require a closing tag. It's shorthand for this:

 <  Qty  unit="dash"></Qty>

Empty tags may or may not have attributes.

In addition to these basic structure rules, XML documents require stricter formatting than HTML. XML documents must be well formed in order for an XML parser to be able to process them. In mathematics, equations have particular forms they must follow in order to be logical. The ones that don't aren't well formed and aren't terribly useful for anything. XML has a similar requirement. In order for a parser to be able to parse an XML document, it must meet certain rules. The most important of these are the following:

Every document must have a root element that envelops the rest of the document. It need not be named "root." In our earlier example, Recipe is the root element.
All tags must have closing tags, either in the form of an end tag or via the empty tag symbol (/). HTML often doesn't enforce this rule. Browsers typically try to guess where a closing tag should go if it's missing.
All tags must be properly nested. If Qty is contained within Ingredient, you must close Qty before you close Ingredient. This is, again, not something that's rigorously enforced by HTML, but an XML parser will not parse tags that are improperly nested.
Unlike element text, attribute values must always be enclosed in single or double quotes.
The characters <, >, and " cannot be represented literally; you must use character entities instead. A character entity is a string that begins with an ampersand (&) and ends with a semicolon and takes the place of a special symbol to avoid confusing the parser. Because <, >, and " all have special meaning in XML, you must represent them using the special character entities <, >, and " respectively. There are two other predefined special character entities that you may use when necessary: & and '. The & entity takes the place of an ampersand. Because ampersands typically denote character entities in an XML document, using them in your data can confuse the parser. Similarly, ' represents a single-quotean apostrophe. Because attribute values can be enclosed in single quotes, a stray apostrophe can confuse the parser.
Unlike HTML, if you wish to use character entities other than the predefined five we just talked about, you must first declare them in a DTD. We'll discuss DTDs shortly.
Element and attribute names may not begin with the letters "XML" in any casing . XML reserves these for its own use.
XML is case sensitive. This means that an element named Customer is a different element than one named customer.

There's a difference between a well-formed XML document and a valid one. A valid XML document is a well-formed document that has had additional validation criteria applied to it. Being well formed is only the beginning. Beyond being able to be parsed, an XML document will typically have certain data relationships and requirements that make it sensible . A document that breaks these rules, although well formed, is not valid. For example, consider this XML fragment (Listing 12-3):

Listing 12-3 A well-formed but invalid XML fragment.

 <Car Name="Mustang" Make="Ford" Model="1966" LicensePlate="OU812">      <Engine Type="Cleveland">341</Engine>      <Engine Type="Winchester">302</Engine> </Car>

Is it well formed? Yes. Is it valid? Perhaps not. Most cars don't have two engines. Consider this modified excerpt from our example document:

 <Ingredient>      <Qty unit="each">12</Qty>      <Qty unit="each">10</Qty>      <Item>Jalapeno peppers</Item> </Ingredient>

Does it make sense for an ingredient to include two Qty specifications? No, probably not. Although the document is well formed, it's most likely invalid.

How do you establish the validity rules for a document? Through DTDs and XML schemas. We'll discuss each of these in the sections that follow.

for RuBoard