The Basics | Fundamentals of SVG Programming: Concepts to Source Code (Graphics Series)

XML vs. HTML

XML extends HTML primarily by allowing a document to describe its own tags (similar to a data type or, significantly, a record type). This capability allows a document to organize its data in a structured format. An XML document can also contain enough metadata (information about the data) so that any application can reliably parse the document and extract the data from the document.

In contrast, HTML is designed to describe documents in a format suitable for end-user viewing in a graphical browser. HTML documents do not contain information about the meaning of the data, nor are they structured in a way that makes it easy for a program to analyze. Therefore, an application may have a difficult time extracting relevant data from an HTML document.

A relatively simple example illustrates this point. Here is a portion of an HTML page that might be generated by an Internet book retailer. It informs an Internet browser how to represent the current contents of the shopping cart page to a potential purchaser.

<td bgcolor="#FFFFFF" width="51%"> <a href="../81332713233407"> <em>Debt of Honor</em></a> <br> Tom Clancy; Paperback</b> <font size=2 face="Verdana, Helvetica, Courier" color=#000000> <NOBR>Price: <font color=#990>$6.99</font></b></NOBR><br> </td> <td bgcolor="#FFFFFF" width="51%"> <a href="../81332713233407"> <em>The Hunt for Red October</em></a> <br> Tom Clancy; Hardcover</b> <font size=2 face="Verdana, Helvetica, Courier" color=#000000> <NOBR>Price: <font color=#990>$18.99</font></b></NOBR><br> </td>

An Internet browser has no trouble understanding how to format and display this information to the end user. While viewing the page in a browser, the end user has no trouble understanding what the data means (the shopping cart contains two books, one for $6.99 and the other for $18.99).

How do we parse this document and to extract the item number and other information, including its price, from the HTML document? In theory, we could use a trial-and-error design approach to build a parser for this particular document. Perhaps we could fine-tune this algorithm so that it can process the shopping cart HTML page and extract the price of the book:

Look for the string NOBR>Price:
Skip past the font declaration (<font ..>).
The characters before the next font declaration contain the price.
Ignore the currency symbol in the price character string.
Convert the price string into a numeric price variable.

Unfortunately, we cannot guarantee that our parser will work if the vendor makes even minor changes to the Web site, or that our parser wouldn't be confused by similar pages. More important, we have no guarantee that our parser would work with another vendor's HTML pages. Furthermore, important contextual information is hard to decipher. For example, what is the identifier (the order ID) for this shopping cart? Is it contained in the href identifier?

An XML document, on the other hand, contains information in a format that can be readily parsed by an application. An XML document might express a shopping cart using this type of syntax:

<Order orderNumber="81332713233407"> <LineItem> <Title>Debt of Honor</Title> <Author>Tom Clancy</Author> <BookType>Paperback</BookType> <Price>$6.99</Price> </LineItem> <LineItem> <Title>The Hunt for Red October</Title> <Author>Tom Clancy</Author> <BookType>Hardcover</BookType> <Price>$18.99</Price> </LineItem> </Order>

Clearly, this syntax is simpler to parse with a program and will produce more predictable results. An application can process and validate information from this document with confidence.

Notice that XML uses the begin tag...end tag construct in a manner similar to HTML. XML data is contained inside user-defined tags. For example, <Title> is the beginning of a tag, and </Title> is the end of the tag. Every XML document must conform to these and other requirements in order to be classified as well formed, or syntactically correct.

Tags can be nested as data elements inside other tags. Observe how the LineItem tag in our example contains each of these tags: Title, Author, BookType, and Price.