2.4 Contrasting XML against HTML

The fundamental distinction between HTML and XML is that HTML defines data presentation (or data rendering), whereas XML defines the meaning of the data. This is best illustrated with an example. Let us start with some sample HTML codes that could be found within a document containing contact information for an individual:

 <p><b>Mr. Anura Guruge</b>  <br>  Principal  <br>  i-net guru  <br>  4 Varney Point Road, Left  <br>  Gilford, NH 03249  <br>  USA  <br>  anu@wownh.com</p>

In this HTML example the tag represents the start of a new paragraph, the tag calls for bold text, and the tag specifies a line break. Note that, as mentioned earlier, HTML does not require or define a tag. This type of asymmetrical tag usage would not be permissible in XML ”with the exception of empty tags, which, in reality, include a closing / at the end of the tag. In all other instances XML expects all open tags to be explicitly closed. This HTML code, when rendered by a Web browser, would show Mr. Anura Guruge in bold with the rest of the information underneath it ”in separate lines per the breaks dictated by the tags, as illustrated in Figure 2.6.

Figure 2.6: Example HTML code rendered by a Web browser.

The problem with this HTML example is that it only describes the layout for the data contained in the document. There is no description as to what the data mean. A person could interpret what the data mean, but it would be difficult for an application to determine what all these fields meant unless it was explicitly programmed to look for contact information in this type of format. A good example of the potential for ambiguity with HTML documents can be seen by doing a search for a keyword such as bill using any of the popular Web search engines, such as Google.com. Such a search will return thousands of entries covering people s names (e.g., Bill Gates), laws (e.g., Bill of Rights), theater playlists, and monetary notes.

With XML, the contact information will be organized between tags that set out to describe the data. Thus, one possible XML representation for some of these data might look like:

 <contact_info>   <name>    <salutation>Mr.</salutation>    <first-name>Anura</first-name>    <last-name>Guruge</last-name>   </name>   <title>Principal</title>   <company>i-net guru</company>   <address>    <street>4 Varney Point Road, Left</street>    <city>Gilford</city>    <state>NH</state>    <zip>03249</zip>   </address>   <e-mail>anu@wownh.com</e-mail>   </contact_info>

The first thing to note here is that the XML representation shown here is totally arbitrary and just reflects my preferences and foibles. You could describe these data in other ways using tags with different names ”true to the extensible nature of XML. In marked contrast to HTML, XML does not come with its own set of fixed tags and elements that has to be used by everybody to cover all applications. That would obviously be much too restrictive and impractical .

XML lets you define the elements you want as you go along to best fit what you are trying to achieve. Thus, with XML you can design your own customized markup languages for limitless different types of documents. You can have elements specific to organizations and industries. Consequently, there are already numerous discipline- and industry-specific variants of XML, such as Chemical Markup Language (CML) for representing molecular information; Mathematical Modeling Language (MathML) for describing mathematical notation; and Human Markup Language (HumanML) to enable consistent description of human emotions, intentions, gestures, and so forth. The key thing to remember is that you will need a DTD or XML schema that describes what is expected and acceptable within a specific XML document ” especially if you want that XML document to be processed by applications written by others. This is unfortunately the downside of not having a specific, predetermined set of markup tags la HTML.

A quick example with MathML will help reinforce the need for a mechanism (such as a DTD), outside and independent of the original XML document per se, to describe what the tags are supposed to represent within that document and how they are intended to be structured. Let us look at a possible MathML depiction of the structure of a relatively simple mathematical expression: (a + b) ² . Given that it is the actual structure of the expression that is being articulated , it will be described as a base (a + b) with a superscript of 2. This can be realized with the following lines of MathML:

 <msup>     <mfenced>      <mrow>       <mi>a</mi>       <mo>+</mo>       <mi>b</mi>      </mrow>     </mfenced>     <mn>2</mn>    </msup>

In this MathML example, the < msup > element indicates that a base and superscript notation is being expressed and the < mfenced > element denotes the use of parentheses (or brackets), whereas the < mrow > element signifies a horizontal row of characters . The < mi > and < mo > elements refer to identifiers and operands, respectively, whereas the < mn > element indicates a specific number ”in this instance 2 . Although the overall flow of MathML may be intuitive enough for some to determine what is being expressed, it should be clear by now that a total, unambiguous interpretation of MathML, like other dialects of XML, is only possible if the recipient is aware of what the elements are supposed to represent.

The application specificity of XML can be further highlighted by noting that the previous MathML representation is not the only way in which the (a + b) ² expression could have been described using MathML. The MathML lines shown focused on the structure of the mathematical expression ”that is, its presentation (or what it looks like). MathML also enables one to describe mathematical expressions in terms of their content rather than their structure. The content representation would look like this, where < apply > refers to an operation ”which is typically represented by empty elements such as < power/ > and < plus/ > .

 <apply>     <power/>     <apply>      <plus/>      <ci>a</ci>      <ci>b</ci>     </apply>     <cn>2</cn>    </apply>

Although this representation may be somewhat more easier to follow than the previous scheme, it is still fair to note that unequivocal interpretation would only be possible with an explanation of what the elements are supposed to represent (e.g., that < ci > must in this instance represent identifiers).

In marked contrast, all it would take with HTML to convey (a + b) ² would be:

 (a + b)<sup>2</sup>  <br>

where the , in this instance at least, is optional. Note that the HTML representation is primarily visual ”that is, the base (a + b) appearing as just a string of text with no indication as to what it represents mathematically. This visual representation would make programmatic interpretation somewhat more difficult ”and in effect would require the recipient application to include the type of logic found within a compiler (or interpreter) to parse and then evaluate the mathematical expression. In the end, it still boils down to the fact that HTML does not attempt to convey meaning, whereas XML s very raison d tre is to impart context and meaning to data.

The lack of predefined markup in XML means that it is possible to invariably come up with different hierarchies and schemes to describe the same set of data. The availability of attributes adds another level of flexibility and extensibility. Consequently, the contact information example shown earlier could be represented in quite a few different XML hierarchies ” with all of them within reason being equally germane and valid. Thus, another possible XML schema to describe contact information could be:

 <contact_info>     <entry name=Anura Gurug>       <address>        <street>4 Varney Point Road, Left</street>          <city>Gilford</city>            <state>NH</state>             <zip>03249</zip>       </address>     </entry>  </contact_info>

At this juncture, note that an XML attribute is made up of a name-value pair that is attached to an element s start tag. The name and value making up a particular name-value pair will be separated from each other by an equals sign and optional whitespace. The value of a name-value pair is enclosed in either single or double quotation marks. It is possible to have multiple attributes per element, with each attribute consisting of a valid name-value pair separated by whitespace as in the previously shown example:

 <wife name=Deanna Gurug birthday=June 27  phone=555-2293/>

When contrasting and comparing HTML with XML, another key factor that becomes immediately obvious is that XML does not provide any predefined formatting guidelines as to how data should be presented. The data presentation issue per se is not an issue if the recipient of an XML document is an application. In such an application-to-application programmatic scenario, all that would be required is that the data in the XML document, which would be in the form of a string of text, are structured correctly and are valid. However, as can be seen with even the simple XML examples used so far in this chapter, it certainly helps people if they can view an XML document with some amount of formatting to highlight the structure of the elements.

It should also be apparent by now that a standard browser would be able to interpret the XML structure and display the data portion of an XML document. This is indeed the case. Microsoft Internet Explorer 5 (and greater) and Netscape 4.x (and greater) can both display XML documents ”albeit bereft of any significant esthetic formatting. The XML in Figure 2.3(b), which shows an Excel spreadsheet as an XML document, is, as can be seen, rendered using Internet Explorer (where the “ symbols that appear at the very start of certain lines enable one to expand or contract the display of that element by clicking on it). Stylized and customized presentation of XML documents is not possible with this basic browser-based display mode. In order to display XML documents in visually pleasing formatted mode, one has to use either Cascading Stylesheets (CSS) or eXtensible Stylesheet Language (XSL) ”as discussed in Section 2.6. XHTML is not an option in this context, given that it is an XML-compliant version of HTML (i.e., HTML that adheres to XML s rules), as opposed to a formatting mechanism for XML.