7.1 XHTML | XML in a Nutshell, Third Edition

XHTML is an official W3C recommendation. It defines an XML-compatible version of HTML, or rather it redefines HTML as an XML application instead of as an SGML application. Just looking at an XHTML document, you might not even realize that there's anything different about it. It still uses the same <p> , <li> , <table> , <h1> , and other tags you're familiar with. Elements and attributes have the same, familiar names they have in HTML. The syntax is still basically the same.

The difference is not so much what's allowed but what's not allowed. <p> is a valid XHTML tag, but <P> is not. <table border="0 " width="515"> is legal XHTML; <table border=0 width=515> is not. A paragraph prefixed with a <p> and suffixed with a </p> is legal XHTML, but a paragraph that omits the closing </p> tag is not. Most existing HTML documents require substantial editing before they become well- formed and valid XHTML documents. However, once they are valid XHTML documents, they are automatically valid XML documents that can be manipulated with the same editors, parsers, and other tools you use to work with any XML document.

7.1.1 Moving from HTML to XHTML

Most of the changes required to turn an existing HTML document into an XHTML document involve making the document well-formed. For instance, given a legacy HTML document, you'll probably have to make at least some of these changes to turn it into XHTML:

Add missing end-tags like </p> and </li> .
Rewrite elements so that they nest rather than overlap. For example, change <p><em>an emphasized paragraph</p></em> to <p><em>an emphasized paragraph</em></p> .
Put double or single quotes around attribute values. For example, change <p align=center> to <p align="center"> .
Add values (which are the same as the name ) to all minimized Boolean attributes. For example, change <input type="checkbox" checked> to <input type="checkbox" checked="checked"> .
Replace any occurrences of & or < in character data or attribute values with & and < . For instance, change A & P to A & P and <a href="http://www.google.com/search?client=googlet & q=Java%20XML"> to <a href="http://www.google.com/search?client=googlet & q=Java%20XML"> .
Make sure the document has a single root html element.
Change empty elements like <hr> to <hr /> or <hr></hr> .
Add hyphens to comments so that <! this is a comment> becomes  .
Encode the document in UTF-8 or UTF-16, or add an XML declaration that specifies in which character set it is encoded.

XHTML doesn't merely require well- formedness ; it also requires validity. In order to create a valid XHTML document, you'll need to make these changes as well:

Add a DOCTYPE declaration to the document pointing to one of the three XHTML DTDs.
Make all element and attribute names lowercase.
Adjust the markup so that the document validates against the DTDfor example, eliminating nonstandard elements like marquee , adding required attributes like the alt attribute of img , or moving child elements out from inside elements where they're not allowed, such as a blockquote inside a p .

In addition, the XHTML specification imposes a couple of requirements that, strictly speaking, are not required for either well-formedness or validity. However, they do make parsing XHTML documents a little easier. These requirements are:

The root element of the document must be html .
There must be a DOCTYPE declaration that uses a PUBLIC ID to identify one of the three XHTML DTDs.

Finally, if you wish, you maybut do not have toadd an XML declaration or an xml-stylesheet processing instruction to the prolog of your document.

Example 7-1 shows an HTML document from the O'Reilly web site that exhibits many of the validity problems you'll find on the Web today. In fact, this is a much neater page than most. Nonetheless, not all attribute values are quoted. The noshade attribute of the HR element doesn't even have a value. There's no document type declaration. Tags are a mix of upper- and lowercase, mostly uppercase. The DD elements are missing end-tags, and there's some character data inside the second definition that's not part of a DT or a DD .

Example 7-1. A typical HTML document

 <HTML><HEAD>   <TITLE>O'Reilly Shipping Information</TITLE> </HEAD> <BODY BGCOLOR="#ffffff" VLINK="#0000CC" LINK="#990000" TEXT="#000000"> <table border=0 width=515> <tr> <td> <IMG SRC="/www/graphics_new/generic_ora_header_wide.gif" BORDER=0> <H2>U.S. Shipping Information </H2> <HR size="1" align=left noshade> <DL> <DT> <B>UPS Ground Service (Continental US only -- 5-7 business days):</B></DT> <DD> <PRE> $  5.95 - $ 49.99 ......................... $ 4.50 $ 50.00 - $ 99.99 ......................... $ 6.50 0.00 - 9.99 ......................... $ 8.50 0.00 - 9.99 ......................... .50 0.00 - 9.99 ......................... .50 0.00 - 9.99 ......................... .50 </PRE> <DT> <B>Federal Express:</B></DT> (Shipping within 24 hours of receipt of order by O'Reilly) <DD> <PRE> <EM>1 or 2 books</EM>: Economy 2-day ............................. $ 8.75 Overnight Standard (Afternoon Delivery) ... .75 Overnight Priority (Morning Delivery) ..... .50 </PRE> </DL> <b>Alaska and Hawaii:</b> add  to Federal Express rates. <P> <A HREF="int-ship.html"><b>International Shipping Information</b></A> <P> <CENTER> <HR SIZE="1" NOSHADE> <FONT SIZE="1" FACE="Verdana, Arial, Helvetica"> <A HREF="http://www.oreilly.com/"> <B>O'Reilly Home</B></A> <B>  </B> <A HREF="http://www.oreilly.com/sales/bookstores"> <B>O'Reilly Bookstores</B></A> <B>  </B> <A HREF="http://www.oreilly.com/order_new/"> <B>How to Order</B></A> <B>  </B> <A HREF="http://www.oreilly.com/oreilly/contact.html"> <B>O'Reilly Contacts<BR></B></A> <A HREF="http://www.oreilly.com/international/"> <B>International</B></A> <B>  </B> <A HREF="http://www.oreilly.com/oreilly/about.html"> <B>About O'Reilly</B></A> <B>  </B> <A HREF="http://www.oreilly.com/affiliates.html"> <B>Affiliated Companies</B></A><p> <EM>&copy; 2000, O'Reilly Media, Inc.</EM> </FONT> </CENTER> </td> </tr> </table>       </BODY> </HTML>

Example 7-2 shows this document after it's been converted to XHTML. All the previously noted problems, and a few more besides, have been fixed. A number of deprecated presentational attributes, such as the size and noshade attributes of hr , had to be replaced with CSS styles. We've also added the necessary document type and namespace declarations. This document can now be read by both HTML and XML browsers and parsers.

Example 7-2. A valid XHTML document

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content="HTML Tidy, see www.w3.org" /> <style type="text/css">   body      {backgroundColor: #FFFFFF; color: #000000}   a:visited {color: #0000CC}   a:link    {color: #990000} </style> <title>O'Reilly Shipping Information</title> </head> <body> <table border="0" width="515"> <tr> <td><img src="/www/graphics_new/generic_ora_header_wide.gif" style="border-width: 0" alt="O'Reilly"/> <h2>U.S. Shipping Information</h2>       <hr style="height: 1; text-align: left"/> <dl> <dt><b>UPS Ground Service (Continental US only -- 5-7 business days):</b></dt>       <dd> <pre> $  5.95 - $ 49.99 ......................... $ 4.50 $ 50.00 - $ 99.99 ......................... $ 6.50 0.00 - 9.99 ......................... $ 8.50 0.00 - 9.99 ......................... .50 0.00 - 9.99 ......................... .50 0.00 - 9.99 ......................... .50 </pre> </dd>       <dt><b>Federal Express:</b></dt>       <dd>(Shipping within 24 hours of receipt of order by O'Reilly)</dd>       <dd> <pre> <em>1 or 2 books</em>: Economy 2-day ............................. $ 8.75 Overnight Standard (Afternoon Delivery) ... .75 Overnight Priority (Morning Delivery) ..... .50       </pre> </dd> </dl>       <b>Alaska and Hawaii:</b> add  to Federal Express rates.       <p><a href="int-ship.html"><b>International Shipping Information</b></a></p>       <div style="font-size: xx-small; font-face: Verdana, Arial, Helvetica;             text-align: center"> <hr style="height: 1"/> <a href="http://www.oreilly.com/"><b>O'Reilly Home</b></a> <b></b> <a href="http://www.oreilly.com/sales/bookstores"><b>O'Reilly Bookstores</b></a> <b></b> <a href="http://www.oreilly.com/order_new/"><b>How to Order</b></a> <b></b> <a href="http://www.oreilly.com/oreilly/contact.html"><b> O'Reilly Contacts<br /> </b></a> <a href="http://www.oreilly.com/international/"><b> International</b></a> <b></b> <a href="http://www.oreilly.com/oreilly/about.html"><b>About O'Reilly</b></a> <b></b> <a href="http://www.oreilly.com/affiliates.html"><b>Affiliated Companies</b></a></div>       <p style="font-size: xx-small;           font-family: Verdana, Arial, Helvetica"><em>&copy; 2000, O'Reilly Media, Inc.</em></p> </td> </tr> </table> </body> </html>

Making all these changes can be quite tedious for large documents or collections of many documents. Fortunately, there's an open source tool that can do most of the work for you. Dave Ragget's Tidy, http://tidy. sourceforge .net, is a C program that has been ported to most major operating systems and can convert some pretty nasty HTML into valid XHTML. For example, to convert the file bad.html to good.xml , you would type:

 %  tidy --output-xhtml yes bad.html good.xml

Tidy fixes as much as it can and warns you about what it can't fix so you can fix it manuallyfor instance, telling you that a required alt attribute is missing from an img element.

7.1.2 Three DTDs for XHTML

XHTML comes in three flavors, depending on which DTD you choose:

Strict

This is the W3C's recommended form of XHTML. This includes all the basic elements and attributes such as p and class . However, it does not include deprecated elements and attributes such as applet and center . It also forbids the use of presentational attributes such as the body element's bgcolor , vlink , link , and text . These capabilities are provided by CSS instead. Strict XHTML is identified with this DOCTYPE declaration:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"                       "DTD/xhtml1-strict.dtd" >

Example 7-2 uses this DTD.

Transitional

This is a looser form of XHTML for when you can't easily do without deprecated elements and attributes, such as applet and bgcolor . It is identified with this DOCTYPE declaration:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"                       "DTD/xhtml1-transitional.dtd" >

Frameset

This is the same as the transitional DTD except that it also allows frame- related elements, such as frameset and iframe . It is identified with this DOCTYPE declaration:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"                       "DTD/xhtml1-frameset.dtd" >

All three DTDs use the same http://www.w3.org/1999/xhtml namespace. You should choose the strict DTD unless you've got a specific reason to use another one.

7.1.3 Browser Support for XHTML

Many web browsers, especially Internet Explorer 5.0 and earlier and Netscape 4.79 and earlier, deal inconsistently with XHTML. Certainly they don't require it, accepting as they do such a wide variety of malformed , invalid, and out-and-out mistaken HTML. However, beyond that they do have some problems when they encounter certain common XHTML constructs.

7.1.3.1 The XML declaration and processing instructions

Some browsers display processing instructions and the XML declaration inline. These should be omitted if possible.

Few, if any, browsers recognize or respect the encoding declaration in the XML declaration. Furthermore, many browsers won't automatically recognize UTF-8 or UCS-2 Unicode text. If you use a non-ASCII character set, you should also include a meta element in the header identifying the character set. For example:

 <meta http-equiv="Content-type"       content='text/html; charset=UTF-8'></meta>

7.1.3.2 Empty elements

Browsers deal inconsistently with both forms of empty element syntax. That is, some browsers understand <hr/> but not <hr></hr> (typically rendering it as two horizontal lines rather than one), while others recognize <hr></hr> but not <hr/> (typically omitting the horizontal line completely). The most consistent rendering seems to be achieved by using an empty-element tag with an optional attribute such as class or id for example, <hr class="empty" /> . There's no real reason for the class attribute here, except that its presence keeps browsers from choking on the /> . Any other attribute the DTD allows would serve equally well.

On the other hand, if a particular instance of an element happens to be empty, but not all instances of the element have to be emptyfor instance, a p that doesn't contain any textyou should use two tags like <p></p> rather than one empty-element tag <p/> .

7.1.3.3 Entity references

Embedded scripts often contain reserved characters like & or < so the document that contains them is not well-formed. However, most JavaScript and VBScript interpreters won't recognize & or < in place of the operators they represent. If the script can't be rewritten without these operators (for instance, by changing a less-than comparison to a greater-than-or-equal-to comparison with the arguments flipped ), then you should move to external scripts instead of embedded ones.

Furthermore, most non-XML-aware browsers don't recognize the ' predefined entity reference. You should avoid this if possible and just use the literal ' character instead. The only place this might be a problem is inside attribute values that are enclosed in single quotes because they contain double quotes. However, most browsers do recognize the " entity reference for the " character so you can enclose the attribute value in double quotes and escape the double quotes that are part of the attribute value as " .

7.1.3.4 Other unsupported features

There are a few other subtle differences between how browsers handle XHTML and how XHTML expects to be handled. For instance, XHTML allows character references and CDATA sections although almost no current browsers understand these constructs. However, you're unlikely to encounter these when converting from HTML to XHTML, and you can generally do without them if you're writing XHTML from scratch.

Mozilla, Opera 5.0 and later, and Netscape 6.0 and later can parse and display valid XHTML without any difficulties and without requiring page authors to jump through these hoops. Safari and Internet Explorer 5.5 and later can mostly display it as long as the pages are mislabeled as text/html . However, both get confused if the pages are labeled with the correct MIME type application/xhtml+xml . Regardless, since many users have not upgraded their browsers to the level XHTML requires, user -friendly web designers will be jumping through these hoops for some time to come.