Anatomy of an XHTML Document | Speed Up Your Site[c] Web Site Optimization

The W3C has made learning XHTML easy for HTML authors. XHTML looks like HTML, but it's neater and has a few more mandatory tags thrown in due to its XML heritage. Let's first look at a minimal XHTML document, and then we'll break down the most important parts :

 <?xml version="1.0" encoding="UTF-8"?>  <!DOCTYPE html     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head>     <title>A Minimal XHTML Document</title> </head> <body>     <p>Hello <a href="http://world.org/">World.org</a>.</p>     <hr /> </body> </html>

The first three statements are different from a minimal HTML document, but the rest look just like garden-variety HTML in lowercase with all tags closed. The first three statements declare that this is an XML 1.0-based XHTML document encoded in 8-bit Unicode (ASCII) using the XHTML 1.0 Strict DTD (wrapped over three lines) and the English XHTML namespace.

The second ( DOCTYPE ) and third (html namespace) statements are mandatory in XHTML documents, while the first (prologue) is optional. The prologue would presumably be included using conditional SSI for modern XML-savvy browsers (see Chapter 17, "Server-Side Techniques"), or left out to save 39 bytes. The next section explains all these new tags.

Pull the Prologue

All XML documents begin with declarations that tell the browser how to interpret them. The XML declaration, or prologue, precedes the DOCTYPE and namespace declarations and defines a document's type or markup language. For example:

 <?xml version="1.0" encoding="UTF-8"?>

This funny -looking element tells the browser three things:

The type of document (XML)
Which version of XML the document uses (1.0)
The document's character encoding (8-bit Unicode)

Unfortunately, some older browsers choke on the <?xml prologue and display a blank page. Internet Explorer 4 and 5 for the Mac and Netscape Navigator 4 behave badly when they come across a page with an XML prologue because they don't recognize the <? xml syntax. The prologue can also occasionally cause trouble with server-side parsing engines, like PHP.

Wisely, the W3C has made the prologue optional. You can either omit the prologue from your XHTML pages or conditionally include it for newer browsers. But what if you want to use a character set other than the default UTF-8 or UTF-16, and you don't want to use a prologue? You can use a meta http-equiv tag instead:

 <meta http-equiv="Content-type" content="text.html; charset=EUC-JP" />

Even better, you can save some bandwidth by configuring your server to send this as part of the content-type header.

Now that we've got everybody talking the same language, let's look at how the DOCTYPE and namespace declarations work to define the grammar and vocabulary of the language of XML documents.

DOCTYPE Declaration

Both HTML and XHTML use Document Type Declarations ( DOCTYPE ) to define markup declarations that provide a grammar for a class of documents. This grammar is the Document Type Definition (DTD). The DTD can point to an external and/or internal definition of markup declarations. For example:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd">

This statement declares that the root element of your document is html , as defined by the DTD whose public identifier is "-//W3C//DTD XHTML 1.0 Strict//EN" . The browser either already knows this public DTD, or it can follow the URI to locate the DTD.

Although it is optional in HTML, the DOCTYPE declaration must be included before the "root" HTML element in XHTML documents. Note that some validators give errors for relative DTD URIs. Use absolute URIs instead to ensure forward compatibility and portability. For example:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/  xhtml1-strict.dtd">

Keep in mind that relative DTD URIs stored locally would save at least 31 bytes, but would not be as portable as absolute URIs.

DOCTYPE Switching

As you learned in Chapter 4, "Advanced HTML Optimization," modern browsers switch their rendering behavior based on the DOCTYPE you specify. IE6 Win and NS6/Mozilla use DOCTYPE switching among standards and quirks modes. The latest versions of Mozilla/Netscape add a third "almost standards" option to their switching arsenal. IE5 Mac switches into standards mode if you specify a DTD URI for any DOCTYPE : strict, transitional, or frameset. For strictly authored XHTML documents, use the full form of the DOCTYPE . For all others, use the abbreviated form or conditionally include the DTD URI for all but IE5 Mac. For more on almost standards mode, see http://www.mozilla.org/docs/web-developer/quirks/ doctypes .html.

DTDs

Both HTML and XHTML use DTDs to define their constituent parts. The DTD defines rules that constrain the logical structure of a class of XML documents. The DTD lists all legal markup and specifies where and how that markup can be included in a document. This element syntax or grammar defines the semantics of the elements and their attributes. For example:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

This DOCTYPE specifies that this document is authored to the XHTML 1.0 Transitional DTD, which allows deprecated elements.

Documents that match the constraints of the DTD are said to be valid, or error free. The three DTDs for XHTML correspond to the ones defined by HTML 4.01:

 <!DOCTYPE html      PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <!DOCTYPE html      PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"      " http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <!DOCTYPE html      PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">

The only differences between the XHTML and HTML DTDs are the ones found in SGML and XML, which impose stricter syntactical constraints. Like HTML, the first DTD is for strict adherence to the XHTML standard, without any deprecated presentational elements like font . The second DTD is transitional and includes deprecated elements and attributes for legacy HTML code. The third DTD is for XHTML documents that use frames .

As I noted previously, using the strict DTD can be faster. Browsers choose among two or three rendering modes, based on their DOCTYPE switching criteria. The transitional parsers are necessarily more complex because they have to handle all of the deprecated tags and attributes in transitional XHTML. By using the strict DTD, you'll separate structure from presentation and behavior, and gain even more benefits from cached CSS files. You'll also be ready for future versions of XHTML that are based on strict XHTML 1.0.

Namespaces

An XML namespace is a collection of element types and attribute names , identified by a URI reference. Every conforming XHTML document must designate the XHTML namespace that it uses in the HTML root element with the xmlns attribute. Here's an example:

 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

In essence, this namespace defines the markup vocabulary for XHTML. The DTD defines the grammar of that vocabulary. Together they define the markup language used in your document so that browsers can more easily grok your code.

Multiple Namespaces

What if you want to include elements and attributes from different document types? You can't combine multiple DTDs for a single document, but you can use multiple namespaces. Namespaces allow authors to use multiple "markup vocabularies" within the same document. Adding new sets of elements is as easy as pointing to another namespace. For example, if you want to include MathML inside an XHTML document, you can include an element with a namespace attribute, like this:

 <div xmlns="http://www.w3.org/1998/Math/MathML">       <!-- math elements here --> </div>

An XML-compliant browser would use the http://www.w3.org/1998/Math/MathML namespace to find out that what follows is MathML, not XTHML. But what if you want to include multiple instances of MathML for multiple equations? Instead of including a namespace declaration for each equation, you can declare the MathML namespace at the beginning of your document and refer to its shorthand prefix later. So instead of this:

 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

Do this:

 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" xmlns:math="http://www.  w3.org/1998/Math/MathML">

Then you can tag each equation div with the math namespace prefix, like this:

 <math:div>      ... </div>

By assigning URIs to each element, namespaces disambiguate elements with the same name to help avoid element name collisions that would confuse XML applications. Namespaces codify your extensions to allow industry-wide data exchange.