The Syntactical Rules of XML | Speed Up Your Site[c] Web Site Optimization

XML imposes stricter rules on XHTML documents than HTML does. The main differences between XHTML and HTML documents are as follows :

XML documents must be well- formed .
Attributes must be fully qualified and quoted.
Lowercase markup.
Scripts and style elements must conform to the stricter #PCDATA format.
Fragment identifiers must be of type id , not name .

Rule #1: Well-Formed Documents

XML documents are by definition well-formed. No, that's not a shapely set of tags; it means essentially that all elements must be closed and properly nested.

There are other requirements of well-formed XML documents, but these are the two that cause the most problems for authors and browsers. So instead of this:

 <p><a href="http://"><em>Extreme</a></em> HTML Optimization

Do this:

 <p><a href="http://"><em>Extreme</em></a> HTML Optimization</p>

Close All Tags

Unlike HTML, all elements in XHTML documents must be closed. Non-empty tags like <p></p> and <div></div> must have matching closing tags. So instead of this:

 <p>This paragraph has no end.  <p>I'm invalid XHTML.

Do this:

 <p>This paragraph has some closure.</p>  <p>This is valid XHTML.</p>

This need for closure extends to empty tags, which can lead a lonely existence. Empty tags like <br> and <img> should be closed by a forward slash like this <br/> or <img src="t.gif"/> to create a self-closing tag. The W3C recommends that you include a space before the slash to allow older browsers to ignore the slash, as shown here:

 <br />

 <img src="t.gif" width="1" height="1" alt="" />

To avoid compatibility problems, the W3C recommends that for empty elements you use the space/trailing slash method rather than a closing tag.

An empty paragraph could be written as <p /> and an empty data cell could be written as <td /> . This abbreviated form can confuse HTML-based browsers, however. Either use <p> </p> or omit these spacing hacks entirely and use CSS. Here's a list of all the empty HTML tags expressed in XHTML transitional form (the starred elements [*] are not allowed in strict XHTML):

<area />
<base />
<basefont /> *
<br />
<col />
<frame /> *
<hr />
<img />
<input />
<isindex /> *
<link />
<meta />
<param />

Netscape 6: No Sloppy Code Allowed

Improperly nested or orphaned elements can cause rendering problems for some browsers, particularly Netscape 6.x. Standards-compliant browsers like Netscape rely on properly nested and closed elements to apply CSS rules and JavaScript. When we updated our JavaScript news flipper for Webreference .com's front page, we found that Netscape 6 did not properly execute dynamically written external JavaScript with improperly nested code. ^[5]

^[5] Andrew King, "JavaScripting Netscape 6: No More Sloppy Code" [online], (Darien, CT: Jupitermedia Corporation, 2001), available from the Internet at http://www.webreference.com/programming/javascript/netscape6/.

Rule #2: Quote All Attributes

In HTML, you can omit quotation marks from certain attributes. In XHTML, all attribute values must be quoted. So instead of this:

 <img src="t.gif" width=1 height=1>

Do this:

 <img src="t.gif" width="1" height="1" alt="" />

Rule #3: Don't Minimize Those Attributes

In HTML, you can minimize certain Boolean-like attributes like this:

 <option selected>The Chosen One</option>

XML does not support attribute minimization. The value of all attribute-value pairs, like checked and compact , must be fully qualified. For example:

 <option selected="selected">The Properly Chosen One</option>

No dangling attributes are allowed. See how well-organized XHTML is? This well- formedness and lack of ambiguity is what gives XML-based documents their power. If XHTML had a bedroom, it would be immaculate.

Note that some older browsers can choke on fully qualified attributes, although HTML 4-compliant browsers don't have this problem. One optimization technique is to use defaults whenever possible (that is, the first option in a select menu is selected by default), or omit attributes entirely. Here is a list of these self-referential attributes (the starred elements [*] are not allowed in strict XHTML):

checked="checked"
compact="compact" *
declare="declare"
defer=" defer "
disabled="disabled"
ismap="ismap"
nowrap="nowrap" *
multiple="multiple"
nohref="nohref"
noresize="noresize" *
noshade="noshade" *
readonly="readonly"
selected="selected"

Rule #4: Higher Court Says Lowercase

XHTML tags and attributes must be lowercase because XML is case-sensitive. XHTML documents will not validate without lowercase markup. The character set of XML is ISO 10646, which makes a distinction between upper- and lowercase characters . Accordingly, all three XHTML DTDs define elements and attributes using lowercase letters . So instead of this:

 <TITLE>The New York Times on the Web</TITLE>

Do this:

 <TITLE>The New York Times on the Web</TITLE>

To be compliant, even your style sheet fragment identifiers should be lowercase. Also make your id values lowercase for consistency.

Rule #5: Handle Script and Style Differences with Care

Style sheets and especially scripts must be handled with care when you are switching from HTML to XHTML. XHTML changes the content type of script and style elements from unparsed characters for HTML to parsed character data (#CDATA to #PCDATA). This seemingly minor change, caused by the stricter way XML handles files, can mean major changes to style sheets and especially scripts.

#PCDATA, or parsed character data, interprets the less-than sign ( < ) and ampersand ( & ) as the start of markup, instead of processing them unparsed as HTML does. The symbols ( ]]> ) and ( ) are also prohibited inside #PCDATA blocks. To make matters worse , XML parsers can also silently remove the contents of comments.

Thus, the traditional practice of surrounding embedded style sheets or scripts with comments to make them backward compatible with very old browsers won't work with XHTML files. But fear not, gentle reader; there is a solution.

Embedded Style Sheets and XHTML

With style sheets, this restriction is usually not a problem. You can create most style sheets without using these characters, and embed them in your page like this:

 <style type="text/css">  code {     color: red;     font-family: monospace;     font-weight: bold; } </style>

Note that comments are not included here. For very old browsers, the text within style tags may display. You can do one of three things:

Conditionally include embedded style sheets
Ignore older browsers
Use external style sheets

The link element is the traditional way of associating an external style sheet with a document:

 <link rel="stylesheet" type="text/css" href="/global.css" title="global styles" />

Because Netscape 4 can choke on some newer CSS commands, the link element can sometimes be a problem. As a workaround, some authors import style sheets into their documents because Netscape 4 ignores this unrecognized CSS2 command. For example:

 <style type="text/css">@import url("/global.css");</style>

These "at rules" act like CSS2 filters, in the way dynamically written JavaScript 1.2 SRC statements do for some DHTML-compatibility techniques. Another option is to layer your styles with basic Netscape 4-friendly styles in a linked style sheet and more advanced styles in an imported style sheet.

External Style Sheets

External style sheets have some definite advantages. With one small sub-1K CSS style sheet, you can style and lay out your entire site, instead of embedding layout markup in the form of tables or embedded CSS within each file. The CSS file is cached after the first time it loads. Embedding your style sheet can, however, save one HTTP request for high-traffic pages. At Webreference.com, we use a hybrid approach by embedding a stripped-down style sheet within the home page and linking to an external style sheet everywhere else. For more details on linking to external style sheets see Chapter 7, "CSS Optimization."

Embedded Scripts in XHTML?

Embedding JavaScript inside XHTML files is another matter. In theory, you could rewrite your code to avoid the < and & symbols prohibited by XHTML's stricter #PCDATA requirement (for example, x <= y becomes y > x, and (x && y) becomes !(!x !y)), but this can be a hassle for larger scripts. To embed scripts with these special characters, the XHTML specification recommends that you either escape them (which HTML browsers then misinterpret) or wrap your script within a CDATA section, essentially forcing XHTML to behave like HTML:

 <script type="text/javascript">  <![CDATA[     ... unescaped script content ... ]]> </script>

These CDATA sections are recognized by XML processors, which act like HTML-based browsers and process the sections as unparsed data. HTML-based browsers such as Internet Explorer 5 for the Macintosh do not, however, recognize the CDATA command and will produce a script error. There are three solutions to XHTML's fussier handling of embedded scripts:

Conditionally include recoded JavaScript.
Ignore older browsers (not recommended now).
Use external JavaScripts.

You can recode your JavaScripts to exclude problem characters ( < , & , ] , ] , > , and - ), and embed them without comments inside a script tag.

Or just do what most people dolink to an external JavaScript and avoid the problem entirely:

 <script src="/gomenu.js" type="text/javascript"></script>

JavaScript optimization techniques are covered in more detail in Chapter 9, "Optimizing JavaScript for Download Speed," and Chapter 10, "Optimizing JavaScript for Execution Speed."

Rule #6: Fragment Identifiers: name and id , Please

In XHTML, fragment identifiers are handled differently from HTML. In XHTML, the id attribute replaces the name attribute. URI references that end in fragment identifiers (such as #review ) do not refer to elements with a named attribute. They refer instead to elements with an id attribute. In CSS, the id attribute is also used to refer to HTML elements. Some existing HTML browsers don't support the id attribute; therefore, you should supply both the id and name attributes to ensure forward and backward compatibility (for example, <a id= " review " name= " review " >...</a> ). Alternatively, you can ignore older browsers and use only id attributes for forward compatibility.

The data type of name and id attributes also has been changed from CDATA to NMTOKEN . To make your fragment identifiers backward compatible, use id and/or name attributes that match the pattern [A-Za-z][A-Za-z0-9:_.-]* (for example, use a string starting with "a" through "Z" and zero or more alphanumeric characters).