Although many people think of XML as a data format, many of the important uses for XML are in layout. Of these, one of the most significant is XHTML, or the Extensible HyperText Markup Language. XHTML is the "XML-ized" version of HTML, cleaning up many of the sloppier features of HTML and creating a more standardized, more easily validated document format. The Cascading Stylesheets (CSS) feature, although not an XML format, is widely viewed as important for XHTML development. CSS is a formatting language that can be used with either HTML or XHTML. It is generally viewed as a cleaner replacement for the Font tag and other similar devices that force a particular view. When used in combination with XHTML, the model is that the XHTML document carries all the content of the page, whereas CSS is used to format it. This chapter looks at these two sets of specifications, as well as some validation tools that help ensure your code is valid. In addition, this chapter looks at microformats, a relatively recent set of uses for both XHTML and CSS.
When people hear that XHTML is the XML version of HTML, the first question is usually, "Isn't HTML already XML?" or "What's wrong with HTML that it has to be XML-ized?" I hope that I'll be able to answer both these questions and more in this chapter. For those who are planning on skipping this chapter or who want the answers now, the answers are, "sort of, but not exactly" and "a few fairly major things."
Markup is information added to text to describe the text. In HTML and XHTML, these are the tags (for example, <b></b>) that are added around the text. However, markup isn't just HTML and its family. Rich Text Format (RTF) is another example of a markup language. The text, "This is bold, and this isn't" could be marked up in RTF as {\b\insrsid801189\charrsid801189 This is bold}{\insrsid13238650\charrsid13238650, and this isn't}. Other markup languages include TeX and ASN.1. Markup, therefore, is just a way of adding formatting and semantic information. Formatting information includes identifiers such as bold, italic, first level of heading, or beginning of a table. Semantic information includes identifiers such as beginning of a section, a list item or similar notations.
The idea of markup is quite old-separate the content from the description of that content. A number of implementations using this concept arose back in the stone ages of computing (the 1960s), including Standard Generalized Markup Language (SGML). SGML was strategy for defining markup. That is, you used SGML to define the tags and attributes that someone else could use to markup a document. This notion was powerful, enabling the production of documents that could be rendered easily in a number of formats.
SGML begat HTML, and it was good. HTML was a markup language loosely defined on the concepts of SGML. It lifted the tagging concept, but simplified it greatly because HTML was intended solely as a means of displaying text on computer screens. Later versions attempted to increase the rigor of the standard, for example, creating a Document Type Description (DTD-the format SGML used as the means of defining a markup language). HTML slowly evolved in a fairly organic fashion: first adding tags and then becoming a standard (4.01). Meanwhile, on an almost parallel track, SGML begat XML, and it was good. XML was an attempt to simplify SGML, creating a technology that provided many of the same capabilities of language definition. Although it wasn't necessarily inevitable, these two cousins decided to get together and produce an offspring, XHTML. XHTML has XML's eye for rigor: XHTML documents must be well-formed XML documents first, and rules around formatting are specific. However, XHTML still has HTML's looks and broad appeal.
Unfortunately, no one XHTML standard exists. In fact, there are currently six flavors or versions of XHTML:
q XHTML 1.0 Transitional: Intended to be a transitional move from HTML 4.01 to XHTML. This flavor included support for some of the newer features of XHTML, while retaining some of the older HTML features (such as <u>, < strike> or <applet>).
q XHTML 1.0 Frameset: Another transitional flavor that included support for HTML frameset tags.
q XHTML 1.0 Strict: The "real" XHTML. This version included strict rules (see the following section) for formatting the markup in a document.
q XHTML Basic: An attempt at creating the smallest possible implementation of XTHML. XHTML Basic is intended for mobile applications that are not capable of rendering complex documents or supporting the full extensibility of XHTML 1.1.
q XHTML 1.1: The current version of XHTML. This is an attempt at defining XHTML in a modular fashion, enabling the addition of new features through extension modules (for example, adding MathML or frameset support).
q XHTML 2.0: As of this writing, this is still a gleam in the eyes of the committee. It will likely end up being a major new version; it will also break compatibility with a number of XHTML documents. Because of this, I expect that it will be some time before it is in broad usage.
This chapter focuses mostly on XHTML 1.0 Strict and XHTML 1.1-primarily 1.1. The remaining current versions are primarily compatibility versions, meant to assist developers in migrating older code. XHTML 2.0 is still in the future, and even the planned broken compatibility may change before it becomes a standard.
The one main improvement of XHTML over HTML is in enforcement of what constitutes a valid document. XHTML requires that a document follow these rules:
q No overlapping elements: Although it was a horrid practice, some people wrote their HTML so that one element started before another was finished, or so that a tag closed before its child tag did. Even worse, some HTML editors created this kind of markup. The result was something that looked like the following:
<b>Bold<i>and italic</b></i>
As you can see, the bold tag (<b>) is closed before the child italics tag (<i>). Although most browsers were capable of interpreting this code, it did not lend itself to building a parse tree correctly. XHTML does not consider this valid.
q No unclosed elements: Some of the HTML elements, such as <br>, <hr>, and <img>, were generally used without closing tags. In XHTML, you must either add a close tag (such as </br>) or use the empty element form (<hr />) of the element. Note the space before the slash character. Although not absolutely necessary, it is highly recommended. For certain tags (such as a paragraph element or table cell) that are empty but that should contain information, do not use this form; instead include the close element, such as <p> </p>.
q All elements and attributes are written in lowercase: HTML is not case-sensitive regarding elements and attributes, therefore <table>, <TABLE> and <Table> are all equivalent. However, XML is case-sensitive, meaning that these three elements are different, and only one can be the real table element. Fortunately for my own personal style, all lowercase was defined as the standard, so the real element is <table>.
q All attributes are quoted: Another code formatting practice used by some authors and HTML editors is quotes around attributes. One argument is that including quotes around attributes, as in <img src=“http://some.url.com/image.png” /> adds two additional characters, bloating the document. Some users prefer the slightly less bandwidth intensive, <img src=http://some.url.com/image.png>. However, this practice (especially when included without a closing element, as shown here) makes it more difficult to parse the attribute correctly.
q All attributes require values: A few attributes are typically used standalone in HTML, such as the checked attribute for the Checkbox control or selected for options in a list.
<input type="checkbox" checked /> <select> <option selected>One</option> <option>Two</option> </select>
In XHTML, attributes must have a value. Therefore, the correct way of writing these elements should be:
<input type="checkbox" checked="checked" /> <select> <option selected="selected">One</option> <option>Two</option> </select>
q IDs are id: In later versions of HTML, two ways of naming elements co-existed; both id and name were used, and often both in the same document. This lead to a great deal of confusion, because users thought each had a unique purpose or meaning. With XHTML, name is now considered invalid, and id should be used when naming elements (all lowercase).
q Script blocks should be wrapped: Because XHTML documents are primarily XML documents, normal XML rules apply to the content. Blocks such as CSS or JavaScript may include XML markers (such as <), possibly breaking the document. Because of this, these blocks should be wrapped in CDATA blocks (see Listing 3-1) to ensure they do not affect the validity of the XHTML. Better yet, use an external document and one of the tags that imports that file (see Listing 3-9 later in this chapter).
Listing 3-1: Using CDATA with embedded script
<script type="text/JavaScript"> <![CDATA[ //JavaScript content here ]]> </script>
The next major set of changes you need to make to convert your HTML pages to XHTML is to remove some of the deprecated HTML tags. XHTML 1.0 (especially the Transitional and Frameset varieties) still permits these elements, but they are invalid in future versions, including XHTML 1.1. (See the following table for more discussion of the deprecated elements.) Most of these elements were removed because they caused an intermixing of content and specific layout. The recommended method of adding layout is now with CSS, as you learn later in this chapter. See Listing 3-2 for a simple XHTML 1.1 file.
Deprecated Element | Replacement | Discussion |
---|---|---|
applet embed | object | Applet, object, and embed were all methods for including content such as Java Applets and ActiveX objects. Rather than maintain these three elements, the object element is used for embedding any external objects. |
dir menu | ul | Dir and Menu were little-used elements that provided much of the same functionality as unordered lists (ul). |
font basefont blockquote i strike center | CSS | These elements enforced a particular view on the content of a page and merged the content with layout. This functionality is now superseded by CSS, and you should use that technology instead. Browsers (such as screen readers for u the sight impaired) are free to ignore the CSS, if necessary, leaving the content usable. |
layer | CSS | A Netscape/Mozilla-specific tag that was used to create dynamic HTML pages. The functionality is roughly replaceable with div and span tags. |
isindex | input type= | This ancient tag (that I haven't seen for a while) was used to create a search field on a page. This should be replaced with a form containing search fields and "real" server-side search functionality. |
style (attribute) | CSS | With XHTML 1.1, the style attribute is also considered deprecated. Although it is not yet removed from the standard, it should be avoided. Instead, use id or class attributes and CSS to apply style to individual elements. |
Finally, to ensure your document is processed in the format you intend, you should include a reference to the DTD of the desired level of XHTML. This provides information to the browser or parser, which should then treat your document appropriately. The following table shows the expected DTD.
XHTML Level | DocType Declaration |
---|---|
XHTML1.0 Transitional | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- transitional.dtd"> |
XHTML1.0 Frameset | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- frameset.dtd"> |
XHTML1.0 Strict | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- strict.dtd"> |
XHTMLBasic | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTMLBasic 1.0//EN" "http://www.w3.org/TR/xhtml-basic/xhtml- basic10.dtd"> |
XHTML1.1 | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> |
Listing 3-2: A simple XHTML 1.1 file
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" > <head> <title>Some Title</title> </head> <body> <p>Page Content</p> </body> </html>
How can you ensure your documents are valid? By validating them, of course. That seems to be a circular argument, doesn't it? A number of XHTML validation services and applications are available to ensure the documents you create are both well-formed and valid, most notably the W3C Validation service and Tidy.
As the standards body responsible for HTML and XHTML, it seems appropriate that the W3C has a service available for validating XHTML documents. This service (see Figure 3-1) is available at http://www.validator.w3.org, and enables checking a document by URL, file upload, or text input.
Figure 3-1
Tidy is an application that was initially developed at the W3C, but later was taken over by the broader development community. It is a command-line application (see Figure 3-2 for some of the command-line arguments) that can validate a document, return a list of errors, or correct the errors. In addition, a number of wrappers are available that provide direct access to the functionality from the programming language of your choice.
Figure 3-2
The two most common uses for Tidy are to create new, compliant versions of your Web pages, and to clean up errors and formatting. Listing 3-3 shows an HTML file that contains a number of issues. Although this file would still be valid in a browser (see Figure 3-3), you can use Tidy to clean up its problems and convert the document to XHMTL.
Figure 3-3
Listing 3-3: Not very valid HTML
<head> <title>Lorem ipsum dolor sit amet, consectetuer adipiscing elit</title></head> <body lang=EN-US BGCOLOR=white text=black link=blue vlink=purple> <p><b><i>Lorem ipsum dolor sit amet</b></i>, consectetuer adipiscing elit. Suspendisse sit amet odio. Duis porta pulvinar arcu. Curabitur pellentesque, neque id hendrerit volutpat, ante nulla mattis lacus, sit amet varius augue orci a enim. Suspendisse ornare purus ac nunc. Maecenas cursus congue libero. Aliquam erat volutpat. Nulla interdum dui. Ut purus. Donec pellentesque lorem vitae purus. Pellentesque ultricies consectetuer nisl. Nulla facilisi. Etiam aliquam adipiscing sem. Nam metus ipsum, nonummy eget, vestibulum quis, fringilla non, nulla. Suspendisse placerat tempor tortor. Mauris tortor dolor, sollicitudin eget, gravida rhoncus, vestibulum vel, eros. Proin vitae nunc vel metus mattis viverra. Pellentesque at turpis vel quam laoreet dapibus. Maecenas interdum metus nec eros. Nam ut elit eu nisl ullamcorper tincidunt. Praesent faucibus pede in risus feugiat viverra.</p> <hr> <p><font face="arial" size=2>Integer vulputate nibh. Mauris convallis nisi vitae magna. Sed varius, velit eu pretium porta, enim tellus ornare ipsum, vel interdum nisi tellus vitae massa.</font></p> <p>Maecenas imperdiet nunc sed ipsum.</p> <li>Cras euismod, lorem et rhoncus placerat, felis nibh lobortis lorem, id eleifend felis eros rutrum dolor. <li>Nunc euismod, nunc viverra porttitor imperdiet, nibh tellus convallis erat, sit amet laoreet neque nunc ac purus.</li> </ul> <Center> <table border=1> <tr> <td width=197 valign=top style='width:2 padding:0in 5.4pt 0in 5.4pt'> <p>Ut ut lectus</p> <td width=197 valign=top style='width:2 border-left:none;padding:0in 5.4pt 0in 5.4pt'> <p> Nunc velit dui, fermentum quis, condimentum viverra, adipiscing quis, nisl</p> <td><p> Curabitur feugiat</p></tr><tr><td><p> Aliquam libero</p> <td> <p> Maecenas at enim</p> <td><p>Nunc non nulla a nulla molestie ornare©</p> </table> </CENTER> </body>
In the preceding code, a number of errors are present in the HTML (such as a missing root html tag, missing close tags for the last tr, and so on). Also, a number of items that are valid HTML items are not valid in XHTML. For example, the hr tag is an empty tag; therefore, it should be written <hr />. In addition, many unquoted attributes are present, and the center tag is written in mixed case in one place and in all uppercase elsewhere.
Converting a document as shown in Listing 3-3 is not an uncommon task, but it can be quite difficult to do manually. HTML editing software and users have found just too many ways to hide bad code in Web pages. Running Tidy with the following command-line generates the list of warnings in Listing 3-4. As you can see, it detected many of the expected errors, as well as a few others.
tidy -o c:\temp\fixed.htm -f errors.txt -i -w 79 -c -b -asxhtml -utf8 Invalid.htm
Note: The options set are:
q Output file is c:\temp\fixed.htm
q Send errors to errors.txt
q Indent output
q Wrap output to 79 characters or less per line
q Replace deprecated font, center, and nobr tags with CSS
q Strip out smart quotes, em dashes, and other formatting characters
q Output should be XHTML
q Output should be encoded as UTF-8
Many other command-line options exist. In addition, many other configuration settings alter the output of Tidy. See the documentation for more details. If you want a common set of parameters, it would be easier to create a configuration file for running Tidy. This is a text file, with the configuration elements listed one per line. With this in place, the previous command-line could be simplified to:
tidy -config myconfig.txt Invalid.htm
Listing 3-4 shows the result of running Tidy on the sample file.
Listing 3-4: Warnings generated
line 1 column 1 - Warning: missing <!DOCTYPE> declaration line 4 column 7 - Warning: replacing unexpected b by </b> line 4 column 4 - Warning: replacing unexpected i by </i> line 3 column 1 - Warning: <li> isn't allowed in <body> elements line 21 column 2 - Warning: inserting implicit <ul> line 25 column 1 - Warning: discarding unexpected </ul> line 21 column 2 - Warning: missing </ul> before <center> line 27 column 1 - Warning: <table> lacks "summary" attribute Info: Document content looks like HTML 4.01 Transitional 8 warnings, 0 errors were found!
Although the cleaned document may not reflect all the intent of the original (an inappropriate change sometimes occurs), it should be much easier to clean up. Listing 3-5 shows the output of the previous code.
Listing 3-5: Cleaned XHTML output
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content= "HTML Tidy for Windows (vers 1 September 2005), see www.w3.org" /> <title>Lorem ipsum dolor sit amet, consectetuer adipiscing elit</title> <style type="text/css"> /*<![CDATA[*/ body { background-color: white; color: black; } :link { color: blue } :visited { color: purple } div.c4 {text-align: center} td.c3 {width:2.05in;border:solid windowtext 1.0pt; border-left:none;padding:0in 5.4pt 0in 5.4pt} td.c2 {width:2.05in;border:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt} p.c1 {font-family: arial; font-size: 80%} /*]]>*/ </style> </head> <body lang="EN-US" xml:lang="EN-US"> <p><b><i>Lorem ipsum dolor sit amet</i></b>, consectetuer adipiscing elit. Suspendisse sit amet odio. Duis porta pulvinar arcu. Curabitur pellentesque, neque id hendrerit volutpat, ante nulla mattis lacus, sit amet varius augue orci a enim. Suspendisse ornare purus ac nunc. Maecenas cursus congue libero. Aliquam erat volutpat. Nulla interdum dui. Ut purus. Donec pellentesque lorem vitae purus. Pellentesque ultricies consectetuer nisl. Nulla facilisi. Etiam aliquam adipiscing sem. Nam metus ipsum, nonummy eget, vestibulum quis, fringilla non, nulla. Suspendisse placerat tempor tortor. Mauris tortor dolor, sollicitudin eget, gravida rhoncus, vestibulum vel, eros. Proin vitae nunc vel metus mattis viverra. Pellentesque at turpis vel quam laoreet dapibus. Maecenas interdum metus nec eros. Nam ut elit eu nisl ullamcorper tincidunt. Praesent faucibus pede in risus feugiat viverra.</p> <hr /> <p >Integer vulputate nibh. Mauris convallis nisi vitae magna. Sed varius, velit eu pretium porta, enim tellus ornare ipsum, vel interdum nisi tellus vitae massa.</p> <p>Maecenas imperdiet nunc sed ipsum.</p> <ul> <li>Cras euismod, lorem et rhoncus placerat, felis nibh lobortis lorem, id eleifend felis eros rutrum dolor.</li> <li>Nunc euismod, nunc viverra porttitor imperdiet, nibh tellus convallis erat, sit amet laoreet neque nunc ac purus.</li> </ul> <div > <table border="1"> <tr> <td width="197" valign="top" class='c2'> <p>Ut ut lectus</p> </td> <td width="197" valign="top" class='c3'> <p> Nunc velit dui, fermentum quis, condimentum viverra, adipiscing quis, nisl</p> </td> <td> <p> Curabitur feugiat</p> </td> </tr> <tr> <td> <p> Aliquam libero</p> </td> <td> <p> Maecenas at enim</p> </td> <td> <p>Nunc non nulla a nulla molestie ornare(c)</p> </td> </tr> </table> </div> </body> </html>
For those less than comfortable with the command-line, Charles Reitzel created a Windows application to enable working visually with Tidy (see Figure 3-4). This is a handy utility if you have only a small amount of HTML to convert. For larger quantities, the command-line (or one of the code wrappers) is a better solution.
Figure 3-4
Just as with the command-line version, you can easily see the errors and warnings your document generates (see Figure 3-5). Double-clicking the warning or error selects the appropriate line in the edit window.
Figure 3-5
The functionality of Tidy has also been exposed through a number of language wrappers. This allows you to integrate the functionality into your own applications. Wrappers are available for COM, .NET, Java, Perl, Python, and many other languages. See the Tidy home page (http://www.tidy.sourceforge.net/) for the full list.
The included project is a simple text editor that includes the capability to run Tidy (using the .NET wrapper) on the content. It is intentionally simple, but shows how you can integrate the Tidy functionality directly in an application.
First, create a new Windows Forms project. The sample project contains three tabs. The first is an edit window, the second a read-only text box containing the tidied XHTML, and the last is a Web browser window for viewing the resulting content. Next, add a reference to the .NET wrapper (see Figure 3-6). If you receive an error while adding the reference, it may be because the TidyATL.dll is not registered (the .NET wrapper is actually a .NET wrapper of the COM wrapper). Register the TidyATL.dll file using the command-line regsvr32 tidyatl.dll and try adding the reference again.
Figure 3-6
Most of the code in the included project is involved in the menus and file handling. The only code that actually calls the Tidy wrapper is in the TidyText function (see Listing 3-6). This takes a block of HTML, processes it with Tidy, and returns the result (see Figure 3-7). Each of the command-line properties of Tidy is exposed in an enumeration (TidyOptionId). You use the SetOptBool, SetOptInt and SetOptValue methods to set the desired settings. Alternatively, you can load the settings from a configuration file. This file is simply a list containing one parameter per line, along with the value, in the format:
property: value
Figure 3-7
For Boolean values, yes/no, true/false or 1/0 can be used for the value. ParseString loads the HTML, and SaveString returns the cleaned XHTML. You could alternatively use ParseFile and SaveFile to process files on disc or CleanAndRepair to clean a file in place.
Listing 3-6: Using the .NET Tidy wrapper
Private Function TidyText(ByVal text As String) As String Dim result As String = String.Empty Dim t As New Tidy.Document With t 'set options .SetOptBool(TidyOptionId.TidyIndentContent, 1) .SetOptBool(TidyOptionId.TidyXhtmlOut, 1) .SetOptBool(TidyOptionId.TidyMakeClean, 1) .SetOptBool(TidyOptionId.TidyIndentContent, 1) .SetOptInt(TidyOptionId.TidyIndentSpaces, 2) .SetOptValue(TidyOptionId.TidyCharEncoding, "utf8") 'or .LoadConfig("tidyconfig.txt") 'parse and return tidy'd html .ParseString(text) result = .SaveString() End With Return result End Function
The functionality of Tidy and its availability for multiple languages and platforms means you never have an excuse for invalid XHTML pages. Try to develop the habit of running it regularly on your XHTML to ensure it conforms.
When I first learned HTML, it was a fairly primitive formatting tool. You had your choice of bold, italics, or one of six headline levels. ("And we liked it!") Going further than this meant using the <font> element. Using this element, you could change the look of your Web sites, getting them to look closer to a corporate or other brand, or to make them look more like offline documentation.
However, like a lot of other gifts of technology, things were waiting to bite in this Pandora's Box. Using the <font> element meant that you were hard-coding huge amounts of information directly in the page. Maintaining <font> information as it changed was a chore. In addition, this information was repeated frequently through the document, causing page bloat and slow response times. Fortunately, around the time of HTML 4, CSS came along. As you will soon see, CSS is a way of applying the same type of information as you could using the <font> element (and more), but in a better way. Therefore, the <font> tag has been deprecated, and support for it in browsers will eventually go the way of the <blink> and other extinct HTML elements.