Item 23. Reuse XHTML for Generic Narrative Content | Effective XML: 50 Specific Ways to Improve Your XML

Many XML applications are intended solely for machine processing. For instance, SOAP messages are almost never seen by a person. However, most DocBook documents are edited by hand and are intended to be formatted and presented to people. In machine-oriented documents, mixed content is uncommon and order tends not to matter much. In narrative documents meant for human eyes, mixed content is extremely common and order matters a great deal. However, there's also a common middle ground of documents that are mostly intended for machine processing but may contain some portion of text meant for people.

For example, consider a bank or credit card statement. Mostly it's just a list of transactions. However, statements often also contain a significant amount of narrative for a person to read, as shown in Figure 23-1. There is nothing in this part of the statement that could not be written in standard XHTML.

Figure 23-1. The Narrative Fine Print from a Typical Bank Statement

graphics/23fig01.jpg

For another example, imagine an invoice document. It probably contains a list of the products ordered, their prices, the delivery address, and so forth. This can all be represented in a straightforward, record-oriented fashion.

 <?xml version="1.0"?> <Invoice>   <Customer>Jane's Electronics</Customer>   <Product>     <Name>Widget</Name>     <SKU>324</SKU>     <Quantity>10</Quantity>     <Price currency="USD">2.95</Price>   </Product>   <Product>     <Name>Gizmo</Name>     <SKU>325</SKU>     <Quantity>1</Quantity>     <Price currency="USD">2344.95</Price>   </Product>   <ShipTo>     <Street>135 Fremont Ave.</Street >     <City>Santa Clara</City>     <State>CA</State>     <Zip>95054</Zip>   </ShipTo>   <Terms>Net-30</Terms> </Invoice>

However, an invoice may also contain a paragraph of text thanking the customer for ordering the products, instructions for returning the product if necessary, and even ads for other products. All of these are traditional narrative text and need a more human-centered markup.

Most developers focus on the more record-like aspects of a document when designing an XML application. Developers are more comfortable with this sort of data, and its structure tends to be more closely tied to the business rules. The narrative content is often an afterthought, if it's included at all, and it's rarely very well thought out. Fortunately, even as an afterthought, it doesn't have to be hard to add sophisticated narrative structure to your documents. The trick is, instead of trying to invent a markup language that describes paragraphs, sections, title, emphasis, and so on from scratch, borrow an existing markup language. In particular, I recommend that you borrow XHTML. XHTML has a number of advantages, not least among them:

XHTML is simple but complete. It includes many features you probably need but may not have considered , such as accessibility, language identification, standard entity references, and more.
Many tools can process and display XHTML. For instance, you can render XHTML in your applications by using javax.swing.JEditorPane in Java, the Gecko engine in C++, the Internet Explorer engine in Windows, and many more.
Authors are very familiar with XHTML already. Including it requires very little extra training.
The DTD is modular, so it can be easily integrated into your application. You can pick and choose those parts you need and leave out the parts you don't.
XHTML uses namespaces, so it's easy to distinguish your elements from the HTML elements. This isn't true of some other formats for generic narrative documents such as DocBook and TEI.
The W3C licenses XHTML on extremely liberal terms, so you don't have to worry about intellectual property issues getting in the way of your data or software. This makes the lawyers happy.

There are two basic ways to integrate XHTML into other, more record-like documents.

Define a placeholder element that contains the XHTML markup (e.g., an AccountInformation element in a bank statement).
Include an entire XHTML document, starting with the root html element as a child of one of the domain-specific elements.

Both approaches have their advantages and disadvantages. The first often seems to flow more naturally with the document as a whole, while the second makes it much easier to extract and process the HTML using a separate process from the one that manipulates the records in the document. Perhaps the best approach is to combine them, that is, to insert a placeholder element that contains an html element. For example, here's a simplified bank statement that includes HTML account information.

 <?xml version="1.0"?> <!DOCTYPE statement PUBLIC "-//MegaBank//DTD Statement//EN"                            "statement.dtd"> <Statement xmlns="http://namespaces.megabank.com/">   <Bank>MegaBank</bank>   <Account>     <Number>00003145298</Number>     <Type>Savings</Type>     <Owner>John Doe</Owner>   </Account>   <Date>2003-30-02</Date>   <OpeningBalance>5266.34</OpeningBalance>   <Deposit>     <Date>2003-02-07</Date>     <Amount>300.00</Amount>   </Deposit>   <ClosingBalance>5566.34</ClosingBalance>   <AccountInfo>     <html xmlns="http://www.w3.org/1999/xhtml">       <body>         <h1>           IMPORTANT INFORMATION ABOUT THIS ACCOUNT STATEMENT           AND YOUR RIGHTS         </h1>         <ol>           <li><strong>Review At Once:</strong>               Notify the Bank in writing, within 14 days                after we mail or make this statement available                to you, of any irregularities, or you may lose                valuable rights. See the brochure <cite>               Information About Our Accounts and Services               </cite> for details about this and other time               limitations regarding notice or irregularities.               (This paragraph does not apply to electronic                funds or wire transfers.)           </li>           <li><strong>Electronic Funds Transfers Under                       Regulation E:</strong>               In case of...</li>         </ol>         ...       </body>     </html>   </AccountInfo> </statement>

If you want to validate documents like this (and you don't always need to do that; sometimes just the markup is enough), you'll want to reference the XHTML DTD. This is not hard. You can load it with a parameter entity reference as discussed in Item 8 and demonstrated below.

 <!ENTITY % xhtml PUBLIC "-//W3C//DTD XHTML 1.1//EN"                         "xhtml11.dtd"> %xhtml;

You then simply include the html element in the content model of the AccountInfo element.

 <!ELEMENT AccountInfo (html)>

The only tricky part is ensuring that no elements in your application share names , such as p , div , body , or table , with standard HTML elements. This is probably a good idea anyway because the HTML elements are so familiar to so many people that using the same names for other things is likely to cause confusion. (Other schema languages do not have this problem because they're namespace aware, but the W3C XML Schema Language schema for XHTML has not been finished as of June 2003.)

In fact, you actually can choose from several variants of XHTML, depending on your needs. These include:

XHTML Basic, -//W3C//DTD XHTML Basic 1.0//EN (http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd) : A minimal profile of XHTML 1.1 that includes headings, paragraphs, lists, links, basic forms, basic tables images, and meta information. This is particularly well suited for embedding narrative content in other XML applications and is usually my first choice.
XHTML 1.1, -//W3C//DTD XHTML 1.1//EN (http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd) : Complete, standard XHTML 1.1 including everything in XHTML Basic plus ruby text, image maps, events, scripting, revision marking, complete forms, complete tables, and more. This is normally overkill for simple, embedded narrative content.
XHTML 1.0 Transitional, -//W3C//DTD XHTML 1.0 Transitional//EN (http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd) : All the standard features of traditional HTML 4.0 except for frames . This includes deprecated style elements such as font , i , and b . This is useful for incorporating existing legacy content. It is not as customizable as the modular form of XHTML introduced in XHTML 1.1.
XHTML 1.0 Frameset, -//W3C//DTD XHTML 1.0 Frameset//EN (http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd) : Transitional XHTML plus frames. This is rarely needed when embedding XHTML inside other XML documents.

The W3C has even published profiles of XHTML that integrate MathML, SVG, and/or VoiceXML support. If none of these suit you, you can use the modularization techniques built into XHTML 1.1 to customize your own. You can add and remove elements and attributes, select only some of the modules, and build almost exactly the language you need. For example, suppose you want to use XHTML Basic but remove forms support. You would simply redefine the xhtml-form.module entity to IGNORE before importing the XHTML basic driver.

 <!ENTITY % xhtml-form.module "IGNORE" > <!ENTITY % xhtml-basic PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN"            "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd" > %xhtml-basic;

It is also possible to go the other way, that is, to mix your own vocabularies into XHTML. The difference is that in this case, the root element is html , and the main driver is HTML, not your own application. This is primarily useful for browser display, either with stylesheets or particular plug-ins. For instance, this is how SVG and MathML are added to web pages. However, this technique tends not to be as useful in custom, local applications.

You may or may not need to customize XHTML like this before mixing it with your own applications. Either way, it's a lot easier to borrow one of the XHTML DTDs and embed XHTML in your documents rather than invent an equivalent language from scratch. Reusing XHTML saves developer time, saves author time, and produces more robust and maintainable documents.