Item 27. Mark Up According to Meaning | Effective XML: 50 Specific Ways to Improve Your XML

One of the most fundamental principles of XML design is the separation of presentation and content. The names of all tags and attributes should reflect the information they contain rather than how they'll be presented to an end user . This is sometimes called semantic markup. Semantic markup has a number of advantages compared to more traditional presentational markup as practiced in HTML, TeX, and other languages.

Semantic markup provides much greater assistance to software that does anything other than present the content to an end user. It provides more reliable hooks to decipher the meaning of the data.
Semantic markup makes it much easier to attach alternate stylesheets and presentations to the document. This allows the content to more easily be displayed in diverse environments, including cell phones, Web TVs, billboards, printed materials, screen readers, Braille printers, and things you haven't even dreamed of.
Semantic markup is much easier for developers to decipher by reading the raw code.
Semantic markup is less likely to conflict with other XML applications with which it may need to be integrated.

However, there is one thing semantic markup does not do:

It does not magically give computers the ability to understand the content of documents, as some of the more excessive XML hype has occasionally claimed.

What semantic markup does do is provide stronger, more accessible hooks to which programs with localized knowledge of the domain and element names can more easily connect.

Presentation is an important use for many XML documents, but it is only one use. It should not overshadow all other uses to which the data may be put. If the formatting outweighs the data's own structure, it can establish connections that aren't really there while hiding those that are. For example, consider italics. These are customarily used to indicate the following:

Importance ( Do not forget your gloves. )
Titles of books and magazines ( Processing XML with Java, Software Development )
Foreign words ( bonjour, ein, zvei, drei )
Names of vehicles ( Challenger, Queen Elizabeth II )
Words that reproduce sounds ( hmm, doh! )

If the markup indicates which words are italicized rather than what the types of the words are, it becomes impossible to easily tell whether a given italicized phrase is a citation, a particularly important point, a word from another language, an onomatopoetic invention, or something else. Yes, a human can almost always figure out which is which without any ambiguity and with very low chances of error. However, software just isn't as smart as people are, and it needs the extra help well-defined semantic markup gives it.

For example, suppose you're marking up a mutual fund prospectus . This is basically a narrative document intended for people to read. It may have very precise formatting requirements, perhaps established by law or securities regulations. Nonetheless, the markup should be designed to reflect the unique meaning and jargon of the mutual fund industry. For example, you could write it in XHTML or DocBook, as shown in Example 27-1.

Example 27-1 A Partial Mutual Fund Prospectus in XHTML

 <html xmlns="http://www.3.org/1999/xhtml">   <head>     <title>California Gold Rush Fund</title>   </head>   <body> <h1>Fund Summary</h1> <p> This is a money market fund that seeks to preserve the value of your investment at .00 per share. However, the fund is not a bank and investments in the fund are not insured or guaranteed by the Federal Deposit Insurance Corporation or any other government agency. It is possible to lose money by investing in the fund. </p> <h2>Investment Summary</h2> <h3>Investment Objective</h3> <p>The California Gold Rush Fund seeks maximum current income exempt from both federal and California state income tax. </p> <h3>Principal Investment Strategies</h3> <p>The investment strategies of this fund include:</p> <ul> <li>Investing in municipal money market securities of California     localities.</li> <li>Investing at least 75% of assets in municipal securities     whose interest is exempt from both federal and California     state income tax.</li> </ul> <h3>Principal Investment Risks</h3> <p>The California Gold Rush Fund   is subject to the following principal investment risks: </p> <ul>   <li>Legislative changes make tax-free munis less       attractive.</li>   <li>Orange County declares bankruptcy (again).</li>   <li>California sinks into the ocean.</li> </ul>   </body> </html>

However, it is much preferred to use a specific vocabulary, as shown in Example 27-2.

Example 27-2 A Partial Mutual Fund Prospectus Using Semantic Markup

 <Prospectus xmlns="http://www.bigfundco/prospectus/">   <Name>California Gold Rush Fund</Name>   <Summary>     This is a money market fund that seeks to preserve     the value of your investment at .00 per share.     However, the fund is not a bank and investments     in the fund are not insured or guaranteed by the Federal     Deposit Insurance Corporation or any other government     agency. It is possible to lose money by investing in     the fund.     <InvestmentSummary>       <Objective>        The California Gold Rush Fund seeks        maximum current income exempt from both        federal and California state income tax.       </Objective>     <Strategies>       <Strategy>         Investing in municipal money market securities of         California localities.       </Strategy>       <Strategy>         Investing at least 75% of assets in municipal         securities whose interest is exempt from both federal         and California state income tax.       </Strategy>     </Strategies>     <Risks>        <Risk>Legislative changes make tax-free munis less              attractive.</Risk>        <Risk>Orange County declares bankruptcy (again). </Risk>        <Risk>California sinks into the ocean.</Risk>     </Risks>     </InvestmentSummary>   </Summary> </Prospectus>

This enables various processes ranging from print formatters to SEC enforcement programs to verify that all requirements are satisfied and to easily extract that subset of the information in which each process is interested. It is much harder to extract the securities meaning out of a more presentational format such as XHTML or DocBook than it is to convert semantic markup to presentational markup as necessary.

Arguably, the semantically marked-up version contains more information than the presentational version. It is always possible to go from a document with more information to one with less. In essence, this is what an XSLT stylesheet that converted the second example into the first would do. However, going from a document with little information to one with more is far more difficult. It is not impossible, mind you. However, some external context would be necessary to provide the additional information, as would a lot more intelligence than most software programs have. Unlike some people in the markup community, I do believe that one day we will have computers that are smart enough to read a presentationally marked -up document such as Example 27-1 and infer all the necessary semantics, and I even believe this will happen within the next 20 years . However, it's clear we're not there yet, so in the meantime we smart humans have to help out the poor dumb computers by being more explicit about what we mean in our markup.

Do not, however, get sucked in by the XML hype that says just because you've used semantic markup that systems will automatically understand what you've done. A computer no more understands that a Risk element indicates a probability of a disadvantageous occurrence than a fish understands that a baited hook is more than a tasty meal. It is still necessary for software code to be written that keys off the semantic markup to take appropriate actions. However, it is much easier to write this code to operate on a neatly semantic document than to write code that screen-scrapes HTML.

What's semantic and what's purely presentational varies depending on what the XML application is describing. Formatting languages like XSL-FO are the exceptions that prove the rule. Although superficially they appear to be entirely nothing but presentation, in fact they are quite semantic. It's just that the domain they describe is the domain of page layout. The semantics are the semantics that would be familiar to any experienced printer or desktop publisher: widows, orphans, font families, columns , and so on. Most importantly, it is never intended that humans would author XSL-FO directly, even using GUI tools like FrameMaker. Instead, humans author in a markup vocabulary that's much closer to the content's semantics. Then an XSLT stylesheet is applied to transform the document into the expected semantics for page layout.

Whatever the semantics are, the markup should reflect those semantics. The element and attribute names should be words that identify what those elements and attributes contain. They should say what the elements and attributes are rather than what they look like.