One of the most fundamental principles of XML design is the separation of presentation and content. The names of all tags and attributes should reflect the information they contain rather than how they'll be presented to an end user . This is sometimes called semantic markup. Semantic markup has a number of advantages compared to more traditional presentational markup as practiced in HTML, TeX, and other languages.
However, there is one thing semantic markup does not do:
What semantic markup does do is provide stronger, more accessible hooks to which programs with localized knowledge of the domain and element names can more easily connect.
Presentation is an important use for many XML documents, but it is only one use. It should not overshadow all other uses to which the data may be put. If the formatting outweighs the data's own structure, it can establish connections that aren't really there while hiding those that are. For example, consider italics. These are customarily used to indicate the following:
If the markup indicates which words are italicized rather than what the types of the words are, it becomes impossible to easily tell whether a given italicized phrase is a citation, a particularly important point, a word from another language, an onomatopoetic invention, or something else. Yes, a human can almost always figure out which is which without any ambiguity and with very low chances of error. However, software just isn't as smart as people are, and it needs the extra help well-defined semantic markup gives it.
For example, suppose you're marking up a mutual fund prospectus . This is basically a narrative document intended for people to read. It may have very precise formatting requirements, perhaps established by law or securities regulations. Nonetheless, the markup should be designed to reflect the unique meaning and jargon of the mutual fund industry. For example, you could write it in XHTML or DocBook, as shown in Example 27-1.
Example 27-1 A Partial Mutual Fund Prospectus in XHTML
<html xmlns="http://www.3.org/1999/xhtml"> <head> <title>California Gold Rush Fund</title> </head> <body> <h1>Fund Summary</h1> <p> This is a money market fund that seeks to preserve the value of your investment at .00 per share. However, the fund is not a bank and investments in the fund are not insured or guaranteed by the Federal Deposit Insurance Corporation or any other government agency. It is possible to lose money by investing in the fund. </p> <h2>Investment Summary</h2> <h3>Investment Objective</h3> <p>The California Gold Rush Fund seeks maximum current income exempt from both federal and California state income tax. </p> <h3>Principal Investment Strategies</h3> <p>The investment strategies of this fund include:</p> <ul> <li>Investing in municipal money market securities of California localities.</li> <li>Investing at least 75% of assets in municipal securities whose interest is exempt from both federal and California state income tax.</li> </ul> <h3>Principal Investment Risks</h3> <p>The California Gold Rush Fund is subject to the following principal investment risks: </p> <ul> <li>Legislative changes make tax-free munis less attractive.</li> <li>Orange County declares bankruptcy (again).</li> <li>California sinks into the ocean.</li> </ul> </body> </html>
However, it is much preferred to use a specific vocabulary, as shown in Example 27-2.
Example 27-2 A Partial Mutual Fund Prospectus Using Semantic Markup
<Prospectus xmlns="http://www.bigfundco/prospectus/"> <Name>California Gold Rush Fund</Name> <Summary> This is a money market fund that seeks to preserve the value of your investment at .00 per share. However, the fund is not a bank and investments in the fund are not insured or guaranteed by the Federal Deposit Insurance Corporation or any other government agency. It is possible to lose money by investing in the fund. <InvestmentSummary> <Objective> The California Gold Rush Fund seeks maximum current income exempt from both federal and California state income tax. </Objective> <Strategies> <Strategy> Investing in municipal money market securities of California localities. </Strategy> <Strategy> Investing at least 75% of assets in municipal securities whose interest is exempt from both federal and California state income tax. </Strategy> </Strategies> <Risks> <Risk>Legislative changes make tax-free munis less attractive.</Risk> <Risk>Orange County declares bankruptcy (again). </Risk> <Risk>California sinks into the ocean.</Risk> </Risks> </InvestmentSummary> </Summary> </Prospectus>
This enables various processes ranging from print formatters to SEC enforcement programs to verify that all requirements are satisfied and to easily extract that subset of the information in which each process is interested. It is much harder to extract the securities meaning out of a more presentational format such as XHTML or DocBook than it is to convert semantic markup to presentational markup as necessary.
Arguably, the semantically marked-up version contains more information than the presentational version. It is always possible to go from a document with more information to one with less. In essence, this is what an XSLT stylesheet that converted the second example into the first would do. However, going from a document with little information to one with more is far more difficult. It is not impossible, mind you. However, some external context would be necessary to provide the additional information, as would a lot more intelligence than most software programs have. Unlike some people in the markup community, I do believe that one day we will have computers that are smart enough to read a presentationally marked -up document such as Example 27-1 and infer all the necessary semantics, and I even believe this will happen within the next 20 years . However, it's clear we're not there yet, so in the meantime we smart humans have to help out the poor dumb computers by being more explicit about what we mean in our markup.
Do not, however, get sucked in by the XML hype that says just because you've used semantic markup that systems will automatically understand what you've done. A computer no more understands that a Risk element indicates a probability of a disadvantageous occurrence than a fish understands that a baited hook is more than a tasty meal. It is still necessary for software code to be written that keys off the semantic markup to take appropriate actions. However, it is much easier to write this code to operate on a neatly semantic document than to write code that screen-scrapes HTML.
What's semantic and what's purely presentational varies depending on what the XML application is describing. Formatting languages like XSL-FO are the exceptions that prove the rule. Although superficially they appear to be entirely nothing but presentation, in fact they are quite semantic. It's just that the domain they describe is the domain of page layout. The semantics are the semantics that would be familiar to any experienced printer or desktop publisher: widows, orphans, font families, columns , and so on. Most importantly, it is never intended that humans would author XSL-FO directly, even using GUI tools like FrameMaker. Instead, humans author in a markup vocabulary that's much closer to the content's semantics. Then an XSLT stylesheet is applied to transform the document into the expected semantics for page layout.
Whatever the semantics are, the markup should reflect those semantics. The element and attribute names should be words that identify what those elements and attributes contain. They should say what the elements and attributes are rather than what they look like.