XML is first and foremost a document format. It was always intended for web pages, books, scholarly articles, poems, short stories, reference manuals, tutorials, textbooks, legal pleadings, contracts, instruction sheets, and other documents that human beings would read. Its use as a syntax for computer data in applications such as order processing, object serialization, database exchange and backup, and electronic data interchange is mostly a happy accident.
Most computer programmers are better trained in working with the rigid structures one encounters in data-oriented applications than in the more free-form environment of an article or story. Most writers are more accustomed to the more free-form format of a book, story, or article. XML is perhaps unique in addressing the needs of both communities equally well. This chapter describes by both elucidation and example the structures encountered in documents that are meant to be read by people instead of computers. Subsequent chapters will look at web pages in particular, then address technologies such as XSLT, XLinks, and stylesheets that are primarily intended for use with documents that will be read by human beings. Once we've done that, we'll look at XML as a format for more or less transitory data meant to be read by computers, rather than semipermanent documents intended for human consumption.
XML is a simplified form of the Standardized General Markup Language (SGML). The language that would eventually become SGML was invented by Charles F. Goldfarb, Ed Mosher, and Ray Lorie at IBM in the 1970s and developed by several hundred people around the world until its eventual adoption as ISO standard 8879 in 1986. SGML was intended to solve many of the same problems XML solves in much the same way as XML solves them. It was and is a semantic and structural markup language for text documents. SGML is extremely powerful and achieved some success in the U.S. military and government, in the aerospace sector, and in other domains that needed ways of efficiently managing technical documents that were tens of thousands of pages long.
SGML's biggest success was HTML, which was and is an SGML application. However, HTML is just one SGML application. It does not have anything close to the full power of SGML itself. SGML has also been used to define many other document formats, including DocBook and TEI, both of which we'll discuss shortly.
However, SGML is complicated very, very complicated. The official SGML specification is over 150 very technical pages. It covers many special cases and unlikely scenarios. It is so complex that almost no software has ever implemented it fully. Programs that implement or rely on different subsets of SGML are often incompatible. The special feature that one program considers essential is all too often considered extraneous fluff and omitted by the next program. Nonetheless, experience with SGML taught developers a lot about the proper design, implementation, and use of markup languages for a wide variety of documents. Much of that general knowledge applies equally well to XML.
One thing all this should make clear is that XML documents aren't just used on the Web. XML can easily handle the needs of publishing in a variety of media, including books, magazines, journals, newspapers, and pamphlets. XML is particularly useful when you need to publish the same information in several of these formats. By applying different stylesheets to the same source document, you can produce web pages, speaker's notes, camera-ready copy for printing, and more.
All XML documents are trees. However, trees are very general-purpose data structures. If you've been formally trained in computer science (and very possibly even if you haven't been), you've encountered binary trees, red-black trees, balanced trees, B-trees, ordered trees, and more. However, when working with XML, it's highly unlikely that any given document matches any of these structures. Instead, XML documents are the most general sort of tree, with no particular restrictions on how nodes are ordered or how or which nodes are connected to which other nodes. Narrative XML documents are even less likely than data-oriented XML documents to have an identifiable structure beyond their mere treeness.
So what does a narrative-oriented XML document look like? Of course, there's a root element. All XML documents have one. Generally speaking, this root element represents the document itself. That is, if the document is a book, the root element is book. If the document is an article, the root element is article, and so on.
Beyond that, large documents are generally broken up into sections of some kind, perhaps chapters for a book, parts for an article, or claims for a legal brief. Most of the document consists of these primary sections. In some cases, there'll be several different kinds of sections; for instance, one for the table of contents, one for the index, and one for the chapters of a book.
Generally, the root element also contains one or more elements providing metainformation about the document, for example, the title of the work, the author of the document, the dates the document was written and last modified, and so forth. One common pattern is to place the metainformation in one child of the root element and the main content of the work in another. This is how HTML documents are written. The root element is html. The metainformation goes in a head element, and the main content goes in the body element. TEI and DocBook also follow this pattern.
Sections of the document can be further divided into subsections. The subsections themselves may be further divided. How many levels of subsection appear generally depends on how large the document is. An encyclopedia will have many levels of sectioning a pamphlet or flier almost none. Each section and subsection normally has a title. It may also have elements or attributes that indicate metainformation about the section, such as the author or date it was last modified.
Up to this point, mixed content is mostly avoided. Elements contain child elements and whitespace, and that's likely all they contain. However, at some level it becomes necessary to insert the actual text of the document the words that people will read. In most Western languages these will probably be divided into paragraphs and other block-level elements like headlines, figures, sidebars, and footnotes. Generic document DTDs like DocBook won't be able to say more about these items than this.
The paragraphs and other block-level items will mostly contain words in a row, that is, text. Some of this text may be marked up with inline elements. For instance, you may wish to indicate that a particular string of text inside the block-level element is a date, a person, or simply important. However, most of the text will not be so annotated.
One area in which different XML applications diverge is the question of whether block-level items may contain other block-level items. For instance, can a paragraph contain a list? Or can a list item contain a paragraph? It's probably easier to work with more structured documents in which blocks can't contain other blocks (particularly other instances of the same kind). However, it's very often the case that a block has a very good reason to contain other blocks. For instance, a long list item or quotation may contain several paragraphs.
For the most part, this entire structure from the root down to the most deeply nested inline item tends to be quite linear; that is, you expect that a person will read the words in pretty much the same order they appear in the document. If all the markup were suddenly removed and you were left with nothing but the raw text, the result should be more or less legible. The markup can be used to index or format the document, but it's not a fundamental part of the content.
Another important point about these sorts of XML documents: not only are they composed of words in a row; they're composed of words. What they contain is text intended for human beings to read. They're not numbers or dates or money, except insofar as these things occur as part of the normal flow of the narrative. The #PCDATA content of the lowest-level elements of the tree mostly have one type: string. If anything has a real type beyond string it's likely metainformation about the document (figure number, date last modified, and so on) rather than the content of the document itself.
This explains, in detail, why DTDs don't provide strong (or really any) data typing. The documents for which SGML was designed didn't need it. XML documents are doing jobs for which SGML wasn't designed, such as tracking inventories or census data, do need data typing; that's why various people and organizations have invented a plethora of schema languages. However, schemas really don't improve on DTDs for narrative documents.
Not all XML documents are like those we've described here. Not even all narrative-oriented XML documents are like this. However, a surprising number of narrative-oriented XML applications do follow this basic pattern, perhaps with a nip here or a tuck there. The reason is that this is the basic structure narratives follow, and that has proven its usefulness in the thousands of years since writing was invented. If you were to define your own DTDs for general narrative-oriented documents, you'd probably come up with something a lot like this. If you define your own DTDs for more specialized narrative-oriented documents, then the names of your elements may change to reflect your domain for instance, if you were writing the next edition of the Boy Scout handbook, one of your subsections might be called MeritBadge however, the basic hierarchy of document, metainformation, sections and subsections, block-level elements, and marked-up text would likely remain.
The Text Encoding Initiative (TEI, http://www.tei-c.org/) is an SGML application designed for the markup of classic literature, such as Virgil's Aeneid or the collected works of Thomas Jefferson. It's a prime example of a narrative-oriented DTD. Since TEI is designed for scholarly analysis of text rather than more casual reading or publishing, it includes elements not only for common document structures (chapter, scene, stanza, etc.) but also for typographical elements, grammatical structure, the position of illustrations on the page, and so forth. These aren't important to most readers, but they are important to TEI's intended audience of humanities scholars. For many academic purposes, one manuscript of the Aeneid is not necessarily the same as the next. Transcription errors and emendations made by various monks in the Middle Ages can be crucial.
TEI is an SGML application. It uses several features of SGML not found in XML, including the & connector and tag minimization. However, XML is clearly the wave of the future. Therefore, like most evolving SGML applications, TEI is moving toward XML. A light version of the TEI DTD is available for authors who prefer to work in pure XML. It's not exactly the same as the SGML version, but it's very close for many practical uses.
Example 6-1 shows a fairly simple TEI Lite document that uses the XML version of the TEI DTD. The content comes from the book you're reading now. Although a complete TEI-encoded copy of this manuscript would be much longer, this simple example demonstrates the basic features of most TEI documents that represent books. (As well as prose, TEI can also be used for plays, poems, missals, and essentially any written form of literature.)
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE TEI.2 SYSTEM "xteilite.dtd"> <TEI.2> <teiHeader> <fileDesc> <titleStmt> <title>XML in a Nutshell</title> <author>Harold, Elliotte Rusty</author> <author>Means, W. Scott</author> </titleStmt> <publicationStmt><p></p></publicationStmt> <sourceDesc><p>Early manuscript draft</p></sourceDesc> </fileDesc> </teiHeader> <text id="HarXMLi"> <front> <div type='toc'> <head>Table Of Contents</head> <list> <item>Introducing XML</item> <item>XML as a Document Format</item> <item>XML on the Web</item> </list> </div> </front> <body> <div1 type="chapter"> <head>Introducing XML</head> <p></p> </div1> <div1 type="chapter"> <head>XML as a Document Format</head> <p> XML is first and foremost a document format. It was always intended for web pages, books, scholarly articles, poems, short stories, reference manuals, tutorials, texts, legal pleadings, contracts, instruction sheets, and other documents that human beings would read. Its use as a syntax for computer data in applications like syndication, order processing, object serialization, database exchange and backup, electronic data interchange, and so forth is mostly a happy accident. </p> <div2 type="section"> <head>SGML's Legacy</head> <p></p> </div2> <div2 type="section"> <head>TEI</head> <p></p> </div2> <div2 type="section"> <head>DocBook</head> <p> DocBook (<hi>http://www.docbook.org/</hi>) is an SGML application designed for new documents, not old ones. It's especially common in computer documentation. Several O'Reilly books have been written in DocBook including <bibl><author>Norm Walsh</author>'s <title>DocBook: The Definitive Guide</title></bibl>. Much of the <abbr expan='Linux Documentation Project'>LDP</abbr> (<hi>http://www.linuxdoc.org/</hi>) corpus is written in DocBook. </p> </div2> </div1> <div1 type="chapter"> <head>XML on the Web</head> <p></p> </div1> </body> <back> <div1 type="index"> <list> <head>INDEX</head> <item>SGML, 8, 9, 91, 92, 94</item> <item>DocBook, 97-101</item> <item>TEI, 94-97, 101</item> <item>Text Encoding Initiative, See TEI</item> </list> </div1> </back> </text> </TEI.2>
The root element of this and all TEI documents is TEI.2. This root element is always divided into two parts, a header represented by a teiHeader element and the main content of the document represented by a text element. The header contains information about the source document (for instance, exactly which medieval manuscript the text was copied from), the encoding of the document, some keywords describing the document, and so forth.
The text element is itself divided into three parts:
The preface, table of contents, dedication page, pictures of the cover, and so forth. Each of these is represented by a div element with a type attribute whose value identifies the division as a table of contents, preface, title page, and so forth. Each of these divisions contains other elements laying out the content of that division.
The individual chapters, acts, and so forth that make up the document. Each of these is represented by a div1 element with a type attribute that identifies this particular division as a volume, book, part, chapter, poem, act, and so forth. Each div1 element has a header child giving the title of the volume, book, part, chapter, etc.
The index, glossary, etc.
The divisions may be further subdivided; div1s can contain div2s, div2s can contain div3s, div3s can contain div4s, and so on up to div7. However, for any given work, there is a smallest division. This division contains paragraphs represented by p elements for prose or stanzas represented by lg elements for poetry. Stanzas are further broken up into individual lines represented by l elements.
Both lines and paragraphs contain mixed content; that is, they contain plain text. However, parts of this text may be marked up further by elements indicating that particular words or characters are peoples' names (name), corrections (corr), illegible (unclear), misspellings (sic), and so on.
This structure fairly closely reflects the structure of the actual documents that are being encoded in TEI. This is true of most narrative-oriented XML applications that need to handle fairly generic documents. TEI is a very representative example of typical XML document structure.
DocBook (http://www.docbook.org/) is an SGML application designed for new documents, not old ones. It's especially common in computer documentation. Several O'Reilly books have been written in DocBook, including Norm Walsh and Leonard Muellner's DocBook: The Definitive Guide. Much of the Linux Documentation Project (LDP, http://www.linuxdoc.org/) corpus is written in DocBook.
The current version of DocBook, 4.1.2, is available as both an SGML and an XML application. The XML version is not quite the same as the SGML version, but it's very close for most practical uses. The DocBook maintainers have announced plans to move to a single DTD that is completely compatible with both SGML and XML in version 5.0. Example 6-2 shows a simple DocBook XML document based on the book you're reading now. Needless to say, the full version of this document would be much longer.
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBk XML V4.1.2//EN" "docbook/docbookx.dtd"> <book> <title>XML in a Nutshell</title> <bookinfo> <author> <firstname>Elliotte Rusty</firstname> <surname>Harold</surname> </author> <author> <firstname>W. Scott</firstname> <surname>Means</surname> </author> </bookinfo> <toc> <tocchap><tocentry>Introducing XML</tocentry></tocchap> <tocchap><tocentry>XML as a Document Format</tocentry></tocchap> <tocchap><tocentry>XML as a "better" HTML</tocentry></tocchap> </toc> <chapter> <title>Introducing XML</title> <para></para> </chapter> <chapter> <title>XML as a Document Format</title> <para> XML is first and foremost a document format. It was always intended for web pages, books, scholarly articles, poems, short stories, reference manuals, tutorials, texts, legal pleadings, contracts, instruction sheets, and other documents that human beings would read. Its use as a syntax for computer data in applications like syndication, order processing, object serialization, database exchange and backup, electronic data interchange, and so forth is mostly a happy accident. </para> <sect1> <title>SGML's Legacy</title> <para></para> </sect1> <sect1> <title>TEI</title> <para></para> </sect1> <sect1> <title>DocBook</title> <para> <ulink url="http://www.docbook.org/">DocBook</ulink> is an SGML application designed for new documents, not old ones. It's especially common in computer documentation. Several O'Reilly books have been written in DocBook including <citation>Norm Walsh and Leonard Muellner's <citetitle>DocBook: The Definitive Guide</citetitle></citation>. Much of the <ulink url="http://www.linuxdoc.org/">Linux Documentation Project (LDP)</ulink> corpus is written in DocBook. </para> </sect1> </chapter> <chapter> <title>XML on the Web</title> <para></para> </chapter> <index> <indexentry> <primaryie>SGML, 8, 9, 91, 92, 94</primaryie> </indexentry> <indexentry> <primaryie>DocBook, 97-101</primaryie> </indexentry> <indexentry> <primaryie>TEI, 94-97, 101</primaryie> </indexentry> <indexentry> <primaryie>Text Encoding Initiative</primaryie> <seeie>TEI</seeie> </indexentry> </index> </book>
DocBook offers many advantages to technical authors. First and foremost, it's open, nonproprietary, and can be created with any text editor. It would feel a little silly to write open source documentation for open source software with closed and proprietary tools like Microsoft Word (which is not to say this hasn't been done). If your documents are written in DocBook, they aren't tied to any one platform, vendor, or application software. They're portable across essentially any plausible environment you can imagine.
Not only is DocBook theoretically editable with basic text editors; it's simple enough that such editing is practical as well. Of course, if you'd like a little help, there are a number of free tools available, including an Emacs major mode (http://www.nwalsh.com/emacs/docbookide/index.html). Furthermore, like many good XML applications, DocBook is modular. You can use the pieces you need and ignore the rest. If you need tables, there's a very complete tables module. If you don't need tables, you don't need to know about or use this module. Other modules cover various entity sets and equations.
DocBook is an authoring format, not a format for finished presentation. Before a DocBook document is read by a person, it should be converted to any of several formats, including the following:
XSL Formatting Objects
Rich Text Format (RTF)
For example, if you want high-quality printed documentation for a program, you can convert a DocBook document to TEX, then use the standard TEX tools to convert the resulting TEX file to a DVI and/or PostScript file and print that. If you just want to read it on your computer, then you'd probably convert it to HTML and load it into your web browser. For other purposes, you'd pick something else. With DocBook all these formats come essentially for free. It's very easy to produce multiple output documents in different formats from a single DocBook source document. Indeed, this benefit isn't just limited to DocBook. Most well-thought-out XML input formats are just as easy to publish in other formats.
XML documents that are intended for computers to read are often transitory. For instance, if you create a SOAP document that represents a request to Windows server running .NET, then that document exists for just as long as it takes the client to send it to the server and for the server to parse it into its internal data structures. After that's done, the document will be discarded. It probably won't be around for two minutes, much less two years. It's an ephemeral communication between two systems, with no more permanence than any of billions of other messages that computers exchange on a daily basis, most of which are never even written to disk, much less archived for posterity.
Some applications do store more permanent computer-oriented data in XML. For instance, XML is the native file format of the Gnumeric spreadsheet. On the other hand, this format is really only understood by Gnumeric and perhaps the other Gnome applications. It's designed to meet the specific needs of that one program. Exchanging data with other applications, including ones that haven't even been invented yet, is a secondary concern.
XML documents meant for humans tend to be more permanent and less software bound, however. If you encode the Declaration of Independence in XML, you want people to be able to read it in two, two hundred, or two thousand years. You also want them to be able to read it with any convenient tool, including ones not invented yet. These requirements have some important implications for both the XML applications you design to hold the data and the tools you use to read and write them.
The first rule is that the format should be very well documented. There should be a DTD, and that DTD should be very well commented. Furthermore, there should be a significant amount of prose documentation as well. Prose documentation can't substitute for the formal documentation of a DTD, but it's an invaluable asset in understanding the DTD.
Standard formats like DocBook and TEI should be preferred to custom, one-off XML applications. You should avoid proprietary DTDs that are owned by any one person or company and whose future may depend on the fortunes of that company or individual. Even DTDs that come from nonprofit consortia like OASIS or TEI should be licensed sufficiently liberally so that intellectual property restrictions won't let anyone throw up road blocks in your path. At least one DTD purveyor has gone so far as to file for patents on its DTDs. These DTDs should be avoided like the plague. Stick to DTDs that may be freely copied and shared and that can be retrieved from many different locations.
Once you've settled on a standard DTD, try to avoid modifying it if you possibly can. If you absolutely must modify it, then document your changes in excruciating, redundant detail. Include comments in both your DTDs and documents, explaining what you've done. Use the parameter entities built into the DTDs to add new element types or subtract old ones, rather than modifying the DTD files themselves.
Conversely, the format shouldn't be too hard to reverse engineer if the documentation is lost. Make sure full names are used throughout for element and attribute names. DocBook's para element is superior to TEI's p element. Paragraph would be better still.
All of the inherent structure of the document should be indicated by markup and markup alone. It should not be left for the user to infer, nor should it be encoded using whitespace or other separators. For instance, here's an example of what not to do from SVG:
<polygon style="fill: blue; stroke: green; stroke-width: 12" points="350,75 379,161 469,161 397,215 423,301 350,250 277,301 303,215 231,161 321,161" />
The style attribute contains three separate and barely related items. Understanding this element requires parsing the non-XML CSS format. The points attribute is even worse. It's a long list of numbers, but there's no information about what each number is. You can't, for instance, see which are the x and which are the y coordinates. An approach like this is preferable:
<polygon fill="blue" stroke="green" stroke-width="12"> <point x="350" y="75"/> <point x="379" y="161"/> <point x="469" y="161"/> <point x="397" y="215"/> <point x="423" y="301"/> <point x="350" y="250"/> <point x="277" y="301"/> <point x="303" y="215"/> <point x="231" y="161"/> <point x="321" y="161"/> </polygon>
The attribute-based style syntax is actually allowed in SVG. However, the debate over which form to use for coordinates was quite heated in the W3C SVG working group. In the end the working group decided (wrongly, in our opinion) that the more verbose form would never be adopted because of its size, even though most members felt it was more in keeping with the spirit of XML. We think the working group overemphasized the importance of document size in an era of exponentially growing hard disks and network bandwidth, not to mention ignoring the ease with which the second format could be compressed for transport or storage.
Stylesheets are important. We're all familiar with the injunction to separate presentation from content. You've heard enough warnings about not including mere style information like italics and font choices in your XML documents. However, be careful not to go the other way and include content in your stylesheets either. Author names, titles, copyrights and other such information that changes from document to document belongs in the document, not the stylesheet, even if it's metainformation about the document rather than the actual content of the document.
Always keep in mind that you're not just writing for the next couple months or years, but possibly for the next couple thousand of years. Have pity on the poor historians who are going to have to decipher your markup with limited tools to help them.
The markup in a typical XML document describes the document's structure, but it tends not to describe the document's presentation. That is, it says how the document is organized but not how it looks. Although XML documents are text, and a person could read them in native form if they really wanted to, much more commonly an XML document is rendered into some other format before being presented to a human audience. One of the key ideas of markup languages in general and XML in particular is that the input format need not be the same as the output format. To put it another way, what you see is not what you get, nor is it what you want to get. The input markup language is designed for the convenience of the writer. The output language is designed for the convenience of the reader.
Of course this requires a means of transforming the input format into the output format. Most XML documents undergo some kind of transformation before being presented to the reader. The transformation may be to a different XML vocabulary like XHTML or XSL-FO, or it may be to a non-XML format like PostScript or RTF.
XML's semiofficial transformation language is Extensible Stylesheet Language Transformations (XSLT). An XSLT document contains a list of template rules. Each template rule has a pattern noting which elements and other nodes it matches. An XSLT processor reads the input document. When it sees something in the input document that matches a template rule in the stylesheet, it outputs the template rule's template. Part of the template is normally an instruction that tells the processor to include content from the input in the output. This allows, for example, the text of the output document to be the same while all the markup is changed. For instance, you could write a stylesheet that would transform DocBook documents into TEI documents. XSLT will be discussed in much more detail in Chapter 8.
However, XSLT is not the only transformation language you can use with your XML documents. Other stylesheet languages such as the Document Style Sheet and Semantics Language (DSSSL, http://www.jclark.com/dsssl/) are also available. So are a variety of proprietary tools like OmniMark (http://www.omnimark.com/). Most of these have particular strengths and weaknesses for particular kinds of documents. Custom programs written in a variety of programming languages, such as Java, C++, Perl, and Python, can use a plethora of APIs, such as SAX, DOM, and JDOM, to transform documents. This is sometimes useful when you need something more than a mere transformation for instance, interpreting certain elements as database queries and actually inserting the results of those queries into the output document, or asking the user to answer questions in the middle of the transformation. However, the biggest single factor when choosing which tool to use is simply which language and syntax you're most comfortable with. De linguis non disputandum est.
There are many different choices for the output format from a transformation. A PostScript file can be printed on paper, overhead transparencies, slides, or even T-shirts. A PDF document can be viewed in all these ways and shown on the screen as well. However, for screen display, PDF is vastly inferior to simple HTML, which has the advantages of being very broadly accessible across platforms and being very easy to generate via XSLT from source XML documents. Generating a PDF or a PostScript file normally requires an additional conversion step in which special software converts some custom XML output format like XSL-FO to what you actually want.
An alternative to a transformation-based presentation is to provide a descriptive stylesheet that simply states how each element in the original document should be formatted. This is the realm of Cascading Style Sheets (CSS). This works particularly well for narrative documents where all that's needed is a list of the fonts, styles, sizes, and so on to apply to the content of each element. The key is that when all markup is stripped from the document, what remains is more or less a plain-text version of what you want to see. No reordering or rearrangement is necessary. This approach works less well for data-oriented documents where the raw content may be nothing more than an undifferentiated mass of numbers, dates, or other information that's hard to understand without the context and annotations provided by the markup. However, in this case a combination of the two approaches works well. First a transformation can produce a new document containing rearranged and annotated information. Then a CSS stylesheet can apply style rules to the elements in this transformed document.