Section 20.2. XML and DocBook

20.2. XML and DocBook

XML (and its predecessor SGML) goes one step beyond earlier text markup languages. It imposes a hierarchical structure on the text that shows the relation of each element to the containing elements. This makes it possible to convert the text to a number of output formats, including PostScript and PDF (the Adobe Portable Document Format).

XML itself is just a framework for defining the structure of a document. A so-called Document Type Description (DTD) or schema then defines what kind of markup you are allowed to use in a document.

SGML, the Standard Generalized Markup Language, was the first of these document description languages to be standardized, but it has mostly fallen into oblivion these days. Its two descendantsHTML and XMLare famous, though, and even overly hyped. Essentially, HTML is an implementation of SGML with a fixed set of tags that is useful for formatting web pages. XML, the eXtended Markup Language, is a general solution like SGML, but minus some of its more difficult features. Both SGML and XML allow people to define any set of tags they like; the exact tags and their relationships are specified in the DTD or schema (which are optional in XML).

For each DTD or schema that you want to use, you need to have processing tools that convert the SGML or XML file to the desired output format. Historically, most free systems did this by means of a system called DSSSL (short for Document Style Semantics and Specification Language). XSLT (eXtended Stylesheet Language Template) is now much more popular for converting XML to other formats. As the author of an SGML or XML document, this is nothing you need to be concerned with, but if you are the one to set up the toolchain or want to change the way the output looks, you need to know how the processing is done.

In the field of computer documentation, the most commonly used DTD is DocBook. Among many other things, most of the freely available Linux documentation is written with DocBook, as well as this book. DocBook users include a huge range of companies and well-known organizations, such as Sun Microsystems, Microsoft, IBM, Hewlett-Packard, Boeing, and the U.S. State Department.

To give you an example of how DocBook text can look, here is a fragment of an article for a computer magazine:

 <!DOCTYPE Article  PUBLIC "-//OASIS//DTD DocBook V4.1.2//EN"> <article>   <artheader>     <title>Looping the Froz with Foobar</title>     <author>       <firstname>Helmer B.</firstname>       <surname>Technerd</surname>       <affiliation>         <orgname>Linux Hackers, Inc.</orgname>       </affiliation>     </author>   </artheader>   <abstract>     <para>This article describes a technique that you can employ to loop the Froz with the Foobar software package.</para>   </abstract>   <sect1>     <title>Motivation</title>     <para>Blah, blah, blah, ...     </para>   </sect1> </article>

The first line specifies the DTD to be used and the root element; in this case we are creating an article using the DocBook DTD. The rest of the source contains the article itself. If you are familiar with HTML, the markup language used for the World Wide Web (see the O'Reilly book HTML & XHTML: The Definitive Guide, by Chuck Musciano and Bill Kennedy), this should look a bit familiar. Tags are used to mark up the text logically.

Describing the whole DocBook DTD is well beyond the scope of this book, but if you are interested, check out DocBook: The Definitive Guide by Norman Walsh and Leonard Muellner (O'Reilly).

Once you have your article, documentation, or book written, you will want to transform it, of course, into a format that you can print or view on the screen. In order to do this, you need a complete XML setup, which, unfortunately, is not easy to achieve. In fact, you need so many pieces in place that we cannot describe this here. But there is hope: a number of distributions (including Red Hat, SUSE, and Debian) come with very good XML setups out of the box; just install their respective XML packages. If you have a working SGML or XML system, you should be able to transform the text shown previously to HTML (as one example) with a command like this:

 tigger$ db2html myarticle.xml input file was called   output will be in myarticle TMPDIR is db2html.C14157 working on /home/kalle/myarticle.xml about to copy cascading stylesheet and admon graphics to temp dir about to rename temporary directory to "myarticle"

The file myarticle/t1.html will contain the generated HTML. If you would like to generate PDF instead, use the following command:

 tigger db2pdf myarticle.xml tex output file name is /home/kalle/projects/rl5/test.tex tex file name is /home/kalle/projects/rl5/test.tex pdf file name is test.pdf This is pdfeTeX, Version 3.141592-1.21a-2.2 (Web2C 7.5.4) entering extended mode (/home/kalle/projects/rl5/test.tex JadeTeX 2003/04/27: 3.13 (/usr/share/texmf/tex/latex/psnfss/t1ptm.fd) Elements will be labelled Jade begin document sequence at 21 (./test.aux) (/usr/share/texmf/tex/latex/cyrillic/t2acmr.fd) (/usr/share/texmf/tex/latex/base/ts1cmr.fd) (/usr/share/texmf/tex/latex/hyperref/nameref.sty) (./test.out) (./test.out) (/usr/share/texmf/tex/latex/psnfss/t1phv.fd) [1.0.49{/var/lib/texmf/fonts/map/p dftex/updmap/pdftex.map}] [2.0.49] (./test.aux) ){/usr/share/texmf/fonts/enc/dv ips/psnfss/8r.enc}</usr/share/texmf/fonts/type1/urw/times/utmri8a.pfb></usr/sha re/texmf/fonts/type1/urw/times/utmr8a.pfb></usr/share/texmf/fonts/type1/urw/hel vetic/uhvb8a.pfb> Output written on test.pdf (2 pages, 35689 bytes). Transcript written on test.log.

As you can see, this command uses T_EX in the background, or more specifically a special version called Jade that is geared toward documents produced by DSSSL.

This is all nice and good, but if you want to change the way the output looks, you'll find DSSSL is quite cumbersome to use, not least because of the lack of available documentation. We will therefore briefly introduce you here to the more modern mechanism using XSLT and FOP. However, be prepared that this will almost invariably require quite some setup work on your side, including reading ample amounts of online documentation.

In an XSLT setup, the processing chain is as follows: First, the XML document that you have written, plus a so-called stylesheet written in XSL (eXtended Stylesheet Language), are run through an XSLT (eXtended Stylesheet Language Template) processor such as Saxon. XSL is yet another DTD; in other words, the stylesheets are XML documents themselves. They describe how each element in the document to be processed will be converted into other elements or body text. Naturally, the stylesheet needs to fit the DTD you have authored your document in. Also, depending on your output target, you will need to use different stylesheets.

If HTML is your target, you are already done at this point. Because HTML is itself XML-conforming, the stylesheet is able to convert your document into HTML that can be directly displayed in your web browser.

If your target is more remote from the XML input (e.g., PDF) you need to go another step. In this case, you do not generate PDF directly from your document, because PDF is not an XML format but rather a mixed binary/text format. Basic XSLT transformation would be very difficult, if not plain impossible. Instead, you use a stylesheet that generates XSL-FO instead, yet another acronym starting with X (eXtended Stylesheet Language Formatting Objects). The XSL-FO document is another XML document, but one where many of the logical instructions in the original document have been turned into physical instructions.

Next, an FO processor such as Apache FOP is run on the XSL-FO document and generates PDF (or other output, if the FO processor supports that).

Now that you have an idea of the general picture, lets look at what you need to set up. First of all, it may be a good idea to run your XML document through a document validator. This does not do any processing, but just checks whether your document conforms to the specified DTD. The advantage of a document validator is that it does this very fast. Because actual processing can be time-consuming, it is good if you can find out first whether your document is ill-formed and bail out quickly.

One such document validator is xmllint. xmllint is a part of libxml2, a library that was originally developed for the GNOME desktop, but is completely independent of it (and actually also used in the KDE desktop). You can find information about xmllint and download it from http://xmlsoft.org.

xmllint is used as follows:

 owl$ xmllint myarticle.xml > myarticle-included.xml

The reason that xmllint is writing the file to standard output is that it can also be used to process X-Includes. These are a technique to modularize XML files in an author-friendly way, and xmllint puts the pieces back together. You can find more information about X-Includes at http://www.w3.org/TR/xinclude.

In the next step, the stylesheet needs to be applied. Saxon is a good tool for this. It comes in a Java version and a C++ version. Often, it does not matter much which you install: the C++ one runs faster, but the Java one has a few additional features, such as automatic escaping of special characters in included program listings. You can find information and downloads for Saxon at http://saxon.sourceforge.net.

Of course, you also need a stylesheet (often, this is a huge set of files, but it is still usually referenced in the singular). For DocBook, nothing exceeds the DocBook-XSL package, which is maintained by luminaries of the DocBook world. You can find it at http://docbook.sourceforge.net/projects/xsl.

Assuming that you are using the Java version of Saxon, you would invoke it more or less as follows for generating XSL-FO:

 java com.icl.saxon.StyleSheet myarticle-included.xml docbook-xsl/fo/docbook.xsl > \      myarticle.fo

and as follows for generating HTML:

 java com.icl.saxon.StyleSheet myarticle-included.xml docbook-xsl/html/docbook.xsl > \      myarticle.html

Notice how only the choice of stylesheet determines the output format.

As was already described, for HTML you are done at this point. For PDF output, you still need to run a FOP processor such as Apache FOP, which you can get from http://xmlgraphics.apache.org/fop. FOP requires a configuration file; see the documentation for how to create one. Often you can just use the userconfig.xml file that ships with FOP. A canonical invocation would look like this, PDF being the standard output format:

 java org.apache.fop.apps.Fop -c configfile myarticle.fo myarticle.pdf

Now you know the general setup and which tools you can use; remember that there are many other similar tools available that might serve your purposes even better. You may ask where your own formatting requirements come in. At this point, all the formatting is determined by the DocBook-XSL stylesheets. And this is also where you can hook into the process. Instead of supplying the docbook.xsl file to Saxon, you can also specify your own file. Of course, you do not want to copy the tremendous amount of work that has gone into DocBook-XSL; instead, you should import the DocBook-XSL stylesheet into your stylesheet, and then overwrite some settings. Here is an example for a custom stylesheet:

 <?xml version='1.0'?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"                 xmlns:fo="http://www.w3.org/1999/XSL/Format"                 version='1.0'                 xmlns="http://www.w3.org/TR/xhtml1/transitional"                 exclude-result-prefixes="#default"> <xsl:import href="docbook-xsl/fo/docbook.xsl"/> <xsl:param name="paper.type" select="'B5'"/> <xsl:param name="shade.verbatim" select="1"/> <xsl:param name="chapter.autolabel" select="1"/> <xsl:attribute-set name="section.title.level1.properties">   <xsl:attribute name="color">     <xsl:value-of select="'#243689'"/>   </xsl:attribute> </xsl:attribute-set> </xsl:stylesheet>

What is happening here? After the boilerplate code at the beginning, the <xsl:import> element loads the default FOP-generating stylesheet (of course, you would use another stylesheet for HTML generation). Then we set a number of parameters; a lot of settings in DocBook-XSL are parametrized, and a <xsl:param> element is all that is needed. In this case, we select a certain output paper format, ask for verbatim blocks to be shaded, and ask for automatic generation of table-of-contents labels for chapters.

Finally, we make a change that cannot be done merely by setting parameters: changing the color of level 1 section titles. Here we overwrite an attribute set with a color attribute of our own. For more complex changes, it is sometimes even necessary to replace element definitions from DocBook-XSL completely. This is not an easy task to do, and you would be well advised to read the DocBook-XSL documentation thoroughly.

XML opens a whole new world of tools and techniques. A good starting point for getting inspired and reading up on this is the web site of the Linux Documentation Project , which, as mentioned before, uses XML/DocBook for all its documentation. You'll find the Linux Documentation Project at http://www.tlpd.org.