Section 8.4. Documentation Environments


8.4. Documentation Environments

This section describes four of the most common file formats and associated documentation environments that are encountered in small to medium-sized software companies and in open source projects.

8.4.1. Raw Text

Text files are the most easily created and portable open document format, but they have many disadvantages. You can't change their formatting easily, since they usually have no formal structure beyond a title and some section headings. They have no hyperlinks, they don't include images (though ASCII art does have its own beauty), and non-ASCII characters are handled in different ways by different tools. Another disadvantage is that the end of a line is marked differently on Unix and Windows machines, though this is often well hidden from users. To improve the printed appearance of text documents, care has to be taken to keep line lengths below about 80 characters.

Many IBM platforms use EBCDIC instead of ASCII for representing raw text, but fortunately conversion of text files to and from ASCII is not difficult. Some text files handle non-English characters by using the Unicode encoding standard (http://www.unicode.org).

If you are writing the basic documentation for open source projects, then raw text is probably still the most common format, at least for small files such as README files, change logs, and release notes. There is even an artist mode for Emacs that lets you create ASCII art freehand within your text. However, raw text files are unlikely to provide what you want for the documentation of larger projects.

8.4.2. FrameMaker

FrameMaker (http://www.adobe.com/products/framemaker) is a well-established commercial document editor from Adobe that starts at $449 per user. FrameMaker has a choice of two native file formats: the default, binary format (.fm files) and a text-based format called MIF that allows you to modify the file via command-line scripts and also saves space with SCM tools. The MIF file format is openly available at http://partners.adobe.com/public/developer/en/framemaker/MIF_Reference.pdf. Recent versions of FrameMaker (7.0 and later) also have built-in support for working with XML as a source file format.

FrameMaker is currently available only on Windows and Solaris. GNU/Linux is not supported, and Adobe discontinued FrameMaker for Macintosh in 2004. The Solaris version comes with a tool named fmbatch that allows you to manipulate .fm files from the command line, including converting to and from MIF, and printing documents to PostScript. There is a similar tool for Windows named DZbatcher available for download from http://www.datazone.com/english/overview/download.html.

Conversion to PDF is commonly achieved by printing the FrameMaker book as PostScript and then using Adobe Distiller to produce PDF, which takes about one minute per hundred pages on an ordinary desktop machine. HTML can be produced dynamically by exporting the .fm file as XML and then using XSL transformations to produce the HTML for web server pages. Alternatively, tools such as "WebWorks Publisher Professional for FrameMaker" from Quadralay (http://www.quadralay.com) are commonly used to generate HTML, though some customization for your own templates will likely be necessary. The conversion process is illustrated in Figure 8-1.

Figure 8-1. Conversion of a FrameMaker document from .fm to PDF and HTML


Apart from the official support web site and knowledge base, there is also a vocal FrameMaker community based around http://www.freeframers.org. Thanks to the FrameMaker Developer Kit, there are any number of plug-ins available for FrameMaker.

Though Adobe doesn't seem to be encouraging use of FrameMaker as strongly as in the past, knowledge of FrameMaker is still considered to be a primary requirement for many technical publication positions, and it's still the most common documentation tool for computer-related companies, including Microsoft for its own technical documentation.

8.4.3. XML: DocBook and OpenOffice

The advantages of editing text-based documentation are many, but text needs some structure to make it more useful. The most popular way to add structure has been to use one of the "markup" file formats such as HTML or XML. HTML is adequate for displaying pages in web browsers but doesn't contain enough markup for creating other kinds of documentation. Browsers are also very tolerant of incorrect HTML, so they shouldn't be the only thing used to test that your HTML is correct.

XML is more flexible, and a number of different ways to represent documents in XML now exist. These different ways are defined by the DTDs or schemas that describe which elements go where in an XML document. The big promise of XML is that because it has a well-defined structure, you should be able to transform XML files to other formats by using XSL scripts and other such tools. In practice, this really is true, but is rarely quite as easy as it first sounds. Using an XML-based documentation environment in 2005 requires a technical publications group whose members are willing to dig a little to resolve the inevitable teething problems of being early adopters.

8.4.3.1. DocBook

DocBook (http://www.oasis-open.org/docbook) is another markup language definition that uses XML (or less commonly, SGML, the big brother of XML). The XML DTD for DocBook is defined by OASIS, a nonprofit standards body, and was created in the early 1990s. Though it was originally created as a standard for computer documentation, DocBook can be used for any kind of documentation. DocBook's strengths are that it is an open file format, is text-based so multiple people can work on each file at once, and can be automatically converted to many different release formats. It is not usually used in WYSIWYG editors (though they do exist) and it takes more effort to set up than other documentation environments. DocBook is the primary source-file format for several large open source projects including FreeBSD, Apache, Samba, GNOME and KDE, and the Linux Documentation Project.

The official documentation for DocBook is DocBook: The Definitive Guide, by Norman Walsh (O'Reilly), which is also available online at http://docbook.org/tdg. Another useful book about DocBook and using XSL to transform it to other formats is DocBook XSL: The Complete Guide, by Bob Stayton (Sagehill), freely available online as HTML at http://www.sagehill.net/docbookxsl.

Generating simple HTML files directly from DocBook XML files works well enough for many web sites, but for finer control over the released files' appearance, many DocBook-based environments use XSL-FO (XSL Formatting Objects) as an intermediate file format. XSL-FO is XML that describes how a document should appear, as opposed to DocBook XML, which describes the purpose of each part of the document. Using stylesheets from http://docbook.sourceforge.net, the DocBook XML is transformed into XSL-FO XML. From there, an XSL-FO tool can create PostScript, PDF, or a number of other formats. The overall process is shown in Figure 8-2.

Figure 8-2. Conversion of a DocBook document from .xml to PDF and HTML


The best-known open source FO processor is FOP (http://xml.apache.org/fop), from the Apache Project. While it works fairly well, the current version of FOP does not implement some parts of the FO specificationfor example, keeping titles and their text on the same page. The next version intends to correct many of these problems.

Two commercial tools for working with FO are XEP (http://www.renderx.com) and the XSL Formatter (http://www.antennahouse.com). There is a long comparison of which parts of the XSL-FO specification are supported by different processors at http://www.antennahouse.com/xslfo/comparison-fo.htm. There is also a book about this whole process, XSL-FO: Making XML Look Good in Print, by Dave Pawson (O'Reilly). I also found the DocBook FAQ that he maintains at http://www.dpawson.co.uk/docbook to be a good online resource.

8.4.3.2. The tools used to write this book

O'Reilly uses FrameMaker 5.5.6 as the common file format for almost all of its books, including this one. Authors write their text in one format, which is usually Microsoft Word or OpenOffice but can also be DocBook or POD (for the first three, see the sections Section 8.4.4, Section 8.4.3.3, and Section 8.4.3.1, respectively; POD is covered in Section 8.5, later in this chapter). The original is then converted to FrameMaker, either by importing the source files directly or by using scripts to convert the source files into XML that is suitable for importing by FrameMaker. The FrameMaker file is then copyedited, figures are added, and an index is created; finally, the book is sent to the printing company as a set of PDF files with the necessary cutting marks on the page edges.

This book was written using DocBook Lite (dblite), which is an O'Reilly-defined subset of DocBook available from ftp://ftp.ora.com/pub/dblite/dblite.tar.gz. The text was added to the XML files using Emacs and its PSGML mode, with one file per chapter and a single file named book.xml to bring all the chapters together. The ability to tidy up a paragraph with fill-region-as-paragraph made reading large amounts of marked-up text much easier.

Generating HTML was simple using the Perl script db2h that comes with the dblite package. This script uses xsltproc (http://xmlsoft.org ), a useful tool for running XSL scripts that is available for Unix, Windows, and Mac OS X. The XSL script generates the HTML, with one web page per chapter and a basic table of contents. Alternatively, a different XSL script can produce one HTML file per section. Generating the HTML for this book took around 30 seconds on an ordinary desktop machine.

Generating PDF was harder work. The original tool chain for DocBook XML to PDF was to convert the XML to LATEX, then generate a .dvi file from that, and then convert the .dvi file to PDF. Instead I used FOP from the Apache Project, which is an open source tool to convert files from the XML format named FO to HTML or PDF. FOP hides this intermediate step nicely, and I was able to create a single PDF file, complete with bookmarks and internal hyperlinks, in under a minute on the same ordinary desktop machine. The precise steps for RedHat Linux 8.0 were:

  1. Install a JVM and set your JAVA_HOME environment variable to it. I used Sun's J2SE 1.4.2.

  2. Download and unpack the file docbook-xsl-1.67.2 (gzip'd .tar or .zip, according to preference) from http://docbook.sourceforge.net into the same directory as your source DocBook XML files.

  3. Create a file named fo-stylesheet.xsl in your source directory. This file is what you use to customize the PDF output. Mine started off looking like:

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"                  xmlns:fo="http://www.w3.org/1999/XSL/Format"                 xmlns:exsl="http://exslt.org/common"                 extension-element-prefixes="exsl"                 exclude-result-prefixes="exsl"                 version='1.0'>   <xsl:import href="docbook-xsl-1.67.2/fo/docbook.xsl"/>   <xsl:param name="fop.extensions" select="1" />   <xsl:param name="variablelist.as.blocks" select="1" /> </xsl:stylesheet>

  4. Download and unpack the file fop-0.20.5.bin (gzip'd .tar or .zip) from http://xml.apache.org/fop/download.html#binary into a convenient directory, which should be added to your PATH or otherwise made available at the command line.

  5. In your source file directory, type:

    fop.sh -xml book.xml -xsl fo-stylesheet.xsl -pdf book.pdf

    You may get some warnings about things not yet implemented by FOP. These can be ignored. (I wish now that I'd sent a patch with a command-line argument to suppress such warnings.)

The generated PDF file is named book.pdf and includes bookmarks and a hyperlinked table of contents for books and manuals. If you need to change the locations of the various packages and files, use absolute names. There are a large number of parameters for the DocBook generation of an FO file, as documented at http://docbook.sourceforge.net/release/xsl/current/doc/fo.

Two things that I never got the FOP PDF generation to do well were keeping section headers and their text together with soft pagebreaks, and handling long URLs across linebreaks. The latter sometimes lead to ugly justification of the surrounding text, with the words padded out too far apart.

I found that the XML validation script xwf that came with dblite didn't give me enough information about which line my errors were on, so I used xmllint instead, which comes with xsltproc. xmllint can be run on the whole book or on individual chapters, which helps when tracking down things like a subtly missing closing slash. Overall, I didn't find any open source XML tool that made it really easy to find errors in a large document made up of multiple XML files.

8.4.3.3. OpenOffice

OpenOffice (http://www.openoffice.org) is a large open source office suite with a word processor, spreadsheet, and presentation editor, among other applications. It is intended to compete with Microsoft Office by running on not just Windows and Mac OS X, but also Linux and Solaris (and other Unixes are in progress). The two big advantages of OpenOffice are that it's available at no cost and it can read and write the file formats used by Microsoft Word, Excel, and PowerPoint, at least if their more complex features aren't used. The problems arise when you use complex Excel macros or newer features of Word. In this case, OpenOffice will usually ignore what it doesn't understand.

There is also a partly closed version of OpenOffice named StarOffice©, which has better support for Asian fonts, more clip art, and a database. StarOffice is available from Sun (http://www.sun.com/software/star/staroffice) for around $80 per user.

The native file format for OpenOffice is gzip'd XML files, but it uses a different set of DTDs and schemas than DocBook. The configuration of OpenOffice is also controlled by XML files, which has helped it to support many localized versions. Another of the strengths of OpenOffice is its ability to generate PDF directly from the source files, though without bookmarks and internal hyperlinks. Command-line tools are well supported by OpenOffice, but I recommend reading OpenOffice.org Writer, by Jean Hollis Weber (O'Reilly), if you intend to generate HTML and PDF automatically from OpenOffice as part of your documentation environment. There are a growing number of other OpenOffice books available; there's even one for dummies it seems.

If you're not generating complex documents and you really need to be able to edit the source files on both Windows and GNU/Linux machines, then OpenOffice may work very well for you. New features are still being added with each yearly release, so this tool is definitely one to watch for further improvements.

8.4.4. Microsoft Word

Microsoft Word (http://www.microsoft.com/office/word) is part of the Microsoft Office suite of programs. Microsoft Office runs on Windows and Macintosh and retails for $399 per user (though installation on two machines is permitted); Word can also be purchased separately for somewhat less. Word is the most common word processing tool in many companies today and is relatively easy to use, at least for simple tasks. What you see on the screen while editing a document is, for the most part, an accurate rendition of what you will see when you print the document. Editing simple images is built into Word, though if you want to convert the images to a different format, you have to cut and paste them to another application such as Microsoft Paint. If you need to view and edit Word documents on platforms other than Windows or Macintosh, then OpenOffice (see the previous section, Section 8.4.3.3) or AbiWord (http://www.abiword.org) are both able to handle basic Word files.

The Microsoft Word file format, known casually as "doc" (from its default file extension, .doc) is a proprietary format that has changed substantially between major versions of Microsoft Word. Word files are large and quite complex since they can store macros, images, and previous versions of documents. Word provides its own tool for clearly showing different people's edits with change bars and color-coded lines below or through altered text.

There is a text-based, open file format named RTF (Rich Text Format) to which all Word files can be exported, though some formatting information is lost during this process. A better approach with more recent versions of Word is to use XML as a text-based export and import file format.

Each version of Word can import files from the previous major version, but this is not always true for versions that are older than that. With Word 2003, support for exporting files to XML is much improved, and if XML becomes a common choice for a source file format, then the upgrade problems may be more easily solved in the future. Other risks of using Word as a documentation environment are viruses disguised as Word macros and the fact that it is easy to accidentally leave information from previous versions inside a document where others may see it.

HTML files generated from Word have traditionally contained large amounts of Microsoft-specific HTML, along with a lot of directives to make the HTML resemble the printed page as closely as possible. More recent versions of Word have the option to generate "filtered" HTML, which is cleaner and smaller. Generating PDF from Word is possible with any number of small commercial converters, and there is also the open source GhostWord project (http://ghostword.sourceforge.net). The overall process is shown in Figure 8-3. For anyone putting together a documentation environment using Word, Word Hacks, by Andrew Savikas (O'Reilly), contains many of the mechanisms and scripts used by O'Reilly to produce PDF and HTML from the Word source files of most of its books.

Figure 8-3. Conversion of a Word document from .doc to PDF and HTML




Practical Development Environments
Practical Development Environments
ISBN: 0596007965
EAN: 2147483647
Year: 2004
Pages: 150

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net