Section 8.3. File Formats for Documentation


8.3. File Formats for Documentation

This section describes some of the commonly used formats for documentation files and discusses the strengths and weaknesses of each file format. Some formats (such as XML) are more common as source formatsthat is, the files where the raw content is added. Other formats (such as PostScript and PDF) are more often used as release formatsthat is, the files that are available for customers to use. Only a few formats (such as raw text files) are used as both source and release file formats.

Current best practice is to provide documents as both HTML web pages (for fast access) and PDF (for downloading complete documentation packages and for printing). It's helpful if you provide a single web page where all the different formats for your documentation can be downloaded, so that people only have to refer to a single URL. For example, http://www.example.com/myproduct/docs, not http://www.example.com/myproduct/pdf/manual and http://www.example.com/myproduct/html/manual.


Some common requirements of a file format and the tools that support it are:

  • Typeset printing, often using different file formats, sizes, and layouts

  • Online viewing, often with hyperlinks

  • Images interleaved with text

  • Searching documents for text or formatting

  • Support for non-English languages and characters

  • Comments that can be mixed with text for reviews but that don't appear in the final product

  • Joining and splitting files

  • Generating lists of the differences between versions of the document, or diffability[1]

    [1] The term diffability is used here to suggest how easy it is to run a command such as diff on files in a particular format. In another context, diffability also refers to people with different abilities, rather than disabilities.

Text-based source file formats such as XML are considerably easier to modify from the command line with simple tools that already exist. Modifying files in binary formats always requires more effort; usually you have to convert them to a text-based format, make the changes, and then convert the files back to the binary format.

Another aspect to consider carefully when choosing a file format is that of closed and open formats. If your documents are stored in a proprietary, closed format, then your ability to convert the documents to a different format is limited to the tools available from official vendors, or tools that use a reverse-engineered understanding of the file format. Microsoft Word is the most common closed file format; the other formats covered in this chapter are open formats. A strongly biased discussion of open and closed file formats can be found at http://www.openformats.org. A partial list of open file formats can be found at http://directory.google.com/Top/Computers/Data_Formats/Open_Standards.

8.3.1. File Formats for Customers

The file format for most released documentation is either HTML or PDF, with Word sometimes being used as well. PostScript is sometimes still used for academic papers. Raw text is used for small documents, and its advantages and disadvantages are discussed in Section 8.4.1, later in this chapter. This section briefly describes the HTML, PostScript, and PDF file formats.

8.3.1.1. HTML

Various file formats exist that are derived from the ISO-standardized SGML, notably any format whose name ends in "ML" (which stands for "markup language"). HTML is the best-known one. It consists of text with added structure in the form of elements such as <h1> for a header, <p> for a paragraph, and <a href="index.html"> for a hypertext link.

HTML files are, of course, what web browsers can display for online viewing. Some browsers support vendor-specific extensions to basic HTML, but thankfully this seems to be increasingly rare. More common problems nowadays are partial implementations of the HTML specification (http://www.w3.org/TR/html) or issues with other technologies such as JavaScript or Flash.

Drawbacks of HTML, or sometimes of the way it is used, include the following:

  • HTML is not designed for printing, so you may see text or images move around or trail off the edge of the paper.

  • Different web browsers treat the same HTML files differently, so manual testing using multiple browsers is still a necessary part of releasing HTML documentation. Common problems are graphics with some nearby text overlapping them, too much whitespace before and after paragraphs or whole pages, or even pages that just won't display at all in some browsers.

  • Books and manuals can be hard to read sequentially if you have to keep scrolling and then clicking links to go to the next paragraph. GNU manuals seem to be particularly prone to this, with a web page for each Info node.

  • Support for mathematical equations in HTML is still limited; MathML (http://www.w3.org/Math) is one effort to make this easier. The common solution of generating small images for each equation seems clumsy to me.

  • HTML comments can't be nested, which makes commenting out sections of HTML awkward. Also, they can't have a double dash in them, which makes adding ASCII art to them hard. To be fair, HTML was designed to be simple to parse, at the expense of such infrequent uses of comments.

8.3.1.2. PostScript

PostScript is a language from Adobe for controlling printed output, the first to do so in a general way that worked well with printers from different manufacturers. Most printers still accept PostScript files directly, which was what made PostScript the default language for printed documents until PDF became more common toward the end of the 1990s.

The specification for the PostScript language is available at http://partners.adobe.com/public/developer/en/ps/PLRM.pdf, and a tutorial and cookbook are available at http://www-cdf.fnal.gov/offline/PostScript/BLUEBOOK.PDF.

8.3.1.3. PDF

PDF is a subset of PostScript, designed to avoid some of the problems of PostScript. Images and fonts can be embedded directly into PDF files, and some of the necessary processing of the PostScript has been already done for PDF files. PDF also has better support for links, searching, and accessibility. The Acrobat reader from Adobe is distributed at no cost for a large number of platforms, something that greatly encouraged the use of PDF after a slow start in the early 1990s. PDF is currently the most common format for distributing documents over the Internet that are intended for printing.

The specifications for PDF are available online at http://partners.adobe.com/public/developer/pdf/index_reference.html. There are tools from Adobe and other companies that can edit PDF directly, and there are also a few open source libraries for working with PDF files. More often, PDF is generated by an application, just as PostScript was generated. OpenOffice is one such application. One great place to start when looking for ideas for PDF tools is PDF Hacks, by Sid Steward (O'Reilly).



Practical Development Environments
Practical Development Environments
ISBN: 0596007965
EAN: 2147483647
Year: 2004
Pages: 150

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net