8.3. File Formats for DocumentationThis section describes some of the commonly used formats for documentation files and discusses the strengths and weaknesses of each file format. Some formats (such as XML) are more common as source formatsthat is, the files where the raw content is added. Other formats (such as PostScript and PDF) are more often used as release formatsthat is, the files that are available for customers to use. Only a few formats (such as raw text files) are used as both source and release file formats.
Some common requirements of a file format and the tools that support it are:
Text-based source file formats such as XML are considerably easier to modify from the command line with simple tools that already exist. Modifying files in binary formats always requires more effort; usually you have to convert them to a text-based format, make the changes, and then convert the files back to the binary format. Another aspect to consider carefully when choosing a file format is that of closed and open formats. If your documents are stored in a proprietary, closed format, then your ability to convert the documents to a different format is limited to the tools available from official vendors, or tools that use a reverse-engineered understanding of the file format. Microsoft Word is the most common closed file format; the other formats covered in this chapter are open formats. A strongly biased discussion of open and closed file formats can be found at http://www.openformats.org. A partial list of open file formats can be found at http://directory.google.com/Top/Computers/Data_Formats/Open_Standards. 8.3.1. File Formats for CustomersThe file format for most released documentation is either HTML or PDF, with Word sometimes being used as well. PostScript is sometimes still used for academic papers. Raw text is used for small documents, and its advantages and disadvantages are discussed in Section 8.4.1, later in this chapter. This section briefly describes the HTML, PostScript, and PDF file formats. 8.3.1.1. HTMLVarious file formats exist that are derived from the ISO-standardized SGML, notably any format whose name ends in "ML" (which stands for "markup language"). HTML is the best-known one. It consists of text with added structure in the form of elements such as <h1> for a header, <p> for a paragraph, and <a href="index.html"> for a hypertext link. HTML files are, of course, what web browsers can display for online viewing. Some browsers support vendor-specific extensions to basic HTML, but thankfully this seems to be increasingly rare. More common problems nowadays are partial implementations of the HTML specification (http://www.w3.org/TR/html) or issues with other technologies such as JavaScript or Flash. Drawbacks of HTML, or sometimes of the way it is used, include the following:
8.3.1.2. PostScriptPostScript is a language from Adobe for controlling printed output, the first to do so in a general way that worked well with printers from different manufacturers. Most printers still accept PostScript files directly, which was what made PostScript the default language for printed documents until PDF became more common toward the end of the 1990s. The specification for the PostScript language is available at http://partners.adobe.com/public/developer/en/ps/PLRM.pdf, and a tutorial and cookbook are available at http://www-cdf.fnal.gov/offline/PostScript/BLUEBOOK.PDF. 8.3.1.3. PDFPDF is a subset of PostScript, designed to avoid some of the problems of PostScript. Images and fonts can be embedded directly into PDF files, and some of the necessary processing of the PostScript has been already done for PDF files. PDF also has better support for links, searching, and accessibility. The Acrobat reader from Adobe is distributed at no cost for a large number of platforms, something that greatly encouraged the use of PDF after a slow start in the early 1990s. PDF is currently the most common format for distributing documents over the Internet that are intended for printing. The specifications for PDF are available online at http://partners.adobe.com/public/developer/pdf/index_reference.html. There are tools from Adobe and other companies that can edit PDF directly, and there are also a few open source libraries for working with PDF files. More often, PDF is generated by an application, just as PostScript was generated. OpenOffice is one such application. One great place to start when looking for ideas for PDF tools is PDF Hacks, by Sid Steward (O'Reilly). |