< Day Day Up > |
Add a search feature to your print edition . Creating a good document Index section is a difficult job performed by professionals. However, an automatically generated index still can be very helpful. Use automatic keywords [Hack #19] or select your own keywords. This hack will locate their pages, build a reference, and then create PDF pages that you can append to your document, as shown in Figure 5-5. It even uses your PDF's page labels (also known as logical page numbering ) to ensure trouble-free lookup. Figure 5-5. Turning document keywords into a PDF Index section5.8.1 Tool UpDownload and install pdftotext [Hack #19] , our kw_index [Hack #19] , and pdftk [Hack #79] . You must also have enscript (Windows users visit http://gnuwin32.sf.net/packages/enscript.htm ) and ps2pdf. ps2pdf comes with Ghostscript [Hack #39] . Our kw_index package includes the kw_catcher and page_refs programs (and source code) that we use in the following sections. 5.8.2 The ProcedureFirst, set your PDF's logical page numbering [Hack #62] to match your document's page numbering. Then, use pdftk to dump this information into a text file, like so: pdftk mydoc.pdf dump_data output mydoc.data.txt Next, convert your PDF to plain text with pdftotext: pdftotext mydoc.pdf mydoc.txt Create a keyword list [Hack #19] from mydoc.txt using kw_catcher, like so: kw_catcher 12 keywords_only mydoc.txt > mydoc.kw.txt Edit mydoc.kw.txt to remove duds and add missing keywords. At present, only one keyword is allowed per line. If two or more keywords are adjacent in mydoc.txt , our page_refs program will assemble them into phrases. Now pull all these together to create a text index using page_refs: page_refs mydoc.txt mydoc.kw.txt mydoc.data.txt > mydoc.index.txt Finally, create a PDF from mydoc.index.txt using enscript and ps2pdf: enscript --columns 2 --font 'Times-Roman@10' \ --header 'INDEX' --header-font 'Times-Bold@14' \ --margins 54:54:36:54 --word-wrap --output - mydoc.index.txt \ ps2pdf - mydoc.index.pdf 5.8.3 The CodeOf course, the thing to do is to wrap this procedure into a tidy script. Copy the following Bourne shell script into a file named make_index.sh , and make it executable by applying chmod 700 . Windows users can get a Bourne shell by installing MSYS [Hack #97] . #!/bin/sh # make_index.sh, version 1.0 # usage: make_index.sh <PDF filename> <page window> # requires: pdftk, kw_catcher, page_refs, # pdftotext, enscript, ps2pdf # # by Ross Presser, Imtek.com # adapted by Sid Steward # http://www.pdfhacks.com/kw_index/ fname=`basename .pdf` pdftk ${fname}.pdf dump_data output ${fname}.data.txt && \ pdftotext ${fname}.pdf ${fname}.txt && \ kw_catcher keywords_only ${fname}.txt \ page_refs ${fname}.txt - ${fname}.data.txt \ enscript --columns 2 --font 'Times-Roman@10' \ --header 'INDEX' --header-font 'Times-Bold@14' \ --margins 54:54:36:54 --word-wrap --output - \ ps2pdf - ${fname}.index.pdf 5.8.4 Running the HackPass the name of your PDF document and the kw_catcher window size to make_index.sh like so: make_index.sh mydoc.pdf 12 The script will create a document index named mydoc.index.pdf . Review this index and append it to your PDF document [Hack #51] if you desire . The script also creates two intermediate files: mydoc.data.txt and mydoc.txt . If the PDF index is faulty, review these intermediate files for problems. Delete them when you are satisfied with the PDF index. The second argument to make_index.sh controls the keyword detection sensitivity. Smaller numbers yield fewer keywords at the risk of omitting some keywords; larger numbers admit more keywords and also more noise. [Hack #19] discusses this parameter and the kw_catcher program that uses it. |
< Day Day Up > |