Hack 57 Create a Traditional Index Section from Keywords | PDF Hacks: 100 Industrial-Strength Tips & Tools

< Day Day Up >

Add a search feature to your print edition .

Creating a good document Index section is a difficult job performed by professionals. However, an automatically generated index still can be very helpful. Use automatic keywords [Hack #19] or select your own keywords. This hack will locate their pages, build a reference, and then create PDF pages that you can append to your document, as shown in Figure 5-5. It even uses your PDF's page labels (also known as logical page numbering ) to ensure trouble-free lookup.

Figure 5-5. Turning document keywords into a PDF Index section

5.8.1 Tool Up

Download and install pdftotext [Hack #19] , our kw_index [Hack #19] , and pdftk [Hack #79] . You must also have enscript (Windows users visit http://gnuwin32.sf.net/packages/enscript.htm ) and ps2pdf. ps2pdf comes with Ghostscript [Hack #39] . Our kw_index package includes the kw_catcher and page_refs programs (and source code) that we use in the following sections.

5.8.2 The Procedure

First, set your PDF's logical page numbering [Hack #62] to match your document's page numbering. Then, use pdftk to dump this information into a text file, like so:

  pdftk    mydoc.pdf    dump_data output    mydoc.data.txt

Next, convert your PDF to plain text with pdftotext:

  pdftotext    mydoc.pdf mydoc.txt

Create a keyword list [Hack #19] from mydoc.txt using kw_catcher, like so:

  kw_catcher    12    keywords_only    mydoc.txt    >    mydoc.kw.txt

Edit mydoc.kw.txt to remove duds and add missing keywords. At present, only one keyword is allowed per line. If two or more keywords are adjacent in mydoc.txt , our page_refs program will assemble them into phrases.

Now pull all these together to create a text index using page_refs:

  page_refs    mydoc.txt mydoc.kw.txt mydoc.data.txt    >    mydoc.index.txt

Finally, create a PDF from mydoc.index.txt using enscript and ps2pdf:

  enscript --columns 2 --font 'Times-Roman@10' \   --header 'INDEX' --header-font 'Times-Bold@14' \   --margins 54:54:36:54 --word-wrap --output -    mydoc.index.txt    \   ps2pdf -    mydoc.index.pdf

5.8.3 The Code

Of course, the thing to do is to wrap this procedure into a tidy script. Copy the following Bourne shell script into a file named make_index.sh , and make it executable by applying chmod 700 . Windows users can get a Bourne shell by installing MSYS [Hack #97] .

 #!/bin/sh # make_index.sh, version 1.0 # usage: make_index.sh <PDF filename> <page window> # requires: pdftk, kw_catcher, page_refs, #           pdftotext, enscript, ps2pdf # # by Ross Presser, Imtek.com # adapted by Sid Steward # http://www.pdfhacks.com/kw_index/ fname=`basename  .pdf` pdftk ${fname}.pdf dump_data output ${fname}.data.txt && \ pdftotext ${fname}.pdf ${fname}.txt && \ kw_catcher  keywords_only ${fname}.txt \  page_refs ${fname}.txt - ${fname}.data.txt \  enscript --columns 2 --font 'Times-Roman@10' \   --header 'INDEX' --header-font 'Times-Bold@14' \   --margins 54:54:36:54 --word-wrap --output - \  ps2pdf - ${fname}.index.pdf

5.8.4 Running the Hack

Pass the name of your PDF document and the kw_catcher window size to make_index.sh like so:

  make_index.sh    mydoc.pdf 12

The script will create a document index named mydoc.index.pdf . Review this index and append it to your PDF document [Hack #51] if you desire . The script also creates two intermediate files: mydoc.data.txt and mydoc.txt . If the PDF index is faulty, review these intermediate files for problems. Delete them when you are satisfied with the PDF index.

The second argument to make_index.sh controls the keyword detection sensitivity. Smaller numbers yield fewer keywords at the risk of omitting some keywords; larger numbers admit more keywords and also more noise. [Hack #19] discusses this parameter and the kw_catcher program that uses it.

< Day Day Up >