Recipe 9.6. Converting Source Documents to Web Pages


Problem

You need to produce clean, validly coded web pages out of documents created in a word processing or page layout program (without spending hours doing it).

Solution

Address the problem at both the starting point when documents are created, as well as at the point where you make the conversion.

  • Get involved in the creation of documents destined for your web site.

  • Explain what you need to the content creators.

  • Set up templates that generate consistent, web-friendly documents.

  • Use HTML processing utilities such as Dreamweaver, HTML Tidy, and/or Word Cleaner to develop methods for converting documents to web pages.

  • Automate the routines wherever possible.

  • Take a long lunch and go home early.

Discussion

The "Save as HTML" functions of many desktop applicationsparticularly the widely used Microsoft Office programsare infamous for the bloated, non-standard code they generate. As a web site builder, you will almost certainly find yourself on the receiving end of these files as part of your site building or regular maintenance duties. These two strategies, especially when used together, can help make your job easier:

  1. Optimize the creation of the source documents to make conversion as smooth as possible. Then employ one or more conversion utilities to generate web page code that meets the same standards as pages you create yourself. Although it's certainly easier said (and written) than done, your best strategy for improving the quality of the source documents you receive for your web site is to contribute to their creation by defining your requirements for the creators. This may simply be a matter of taking a list of your most aggravating conversion problems to the document creator(s) to find common ground in attempt to mitigate, if not solve, them.

  2. Take matters more firmly into your own hands and create create (or modify) source document templates that are web-friendly and train the creators of the source documents on how to use them. (For example, Microsoft Word's default heading stylesHeading 1, Heading 2, etc.map to standard HTML heading elements (<h1>, <h2>, etc.), so encourage their use over custom heading styles.) Trying to change habits and enforce discipline may not make you many new friends, but you may find that you prefer some mild animosity from co-workers at times to the hours of mind-numbing tedium required to fix the offending code all by yourself.

Some of the tedium also can be handed off to document conversion utilities or filters (see the list referenced in the See Also section of this Recipe). Dreamweaver offers a built-in "Clean Up Word HTML" function that removes most (but not all) of Word's wonky web page code. Other widely used applications, such as HTML Tidy and Word Cleaner, offer the same capabilitiesas well as more configuration optionsand the ability (unlike Dreamweaver) to batch-process multiple files. I took all three for a test drive on a Word-generated web page and found that none did exactly what I wanted with the file out-of-the-box. With a little trial and error, though, I was able to improve the output to my satisfactionyou should be able to do the same with minimal work.

These specific steps are not laid out here, since Word will often turn even two similar documents into wildly different HTML pages. You'll have to perform slightly different steps for each converted document.


Given a set of documents that follow the web-friendly formatting rules you've established, these utilities can automate an otherwise arduous task.

See Also

Recipe 4.8 discusses other code manipulation utilities and a way to use Perl to remove unwanted code fragments from one file or a batch.

Two popular and customizable HTML clean-up utilities are HTML Tidy (http://tidy.sourceforge.net) and Word Cleaner (http://www.zapadoo.com/wordcleaner). Although HTML Tidy is a command-line utility, it also is built in to the Windows HTML editor HomeSite (http://www.macromedia.com/software/homesite), and is available as a plug-in for BBEdit for Mac (http://www.barebones.com). The W3C maintains a comprehensive list of filters that will convert files from a variety of applications to HTML (http://www.w3.org/Tools/Word_proc_filters.html). Microsoft's HTML Filter (http://office.microsoft.com/downloads/2000/msohtmf2.aspx) extends the capabilities of Word's built-in "Save as web page" function.



Web Site Cookbook.
Web Site Cookbook: Solutions & Examples for Building and Administering Your Web Site (Cookbooks (OReilly))
ISBN: 0596101090
EAN: 2147483647
Year: N/A
Pages: 144
Authors: Doug Addison

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net