ProblemYou need to produce clean, validly coded web pages out of documents created in a word processing or page layout program (without spending hours doing it). SolutionAddress the problem at both the starting point when documents are created, as well as at the point where you make the conversion.
DiscussionThe "Save as HTML" functions of many desktop applicationsparticularly the widely used Microsoft Office programsare infamous for the bloated, non-standard code they generate. As a web site builder, you will almost certainly find yourself on the receiving end of these files as part of your site building or regular maintenance duties. These two strategies, especially when used together, can help make your job easier:
Some of the tedium also can be handed off to document conversion utilities or filters (see the list referenced in the See Also section of this Recipe). Dreamweaver offers a built-in "Clean Up Word HTML" function that removes most (but not all) of Word's wonky web page code. Other widely used applications, such as HTML Tidy and Word Cleaner, offer the same capabilitiesas well as more configuration optionsand the ability (unlike Dreamweaver) to batch-process multiple files. I took all three for a test drive on a Word-generated web page and found that none did exactly what I wanted with the file out-of-the-box. With a little trial and error, though, I was able to improve the output to my satisfactionyou should be able to do the same with minimal work.
Given a set of documents that follow the web-friendly formatting rules you've established, these utilities can automate an otherwise arduous task. See AlsoRecipe 4.8 discusses other code manipulation utilities and a way to use Perl to remove unwanted code fragments from one file or a batch. Two popular and customizable HTML clean-up utilities are HTML Tidy (http://tidy.sourceforge.net) and Word Cleaner (http://www.zapadoo.com/wordcleaner). Although HTML Tidy is a command-line utility, it also is built in to the Windows HTML editor HomeSite (http://www.macromedia.com/software/homesite), and is available as a plug-in for BBEdit for Mac (http://www.barebones.com). The W3C maintains a comprehensive list of filters that will convert files from a variety of applications to HTML (http://www.w3.org/Tools/Word_proc_filters.html). Microsoft's HTML Filter (http://office.microsoft.com/downloads/2000/msohtmf2.aspx) extends the capabilities of Word's built-in "Save as web page" function. |