Cleaning Up HTML Documents with tidy If you ever have to develop HTML documentswhen developing personal Web sites, completing a class project, or creating Web pages on the jobthe tidy utility can be a handy resource for you. If you're creating HTML pages by hand, you'll likely make occasional errors. These errors probably won't cause significant problems with using the pages, but they might make the pages harder to read, harder to maintain, and harder to subject to the scrutiny of your peers. Not to worry; tidy can help! tidy is not usually included with Linux or Unix distributions, but you can download (and install, using the instructions in Chapter 14) from http://tidy.sourceforge.net. To Clean Up Html Documents with tidy: | | 1. | vi sampledoc.html
Use the editor of your choice to create an HTML document. Our sample document is called, well, sampledoc.html (Figure 17.1) Don't worry about getting the tagging or syntax exactly right; tidy will take care of the details. Save and close your document.
Figure 17.1. Even a flawed HTML document, like this one, can be fixed by tidy. | 2. | tidy sampledoc.html
The tidy utility will apply HTML formatting rules and then output a massaged version of your document that is technically correct (Code Listing 17.1). Cool, huh?
Code Listing 17.1. The tidy command is handy for cleaning up HTML documents. [jdoe@frazz public_html]$ tidy sampledoc.html Tidy (vers 4th August 2000) Parsing "sampledoc.html line 10 column 6 - Warning: discarding unexpected </ul> sampledoc.html: Document content looks like HTML 2.0 1 warnings/errors were found! <!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN> <html> <head> <meta name="generator" content="HTML Tidy, see www.w3.org> <title>Jdoe's Home Page</title> </head> <body> <h1>Making Unix Work, One Day at a Time</h1> <p>Read these tips, when I get around to writing them, and weep.</p> <ul> <li>To be written</li> <li>To be written later</li> <li>To be written next week</li> </ul> <address>jdoe@example.com</address> </body> </html> HTML & CSS specifications are available from http://www.w3.org/ To learn more about Tidy see http://www.w3.org/People/Raggett/tidy/ Please send bug reports to Dave Raggett care of <html-tidy@w3.org> Lobby your company to join W3C, see http://www.w3.org/Consortium [jdoe@frazz public_html]$ | | 3. | tidy sampledoc.html > fixedupdoc.html
If you like the results, redirect the document to a new filename, as shown here, or use tidy m sampledoc.html to replace the original document.
| Tips For even spiffier results, we like using tidy indent quiet doctype loosemodify sampledoc.html, which suppresses the informative messages from tidy, makes the output an HTML 4 document, tidily indents the output, and replaces the original with the modified file (Code Listing 17.2). All that, and only one command. Consider using tidy with the sed script (described in the next section) to do a lot of cleanup at once. Code Listing 17.2. The tidy command, with the appropriate flags, performs miraclesalmost. [jdoe@frazz public_html]$ tidy -indent -quietdoctypeloose sampledoc.html line 10 column 6 -- Warning: discarding unexpected </ul> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN> <html> <head> <meta name="generator" content="HTML Tidy, see www.w3.org> <title> Jdoe's Home Page </title> </head> <body> <h1> Making Unix Work, One Day at a Time </h1> <p> Read these tips, when I get around to writing them, and weep. </p> <ul> <li> To be written </li> <li> To be written later </li> <li> To be written next week </li> </ul> <address> jdoe@example.com </address> </body> </html> HTML&CSS specifications are available from http://www.w3.org/ To learn more about Tidy see http://www.w3.org/People/Raggett/tidy/ Please send bug reports to Dave Raggett care of <html-tidy@w3.org> Lobby your company to join W3C, see http://www.w3.org/Consortium [jdoe@frazz public_html]$ | |