You've got some HTML with malformed syntax that you'd like to clean up. This makes it easier to parse and ensures that the pages you produce are standards compliant.
Use PHP's Tidy extension. It relies on the popular, powerful, HTML Tidy library to turn frightening piles of tag soup into well-formed, standards-compliant HTML or XHTML. Example 13-44 shows how to repair a file.
Repairing an HTML file with Tidy
The HTML Tidy library has a large number of rules and features built up over time that creatively handle a wide variety of HTML abominations. Fortunately, you don't have to care about what all those rules are to reap the benefits of Tidy. Just pass a filename to tidy_repair_file( ) and you get back a cleaned-up version. For example, if bad.html contains:
<img src="/books/3/131/1/html/2/monkey.jpg"> <b>I <em>love</b> monkeys</em>.
then Example 13-44 writes the following out to good.html:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN"> <html> <head> <title></title> </head> <body> <img src="/books/3/131/1/html/2/monkey.jpg"> <b>I <em>love</em> monkeys</b>. </body> </html>
Tidy has a large number of configuration options that affect the output it produces. You can read about them at http://tidy.sourceforge.net/docs/quickref.html. Pass configuration to tidy_repair_file( ) by providing a second argument that is an array of configuration options and values. Example 13-45 uses the output-xhtml option, which tells Tidy to produce valid XHTML.
Production of XHTML with Tidy
Example 13-45 writes the following to good.xhtml:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title></title> </head> <body> <img src="/books/3/131/1/html/2/monkey.jpg" /> <b>I <em>love</em> monkeys</b>. </body> </html>
If your source HTML is in a string instead of a file, use tidy_repair_string( ). It expects a first argument that contains HTML, not a filename.
13.10.4. See Also
Documentation on tidy_repair_file( ) at http://www.php.net/tidy_repair_file, on tidy_repair_string( ) at http://www.php.net/tidy_repair_string, and on Tidy configuration options at http://tidy.sourceforge.net/docs/quickref.html.