Recipe 13.10. Cleaning Up Broken or Nonstandard HTML


13.10.1. Problem

You've got some HTML with malformed syntax that you'd like to clean up. This makes it easier to parse and ensures that the pages you produce are standards compliant.

13.10.2. Solution

Use PHP's Tidy extension. It relies on the popular, powerful, HTML Tidy library to turn frightening piles of tag soup into well-formed, standards-compliant HTML or XHTML. Example 13-44 shows how to repair a file.

Repairing an HTML file with Tidy

<?php $fixed = tidy_repair_file('bad.html'); file_put_contents('good.html', $fixed); ?>

13.10.3. Discussion

The HTML Tidy library has a large number of rules and features built up over time that creatively handle a wide variety of HTML abominations. Fortunately, you don't have to care about what all those rules are to reap the benefits of Tidy. Just pass a filename to tidy_repair_file( ) and you get back a cleaned-up version. For example, if bad.html contains:

<img src="/books/3/131/1/html/2/monkey.jpg"> <b>I <em>love</b> monkeys</em>. 

then Example 13-44 writes the following out to good.html:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN"> <html> <head> <title></title> </head> <body> <img src="/books/3/131/1/html/2/monkey.jpg"> <b>I <em>love</em> monkeys</b>. </body> </html> 

Tidy has a large number of configuration options that affect the output it produces. You can read about them at http://tidy.sourceforge.net/docs/quickref.html. Pass configuration to tidy_repair_file( ) by providing a second argument that is an array of configuration options and values. Example 13-45 uses the output-xhtml option, which tells Tidy to produce valid XHTML.

Production of XHTML with Tidy

<?php $config = array('output-xhtml' => true); $fixed = tidy_repair_file('bad.html', $config); file_put_contents('good.xhtml', $fixed); ?>

Example 13-45 writes the following to good.xhtml:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title></title> </head> <body> <img src="/books/3/131/1/html/2/monkey.jpg" /> <b>I <em>love</em> monkeys</b>. </body> </html> 

If your source HTML is in a string instead of a file, use tidy_repair_string( ). It expects a first argument that contains HTML, not a filename.

13.10.4. See Also

Documentation on tidy_repair_file( ) at http://www.php.net/tidy_repair_file, on tidy_repair_string( ) at http://www.php.net/tidy_repair_string, and on Tidy configuration options at http://tidy.sourceforge.net/docs/quickref.html.




PHP Cookbook, 2nd Edition
PHP Cookbook: Solutions and Examples for PHP Programmers
ISBN: 0596101015
EAN: 2147483647
Year: 2006
Pages: 445

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net