Dealing with Broken HTML


Like it or not, someone is going to send you broken HTML at some point, and unless you decided to strip all HTML tags from the feeds, this is going to adversely affect your site. Luckily PHP and HTML Tidy make a great pair, and make dealing with broken HTML a breeze.

There are two versions of Tidy: 1.0 and 2.0. Version 1.0 is used with the 4.3.x tree of PHP, and the 2.0 release is used with the 5.x tree. You can check to see if you have Tidy installed with your version of PHP with the phpinfo() command. You should be able to locate a "tidy" section in the output if it is present.

Installing Tidy

Assuming you don't already have it, installing Tidy under PHP4 should be pretty easy. If your system includes pear, you can download the pecl package with one command (from a suitable account):

 pear -v install tidy 

Failing that, you will need to download the package directly from the pecl repository at http://pecl.php.net/package/tidy.

Tidy support is built into PHP5. It just needs to be enabled either at compile time or runtime, depending on the host operating system. Getting Tidy to run just involves that you uncomment the following line in your php.ini file:

 extension=php_tidy.dll 

Then restart your web server for the changes to take effect. You can confirm that Tidy is present by checking the output of phpinfo().

Installing Tidy on a Linux system will require that you (or your host) recompile PHP to include Tidy. This can be done with the -with-tidy configure option. Don't just type ./configure -with-tidy to get it to work, because chances are that several other configure options are already present, and doing this will lose them. The phpinfo() command will display your current configure options — use this as a base and add -with-tidy to it.

If tidylib is not installed on the machine in question (you will know because the configure returns an error telling you so), you will need to download and install tidylib. You can get tidylib from http://tidy.sourceforge.net/. Grab the source package, not the compiled binary (it won't have the libraries PHP will need). Build from the source package as you normally would. Then reconfigure php and install. Finally, restart your web server for the changes to take effect.

Cleaning Broken HTML

Take the following sample output:

 <html> <head> <title>This is a horrible page</title> <body> <h1>This is a broken snippet <p>Notice the poor use of tags, leaving tags open, links left <a href="open.html">open <p>All in all, this is a horrible piece of <b>code! </html> 

Although it is unlikely that anyone will ever provide you with a piece of code quite that bad, you need to be prepared for tags to be left open at the termination of the feed. Viewing that in a browser yields a broken HTML sample, as shown in Figure 3-2.

image from book
Figure 3-2

Giving it a quick run through HTML, Tidy results in the following code:

 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN"> <html>   <head>     <title>       This is a horrible page     </title>   </head>   <body>     <h1>       This is a broken snippet     </h1>     <p>       Notice the poor use of tags, leaving tags open, links left <a href="open.html">open</a>     </p>     <p>       All in all, this is a horrible piece of <b>code!</b>     </p>   </body> </html> 

The code involved was as follows:

 $brokenHTML = file_get_contents('./broken.html'); $config = array('indent' => TRUE,                'output-html' => TRUE,                'wrap' => 200,                'clean' => TRUE); $tidy = tidy_parse_string($brokenHTML, $config, 'UTF8'); tidy_clean_repair($tidy); echo tidy_get_output($tidy); 

The broken HTML page is loaded into the appropriate variable, and then a few configuration options are set (this is covered in greater detail in a moment). The broken HTML is given to Tidy to be parsed, along with the configuration options and desired output. Finally, Tidy is asked to clean and repair the document in question and output the result. These few simple steps can save major headaches down the road when your site design is thrown out the window by a few unclosed tags floating around your displayed feeds (see Figure 3-3).

image from book
Figure 3-3

The configuration options available are quite expansive; one of particular interest when dealing with feeds is show-body-only. Using that option against the earlier example would yield the following:

 <h1>   This is a broken snippet </h1> <p>   Notice the poor use of tags, leaving tags open, links left <a href="open.html">open</a> </p> <p>   All in all, this is a horrible piece of <b>code!</b> </p> 

This would obviously be necessary or there would be one HTML document declared for every feed shown on your page. Configuration options of particular note are shown in the following table.

Option

Action

Output-html

This option specifies that the output should be presented as HTML, in contrast to the following two options.

Output-xml

This option specifies that output should be XML.

Output-xhtml

This option specifies that output should be XHTML.

Wrap

This specifies the maximum line length before Tidy will line-wrap to the next line. A good thing to keep in mind for consistency among the code generated by your site.

Clean

This option instructs Tidy to strip out surplus presentation tags (think about the code generated by nearly every automated tool out there) and attributes, replacing them with style rules or structural markup as required.

Hide-comments

Specifies whether Tidy should print out comments.

Css-prefix

This is the prefix Tidy will use for all of its css classes. Keep in mind the css classes used in the rest of your site to avoid conflict.

Drop-empty-paras

This option specifies whether empty paragraphs should be dropped entirely or replaced with <br> tags. The HTML 4 specification does not allow for empty paragraph tags.

Enclose-text

Tells Tidy to enclose any text in the body within a <p> element. Useful if you want all text to be enclosed for css reasons.

Fix-backslash

Defaults to yes, but tells Tidy to replace backslashes in URLs with forward slashes. Internet Explorer generally allows either, while back-slashes confuse everything else (and rightly so).

Indent

Instructs Tidy to properly indent the code; helps keep it all readable.

Show-errors

Whether or not Tidy should display errors with the output.

Show-warnings -

Whether warnings should be displayed.

Error-file

By default errors go to stderr; use this option to have them saved to a file.

Force-output

With this option you can force Tidy to give some output in all circumstances. This is not recommended, however, because the attempts that may be made in order to give some output may result in a very odd-looking result.

Generally I am a large proponent of storing all data in a state as close to its original or provided state as possible, then doing any necessary modifications at page time. This allows changes to formatting preferences and the like as needed. In this case, however, as a concession to performance issues, I would recommend dealing with proper formatting of consumed feeds at the time of consumption. If you do want to record the original form of the data (escaping it for safe SQL entry, of course), do it in a separate table.




Professional Web APIs with PHP. eBay, Google, PayPal, Amazon, FedEx, Plus Web Feeds
Professional Web APIs with PHP. eBay, Google, PayPal, Amazon, FedEx, Plus Web Feeds
ISBN: 764589547
EAN: N/A
Year: 2006
Pages: 130

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net