Like it or not, someone is going to send you broken HTML at some point, and unless you decided to strip all HTML tags from the feeds, this is going to adversely affect your site. Luckily PHP and HTML Tidy make a great pair, and make dealing with broken HTML a breeze.
There are two versions of Tidy: 1.0 and 2.0. Version 1.0 is used with the 4.3.x tree of PHP, and the 2.0 release is used with the 5.x tree. You can check to see if you have Tidy installed with your version of PHP with the phpinfo() command. You should be able to locate a "tidy" section in the output if it is present.
Assuming you don't already have it, installing Tidy under PHP4 should be pretty easy. If your system includes pear, you can download the pecl package with one command (from a suitable account):
pear -v install tidy
Failing that, you will need to download the package directly from the pecl repository at http://pecl.php.net/package/tidy.
Tidy support is built into PHP5. It just needs to be enabled either at compile time or runtime, depending on the host operating system. Getting Tidy to run just involves that you uncomment the following line in your php.ini file:
extension=php_tidy.dll
Then restart your web server for the changes to take effect. You can confirm that Tidy is present by checking the output of phpinfo().
Installing Tidy on a Linux system will require that you (or your host) recompile PHP to include Tidy. This can be done with the -with-tidy configure option. Don't just type ./configure -with-tidy to get it to work, because chances are that several other configure options are already present, and doing this will lose them. The phpinfo() command will display your current configure options — use this as a base and add -with-tidy to it.
If tidylib is not installed on the machine in question (you will know because the configure returns an error telling you so), you will need to download and install tidylib. You can get tidylib from http://tidy.sourceforge.net/. Grab the source package, not the compiled binary (it won't have the libraries PHP will need). Build from the source package as you normally would. Then reconfigure php and install. Finally, restart your web server for the changes to take effect.
Take the following sample output:
<html> <head> <title>This is a horrible page</title> <body> <h1>This is a broken snippet <p>Notice the poor use of tags, leaving tags open, links left <a href="open.html">open <p>All in all, this is a horrible piece of <b>code! </html>
Although it is unlikely that anyone will ever provide you with a piece of code quite that bad, you need to be prepared for tags to be left open at the termination of the feed. Viewing that in a browser yields a broken HTML sample, as shown in Figure 3-2.
Figure 3-2
Giving it a quick run through HTML, Tidy results in the following code:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN"> <html> <head> <title> This is a horrible page </title> </head> <body> <h1> This is a broken snippet </h1> <p> Notice the poor use of tags, leaving tags open, links left <a href="open.html">open</a> </p> <p> All in all, this is a horrible piece of <b>code!</b> </p> </body> </html>
The code involved was as follows:
$brokenHTML = file_get_contents('./broken.html'); $config = array('indent' => TRUE, 'output-html' => TRUE, 'wrap' => 200, 'clean' => TRUE); $tidy = tidy_parse_string($brokenHTML, $config, 'UTF8'); tidy_clean_repair($tidy); echo tidy_get_output($tidy);
The broken HTML page is loaded into the appropriate variable, and then a few configuration options are set (this is covered in greater detail in a moment). The broken HTML is given to Tidy to be parsed, along with the configuration options and desired output. Finally, Tidy is asked to clean and repair the document in question and output the result. These few simple steps can save major headaches down the road when your site design is thrown out the window by a few unclosed tags floating around your displayed feeds (see Figure 3-3).
Figure 3-3
The configuration options available are quite expansive; one of particular interest when dealing with feeds is show-body-only. Using that option against the earlier example would yield the following:
<h1> This is a broken snippet </h1> <p> Notice the poor use of tags, leaving tags open, links left <a href="open.html">open</a> </p> <p> All in all, this is a horrible piece of <b>code!</b> </p>
This would obviously be necessary or there would be one HTML document declared for every feed shown on your page. Configuration options of particular note are shown in the following table.
Option | Action |
---|---|
Output-html | This option specifies that the output should be presented as HTML, in contrast to the following two options. |
Output-xml | This option specifies that output should be XML. |
Output-xhtml | This option specifies that output should be XHTML. |
Wrap | This specifies the maximum line length before Tidy will line-wrap to the next line. A good thing to keep in mind for consistency among the code generated by your site. |
Clean | This option instructs Tidy to strip out surplus presentation tags (think about the code generated by nearly every automated tool out there) and attributes, replacing them with style rules or structural markup as required. |
Hide-comments | Specifies whether Tidy should print out comments. |
Css-prefix | This is the prefix Tidy will use for all of its css classes. Keep in mind the css classes used in the rest of your site to avoid conflict. |
Drop-empty-paras | This option specifies whether empty paragraphs should be dropped entirely or replaced with <br> tags. The HTML 4 specification does not allow for empty paragraph tags. |
Enclose-text | Tells Tidy to enclose any text in the body within a <p> element. Useful if you want all text to be enclosed for css reasons. |
Fix-backslash | Defaults to yes, but tells Tidy to replace backslashes in URLs with forward slashes. Internet Explorer generally allows either, while back-slashes confuse everything else (and rightly so). |
Indent | Instructs Tidy to properly indent the code; helps keep it all readable. |
Show-errors | Whether or not Tidy should display errors with the output. |
Show-warnings - | Whether warnings should be displayed. |
Error-file | By default errors go to stderr; use this option to have them saved to a file. |
Force-output | With this option you can force Tidy to give some output in all circumstances. This is not recommended, however, because the attempts that may be made in order to give some output may result in a very odd-looking result. |
Generally I am a large proponent of storing all data in a state as close to its original or provided state as possible, then doing any necessary modifications at page time. This allows changes to formatting preferences and the like as needed. In this case, however, as a concession to performance issues, I would recommend dealing with proper formatting of consumed feeds at the time of consumption. If you do want to record the original form of the data (escaping it for safe SQL entry, of course), do it in a separate table.