< Day Day Up > |
The Tidy extension "cleans up" messy HTML and XML files into valid and pretty-looking documents. This feature is particularly useful when you're serving lots of externally generated content. For example, you want to allow visitors to enter HTML-enabled messages, but you don't want them to be able to create an invalid page. Manually checking each post is quite laborious, but with Tidy you can automate this process. Alternatively, Tidy can be used to reformat documents, either to reduce their file size or to make them easily understandable by humans . The first option saves you bandwidth, making your pages arrive more quickly and reducing your overall hosting costs. The second option simplifies your debugging process, as you're not tracking down stray closing tags. The Tidy extension is bundled with PHP, but not enabled, because it requires you to install the Tidy library. Download the Tidy library from http://tidy. sourceforge .net/ and add --with-tidy=DIR to turn on Tidy support in PHP. 9.2.1 BasicsInteracting with Tidy is a simple three step process. You parse the file, then clean its contents, and finally print or save the repaired file. Use tidy_parse_file( ) to read in a file for tidying: $tidy = tidy_parse_file('index.html'); When your data is in a string, use tidy_parse_string( ) instead: // This string is missing a closing </i> tag $tidy = tidy_parse_string('I am <b>bold and I am <i>bold and italic</b>'); Transform the document using the tidy_clean_repair( ) command: $tidy = tidy_parse_string('I am <b>bold and I am <i>bold and italic</b>'); tidy_clean_repair($tidy); The tidy_clean_repair( ) function takes a Tidy resource. It returns true if everything went okay, and false on an error. It does not return the tidied document. Use tidy_get_output( ) to retrieve the altered file: $tidy = tidy_parse_string('I am <b>bold and I am <i>bold and italic</b>'); tidy_clean_repair($tidy); print tidy_get_output($tidy); This prints: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN"> <html> <head> <title></title> </head> <body> I am <b>bold and I am <i>bold and italic</i></b> </body> </html> Tidy has not only repaired the missing </i> tag, but also turned the string into a valid HTML 3.2 file. 9.2.2 Configuring TidyYou can configure Tidy in innumerable ways. These options can be set at parse time in an array or in a configuration file. For example, to make Tidy return only the body of a cleaned document: $options = array('show-body-only' => true); $tidy = tidy_parse_string('I am <b>bold and I am <i>bold and italic</b>'); tidy_clean_repair($tidy); print tidy_get_output($tidy); I am <b>bold and I am <i>bold and italic</i></b> This is useful when you're cleaning document fragments , such as message board posts or HTML that is placed inside a template. Alternatively, you can place this information in a file and pass the filename: show-body-only: true logical-emphasis: true In the configuration file, place each individual option on a new line, and separate options with a colon (:). Then, provide Tidy with its location: $tidy = tidy_parse_string('I am <b>bold and I am <i>bold and italic</b>', 'tidy.cnf'); tidy_clean_repair($tidy); print tidy_get_output($tidy); I am <strong>bold and I am <em>bold and italic</em></strong> In addition to fixing the HTML, turning on logical-emphasis has switched your b and i tags to their logical equivalents. Table 9-1 contains commonly used Tidy settings. A complete list of available options is at http://tidy.sourceforge.net/docs/quickref.html. Table 9-1. Important Tidy configuration options
9.2.3 Optimize FilesTidy provides options that can reduce your file size by stripping away extra whitespace and comments and by converting verbose <font> tags into CSS. Here is a sample configuration for Tidy that aggressively strips away as many unneeded characters as possible: $options = array( 'clean' => true, 'drop-proprietary-attributes' => true, 'drop-empty-paras' => true, 'hide-comments' => true, 'hide-endtags' => true, 'join-classes' => true, 'join-styles' => true, 'wrap' => 0, ); $tidy = tidy_parse_file('http://www.example.org/', $options); tidy_clean_repair($tidy); print $tidy; The overall effect of these options is to eliminate HTML that the browser doesn't use when rendering the page and to combine duplicated styles into one unified style. Each setting is detailed in Table 9-2. Remember that even a small improvement in page size is multiplied by every single page your server delivers. The combined reduction in bandwidth may translate to serious cost savings on a high-traffic site. Table 9-2. Optimizing Tidy configuration options
9.2.4 Object-Oriented InterfaceTidy also has an object-oriented interface: $tidy = new Tidy('I am <b>bold and I am <i>bold and italic</b>'); $tidy->cleanRepair( ); print $this->getOutput( ); Like other extensions with dual procedural and OO interfaces, Tidy's methods use studlyCaps instead of underscores. Additionally, you don't pass a Tidy resource to the methods , because the resource is stored in the object. |
< Day Day Up > |