9.2 Tidy

 <  Day Day Up  >  

The Tidy extension "cleans up" messy HTML and XML files into valid and pretty-looking documents. This feature is particularly useful when you're serving lots of externally generated content.

For example, you want to allow visitors to enter HTML-enabled messages, but you don't want them to be able to create an invalid page. Manually checking each post is quite laborious, but with Tidy you can automate this process.

Alternatively, Tidy can be used to reformat documents, either to reduce their file size or to make them easily understandable by humans . The first option saves you bandwidth, making your pages arrive more quickly and reducing your overall hosting costs. The second option simplifies your debugging process, as you're not tracking down stray closing tags.

The Tidy extension is bundled with PHP, but not enabled, because it requires you to install the Tidy library. Download the Tidy library from http://tidy. sourceforge .net/ and add --with-tidy=DIR to turn on Tidy support in PHP.

9.2.1 Basics

Interacting with Tidy is a simple three step process. You parse the file, then clean its contents, and finally print or save the repaired file.

Use tidy_parse_file( ) to read in a file for tidying:

 $tidy = tidy_parse_file('index.html'); 

When your data is in a string, use tidy_parse_string( ) instead:

 // This string is missing a closing </i> tag $tidy = tidy_parse_string('I am <b>bold and I am <i>bold and italic</b>'); 

Transform the document using the tidy_clean_repair( ) command:

 $tidy = tidy_parse_string('I am <b>bold and I am <i>bold and italic</b>'); tidy_clean_repair($tidy); 

The tidy_clean_repair( ) function takes a Tidy resource. It returns true if everything went okay, and false on an error. It does not return the tidied document. Use tidy_get_output( ) to retrieve the altered file:

 $tidy = tidy_parse_string('I am <b>bold and I am <i>bold and italic</b>'); tidy_clean_repair($tidy); print tidy_get_output($tidy); 

This prints:

  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">   <html>   <head>   <title></title>   </head>   <body>   I am <b>bold and I am <i>bold and italic</i></b>   </body>   </html>  

Tidy has not only repaired the missing </i> tag, but also turned the string into a valid HTML 3.2 file.

9.2.2 Configuring Tidy

You can configure Tidy in innumerable ways. These options can be set at parse time in an array or in a configuration file.

For example, to make Tidy return only the body of a cleaned document:

 $options = array('show-body-only' => true); $tidy = tidy_parse_string('I am <b>bold and I am <i>bold and italic</b>'); tidy_clean_repair($tidy); print tidy_get_output($tidy);  I am <b>bold and I am <i>bold and italic</i></b>  

This is useful when you're cleaning document fragments , such as message board posts or HTML that is placed inside a template.

Alternatively, you can place this information in a file and pass the filename:

 show-body-only: true logical-emphasis: true 

In the configuration file, place each individual option on a new line, and separate options with a colon (:). Then, provide Tidy with its location:

 $tidy = tidy_parse_string('I am <b>bold and I am <i>bold and italic</b>',                            'tidy.cnf'); tidy_clean_repair($tidy); print tidy_get_output($tidy);  I am <strong>bold and I am <em>bold and italic</em></strong>  

In addition to fixing the HTML, turning on logical-emphasis has switched your b and i tags to their logical equivalents.

Table 9-1 contains commonly used Tidy settings. A complete list of available options is at http://tidy.sourceforge.net/docs/quickref.html.

Table 9-1. Important Tidy configuration options

Name

Description

Values

Default

clean

Should Tidy convert font tags to CSS?

Boolean

no

hide-endtags

Omit ending tags?

Boolean

no

indent

Indent block-level tags?

yes , no , auto

no

indent-spaces

Number of spaces per indent

Integer

2

markup

Create a "Pretty Print" version of the file?

Boolean

yes

output-xml

Output XML instead of HTML?

Boolean

no

output-xhtml

Output XHTML instead of HTML?

Boolean

no

show-body-only

Only return document body

Boolean

no

wrap

Length of line wrap

Integer

66

wrap-attributes

Wrap attribute values?

Boolean

no

wrap-php

Wrap PHP code?

Boolean

no


9.2.3 Optimize Files

Tidy provides options that can reduce your file size by stripping away extra whitespace and comments and by converting verbose <font> tags into CSS.

Here is a sample configuration for Tidy that aggressively strips away as many unneeded characters as possible:

 $options = array(     'clean' => true,     'drop-proprietary-attributes' => true,      'drop-empty-paras' => true,      'hide-comments' => true,     'hide-endtags' => true,     'join-classes' => true,      'join-styles' => true,      'wrap' => 0, );  $tidy = tidy_parse_file('http://www.example.org/', $options);  tidy_clean_repair($tidy);  print $tidy; 

The overall effect of these options is to eliminate HTML that the browser doesn't use when rendering the page and to combine duplicated styles into one unified style. Each setting is detailed in Table 9-2.

Remember that even a small improvement in page size is multiplied by every single page your server delivers. The combined reduction in bandwidth may translate to serious cost savings on a high-traffic site.

Table 9-2. Optimizing Tidy configuration options

Name

Description

Values

Default

clean

Convert presentational tags, such as <center> , with style rules?

Boolean

no

drop-proprietary-attributes

Eliminate proprietary attributes added by programs such as Microsoft Office?

Boolean

no

drop-font-tags

Eliminate <font> tags when used with the clean option?

Boolean

no

drop-empty-paras

Eliminate empty <p> tags?

Boolean

yes

hide-comments

Remove HTML comments?

Boolean

no

hide-endtags

Remove closing tags when possible according to the Document Type?

Boolean

no

join-classes

Merge related classes together?

Boolean

no

join-styles

Merge related styles together?

Boolean

yes

wrap

Width for line wrapping (a value of disables wrapping).

Integer

68


9.2.4 Object-Oriented Interface

Tidy also has an object-oriented interface:

 $tidy = new Tidy('I am <b>bold and I am <i>bold and italic</b>'); $tidy->cleanRepair( ); print $this->getOutput( ); 

Like other extensions with dual procedural and OO interfaces, Tidy's methods use studlyCaps instead of underscores. Additionally, you don't pass a Tidy resource to the methods , because the resource is stored in the object.

 <  Day Day Up  >  


Upgrading to PHP 5
Upgrading to PHP 5
ISBN: 0596006365
EAN: 2147483647
Year: 2004
Pages: 144

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net