Hack84.Spider Your Site | PHP Hacks: Tips & Tools For Creating Dynamic Websites

Hack 84. Spider Your Site

Use the HTTP_Client PEAR module to create a spider that walks all of the pages on your web site.

This hack demonstrates using PHP to write a spider for checking out the pages on your site. This is ideal for testing purposes and makes it simple to ensure that all of the PHP and HTML on your site still responds properly after an update.

8.7.1. The Code

Save the code in Example 8-12 as spider.php.

Example 8-12. A simple spider

 <?php require_once 'HTTP/Client.php'; require_once 'HTTP/Request/Listener.php'; $baseurl = "http://localhost/phphacks/spider/test/index.html"; $pages = array(); add_urls( $baseurl ); while( ( $page = next_page() ) != null ) {   add_urls( $page ); } function next_page() {   global $pages;   foreach( array_keys( $pages ) as $page )   { if ( $pages[ $page ] == null )   return $page;   }   return null; } function add_urls( $page ) {   global $pages;   $start = microtime();   $urls = get_urls( $page );   $resptime = microtime() - $start;   print "$page…\n";   $pages[ $page ] = array( 'resptime' => floor( $resptime * 1000 ), 'url' => $page );   foreach( $urls as $url )   { if ( !array_key_exists( $url, $pages ) )   $pages[ $url ] = null;   } } function get_urls( $page ) {   $base = preg_replace( "/\/([^\/]*?)$/", "/", $page );   $client = new HTTP_Client();   $client->get( $page );   $resp = $client->currentResponse();   $body = $resp['body'];   $out = array();   preg_match_all( "/(\<a.*?\>)/is", $body, $matches );   foreach( $matches[0] as $match )   { preg_match( "/href=(.*?)[\s|\>]/i", $match, $href ); if ( $href != null ) {   $href = $href[1];   $href = preg_replace( "/^\"/", "", $href );   $href = preg_replace( "/\"$/", "", $href );   if ( preg_match( "/^mailto:/", $href ) )   {   }   elseif ( preg_match( "/^http:\/\//", $href ) )   {     if ( preg_match( '/^$base/', $href ) )   $out []= $href;   }   else   {     $out []= $base.$href;       } }   }   return $out; } ob_start(); ?> <html> <head> <title>Spider report</title> </head> <body> <table width="600"> <tr> <th>URL</th> <th>Response Time (ms)</th> </tr> <?php foreach( array_values( $pages ) as $page ) { ?> <tr> <td><?php echo( $page['url' ] ); ?></td> <td><?php echo( $page['resptime' ] ); ?></td> </tr> <?php } ?> </table> </body> </html> <?php $html = ob_get_clean(); $fh = fopen( "report.html", "w" ); fwrite( $fh, $html ); fclose( $fh ); ?>

The spider code starts with a single URL and calls add_url( ) on that URL. The add_url( ) function retrieves the specified page and parses out all of the links. It adds the links it finds to the global $pages array. The script then iterates over the $pages array, calling next_page( ) until no more pages remain. Once all of the pages are spidered, the second half of the script outputs the result of each page fetch on the site.

The rest of this hack's examples are test pages for example's sake. Save the first of these, shown in Example 8-13, as index.html.

Example 8-13. A sample starting page

 <html><body> <a href="test1.html">Test 1</a><br/> <a href="test2.html">Test 2</a><br/> <a href="test3.html">Test 3</a><br/> </body></html>

Save the code in Example 8-14 as test1.html.

Example 8-14. A second sample page

 <html><body> <a href="http://www.cnn.com">CNN</a> </body></html>

Save the code in Example 8-15 as test2.html.

Example 8-15. A third sample page

 <html><body> </body></html>

Save the code in Example 8-16 as test3.html.

Example 8-16. A fourth sample page

 <html><body> </body></html>

8.7.2. Running the Hack

Save the test files (index.html, test1.html, test2.html, and test3.html) in a tests subdirectory on your server. Run the spider using the PHP command-line interpreter:

 % php spider.php http://localhost/phphacks/spider/test/index.html… http://localhost/phphacks/spider/test/test1.html… http://localhost/phphacks/spider/test/test2.html… http://localhost/phphacks/spider/test/test3.html…

The console output of the spider is each URL that it looks at as it spiders the site. In addition, the spider also creates an HTML report of what it spidered, as shown inFigure 8-6

Figure 8-6. The report from the spider

This report shows the time required to fetch the page in milliseconds. That includes the network transit time, and the time the server took to build the page. Because these are static HTML pages, and the web server and spider are both running locally, the fetch times are almost instantaneous; generally, any time under 200 ms is considered a quick response.

This script will not find stranded pages, and that's something to watch out for. If users bookmark particular pages that aren't linked to your main site, you can end up with broken pages that this script neither sees nor reports.

8.7.3. See Also

"Test Your Application with Simulated Users" [Hack #82]
"Test Your Application with Robots" [Hack #83]