Recipe 13.18. Program: Finding Fresh Links


Example 13-63 is a modification of the program in Example 13-62 that produces a list of links and their last-modified time. If the server on which a URL lives doesn't provide a last-modified time, the program reports the URL's last-modified time as the time the URL was requested. If the program can't retrieve the URL successfully, it prints out the status code it got when it tried to retrieve the URL. Run the program by passing it a URL to scan for links:

% php fresh-links.php http://www.oreilly.com https://epoch.oreilly.com/account/default.orm: MOVED: https://epoch.oreilly.com/ lib/p_sso.orm?d=account https://epoch.oreilly.com/shop/cart.orm: OK http://www.oreilly.com/: OK; Last Modified: Mon, 08 May 2006 22:11:04 GMT http://oreillynet.com/: OK http://www.oreilly.com/store/: OK http://safari.oreilly.com: OK http://conferences.oreillynet.com/: OK http://www.oreillylearning.com: OK http://academic.oreilly.com: MOVED: http://academic.oreilly.com/index.csp ...

This output is from a run of the program at about 11:48 P.M. GMT on May 8, 2006. Most links aren't accompanied by a last modified time'this means the server didn't provide one, so the page is probably dynamic. The link to http://www.oreilly.com/ shows that page being about 90 minutes old. The link to http://academic.oreilly.com shows that it has been moved elsewhere, as reported by the output of stale-links.php in Recipe 13.17.

The program to find fresh links is conceptually almost identical to the program to find stale links. It uses the same techniques to pull links out of a page; however, it uses the HTTP_Request class instead of cURL to retrieve URLs. The code to get the base URL specified on the command line is inside a loop so that it can follow any redirects that are provided and easily return the final URL in a redirect chain.

Once a page has been retrieved, each linked URL is retrieved with the head method. Instead of just printing out a new location for moved links, however, it prints out a formatted version of the Last-Modified header if it's available.

fresh-links.php

<?php error_reporting(E_ALL); require_once 'HTTP/Request.php'; if (! isset($_SERVER['argv'][1])) {     die("No URL provided.\n"); } $url = $_SERVER['argv'][1]; // Load the page $r = load_with_http_request($url); if (! strlen($r->getResponseBody())) {     die("No page retrieved from $url"); } // Convert to XML for easy parsing $opts = array('output-xhtml' => true,               'numeric-entities' => true); $xml = tidy_repair_string($r->getResponseBody(), $opts); $doc = new DOMDocument(); $doc->loadXML($xml); $xpath = new DOMXPath($doc); $xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml'); // Compute the Base URL for relative links. $baseURL = ''; // Check if there is a <base href=""/> in the page $nodeList = $xpath->query('//xhtml:base/@href'); if ($nodeList->length == 1) {     $baseURL = $nodeList->item(0)->nodeValue; } // No <base href=""/>, so build the Base URL from $url else {     $URLParts = parse_url($r->_url->getURL());     if (! (isset($URLParts['path']) && strlen($URLParts['path']))) {         $basePath = '';     } else {         $basePath = preg_replace('#/[^/]*$#','',$URLParts['path']);     }     if (isset($URLParts['username']) || isset($URLParts['password'])) {         $auth = isset($URLParts['username']) ? $URLParts['username'] : '';         $auth .= ':';         $auth .= isset($URLParts['password']) ? $URLParts['password'] : '';         $auth .= '@';     } else {         $auth = '';     }     $baseURL = $URLParts['scheme'] . '://' .                $auth . $URLParts['host'] .                $basePath; } // Keep track of the links we visit so we don't visit each more than once $seenLinks = array(); // Grab all links $links = $xpath->query('//xhtml:a/@href'); foreach ($links as $node) {     $link = $node->nodeValue;     // Resolve relative links     if (! preg_match('#^(http|https|mailto):#', $link)) {         if (((strlen($link) == 0)) || ($link[0] != '/')) {             $link = '/' . $link;         }         $link = $baseURL . $link;     }     // Skip this link if we've seen it already     if (isset($seenLinks[$link])) {         continue;     }     // Mark this link as seen     $seenLinks[$link] = true;     // Print the link we're visiting     print $link.': ';     flush();     $r = load_with_http_request($link, 'HEAD');     // Decide what to do based on the response code     // 2xx response codes mean the page is OK     if (($r->getResponseCode() >= 200) && ($r->getResponseCode() < 300)) {         $status = 'OK';     }     // 3xx response codes mean redirection     else if (($r->getResponseCode() >= 300) && ($r->getResponseCode() < 400)) {         $status = 'MOVED';         if (strlen($location = $r->getResponseHeader('location'))) {             $status .= ": $location";         }     }     // Other response codes mean errors     else {         $status = "ERROR: {$r->getResponseCode()}";     }     if (strlen($lastModified = $r->getResponseHeader('last-modified'))) {         $status .= "; Last Modified: $lastModified";     }     // Print what we know about the link     print "$status\n"; } function load_with_http_request($url, $method = 'GET') {     if ($method == 'GET') {         $done = false; $max_redirects = 10;         while ((! $done) && ($max_redirects > 0)) {             $r = new HTTP_Request($url);             $r->sendRequest();             $responseCode = $r->getResponseCode();             if (($responseCode >= 300) && ($responseCode < 400) &&                 strlen($location = $r->getResponseHeader('location'))) {                     $url = $location;                     $max_redirects--;             } else {                 $done = true;             }         }     } else {         $r = new HTTP_Request($url);         $r->setMethod(HTTP_REQUEST_METHOD_HEAD);         $r->sendRequest();     }     return $r; } ?>




PHP Cookbook, 2nd Edition
PHP Cookbook: Solutions and Examples for PHP Programmers
ISBN: 0596101015
EAN: 2147483647
Year: 2006
Pages: 445

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net