Recipe 13.17. Program: Finding Stale Links


The stale-links.php program in Example 13-62 produces a list of links in a page and their status. It tells you if the links are okay, if they've been moved somewhere else, or if they're bad. Run the program by passing it a URL to scan for links:

% php stale-links.php http://www.oreilly.com http://www.oreilly.com/: OK http://oreillynet.com/: OK http://www.oreilly.com/store/: OK http://safari.oreilly.com: OK http://conferences.oreillynet.com/: OK http://www.oreillylearning.com: OK http://academic.oreilly.com: MOVED: http://academic.oreilly.com/index.csp http://www.oreilly.com/about/: OK ...

The stale-links.php program uses the cURL extension to retrieve web pages (see Example 13-62). First, it retrieves the URL specified on the command line. Once a page has been retrieved, the program uses the XPath technique from Recipe 13.11 to get a list of links in the page. Then, after prepending a base URL to each link if necessary, the link is retrieved. Because we need just the headers of these responses, we use the HEAD method instead of GET by setting the CURLOPT_NOBODY option. Setting CURLOPT_HEADER tells curl_exec( ) to include the response headers in the string it returns. Based on the response code, the status of the link is printed, along with its new location if it's been moved.

stale-links.php

<?php if (! isset($_SERVER['argv'][1])) {     die("No URL provided.\n"); } $url = $_SERVER['argv'][1]; // Load the page list($page,$pageInfo) = load_with_curl($url); if (! strlen($page)) {     die("No page retrieved from $url"); } // Convert to XML for easy parsing $opts = array('output-xhtml' => true,               'numeric-entities' => true); $xml = tidy_repair_string($page, $opts); $doc = new DOMDocument(); $doc->loadXML($xml); $xpath = new DOMXPath($doc); $xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml'); // Compute the Base URL for relative links $baseURL = ''; // Check if there is a <base href=""/> in the page $nodeList = $xpath->query('//xhtml:base/@href'); if ($nodeList->length == 1) {     $baseURL = $nodeList->item(0)->nodeValue; } // No <base href=""/>, so build the Base URL from $url else {     $URLParts = parse_url($pageInfo['url']);     if (! (isset($URLParts['path']) && strlen($URLParts['path']))) {         $basePath = '';     } else {         $basePath = preg_replace('#/[^/]*$#','',$URLParts['path']);     }     if (isset($URLParts['username']) || isset($URLParts['password'])) {         $auth = isset($URLParts['username']) ? $URLParts['username'] : '';         $auth .= ':';         $auth .= isset($URLParts['password']) ? $URLParts['password'] : '';         $auth .= '@';     } else {         $auth = '';     }     $baseURL = $URLParts['scheme'] . '://' .                $auth . $URLParts['host'] .                $basePath; } // Keep track of the links we visit so we don't visit each more than once $seenLinks = array(); // Grab all links $links = $xpath->query('//xhtml:a/@href'); foreach ($links as $node) {     $link = $node->nodeValue;     // resolve relative links     if (! preg_match('#^(http|https|mailto):#', $link)) {         if (((strlen($link) == 0)) || ($link[0] != '/')) {             $link = '/' . $link;         }         $link = $baseURL . $link;     }     // Skip this link if we've seen it already     if (isset($seenLinks[$link])) {         continue;     }     // Mark this link as seen     $seenLinks[$link] = true;     // Print the link we're visiting     print $link.': ';     flush();     list($linkHeaders, $linkInfo) = load_with_curl($link, 'HEAD');     // Decide what to do based on the response code     // 2xx response codes mean the page is OK     if (($linkInfo['http_code'] >= 200) && ($linkInfo['http_code'] < 300)) {         $status = 'OK';     }     // 3xx response codes mean redirection     else if (($linkInfo['http_code'] >= 300) && ($linkInfo['http_code'] < 400)) {         $status = 'MOVED';         if (preg_match('/^Location: (.*)$/m',$linkHeaders,$match)) {                 $status .= ': ' . trim($match[1]);         }     }     // Other response codes mean errors     else {         $status = "ERROR: {$linkInfo['http_code']}";     }     // Print what we know about the link     print "$status\n"; } function load_with_curl($url, $method = 'GET') {     $c = curl_init($url);     curl_setopt($c, CURLOPT_RETURNTRANSFER, true);     if ($method == 'GET') {         curl_setopt($c,CURLOPT_FOLLOWLOCATION, true);     }     else if ($method == 'HEAD') {         curl_setopt($c, CURLOPT_NOBODY, true);         curl_setopt($c, CURLOPT_HEADER, true);     }     $response = curl_exec($c);     return array($response, curl_getinfo($c)); } ?>




PHP Cookbook, 2nd Edition
PHP Cookbook: Solutions and Examples for PHP Programmers
ISBN: 0596101015
EAN: 2147483647
Year: 2006
Pages: 445

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net