The stale-links.php program in Example 13-62 produces a list of links in a page and their status. It tells you if the links are okay, if they've been moved somewhere else, or if they're bad. Run the program by passing it a URL to scan for links: % php stale-links.php http://www.oreilly.com http://www.oreilly.com/: OK http://oreillynet.com/: OK http://www.oreilly.com/store/: OK http://safari.oreilly.com: OK http://conferences.oreillynet.com/: OK http://www.oreillylearning.com: OK http://academic.oreilly.com: MOVED: http://academic.oreilly.com/index.csp http://www.oreilly.com/about/: OK ... The stale-links.php program uses the cURL extension to retrieve web pages (see Example 13-62). First, it retrieves the URL specified on the command line. Once a page has been retrieved, the program uses the XPath technique from Recipe 13.11 to get a list of links in the page. Then, after prepending a base URL to each link if necessary, the link is retrieved. Because we need just the headers of these responses, we use the HEAD method instead of GET by setting the CURLOPT_NOBODY option. Setting CURLOPT_HEADER tells curl_exec( ) to include the response headers in the string it returns. Based on the response code, the status of the link is printed, along with its new location if it's been moved. stale-links.php <?php if (! isset($_SERVER['argv'][1])) { die("No URL provided.\n"); } $url = $_SERVER['argv'][1]; // Load the page list($page,$pageInfo) = load_with_curl($url); if (! strlen($page)) { die("No page retrieved from $url"); } // Convert to XML for easy parsing $opts = array('output-xhtml' => true, 'numeric-entities' => true); $xml = tidy_repair_string($page, $opts); $doc = new DOMDocument(); $doc->loadXML($xml); $xpath = new DOMXPath($doc); $xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml'); // Compute the Base URL for relative links $baseURL = ''; // Check if there is a <base href=""/> in the page $nodeList = $xpath->query('//xhtml:base/@href'); if ($nodeList->length == 1) { $baseURL = $nodeList->item(0)->nodeValue; } // No <base href=""/>, so build the Base URL from $url else { $URLParts = parse_url($pageInfo['url']); if (! (isset($URLParts['path']) && strlen($URLParts['path']))) { $basePath = ''; } else { $basePath = preg_replace('#/[^/]*$#','',$URLParts['path']); } if (isset($URLParts['username']) || isset($URLParts['password'])) { $auth = isset($URLParts['username']) ? $URLParts['username'] : ''; $auth .= ':'; $auth .= isset($URLParts['password']) ? $URLParts['password'] : ''; $auth .= '@'; } else { $auth = ''; } $baseURL = $URLParts['scheme'] . '://' . $auth . $URLParts['host'] . $basePath; } // Keep track of the links we visit so we don't visit each more than once $seenLinks = array(); // Grab all links $links = $xpath->query('//xhtml:a/@href'); foreach ($links as $node) { $link = $node->nodeValue; // resolve relative links if (! preg_match('#^(http|https|mailto):#', $link)) { if (((strlen($link) == 0)) || ($link[0] != '/')) { $link = '/' . $link; } $link = $baseURL . $link; } // Skip this link if we've seen it already if (isset($seenLinks[$link])) { continue; } // Mark this link as seen $seenLinks[$link] = true; // Print the link we're visiting print $link.': '; flush(); list($linkHeaders, $linkInfo) = load_with_curl($link, 'HEAD'); // Decide what to do based on the response code // 2xx response codes mean the page is OK if (($linkInfo['http_code'] >= 200) && ($linkInfo['http_code'] < 300)) { $status = 'OK'; } // 3xx response codes mean redirection else if (($linkInfo['http_code'] >= 300) && ($linkInfo['http_code'] < 400)) { $status = 'MOVED'; if (preg_match('/^Location: (.*)$/m',$linkHeaders,$match)) { $status .= ': ' . trim($match[1]); } } // Other response codes mean errors else { $status = "ERROR: {$linkInfo['http_code']}"; } // Print what we know about the link print "$status\n"; } function load_with_curl($url, $method = 'GET') { $c = curl_init($url); curl_setopt($c, CURLOPT_RETURNTRANSFER, true); if ($method == 'GET') { curl_setopt($c,CURLOPT_FOLLOWLOCATION, true); } else if ($method == 'HEAD') { curl_setopt($c, CURLOPT_NOBODY, true); curl_setopt($c, CURLOPT_HEADER, true); } $response = curl_exec($c); return array($response, curl_getinfo($c)); } ?> | |