11.10 Extracting All the URLs from a Web Page | PHP Developers Cookbook (2nd Edition)

You want to extract all URLs from a Web page.

Technique

Use a regex expression and strip the URLs as you read the file line by line:

 <?php $fp = fopen('http://www.yahoo.com/', 'r') or die('Cannot connect'); while ($line = fgets($fp, 1024)) {     if (preg_match_all('/<.*?a.*?href=\s*?[\'"](.+)[\'"].*?>.*?<\/*?a.*?>/i',         $line, $matches)) {          array_shift($matches);          foreach ( $matches as $match ) $url_list[] = $match;     } } fclose($fp)     or die("Cannot Close File"); ?>

Comments

The optimal way to get all links from a Web page is by looping through the file line by line, and then matching every occurrence of the links on a Web page. After we match the URL, we can add it to our final array ( $url_list ). However, $matches is an array and the first item of the array contains the entire match, and we are not interested in that. So, we kick out the first item of the $matches array and loop through the entire array adding it element by element to $url_list .

To achieve this effect, you can also use the Snoopy class, available from http://snoopy. sourceforge .net/. Snoopy extracts all the links on a Web page for you:

 <?php include_once 'Snoopy.class.inc'; $snoopy = new Snoopy; $snoopy->fetchlinks('http://www.internet.com/'); foreach ($snoopy->results as $link) {     print "Link: $link\n"; } ?>