Recipe 13.11. Extracting Links from an HTML File | PHP Cookbook: Solutions and Examples for PHP Programmers

13.11.1. Problem

You need to extract the URLs that are specified inside an HTML document.

13.11.2. Solution

Use Tidy to convert the document to XHTML, then use an XPath query to find all the links, as shown in Example 13-46.

Extracting links with Tidy and XPath

<?php $doc = new DOMDocument(); $opts = array('output-xml' => true,               // Prevent DOMDocument from being confused about entities               'numeric-entities' => true); $doc->loadXML(tidy_repair_file('linklist.html',$opts)); $xpath = new DOMXPath($doc); // Tell $xpath about the XHTML namespace $xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml'); foreach ($xpath->query('//xhtml:a/@href') as $node) {     $link = $node->nodeValue;     print $link . "\n"; }

If Tidy isn't available, use the pc_link_extractor( ) function shown in Example 13-47.

Extracting links without Tidy

<?php $html = file_get_contents('linklist.html'); $links = pc_link_extractor($html); foreach ($links as $link) {     print $link[0] . "\n"; } function pc_link_extractor($html) {     $links = array();     preg_match_all('/<a\s+.*?href=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/a>/i',                    $html,$matches,PREG_SET_ORDER);     foreach($matches as $match) {         $links[] = array($match[1],$match[2]);     }     return $links; }

13.11.3. Discussion

The XHTML document that Tidy generates when the output-xhtml option is turned on may contain entities other than the four that are defined by the base XML specification (< > & "). Turning on the numeric-entities option prevents those other entities from appearing in the generated XHTML document. Their presence would cause DOMDocument to complain about undefined entities. An alternative is to leave out the numeric-entities option but set $doc->resolveExternals to true. This tells DOMDocument to fetch any Document Type Definition referenced in the file it's loading and use that to resolve the entities. Tidy generates XML with an appropriate DTD in it. The downside of this approach is that the DTD URL points to a resource on an external web server, so your program would have to download that resource each time it runs.

XHTML is an XML application'a defined XML vocabulary for expressing HTML. As such, all of its elements (the familiar <a/>, <h1/>, and so on) live in a namespace. The URI for that namespace is http://www.w3.org/1999/xhtml. For XPath queries to work properly, the namespace has to be attached to a prefix (that's what the registerNamespace( ) method does) and then used in the XPath query.

The pc_link_extractor( ) function is a useful alternative if Tidy isn't available. Its regular expression won't work on all links, such as those that are constructed with some hexadecimal escapes, but it should function on the majority of reasonably well-formed HTML. The function returns an array. Each element of that array is itself a two-element array. The first element is the target of the link, and the second element is the link anchor'text that is linked.

The XPath expression in Example 13-46 only grabs links, not anchors. Example 13-48 shows an alternative that produces both links and anchors.

Extracting links and anchors with Tidy and XPath

<?php $doc = new DOMDocument(); $opts = array('output-xhtml'=>true,               // Prevent DOMDocument from being confused about entities               'numeric-entities' => true); $doc->loadXML(tidy_repair_file('linklist.html',$opts)); $xpath = new DOMXPath($doc); // Tell $xpath about the XHTML namespace $xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml'); foreach ($xpath->query('//xhtml:a') as $node) {     $anchor = trim($node->textContent);     $link = $node->getAttribute('href');     print "$anchor -> $link \n"; }

In Example 13-48, the XPath query finds all the <a/> element nodes. The textContent property of the node holds the anchor text and the link is in the HRef attribute.

13.11.4. See Also

Documentation on on DOMDocument at http://www.php.net/DOM, on DOMXPath::query( ) at http://www.php.net/DOM_DOMXPath::query, on DOMXPath::registerNamespace( ) at http://www.php.net/DOM_DOMXPath::registerNamespace, on tidy_repair_file( ) at http://www.php.net/tidy_repair_file, and on preg_match_all( ) at http://www.php.net/preg_match_all; Recipe 13.10 has more information about Tidy; http://www.w3.org/TR/xpath describes XPath; http://www.w3.org/TR/xhtml1/ details XHTML.