Hack 25 A Quick Introduction to XPath

Sure, you've got your traditional HTML parsers of the tree and token variety, and you've got regular expressions that can be as innocent or convoluted as you wish. But if neither are perfect fits to your scraping needs, consider XPath .

XPath is designed to locate and process items within properly formatted XML or HTML documents. At its simplest, XPath works similarly to how a pathname is used to locate a file, but instead of stepping through directories in a filesystem, it steps through elements in a document.

For example, to get the title of an HTML document, you could use /html/head/title to start at the root ( / ), step into the html element, then into the head , and finally the title . This is similar to tree-based parsers like HTML::TreeBuilder [Hack #19] but has a number of advantages and additional capabilitiesmost useful of which is that an XPath statement can be a single powerful expression, as opposed to multiple lines of traversal code.

Like filesystems, there's a current location in the tree, and paths that don't start with / are relative to it. . and . . refer to the current node and parent node, respectively, just as they refer to a filesystem's current and parent directories. If the current node is /html/head , then title and ./title mean /html/head/title , and . . means /html .

That's as complex as filesystem paths usually get, but since XPath deals with XML (and HTML), it has to go furtherand it goes a lot further. Luckily, for both our sanity and page count, we'll only scratch the surface for this introductory hack. If you want to know more, check out the book XPath and XPointer (http://www.oreilly.com/catalog/xpathpointer).

Directories can contain only one file with a particular name, but an element can contain any number of children with the same type name : paragraphs can contain multiple anchors, lists can contain multiple items, and so on. XPath, unlike filesystems, allows a step, and hence a path , to match any number of nodes; so, ul/li means all the items of all the unordered lists in the current node.

You can distinguish between matched items by adding a number in square brackets, so that a[4] selects the fourth child anchor of a node (note that XPath counts from one, not zero). The fourth cell of each row of the third child table of the body is /html/body/table[3]/tr/td[4] .

Attributes are treated similarly to children, except the attribute names are prefixed with @ , so that a/@href means the href attributes of the anchors. XPath allows a path to be abbreviated with // , which matches any number of steps. //a/@href makes a list of every link in a page!

There's much, much more to XPath, but this discussion has given us enough of a background to allow some useful things to be done. When you want to use XPath within a Perl script, the preferred approach is XML::LibXML , which depends on the libxml2 (http://xmlsoft.org/) library being installed.

Using LibXML's xmllint

LibXML comes with a tool named xmllint , whose most interesting feature is a command-line shell that lets you navigate around a document's object tree using commands named after Unix tools. It's a good way to discover and try out paths interactively.

Let's see it in action on JungleScan (http://www.junglescan.com/), which tracks changing Amazon.com ranks for various products. We'll pick out some info from the "top ten winners" list and reformat them for our own purposes:

 % xmllint --shell --html http://junglescan.com/

Utilities with lint in the title are referred to as lint-pickers : they help clean up and report incorrect and crusty code. In this case, since JungleScan isn't valid HTML (due to unencoded ampersands), the previous command will generate many complaints, similar to this:

 http://junglescan.com:59: error: htmlParseEntityRef: expecting ';'

None of that matters to us, though, as eventually we'll arrive at a prompt:

/ >

Let's try some navigating:

 / > cd //title title >  pwd  /html/head/title title >  cat  <title>JungleScan.com</title> title >  cd ..  head >  ls  ---        1 style ---        1 title -a-        5 script

That -a- tells us the script element has at least one attribute:

 head >  dir script/@*  ATTRIBUTE language   TEXT     content=JavaScript

Okay, enough of thattime to find some data. Looking at the page in a browser shows that the first of today's top winners is an exam guide:

 / >  grep Certification  /html/body/table/tr/td[1]/font/form[2]/table[2]/  tr[3]  /td[3]/table/tr[1]/td : -a-        0 img t--       44     A+ All-In-One Certification Exam Gui...

Yep, there it is, and there's one of the paths that leads to it. The tables are nested three deep; I'm glad xmllint is keeping track of them for us. We now have the beginnings of our desired data, so let's grab the end (in this case, a camera):

 / >  grep Camera  /html/body/table/tr/td[1]/font/form[2]/table[2]/  tr[12]  /td[3]/table/tr[1]/td : -a-        0 img t--       63     Sony DSC-F717 5MP Digital Still Came...

Comparing the two paths, we can see that the middle table has a row for each product (the emphasized tr in the previous outputs); inside that is another table containing the product's name. Let's have a closer look at one of these middle table rows:

 / >  cd /html/body/table/tr/td[1]/font/form[2]/table[2]/tr[4]

The other products are named in td[3]/table/tr[1]/td , and so should this one:

 tr >  cat td[3]/table/tr[1]/td  ------- <td> <img alt="Book" src="/images/book.gif">     LT's Theory of Pets [UNABRIDGED]&nbsp;</td>

Yes, that was the second product in the list. Conveniently, the image's alternate text tells us this is a book. Likewise, the second row of that inner table holds three supplementary links concerning this product (its Amazon.com page, a bulletin board link, and its current JungleScan stats):

 tr >  cat td[3]/table/tr[2]  ------- <tr><td bgcolor="555555">   ... etc ... </td></tr>

And the percentage this product rose is to the right of that, being the fourth cell:

 tr >  cat td[4]  ------- <td><a href="http://1.junglescan.com/scan/ details.php?asin=0743520041">+677%</a></td>

Now we know enough about the page's structure to write a script. In this example, we take a look at the top five products; our code will suck down JungleScan, issue some XPath statements, and spit out the results to the shell.

The Code

Save the following code to a file called junglescan.pl :

 #!/usr/bin/perl -w use strict; use utf8; use LWP::Simple; use XML::LibXML; use URI; # Set up the parser, and set it to recover # from errors so that it can handle broken HTML my $parser = XML::LibXML->new(  ); $parser->recover(1); # Parse the page into a DOM tree structure my $url  = 'http://junglescan.com/'; my $data = get($url) or die $!; my $doc  = $parser->parse_html_string($data); # Extract the table rows (as an # array of references to DOM nodes) my @winners = $doc->findnodes(q{     /html/body/table/tr/td[1]/font/form[2]/table[2]/tr }); # The first two rows contain headings, # and we want only the top five, so slice. @winners = @winners[2..6]; foreach my $product (@winners) {     # Get the percentage change and type     # We use the find method since we only need strings     my $change = $product->find('td[4]');     my $type = $product->find('td[3]//img/@alt');     # Get the title. It has some annoying     # whitespace, so we trim that off with regexes.     my $title = $product->find('td[3]//tr[1]');     $title =~ s/^\s*//; $title =~ s/\xa0$//;     # Get the first link ("Visit Amazon.com page")     # This is relative to the page's URL, so we make it absolute     my $relurl = $product->find('td[3]//a[1]/@href');     my $absurl = URI->new($relurl)->abs($url);     # Output. There isn't always a type, so we ignore it if there isn't.     print "$change  $title";     print " [$type]" if $type;     print "\n       Amazon info: $absurl\n\n"; }

Running the Hack

Invoke the script on the command line:

 %  perl junglescan.pl  +1540%  A+ All-In-One Certification Exam Guide [Book]          Amazon info: http://junglescan.com/redirect.cfm?asin=0072126795   +677%  LT's Theory of Pets [UNABRIDGED] [Book]          Amazon info: http://junglescan.com/redirect.cfm?asin=0743520041   +476%  The Wellstone [Book]          Amazon info: http://junglescan.com/redirect.cfm?asin=0553584464   +465%  Greywolf [DOWNLOAD: MICROSOFT READER] [Book]          Amazon info: http://junglescan.com/redirect.cfm?asin=B000066U03   +455%  VirusScan Home Edition 7.0 [Software]          Amazon info: http://junglescan.com/redirect.cfm?asin=B00006J3FM

XPath is a powerful searching technology, and with the explorative capabilities of xmllint you can quickly get a direct pointer to the data you're looking for.

Daniel Biddle