Hack 19 Scraping with HTML::TreeBuilder

One of many popular HTML parsers available in Perl, HTML::TreeBuilder approaches the art of HTML parsing as a parent/child relationship .

Sometimes regular expressions [Hack #23] won't get you all the way to the data you want and you'll need to use a real HTML parser. CPAN has a few of these, the main two being HTML::TreeBuilder and HTML::TokeParser [Hack #20], both of which are friendly faades for HTML::Parser . This hack covers the former.

The Tree in TreeBuilder represents a parsing ideology: trees are a good way to represent HTML. The <head> tag is a child of the <html> tag. The <title> and <meta> tags are children of the <head> tag.

TreeBuilder takes a stream of HTML, from a file or from a variable, and turns it into a tree of HTML::Element nodes. Each of these nodes can be queried for its parent, its siblings, or its children. Each node can also be asked for a list of children that fulfill certain requirements.

We'll demonstrate this concept by writing a program that extracts a complete list of O'Reilly's books and then does some queries on that data. First, we have to fetch the page, easily done with LWP::Simple [Hack #9]. The script grabs all the content from O'Reilly's catalog page and constructs a new tree by feeding the content to the new_from_content method:

 #!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::TreeBuilder; my $url = 'http://www.oreilly.com/catalog/prdindex.html'; my $page = get($url) or die $!; my $p = HTML::TreeBuilder->new_from_content($page);

The look_down method starts from the top of the tree and then works downward, seeing which nodes match the specified conditions. We specify that we want anchor tags and the URL has to match a certain regular expression. This returns a list of matching nodes, which we put in @links :

 my @links = $p->look_down(_tag => 'a',     href => qr{^ \Qhttp://www.oreilly.com/catalog/\E \w+ $}x);

We could happily make a list of titles and URLs, but the page we're fetching has more: price, ISBN, and whether it's on O'Reilly's subscription-based Safari online library (http://safari.oreilly.com/). This information is contained in a bit of HTML code that looks like this:

 <tr bgcolor="#ffffff">   <td valign="top">  <a href="http://oreilly.com/catalog/googlehks">Google Hacks</a><br />  </td>   <td valign="top" nowrap="nowrap">0-596-00447-8</td>   <td valign="top" align="right">.95</td>   <td valign="top" nowrap="nowrap" align="center">&nbsp;     <a href="http://safari.oreilly.com/0596004478">Read it on Safari</a>   </td>   <td valign="top" nowrap="nowrap">     <a href="http://examples.oreilly.com/googlehks/">Get examples</a>   </td> </tr>

Our previous match with look_down places us at the emphasized code in the HTML output. To get at the rest of the data, we need to move upward in the HTML tree until we hit the <tr> element. The parent of our current location is <td> , and the parent of that element is our desired <tr> . Thus, we get @rows :

 my @rows = map { $_->parent->parent } @links;

We then loop over each of those rows, representing one book at a time. We find each of the <td> elements, giving us the table cells . The first one is the title, the second is the ISBN, and the third is price. Since we want only the text of the table cell , we use as_trimmed_text to return the text, minus any leading or trailing whitespace:

 my @books; for my $row (@rows) {     my %book;     my @cells = $row->look_down(_tag => 'td');     $book{title}    = $cells[0]->as_trimmed_text;     $book{isbn}     = $cells[1]->as_trimmed_text;     $book{price}    = $cells[2]->as_trimmed_text;     $book{price} =~ s/^$//;

The URLs are slightly trickier. We want the first (only) URL in each of the cells, but one might not always exist. We add a new routine, get_url , which is given an HTML::Element node and works out the correct thing to do. As is typical in web scraping, there's a slight bit of cleaning to do. Some URLs on the page have a trailing carriage return, so we get rid of those:

 $book{url}      = get_url($cells[0]);     $book{safari}   = get_url($cells[3]);     $book{examples} = get_url($cells[4]);     push @books, \%book; } sub get_url {     my $node = shift;     my @hrefs = $node->look_down(_tag => 'a');     return unless @hrefs;     my $url = $hrefs[0]->attr('href');     $url =~ s/\s+$//;     return $url; }

Finally, we delete the tree, because it's not needed anymore. Due to the cross-linking of nodes, it has to be deleted explicitly; we can't let Perl try (and fail) to clean it up. Failing to delete your trees will leave unnecessary fragments in Perl's memory, taking up nothing but space (read: memory).

 $p = $p->delete; # don't need it anymore

We now have an array of books with all sorts of information about them. With this array, we can now ask questions, such as how many books are there with "Perl" in the title, which is the cheapest, and how many more Java books can we expect to find:

 {     my $count = 1;     my @perlbooks  = sort { $a->{price} <=> $b->{price} }                      grep { $_->{title} =~ /perl/i } @books;     print $count++, "\t", $_->{price}, "\t", $_->{title} for @perlbooks; } {     my @perlbooks = grep { $_->{title} =~ /perl/i } @books;     my @javabooks = grep { $_->{title} =~ /java/i } @books;     my $diff = @javabooks - @perlbooks;     print "There are ".@perlbooks." Perl books and ".@javabooks.           " Java books. $diff more Java than Perl."; }

Hacking the Hack

Say you want more information on each book. We now have a list of URLs, one for each book. Want to collect author and publication information? First, fetch the individual page for the book in question. As there are 453 titles (at least), it's probably a bad idea to fetch all of them.

From there, we can do this:

 for my $book ($books[34]) {     my $url = $book->{url};     my $page = get($url);     my $tree = HTML::TreeBuilder->new_from_content($page);     my ($pubinfo) = $tree->look_down(_tag => 'span',                                       class => 'secondary2');     my $html = $pubinfo->as_HTML; print $html;

Since as_HTML produces well- formed and regular HTML, you can easily extract the desired information with a set of regular expressions:

 my ($pages) = $html =~ /(\d+) pages/;     my ($edition) = $html =~ /(\d)(?:stndrdth) Edition/;     my ($date) = $html =~ /(\w+ (1920)\d\d)/;     print "\n$pages $edition $date\n";

Need the cover?

 my ($img_node) = $tree->look_down(_tag => 'img',                                        src  => qr{^/catalog/covers/},);     my $img_url = 'http://www.oreilly.com'.$img_node->attr('src');     my $cover = get($img_url);     # now save $cover to disk. }

Iain Truskett