Hack 19 Scraping with HTML::TreeBuilder
One of many popular HTML parsers available in Perl, HTML::TreeBuilder approaches the art of HTML parsing as a parent/child relationship . Sometimes regular expressions [Hack #23] won't get you all the way to the data you want and you'll need to use a real HTML parser. CPAN has a few of these, the main two being HTML::TreeBuilder and HTML::TokeParser [Hack #20], both of which are friendly faades for HTML::Parser . This hack covers the former. The Tree in TreeBuilder represents a parsing ideology: trees are a good way to represent HTML. The <head> tag is a child of the <html> tag. The <title> and <meta> tags are children of the <head> tag. TreeBuilder takes a stream of HTML, from a file or from a variable, and turns it into a tree of HTML::Element nodes. Each of these nodes can be queried for its parent, its siblings, or its children. Each node can also be asked for a list of children that fulfill certain requirements.
We'll
#!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::TreeBuilder; my $url = 'http://www.oreilly.com/catalog/prdindex.html'; my $page = get($url) or die $!; my $p = HTML::TreeBuilder->new_from_content($page); The look_down method starts from the top of the tree and then works downward, seeing which nodes match the specified conditions. We specify that we want anchor tags and the URL has to match a certain regular expression. This returns a list of matching nodes, which we put in @links :
my @links = $p->look_down(_tag => 'a',
href => qr{^ \Qhttp://www.oreilly.com/catalog/\E \w+ $}x);
We could happily make a list of titles and URLs, but the page we're fetching has more: price, ISBN, and whether it's on O'Reilly's subscription-based Safari online library (http://safari.oreilly.com/). This information is contained in a bit of HTML code that looks like this:
<tr bgcolor="#ffffff">
<td valign="top">
<a href="http://oreilly.com/catalog/googlehks">Google Hacks</a><br />
</td>
<td valign="top" nowrap="nowrap">0-596-00447-8</td>
<td valign="top" align="right">.95</td>
<td valign="top" nowrap="nowrap" align="center">
<a href="http://safari.oreilly.com/0596004478">Read it on Safari</a>
</td>
<td valign="top" nowrap="nowrap">
<a href="http://examples.oreilly.com/googlehks/">Get examples</a>
</td>
</tr>
Our previous match with
look_down
places us at the
my @rows = map { $_->parent->parent } @links;
We then loop over each of those rows, representing one book at a time. We find each of the
<td>
elements, giving us the table
my @books;
for my $row (@rows) {
my %book;
my @cells = $row->look_down(_tag => 'td');
$book{title} = $cells[0]->as_trimmed_text;
$book{isbn} = $cells[1]->as_trimmed_text;
$book{price} = $cells[2]->as_trimmed_text;
$book{price} =~ s/^$//;
The URLs are slightly trickier. We want the first (only) URL in each of the cells, but one might not always exist. We add a new routine,
get_url
, which is given an
HTML::Element
node and works out the correct thing to do. As is typical in web scraping, there's a
$book{url} = get_url($cells[0]);
$book{safari} = get_url($cells[3]);
$book{examples} = get_url($cells[4]);
push @books, \%book;
}
sub get_url {
my $node = shift;
my @hrefs = $node->look_down(_tag => 'a');
return unless @hrefs;
my $url = $hrefs[0]->attr('href');
$url =~ s/\s+$//;
return $url;
}
Finally, we delete the tree, because it's not needed anymore. Due to the cross-linking of nodes, it has to be deleted explicitly; we can't let Perl try (and fail) to clean it up. Failing to delete your trees will leave unnecessary
$p = $p->delete; # don't need it anymore We now have an array of books with all sorts of information about them. With this array, we can now ask questions, such as how many books are there with "Perl" in the title, which is the cheapest, and how many more Java books can we expect to find:
{
my $count = 1;
my @perlbooks = sort { $a->{price} <=> $b->{price} }
grep { $_->{title} =~ /perl/i } @books;
print $count++, "\t", $_->{price}, "\t", $_->{title} for @perlbooks;
}
{
my @perlbooks = grep { $_->{title} =~ /perl/i } @books;
my @javabooks = grep { $_->{title} =~ /java/i } @books;
my $diff = @javabooks - @perlbooks;
print "There are ".@perlbooks." Perl books and ".@javabooks.
" Java books. $diff more Java than Perl.";
}
Hacking the HackSay you want more information on each book. We now have a list of URLs, one for each book. Want to collect author and publication information? First, fetch the individual page for the book in question. As there are 453 titles (at least), it's probably a bad idea to fetch all of them. From there, we can do this:
for my $book ($books[34]) {
my $url = $book->{url};
my $page = get($url);
my $tree = HTML::TreeBuilder->new_from_content($page);
my ($pubinfo) = $tree->look_down(_tag => 'span',
class => 'secondary2');
my $html = $pubinfo->as_HTML; print $html;
Since
as_HTML
produces well-
my ($pages) = $html =~ /(\d+) pages/;
my ($edition) = $html =~ /(\d)(?:stndrdth) Edition/;
my ($date) = $html =~ /(\w+ (1920)\d\d)/;
print "\n$pages $edition $date\n";
Need the cover?
my ($img_node) = $tree->look_down(_tag => 'img',
src => qr{^/catalog/covers/},);
my $img_url = 'http://www.oreilly.com'.$img_node->attr('src');
my $cover = get($img_url);
# now save $cover to disk.
}
Iain Truskett |