Hack 20 Parsing with HTML::TokeParser

HTML::TokeParser allows you to follow a path through HTML code, storing the contents of tags as you move nearer your desire .

One of the main limitations of HTML as a language for pages on the Web is its lack of separation between content and form. It's not possible for us to look solely at the information an HTML page has to offer; rather, we need to navigate through a mass of tags in order to programmatically split the content of a page from the markup used to specify how it should look.

One way to accomplish this is with the HTML::TokeParser module written by Gisle Aas. It allows us to easily model an HTML page as a stream of elements instead of an entire and complete tree, directing the parser to perform actions such as moving to the next tag with a given property and storing the content inside the tag.

For a demonstration, let's write a parser for the Echocloud site (http://www.echocloud.net/). Echocloud provides recommendations for music artists, based on the file lists found on popular peer-to-peer networks; if two artists are often found together in the music collections of different people sharing files, Echocloud assumes that people listening to the first artist would enjoy listening to the second, and in this way a list of similar artists is created.

The HTML::TokeParser modus operandi is typically something like this:

Download the page to be worked on.
Determine the structure of the HTML document by looking at the tags present. Is there always a particular tag or group of tags just before the content we're trying to save? Do tags that contain content have any special modifiers, such as a class attribute?
Model the structure in code, storing wanted content as it is found.

By searching for our favorite artist at Echocloud, the snippet of returned HTML for each similar artist looks something like the following mass of code:

 <TR bgcolor=#F2F2F2><TD class = "cf" nowrap WIDTH='300'>  <   A   HREF='index.php?searchword=Autechre&option=asearch&nrows=40&cur=0   &stype=2&order=0'  class = "cf">&nbsp;Autechre</A   >  </TD><TD align=center><A  HREF="http://www.amazon.com/exec/obidos/external-search?tag=echocloud- 20&keyword=Autechre&mode=music"><img src="images/M_images/amazon_small.gif"  border=0 align="center"></A><a href="http://www.insound.com/search.cfm? from=52208&searchby=artist&query=Autechre"><img src="images/M_images/ insound.gif" border=0 align="center"></a>&nbsp;&nbsp;&nbsp;</TD>         <TD><span class = newsflash>8.05 </span></TD><TD><span class =newsflash>0.50</span> </TD></TR>

Thankfully, each of the results helpfully uses a class of cf that is present only for the <A> tags in our search results. Therefore, we use this to discriminate as we travel through the document.

The Code

Save the following code to a file called echocloud.pl :

 #!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::TokeParser; use URI::Escape; # The artist to search for should be given as an argument. my $artist = $ARGV[0]; die "No artist specified" unless $artist ne ''; # We use URI::Escape to convert the artist's name # into a form that can be encoded as part of a URL. my $search = uri_escape($artist); # 1.  Download the page to be worked on. ##################################### my $content =   get('http://www.echocloud.net/index.php?searchword='.    "$search".'&option=asearch&stype=2&order=0&nrows=6'); # Now that we have our content, initialize a # new HTML::TokeParser object with it. my $stream = new HTML::TokeParser($content); print "Artists liked by $artist listeners include:\n"; # 2.  Determine the structure of the HTML document. # An HTML result looks like: <a href='index.php?searchword # =Beck&option=asearch' class="cf">&nbsp;Beck</a> ##################################### # 3.  Model the structure in code. # Given that each <a class="cf"> contains our result, we: #   - Search for each <a> tag. #   - If it has a 'class' attribute, and #     the class attribute is "cf": #       - Save all the text from <a> to </a>. #   - Repeat. # # Of the methods used below, the two from TokeParser are: # get_tag:  Move the stream to the next occurence of a tag. # get_trimmed_text:  Store text from the current location # of the stream to the tag given. ##################################### # For each <a> tag while (my $tag = $stream->get_tag("a")) {   # Is there a 'class' attribute?  Is it 'cf'?   if ($tag->[1]{class} and $tag->[1]{class} eq "cf") {       # Store everything from <a> to </a>.       my $result = $stream->get_trimmed_text("/a");       # Remove leading.       # '&nbsp;' character.       $result =~ s/^.//g;       # Echocloud sometimes returns the artist we searched       # for as one of the results.  Skip the current loop       # if the string given matches one of the results.       next if $result =~ /$artist/i;       # And we can print our final result.       print "  - $result\n";   } }

Running the Hack

Here, I invoke the script, asking for artists associated with Aimee Mann:

 %  perl echocloud.pl 'Aimee Mann'  Artists liked by Aimee Mann listeners include:   - Beck   - Counting Crows   - Bob Dylan   - Radiohead   - Blur

While this has been a simple example of the power of HTML::TokeParser , modeling more complex pages rarely involves more than increasing the number of get_tag calls and conditional checks on attributes and tags in code. For more complex interactions with sites, your TokeParser code can also be combined with WWW::Mechanize [Hack #22].

Hack 20 Parsing with HTML::TokeParser

The Code

Running the Hack

See Also