Hack 24 Painless RSS with Template::Extract

figs/moderate.gif figs/hack24.gif

Wouldn't it be nice if you could simply visualize what data on a page looks like, explain it in template form to Perl, and not bother with the need for parsers, regular expressions, and other programmatic logic? That's exactly what Template::Extract helps you do .

One thing that I'd always wanted to do, but never got around to doing, was produce RSS files for all those news sites I read regularly that don't have their own RSS feeds. Maybe I'd read them more regularly if they notified me when something was new, instead of requiring me to remember to check.

One day, I was fiddling about with the Template Toolkit (http://www.template-toolkit.com) and it dawned on me that all these sites were, at some level, generated with some templating engine. The Template Toolkit takes a template and some data and produces HTML output. For instance, if I have the following Perl data structure:

 @news = (         { date => "2003-09-02", subject => "Some News!",           content => "Something interesting happened today." },          { date => "2003-09-03", subject => "More News!",           content => "I ran out of imagination today." }  ); 

I can apply a template like so:

 <ul>     [% FOREACH item = news %]         <li> <i> [% item.date %] </i> - <b> [% item.subject %] </b>             <p> [% item.content %] </p>         </li>     [% END %] </ul> 

I'll end up with some HTML that looks like this:

 <ul>         <li> <i> 2003-09-02 </i> - <b> Some News! </b>             <p> Something interesting happened today. </p>         </li>         <li> <i> 2003-09-03 </i> - <b> More News! </b>             <p> I  ran out of imagination today. </p>         </li> </ul> 

Okay, you might think, very interesting, but how does this relate to scraping web pages for RSS? Well, we know what the HTML looks like, and we can make a reasonable guess at what the template ought to look like, but we want only the data. If only I could apply the Template Toolkit backward somehow. Taking HTML output and a template that could conceivably generate the output, I could retrieve the original data structure and, from then on, generating RSS from the data structure would be a piece of cake.

Like most brilliant ideas, this is hardly original, and an equally brilliant man named Autrijus Tang not only had the idea a long time before me, butand this is the hard partactually worked out how to implement it. His Template::Extract Perl module (http://search.cpan.org/author/AUTRIJUS/Template-Extract/) does precisely this: extract a data structure from its template and output.

I put it to work immediately to turn the blog of one of my favorite singers, Martyn Joseph (http://www.piperecords.co.uk/news/diary.asp), into an RSS feed. I'll use his blog for the example in this hack.

First, write a simple bit of Perl to grab the page, and tidy it up to avoid tripping over whitespace issues:

 #!/usr/bin/perl my $page = get(" http://www.piperecords.co.uk/news/diary.asp" ); exit unless $page; $page = join "\n", grep { /\S/ } split /\n/, $page; $page =~ s/\r//g; $page =~ s/^\s+//g; 

This removes blank lines, DOS line feeds, and leading spaces. Once you've done this, take a look at the structure of the page. You'll find that blog posts start with this line:

 <!--START OF ABSTRACT OF NEWSITEM--> 

and end with this one:

 <!--END OF ABSTRACT OF NEWSITEM--> 

The interesting bit of the diary starts after the close of an HTML comment:

 --> 

After a bit more observation, you can glean a template like this:

 -->  [% FOR records %]     <!--START OF ABSTRACT OF NEWSITEM-->     [% ... %]     <a href="[% url %]"><acronym title="Click here to read this article">     [% title %]</acronym></a></strong> &nbsp; &nbsp; ([% date %]) <BR>     [% ... %]<font size="2">[% content %]</font></font></div>     [% ... %]     <!--END OF ABSTRACT OF NEWSITEM-->  [% END %] 

The special [% ... %] template markup means "stuff" or things that we don't care about; it's the Template::Extract equivalent of regular expression's .* . Now, feed your document and this template to Template::Extract :

 my $x = Template::Extract->new(  ); my $data = $x->extract($template, $doc); 

You end up with a data structure that looks like this:

 $data = { records => [          { url => "...", title => "...", date => "...", content => "..." },          { url => "...", title => "...", date => "...", content => "..." },            ...         ]}; 

The XML::RSS Perl module [Hack #94] can painlessly turn this data structure into a well- formed RSS feed:

 $rss = new XML::RSS; $rss->channel( title => "Martyn's Diary",                 link => "http://www.piperecords.co.uk/news/diary.asp" ,                description => "Martyn Joseph's Diary" );  for (@{$data->{records}}) {         $rss->add_item(             title => $_->{title},             link => $_->{url},             description => $_->{content}         ); }  print $rss->as_string; 

Job donewell, nearly.

You see, it's a shame to have solved such a generic problemscraping a web page into an RSS feedin such a specific way. Instead, what I really use is the following CGI driver, which allows me to specify all the details of the site and the RSS in a separate file:

 #!/usr/bin/perl -T use Template::Extract; use LWP::Simple qw(get); use XML::RSS; use CGI qw(:standard); print "Content-type: text/xml\n\n"; my $x = Template::Extract->new(  ); my %params;  path_info(  ) =~ /(\w+)/ or die "No file name given!"; open IN, "rss/" or die "Can't open $file: $!"; while (<IN>) { /(\w+): (.*)/ and $params{} = ; last if !/\S/; }  my $template = do {local $/; <IN>;}; $rss = new XML::RSS; $rss->channel( title => $params{title}, link => $params{link},                description => $params{description} ); my $doc = join "\n", grep { /\S/ } split /\n/, get($params{link}); $doc =~ s/\r//g; $doc =~ s/^\s+//g; for (@{$x->extract($template, $doc)->{records}}) {     $rss->add_item(         title => $_->{title},         link => $_->{url},         description => $_->{content}     ); } print $rss->as_string; 

Now I can have a bunch of files that describe how to scrape sites:

 title: Martyn's Diary link: http://www.piperecords.co.uk/news/diary.asp description: Martyn Joseph's diary  -->  [% FOR records %]     <!--START OF ABSTRACT OF NEWSITEM-->     [% ... %]     <a href="[% url %]"><acronym title="Click here to read this article">      [% title %]</acronym></a></strong> &nbsp; &nbsp; ([% date %]) <BR>     [% ... %]<font size="2">[% content %]</font></font></div>     [% ... %]     <!--END OF ABSTRACT OF NEWSITEM-->  [% END %] 

When I point my RSS aggregator at the CGI script (http://blog.simon-cozens.org/rssify.cgi/martynj), I have an instant scraper for all those wonderful web sites that haven't made it into the RSS age yet.

Template::Extract is a brilliant new way of doing data-directed screen scraping for structured documents, and it's especially brilliant for anyone who already uses Template to turn templates and data into HTML. Also look out for Autrijus's latest crazy idea, Template::Generate (http://search.cpan.org/author/AUTRIJUS/Template-Generate/), which provides the third side of the Template triangle, turning data and output into a template.

Simon Cozens



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net