Hack 86 Create RSS with XML::RSS

   

figs/expert.gif figs/hack86.gif

By using the popular syndication format known as RSS, you can use your newly scraped data in dozens of different aggregators, toolkits, and more.

Due to the incredible popularity of RSS, folks are starting to syndicate just about anything one might be interested in following recipes, job listings, sports scores, and TV schedules to name but a few.

Here's a simple RSS 0.91 file:

<?xml version="1.0"?> <rss version="0.91">   <channel>      <title>ResearchBuzz</title>      <link>http://www.researchbuzz.com</link>      <description>News and information on search engines... </description>      <language>en-us</language>      <item>       <title>Survey Maps of Scotland Towns</title>       <link>http://researchbuzz.com/news/2003/jul31aug603.shtml</link>      </item>      <item>       <title>Directory of Webrings</title>       <link>http://researchbuzz.com/news/2003/jul31aug603.shtml</link>      </item>   </channel> </rss>

Thankfully, we don't have to worry about creating these files by hand. We can create RSS files with a simple Perl module called XML::RSS (http://search.cpan.org/author/KELLAN/XML-RSS/). Here's how (rss.pl):

#!/usr/bin/perl -w use strict; use XML::RSS; my $rss = new XML::RSS(version => '0.91'); $rss->channel(     title       => 'Research Buzz',     link        => 'http://www.researchbuzz.com',     description => 'News and information on search en...', ); $rss->add_item(     title       => 'Survey Maps of Scotland Towns',     link        => 'http://researchbuzz.com/news/2003/etc/etc/etc#etc',     description => 'An optional description can go here.' ); $rss->add_item(     title       => 'Directory of Webrings',     link        => 'http://researchbuzz.com/news/2003/yadda/yadda#etc',     description => 'Another optional description can go here.' ); print $rss->as_string;

This code creates a channel to describe the ResearchBuzz web site, replete with some story entries, each created by a call to add_item. When we're done adding items, we print the final RSS to STDOUT. Alternatively, if we want to save our RSS to a file, we use the save method:

$rss->save("file.rss");

Saving your RSS to a file is very important. RSS is likely to be downloaded a lot by users checking for updates, so generating it on the fly each time will bog down your server unnecessarily. It's common for a mildly popular site to have its RSS feed downloaded six times a minute. You can automate the creation of your RSS files with acron job [Hack #90]

Because the call to add_item always creates a new RSS entry with a title, link, and description, we can feed it from anything available, such as iterating over the results of a database search, matches from an HTML parser or regular expression, and so on. Or, we can do something much more interesting and hack it together with one of the existing scripts in this book.

In this example, we'll use an aggregated search engine from [Hack #85] in Spidering Hacks (O'Reilly) and repurpose its results into RSS instead of its normal format.

You'll need the XML::RSS module installed, as well as the code from [Hack #85], in Spidering Hacks, available from http://www.oreilly.com/catalog/spiderhks/. Note that most fields within an RSS feed are optional, so this code (agg2rss.pl) outputs only a title and link, not a description:

#!/usr/bin/perl -w # agg2rss - aggregated search to RSS converter # This file distributed under the same licence as Perl itself # by rik - ora@rikrose.net use strict; use XML::RSS; # Build command line, and run the aggregated search engine. my (@currentPlugin, @plugins, $url, $desc, $plugin, %results); my $commandLine = "aggsearch " . join " ", @ARGV; open INPUT, "$commandLine |" or die $!; while (<INPUT>){ chomp;     @currentPlugin = split / /, $_;     push @plugins, $currentPlugin[0];     while (<INPUT>){         chomp;         last if length == 0;         s/</&lt;/; s/>/&gt;/;         ($url, $desc) = split /: /, $_, 2;         $url =~ s/^ //; $desc =~ s/^ //;         $results{$currentPlugin[0]}{$url} = $desc;     } } close INPUT; # Results are now in the @plugins, # %results pair. Put the results into RSS: my $rss = XML::RSS->new(version => '0.91'); # Create the channel object. $rss->channel(         title       => 'Aggregated Search Results',         link        => 'http://www.example.com/cgi/make-new-search',         description => 'Using plugins: ' . join ", ", @plugins ); # Add data. for $plugin (@plugins){     for $url (keys %{$results{$plugin}}){         $rss->add_item(                 title       => $results{$plugin}{$url},                 link        => $url,         );     } } # Save it for later, in our RSS feed for our web site. $rss->save("/rss/index.rdf");

Okay, we've created the RSS and placed it on our web site so that others can consume it. What now? XML::RSS not only generates RSS files, but it can also parse them. In this example, we'll download the RSS feed for the front page of the BBC and print a nice, readable summary to STDOUT (most likely your screen):

#!/usr/bin/perl -w # get-headlines - get the BBC headlines in RSS format, and print them # This file distributed under the same licence as Perl itself # by rik - ora@rikrose.net use strict; use LWP::UserAgent; use XML::RSS; my $url = "http://www.bbc.co.uk/syndication/feeds/".           "news/ukfs_news/front_page/rss091.xml"; # get data my $ua = LWP::UserAgent->new( ); my $response = $ua->get($url); die $response->status_line . "\n"   unless $response->is_success; # parse it my $rss = XML::RSS->new; $rss->parse($response->content); # print each item foreach my $item (@{$rss->{'items'}}){     print "title: $item->{'title'}\n";     print "link: $item->{'link'}\n\n"; }

This code is saved as bbc.pl. Its results look similar to this:

% perl bbc.pl title: UK troops attacked in Basra link: http://news.bbc.co.uk/go/click/rss/0.91/public/-       /1/hi/world/middle_east/3137779.stm title: 'More IVF' on the NHS link: http://news.bbc.co.uk/go/click/rss/0.91/public/-       /1/hi/health/3137191.stm ...etc...

6.8.1 See Also

Google Hacks (O'Reilly) for much more information on search engines.
Spidering Hacks (O'Reilly) for much more on scraping data from web sites.

Richard Rose



XML Hacks
XML Hacks: 100 Industrial-Strength Tips and Tools
ISBN: 0596007116
EAN: 2147483647
Year: 2006
Pages: 156

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net