Hack 94 Using XML::RSS to Repurpose Data

Hack 94 Using XML::RSS to Repurpose Data

figs/moderate.gif figs/hack94.gif

By using the popular syndication format known as RSS, you can use your newly scraped data in dozens of different aggregators, toolkits, and more .

At its simplest, RSS is an XML format for publishing summaries of data, with links to more information. The main use of RSS is to syndicate news headlines and weblog postings with titles, summaries, and links to full stories. A reader can then comb through the story listings, reading a full story if it interests him, rather than having to hop from site to site looking for what's new. You read RSS files with news aggregators , such as NewsMonster (http://www. newsmonster .org), AmphetaDesk (http://www. disobey .com/amphetadesk/), Radio UserLand (http://radio.userland.com), and many more. Due to the incredible popularity of RSS, folks are starting to syndicate just about anything one might be interested in followingrecipes, job listings, sports scores, and TV schedules to name but a few.

Here's a simple RSS file:

 <?xml version="1.0"?> <rss version="0.91">   <channel>      <title>ResearchBuzz</title>      <link>http://www.researchbuzz.com</link>      <description>News and information on search engines... </description>      <language>en-us</language>      <item>       <title>Survey Maps of Scotland Towns</title>       <link>http://researchbuzz.com/news/2003/jul31aug603.shtml</link>      </item>      <item>       <title>Directory of Webrings</title>       <link>http://researchbuzz.com/news/2003/jul31aug603.shtml</link>      </item>   </channel> </rss> 

Thankfully, we don't have to worry about creating these files by hand. We can create RSS files with a simple Perl module called XML::RSS (http://search.cpan.org/author/KELLAN/XML-RSS/). Here's how:

 #!/usr/bin/perl -w use strict; use XML::RSS; my $rss = new XML::RSS(version => '0.91'); $rss->channel(     title       => 'Research Buzz',     link        => 'http://www.researchbuzz.com',     description => 'News and information on search en...', ); $rss->add_item(     title       => 'Survey Maps of Scotland Towns',     link        => 'http://researchbuzz.com/news/2003/etc/etc/etc#etc',     description => 'An optional description can go here.' ); $rss->add_item(     title       => 'Directory of Webrings',     link        => 'http://researchbuzz.com/news/2003/yadda/yadda#etc',     description => 'Another optional description can go here.' ); print $rss->as_string; 

This code creates a channel to describe the ResearchBuzz web site, replete with some story entries, each created by a call to add_item . When we're done adding items, we print the final RSS to STDOUT . Alternatively, if we want to save our RSS to a file, we use the save method:

 $rss->save("file.rss"); 

Saving your RSS to a file is very important. RSS is likely to be downloaded a lot by users checking for updates, so generating it on the fly each time will bog down your server unnecessarily. It's common for a mildly popular site to have its RSS feed downloaded six times a minute. You can automate the creation of your RSS files with a cron job [Hack #90].

Since the call to add_item always creates a new RSS entry with a title , link , and description , we can feed it from anything available, such as iterating over the results of a database search, matches from an HTML parser or regular expression, and so on. Or, we can do something much more interesting and hack it together with one of the existing scripts in this book.

In this example, we'll use our aggregated search engine [Hack #85] and repurpose its results into RSS instead of its normal format.

You'll need the XML::RSS module installed, as well as the code from [Hack #85]. Note that most fields within an RSS feed are optional, so this code outputs only a title and link, not a description:

 #!/usr/bin/perl -w # agg2rss - aggregated search to RSS converter # This file distributed under the same licence as Perl itself # by rik - ora@rikrose.net use strict; use XML::RSS; # Build command line, and run the aggregated search engine. my (@currentPlugin, @plugins, $url, $desc, $plugin, %results); my $commandLine = "aggsearch " . join " ", @ARGV; open INPUT, "$commandLine " or die $!; while (<INPUT>){ chomp;     @currentPlugin = split / /, $_;     push @plugins, $currentPlugin[0];     while (<INPUT>){         chomp;         last if length == 0;         s/</&lt;/; s/>/&gt;/;         ($url, $desc) = split /: /, $_, 2;         $url =~ s/^ //; $desc =~ s/^ //;         $results{$currentPlugin[0]}{$url} = $desc;     } } close INPUT; # Results are now in the @plugins, # %results pair. Put the results into RSS: my $rss = XML::RSS->new(version => '0.91'); # Create the channel object. $rss->channel(         title       => 'Aggregated Search Results',         link        => 'http://www.example.com/cgi/make-new-search',         description => 'Using plugins: ' . join ", ", @plugins ); # Add data. for $plugin (@plugins){     for $url (keys %{$results{$plugin}}){         $rss->add_item(                 title       => $results{$plugin}{$url},                 link        => $url,         );     } } # Save it for later, in our RSS feed for our web site. $rss->save("/rss/index.rdf"); 

Okay, we've created the RSS and placed it on our web site so that others can consume it. What now? XML::RSS not only generates RSS files, but it can also parse them. In this example, we'll download the RSS feed for the front page of the BBC and print a nice, readable summary to STDOUT (most likely to your screen):

 #!/usr/bin/perl -w # get-headlines - get the BBC headlines in RSS format, and print them # This file distributed under the same licence as Perl itself # by rik - ora@rikrose.net use strict; use LWP::UserAgent; use XML::RSS; my $url = "http://www.bbc.co.uk/syndication/feeds/".           "news/ukfs_news/front_page/rss091.xml"; # get data my $ua = LWP::UserAgent->new(  ); my $response = $ua->get($url); die $response->status_line . "\n"   unless $response->is_success; # parse it my $rss = XML::RSS->new; $rss->parse($response->content); # print each item foreach my $item (@{$rss->{'items'}}){     print "title: $item->{'title'}\n";     print "link: $item->{'link'}\n\n"; } 

Save this code as bbc.pl. Its results look similar to this:

 %  perl bbc.pl  title: UK troops attacked in Basra link: http://news.bbc.co.uk/go/click/rss/0.91/public/-       /1/hi/world/middle_east/3137779.stm title: 'More IVF' on the NHS link: http://news.bbc.co.uk/go/click/rss/0.91/public/-       /1/hi/health/3137191.stm ...etc... 

See Also

  • Content Syndication with XML and RSS (http://www.oreilly.com/catalog/consynrss/) by Ben Hammersley for a full explanation of the various RSS formats, as well as further information on parsing, creating, and adding your own elements via namespaces.

Richard Rose



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net