Hack 71 Finding Related RSS Feeds

Hack 71 Finding Related RSS Feeds

figs/expert.gif figs/hack71.gif

If you're a regular reader of weblogs, you know that most syndicate their content in a format called RSS. By querying aggregated RSS databases, you can find related sites you may be interested in reading .

One of the fastest growing applications on the Web is the use of RSS feeds. Although there's some contention regarding what RSS stands forone definition of the acronym calls it "Really Simple Syndication" and another calls it "Rich Site Summary"RSS feeds are XML documents that provide a feed of headlines from a web site (commonly a weblog or news site) that can be processed easily by a piece of software called a news aggregator . News aggregators allow you to subscribe to content from a multitude of web sites, allowing the program to go out and check for new content, rather than requiring you to go out and look for it.

RSS feeds are like potato chips, though. Once you subscribe to one, you find yourself grabbing one after another. It would be nice if you could supply a list of feeds you already read to a robot and have it go out and find related feeds in which you might also be interested.

Filling Up the Toolbox

We're going to need a number of tools to get this script off the ground. Also, we'll be calling on a couple of web services, namely those at Syndic8 (http://www.syndic8.com) and Technorati (http://www. technorati .com).

Syndic8 is a catalog of feeds maintained by volunteers, and it contains quite a bit of information on each feed. It also catalogs feeds for sites created by people other than the site owners , so even if a particular site might not have a feed, Syndic8 might be able to find one anyway. Also, Syndic8 employs several categorization schemes; so, given one feed, we might be able to find others in its category. Since Syndic8 offers an XML-RPC web service, we can call upon this directory for help.

Technorati is a search engine and a spider of RSS feeds and weblogs. Among other things, it indexes links between weblogs and feeds, and it maps the relationships between sites. So, while we're looking for feeds, Technorati can tell us which sites link to each other. Since it supports a simple URL-based API that produces XML, we can integrate this into our script fairly easily.

Let's gather some tools and start the script:

 #!/usr/bin/perl -w use strict; use POSIX; use Memoize; use LWP::Simple; use XMLRPC::Lite; use XML::RSS; use HTML::RSSAutodiscovery; use constant SYNDIC8_ID => 'syndic8_id'; use constant FEED_URL   => 'feed_url'; use constant SITE_URL   => 'site_url'; 

This script starts off with some standard Perl safety features. The Memoize module is a useful tool we can use to cache the results of functions so that we aren't constantly rerequesting information from web services. LWP::Simple allows us to download content from the Web; XMLRPC::Lite allows us to call on XML-RPC web services; XML::RSS allows us to parse and extract information from RSS feeds themselves ; and HTML::RSSAutodiscovery gives us a few tricks to locate a feed for a site when we don't know its location.

The rest of this preamble consists of a few constants we'll use later. Now, let's do some configuration:

 our $technorati_key = "   your Technorati key   "; our $ta_url         = 'http://api.technorati.com'; our $ta_cosmos_url  = "$ta_url/cosmos?key=$technorati_key&url="; our $syndic8_url = 'http://www.syndic8.com/xmlrpc.php'; our $syndic8_max_results = 10; my @feeds =   qw(    http://www.macslash.com/macslash.rdf    http://www.wired.com/news_drop/netcenter/netcenter.rdf    http://www.cert.org/channels/certcc.rdf   ); 

Notice that, like many web services, the Technorati API requires you to sign up for an account and be assigned a key string in order to use it (http://www.technorati.com// members /apikey.html). You might also want to check out the informal documentation for this service (http://www.sifry.com/alerts/archives/000288.html). After we set our API key, we construct the URL we'll be using to call upon the service.

Next , we set up the URL for the Syndic8 XML-RPC service, as well as a limit we'll use later for restricting the number of feeds we want the robot to look for at once.

Finally, we set up a list of favorite RSS feeds to use in digging for more feeds. With configuration out of the way, we have another trick to use:

 map { memoize($_) }   qw(      get_ta_cosmos      get_feed_info      get_info_from_technorati      get_info_from_rss     ); 

This little map statement sets up the Memoize module for us so that the mentioned function names will have their results cached. This means that, if any of the four functions in the statement are called with the same parameters throughout the program, the results will not be recomputed but will be pulled from a cache in memory. This should save a little time and use of web services as we work.

Next, here's the main driving code of the script:

 my $feed_records = []; for my $feed (@feeds) {   my %feed_record = (url=>$feed);   $feed_record{info}    = get_feed_info(FEED_URL, $feed);   $feed_record{similar} = collect_similar_feeds($feed_record{info});   $feed_record{related} = collect_related_feeds($feed_record{info});   push @$feed_records, \%feed_record; } print html_wrapper(join("<hr />\n",                    map { format_feed_record($_) }                    @$feed_records)); 

This loop runs through each of our favorite RSS feeds and gathers records for each one. Each record is a hash, whose primary keys are info , similar , and related . info will contain basic information about the feed itself; similar will contain records about feeds in the same category as this feed; and related will contain records about feeds that have linked to items from the current feed.

Now, let's implement the functions that this code needs.

Getting the Dirt on Feeds

The first thing we want to do is build a way to gather information about RSS feeds, using our chosen web services and the feeds themselves:

 sub get_feed_info {   my ($type, $id) = @_;   return {} if !$id;   my ($rss, $s_info, $t_info, $feed_url, $site_url);   if ($type eq SYNDIC8_ID) {     $s_info = get_info_from_syndic8($id)  {};     $feed_url = $s_info->{dataurl};   } elsif ($type eq FEED_URL) {     $feed_url = $id;   } elsif ($type eq SITE_URL) {     my $rss_finder = new HTML::RSSAutodiscovery(  );     eval {       ($feed_url) = map { $_->{href} } @{$rss_finder->locate($site_url)};     };   }   $rss = get_info_from_rss($feed_url)  {};   $s_info = get_info_from_syndic8($feed_url)  {};   $site_url = $rss->{channel}{link}  $s_info->{dataurl};   $t_info = get_info_from_technorati($site_url);   return {url=>$feed_url, rss=>$rss, syndic8=>$s_info, technorati=>$t_info}; } 

This function gathers basic information on a feed. It accepts several different forms of identification for a feed: the Syndic8 feed internal ID number, the URL of the RSS feed itself, and the URL of a site that might have a feed. The first parameter indicates which kind of identification the function should expect (using the constants we defined at the beginning of the script), and the second is the identification itself.

So, we must first figure out a URL to the feed from the identification given. With a Syndic8 feed ID, the function tries to grab the feed's record via the Syndic8 web service and then get the feed URL from that record. If a feed URL is given, great; use it. Otherwise , if a site URL is given, we use the HTML::RSSAutodiscovery module to look for a feed for this site.

Once we have the feed URL, we get and parse the feed, grab information from Syndic8 if we haven't already, and then get feed information from Technorati. All of this information is then collected into a hash and returned. You might want to check out the documentation for the Syndic8 and Technorati APIs to learn what information each service provides on a feed.

Moving on, let's see what it takes to get information from Syndic8:

 sub get_info_from_syndic8 {   my $feed_url = shift;   return {} if !$feed_url;   my $result = {};   eval {     $result = XMLRPC::Lite->proxy($syndic8_url)       ->call('syndic8.GetFeedInfo', $feed_url)->result(  )  {};   };   return $result; } 

Here, we expect a feed URL and return empty-handed if one isn't given. If a feed URL is given, we simply call the Syndic8 web service method syndic8.GetFeedInfo with the URL to our feed and catch the results. One thing to note is that we wrap this call in an eval statement, which prevents any ostensibly fatal errors in this call or XML parsing from exiting the script. In the case of such an error, we simply return an empty record.

Grabbing information from Technorati is a little more complex, if only because we'll be parsing the XML resulting from calls without the help of a convenience package such as XMLRPC::Lite . But let's get on with that:

 sub get_info_from_technorati {   my $site_url = shift;   return {} if !$site_url;   my $xml = get_ta_cosmos($site_url);   my $info = {};   if ($xml =~ m{<result>(.*?)</result>}mgis) {     my $xml2 = ;     $info = extract_ta_bloginfo($xml2);   }   return ($info->{lastupdate} =~ /1970/) ? {} : $info; } 

Here, we make a request to the web service's cosmos method with the site URL parameter. Using a regular expression, we look for the contents of a results tag in the response to our query and call upon a convenience function to extract the XML data into a hash. We also check to make sure the date doesn't contain 1970 , a value that occurs when a record isn't found.

The implementation of our first convenience function goes like so:

 sub get_ta_cosmos {   my $url = shift;   return get($ta_cosmos_url.$url); } 

This is just a simple wrapper around LWP::Simple 's get function, done so that we can memoize it without interfering with other modules' use of the same function. Next, here's how to extract a hash from the XML data:

 sub extract_ta_bloginfo {   my $xml = shift;   my %info = (  );   if ($xml =~ m{<weblog>(.*?)</weblog>}mgis) {     my ($content) = ('');     while ($content =~ m{<(.*?)>(.*?)</>}mgis) {       my ($name, $val) = ('', '');       $info{$name} = $val;     }   }   return \%info; } 

With another couple of regular expressions, we look for the weblog tag in a given stream of XML and extract all of the tags it contains into a hash. Hash keys are tag names, and the values are the contents of those tags. The resulting hash contains basic information about a weblog cited in the Technorati results. We'll also use this in another function in a little bit.

We can extract information from both services, but how about feeds themselves? We can extract feeds with a simple function:

 sub get_info_from_rss {   my $feed_url = shift;   return {} if !$feed_url;   my $rss = new XML::RSS(  );   eval {     $rss->parse(get($feed_url));   };   return $rss; } 

Again, we expect a feed URL and return empty handed if one is missing. If a feed URL is given, we download the contents of that URL and use the XML::RSS module to parse the data. Notice that we use another eval statement to wrap this processing so that parsing errors do not exit our script. If everything goes well, we return an instance of XML::RSS .

Our basic feed information-gathering machinery is in place now. The next thing to tackle is gathering feeds. Let's start with employing the Technorati API to find feeds that have referred to a given feed:

 sub collect_related_feeds {   my $feed_info = shift;   my $site_url = $feed_info->{rss}{channel}{link}  $feed_info->{url};   my %feeds = (  ); 

We start off by expecting a feed information record, as produced earlier by our get_info function. From this record, we get the site URL for which the feed is a summary. We try two options. First, we check the RSS feed itself for the information. Then, we check the record as a backup and treat the RSS feed URL itself as the site URL so that we at least have something to go on.

With that, we call on the Technorati API to get a list of related feeds:

 my $xml = get_ta_cosmos($site_url);   while ($xml =~ m{<item>(.*?)</item>}mgis) {     my $xml2 = ;     my $ta_info = extract_ta_bloginfo($xml2);     my $info = ($ta_info->{rssurl} ne '') ?       get_feed_info(FEED_URL, $ta_info->{rssurl}) :       get_feed_info(SITE_URL, $ta_info->{url}); 

With our previous call to the Technorati API, we were gathering information about a feed. This time, we're using the same call to gather information about related feeds. Thanks to Memoize , we should be able to reuse the results of a given API call for the same site URL over and over again, though we actually call upon the API only once.

So, we use a regular expression to iterate through item tags in the resulting data and extract weblog information from each result. Then, we check to see if a URL to this weblog's RSS feed was supplied. If so, we use it to get a feed record on this site; otherwise, we use the site URL and try to guess where the feed is.

After getting the record, we grab the rest of the information in the item tag:

 $info->{technorati} = $ta_info;     while ($xml2 =~ m{<(.*?)>(.*?)</>}mgis) {       my ($name, $val) = ('', '');       next if $name eq 'weblog';       $info->{technorati}{$name} = $val;     } 

Once more, we use a regular expression to convert from tag names and contents to a hash. The hash contains information about the weblog's relationship to the feed we're considering, among other things.

To finish up, let's add this record to a hash (to prevent duplicate records) and return that hash when we're all done:

 $feeds{$info->{url}} = $info;   }   return \%feeds; } 

The returned hash will contain feed URLs as keys and feed records as values. Each of these feeds should be somewhat related to the original feed, if only because they linked to its content at one point.

Now, let's go on to use the Syndic8 API to find feeds in a category:

 sub collect_similar_feeds {   my $feed_info = shift;   my %feeds = (  );   my $categories = $feed_info->{syndic8}->{Categories}  {};   for my $cat_scheme (keys %{$categories}) {     my $cat_name = $categories->{$cat_scheme}; 

The first thing we do is expect a feed information record and try to grab a list of categories from it. This will be a hash whose keys are codes that identify categorization schemes and whose values identify category titles. We'll loop through each of these pairs and gather feeds in each category:

 my $feeds = XMLRPC::Lite->proxy($syndic8_url)       ->call('syndic8.GetFeedsInCategory', $cat_scheme, $cat_name)         ->result(  )  [];     # Limit the number of feeds handled in any one category     $feeds = [ @{$feeds}[0..$syndic8_max_results] ]       if (scalar(@$feeds) > $syndic8_max_results); 

Once we have a category scheme and title, we call on the Syndic8 API web service to give us a list of feeds in this category. This call returns a list of internal Syndic8 feed ID numbers , which is why we built in the ability to use them to locate feeds earlier, in our get_feed_info function. Also, we limit the number of results used, based on the configuration variable at the beginning of the script.

Next, let's gather information about the feeds we've found in this category:

 for my $feed (@$feeds) {       my $feed_info = get_feed_info(SYNDIC8_ID, $feed);       my $feed_url = $feed_info->{syndic8}{dataurl};       next if !$feed_url;       $feeds{"$cat_name ($cat_scheme)"}{$feed_url} = $feed_info;     }   }   return \%feeds; } 

Using the Syndic8 feed ID returned for each feed, we get a record for each and add it to a hash whose keys are based on the category and the feed URL. This is an attempt to make sure there is a list of unique feeds for each category. Finally, we return the results of this process.

Reporting on Our Findings

At this point, we can gather information about feeds and use the Syndic8 and Technorati APIs to dig for feeds in similar categories and feeds related by linking. Now, let's produce an HTML page for what we find for each of our favorite feeds:

 sub html_wrapper {   my $content = shift;   return qq^     <html>       <head>         <title>Digging for RSS feeds</title>       </head>       <body>         $content       </body>     </html>     ^; } 

We just put together a simple HTML shell here to contain our results. It wraps whatever content it is given with a simple HTML skeleton. The next step, since our basic unit of results is the feed information record, is to come up with a means of formatting one:

 sub format_feed_info {   my $info = shift;   my ($feed_url, $feed_title, $feed_link) =     ($info->{url}, feed_title($info), feed_link($info));   return qq^<a href="$feed_link">$feed_title</a>     (<a href="$feed_url">RSS</a>)^; } 

This doesn't do much with the wealth of data contained in a feed information record, but for now we simply construct a link to the site and a link to the feed. We'll use this to format the results of our digging for a given feed:

 sub format_feed_record {   my $record = shift;   my $out = '';   $out .= qq^     <div class="record">       ^;   $out .= qq^<h2 class="main_feed">^.     format_feed_info($record->{info})."</h2>\n"; 

The first thing we do here is open a div tag to contain these particular record results. Then, we format the record that describes the favorite feed under investigation. Next, we format the results of looking for related feeds:

 my $related = $record->{related};   if (keys %{$related}) {     $out .= "<h3>Feeds related by links:</h3>\n<ul>\n";     $out .= join       ('',        map { "<li>".format_feed_info($related->{$_})."</li>\n" }        sort keys %{$related})."\n\n";     $out .= "</ul>\n";   } 

This produces a bulleted list of feeds discovered , as related by linking to our feed. Next, we include the feeds related by category:

 my $similar = $record->{similar};   if (keys %{$similar}) {     $out .= "<h3>Similar feeds by category:</h3>\n<ul>\n";     for my $cat (sort keys %{$similar}) {       $out .= "<li>$cat\n<ul>";       $out .= join         ('',          map { "<li>".format_feed_info($similar->{$cat}{$_})."</li>\n" }          sort keys %{$similar->{$cat}})."\n\n";         );       $out .= "</ul>\n</li>\n";     }     $out .= "</ul>\n";   } 

A little bit more involved, this produces a set of nested lists, with the outer bullets describing categories and the inner bullets describing feeds belonging to the categories. Finally, let's wrap up our results:

 $out .= qq^     </div>       ^;   return $out; } 

We now have just a few loose ends to tie up. Some feed titles have a bit of extra whitespace in them, so we'll need to tidy that:

 sub trim_space {   my $val = shift;   $val=~s/^\s+//;   $val=~s/\s+$//g;   return $val; } 

And, since there's a lot of variability in our results as to where a feed's title is, we employ several options in grabbing it:

 sub feed_title {   my $feed_info = shift;   return trim_space     (      $feed_info->{rss}{channel}{title}       $feed_info->{syndic8}{sitename}       $feed_info->{technorati}{name}       $feed_info->{url}       '(untitled)'     ); } 

As with the title, there are many places where a link to the feed can be found, so we do something similar with it:

 sub feed_link {   my $feed_info = shift;   return trim_space     (      $feed_info->{rss}{channel}{link}       $feed_info->{syndic8}{siteurl}       $feed_info->{technorati}{url}       $feed_info->{url}       ''     ); } 

Figure 4-7 shows a sample of the generated HTML results.

Figure 4-7. A sampling of possibly related sites
figs/sphk_0407.gif

With the use of two web services, we have a pretty powerful robot with which to dig for more interesting feeds. This hack makes quite a few calls to web services, so, although you might want to run it every now and then to find updates, you might want to go easy on it.

Hacking the Hack

A few things are left as exercises for the reader. Most notably, we don't make much use of all the information gathered into a feed information record. In our report, we simply display a link to a site and a link to its feed. In fact, this record also contains all the most recent headlines for a feed, as well as the wealth of information provided by the Syndic8 and Technorati APIs. With some homework, this tool could be expanded even further to make use of all of this additional information.

l.m.orchard



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net