Recipe 22.9 Reading and Writing RSS Files

22.9.1 Problem

You want to create an Rich Site Summary (RSS) file, or read one produced by another application.

22.9.2 Solution

Use the CPAN module XML::RSS to read an existing RSS file:

use XML::RSS; my $rss = XML::RSS->new; $rss->parsefile($RSS_FILENAME); my @items = @{$rss->{items}}; foreach my $item (@items) {   print "title: $item->{'title'}\n";   print "link: $item->{'link'}\n\n";  }

To create an RSS file:

use XML::RSS; my $rss  = XML::RSS->new (version => $VERSION); $rss->channel( title       => $CHANNEL_TITLE,                link        => $CHANNEL_LINK,                description => $CHANNEL_DESC); $rss->add_item(title       => $ITEM_TITLE,                link        => $ITEM_LINK,                description => $ITEM_DESC,                name        => $ITEM_NAME); print $rss->as_string;

22.9.3 Discussion

There are at least four variations of RSS extant: 0.9, 0.91, 1.0, and 2.0. At the time of this writing, XML::RSS understood all but RSS 2.0. Each version has different capabilities, so methods and parameters depend on which version of RSS you're using. For example, RSS 1.0 supports RDF and uses the Dublin Core metadata (http://dublincore.org/). Consult the documentation for what you can and cannot call.

XML::RSS uses XML::Parser to parse the RSS. Unfortunately, not all RSS files are well-formed XML, let alone valid. The XML::RSSLite module on CPAN offers a looser approach to parsing RSS it uses regular expressions and is much more forgiving of incorrect XML.

Example 22-13 uses XML::RSSLite and LWP::Simple to download The Guardian's RSS feed and print out the items whose descriptions contain the keywords we're interested in.

Example 22-13. rss-parser
#!/usr/bin/perl -w # guardian-list -- list Guardian articles matching keyword use XML::RSSLite; use LWP::Simple; use strict; # list of keywords we want my @keywords = qw(perl internet porn iraq bush); # get the RSS my $URL = 'http://www.guardian.co.uk/rss/1,,,00.xml'; my $content = get($URL); # parse the RSS my %result; parseRSS(\%result, \$content); # build the regex from keywords my $re = join "|", @keywords; $re = qr/\b(?:$re)\b/i; # print report of matching items foreach my $item (@{ $result{items} }) {   my $title = $item->{title};   $title =~ s{\s+}{ };  $title =~ s{^\s+}{  }; $title =~ s{\s+$}{  };   if ($title =~ /$re/) {     print "$title\n\t$item->{link}\n\n";   } }

The following is sample output from Example 22-13:

UK troops to lead Iraq peace force         http://www.guardian.co.uk/Iraq/Story/0,2763,989318,00.html?=rss Shia cleric challenges Bush plan for Iraq         http://www.guardian.co.uk/Iraq/Story/0,2763,989364,00.html?=rss

We can combine this with XML::RSS to generate a new RSS feed from the filtered items. It would be easier, of course, to do it all with XML::RSS, but this way you get to see both modules in action. Example 22-14 shows the finished program.

Example 22-14. rss-filter
#!/usr/bin/perl -w # guardian-filter -- filter the Guardian's RSS feed by keyword use XML::RSSLite; use XML::RSS; use LWP::Simple; use strict; # list of keywords we want my @keywords = qw(perl internet porn iraq bush); # get the RSS my $URL = 'http://www.guardian.co.uk/rss/1,,,00.xml'; my $content = get($URL); # parse the RSS my %result; parseRSS(\%result, \$content); # build the regex from keywords my $re = join "|", @keywords; $re = qr/\b(?:$re)\b/i; # make new RSS feed my $rss = XML::RSS->new(version => '0.91'); $rss->channel(title       => $result{title},               link        => $result{link},               description => $result{description}); foreach my $item (@{ $result{items} }) {   my $title = $item->{title};   $title =~ s{\s+}{ };  $title =~ s{^\s+}{  }; $title =~ s{\s+$}{  };   if ($title =~ /$re/) {     $rss->add_item(title => $title, link => $item->{link});   } } print $rss->as_string;

Here's an example of the RSS feed it produces:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"             "http://my.netscape.com/publish/formats/rss-0.91.dtd"> <rss version="0.91"> <channel> <title>Guardian Unlimited</title> <link>http://www.guardian.co.uk</link> <description>Intelligent news and comment throughout the day from The Guardian  newspaper</description> <item> <title>UK troops to lead Iraq peace force</title> <link>http://www.guardian.co.uk/Iraq/Story/0,2763,989318,00.html?=rss</link> </item> <item> <title>Shia cleric challenges Bush plan for Iraq</title> <link>http://www.guardian.co.uk/Iraq/Story/0,2763,989364,00.html?=rss</link> </item> </channel> </rss>

22.9.4 See Also

The documentation for the modules XML::RSS and XML::RSSLite



Perl Cookbook
Perl Cookbook, Second Edition
ISBN: 0596003137
EAN: 2147483647
Year: 2003
Pages: 501

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net