Section 8.3. Parsing for Programming | Developing Feeds with Rss and Atom

8.3. Parsing for Programming

The ability to display a feed on a web page is important, no doubt about it, but it's not going to really excite anyone. To do that, you need to be able to parse feeds inside your own programs. In this section, we'll look at the two major alternatives, MagpieRSS and the Ultraliberal Feed Parser. Both parsers are libraries; both convert feeds into native data structures; and neither cares whether a feed is RSS 1.0, RSS 2.0 or Atom. That, really, is the final word with respect to the Great Battle of the Standards; most of the time, at a programmatic level, no one cares.

8.3.1. PHP: MagpieRSS

The most popular parser in PHP, and arguably the most popular in use on the Web right now, is Kellan Elliott-McCrea's MagpieRSS. As I write this, it stands at version 0.7, a low number indicative of modesty rather than product immaturity. MagpieRSS is a very refined product indeed.

To use MagpieRSS, first download the latest build from its web page at http://sourceforge.net/projects/magpierss/. There is also a weblog at http://laughingmeme.org/magpie_blog/.

Once downloaded, you're presented with a load of READMEs and example scripts, plus five include files:

rss_fetch.inc is the library you call from scripts. It deals with retrieving the feed, and marshals the other files into parsing it, before returning the results to your code.
rss_parse.inc deals with the nitty gritty of feed parsing. MagpieRSS is a liberal parser, which means it doesn't validate the feed it is given. It can also deal with any arbitrarily invented element as long as it follows the right sort of format, meaning that it is quite futureproof.
rss_cache.inc lets you make rss_fetch.inc cache feeds instead of continually requesting new ones.
rss_utils.inc currently contains only one internal function, which converts a W3CDTF standard date to Unix epoch time.
extlib/Snoopy.class.inc provides the network support for the other included functions.

To install these include files, place them in the same directory as the script that is going to use them.

8.3.1.1 Using MagpieRSS

MagpieRSS is simple to use and comes well-documented. Included in the distribution is an example script called magpie_simple.php. It looks like Example 8-3.

Example 8-3. magpie_simple.php

<?php define('MAGPIE_DIR', '../'); require_once(MAGPIE_DIR.'rss_fetch.inc'); $url = $_GET['url']; if ( $url ) {         $rss = fetch_rss( $url );                  echo "Channel: " . $rss->channel['title'] . "<p>";         echo "<ul>";         foreach ($rss->items as $item) {                 $href = $item['link'];                 $title = $item['title'];                         echo "<li><a href=$href>$title</a></li>";         }         echo "</ul>"; } ?> <form>         RSS URL: <input type="text" size="30" name="url" value="<?php echo $url ?>"><br />         <input type="submit" value="Parse RSS"> </form>

Running this on my own weblog's RSS 1.0 feed produces a page that looks like Figure 8-3.

Figure 8-3. A very basic display using MagpieRSS

As you can see, it's very straightforward. Taken line by line, the meat of the script goes like this:

define('MAGPIE_DIR', '../'); require_once(MAGPIE_DIR.'rss_fetch.inc'); $url = $_GET['url'];

Here, you tell PHP where Magpie's files are keptin this case, in the parent directory to the script. Now, invoke the rss_fetch.inc library, retrieve the URL, and place it, as a string, into the variable $url:

if ( $url ) { $rss = fetch_rss( $url );          echo "Channel: " . $rss->channel['title'] . "<p>"; echo "<ul>";

If the retrieval worked, you pass the contents of $url to the parser and print out a headline for the web page, containing the <title> of the <channel> and the start of an HTML list. (The HTML in this example isn't very compliant, but no matter.)

As you can see, the rest is easy to follow. It simply sets up a loop to run down the feed document and creates a link within an HTML list element from what it finds in the feed. This method of looping through the 15 or so elements in a feed is very typical.

foreach ($rss->items as $item) {         $href = $item['link'];         $title = $item['title'];                 echo "<li><a href=$href>$title</a></li>"; }

Once that's done, you can close off the list and get on with other things:

echo "</ul>";

8.3.2. Python: The Universal Feed Parser

Mark Pilgrim's Universal Feed Parser, hosted at http://sourceforge.net/projects/feedparser/, is perhaps the best feed application ever written. It is incredibly well-done and magnificently well-documented. Furthermore, it is released under the GPL and comes with over 2,000 unit tests. Those unit tests themselves are worth months of screaming from anyone writing their own parser; however, the question remains why would you when the UFP already exists?

It's well-documented, so the following sections will serve only to demonstrate its power.

8.3.2.1 A complete aggregator in 40 lines

To that end, here is a complete aggregator in only 40 lines, written by Jonas Galvez of http://jonasgalvez.com. Jonas released this underneath the GPL license, so it is free to use within the usual GPL bounds. Many thanks to him for that.

The full code is listed later in Example 8-4, but let's step through it section by section. The program stores the list of URLs to fetch and parse in a text file, feeds.txt, so to start, you import the required modules, pull in the contents of feeds.txt, and define an array to hold the items once you parse them:

import time import feedparser   sourceList = open('feeds.txt').readlines( ) postList = [  ]

Next, define the Entry class to act as a wrapper for the entry object the Universal Feed Parser will return. The modified_parsed property contains the entry date in a tuple of nine elements, in which the first six are the year, month, day, hour, minute, and second. This tuple can be converted to Unix Epoch with the built-in time method mktime():

class Entry:     def _ _init_ _(self, data, blog):         self.blog = blog         self.title = data.title         self.date = time.mktime(data.modified_parsed)         self.link = data.link     def _ _cmp_ _(self, other):         return other.date - self.date

The _ _cmp_ _ method defines the standard comparison behavior of the class. Once you get an array with Entry instances and call sort( ), the _ _cmp_ _ method defines the order.

Here is where the UFP comes in. Since we want to show entries ordered by date, it's prudent to at least verify if the entry actually includes a date. With UFP. you can also check for a "bozo bit" and refuse invalid feeds altogether. The package's documentation gives details on that:

for uri in sourceList:     xml = feedparser.parse(uri.strip( ))     blog = xml.feed.title     for e in xml.entries[:10]:         if not e.has_key('modified_parsed'):             continue         postList.append(Entry(e, blog)) postList.sort( )

To finish, print everything out as an XHTML list:

print 'Content-type: text/html\n' print '<ul style="font-family: monospace;">' for post in postList[:20]: # last 20 items     date = time.gmtime(post.date)     date = time.strftime('%Y-%m-%d %H:%M:%S', date)     item = '\t<li>[%s] <a href=\"%s\">%s</a> (%s)</li>'     print item % (date, post.link, post.title, post.blog)   print '</ul>'

Example 8-4 shows the entire aggregator.

Example 8-4. A 40-line aggregator in Python using the Universal Feed Parser

#!/usr/bin/python2.2 """ License: GPL 2; share and enjoy! Requires: Universal Feed Parser <http://feedparser.org> Author: Jonas Galvez <http://jonasgalvez.com> """ import time import feedparser sourceList = open('feeds.txt').readlines( ) postList = [  ] class Entry:     def _ _init_ _(self, data, blog):         self.blog = blog         self.title = data.title         self.date = time.mktime(data.modified_parsed)         self.link = data.link     def _ _cmp_ _(self, other):         return other.date - self.date for uri in sourceList:     xml = feedparser.parse(uri.strip( ))     blog = xml.feed.title     for e in xml.entries[:10]:         if not e.has_key('modified_parsed'):             continue         postList.append(Entry(e, blog)) postList.sort( ) print 'Content-type: text/html\n' print '<ul style="font-family: monospace;">' for post in postList[:20]:     date = time.gmtime(post.date)     date = time.strftime('%Y-%m-%d %H:%M:%S', date)     item = '\t<li>[%s] <a href=\"%s\">%s</a> (%s)</li>'     print item % (date, post.link, post.title, post.blog) print '</ul>'

8.3.3. Perl: XML::Simple

Because of the all-conquering success of Magpie and the UFP, Perl programmers haven't really moved on with the evolution of their feed-parsing tools. The UFP package can be called from Perl if need be, and many people have used the UFP as an excuse to try to learn Python anyway.

Certainly, there is no all-encompassing module for Perl that can parse all the flavors of RSS 1.0, RSS 2.0, and Atom with as much aplomb as the other scripting languages.

XML::RSS provides basic RSS parsing, as does Timothy Appnel's XML::RAI module framework, but neither support Atom. Ben Trott's XML::Atom is really designed for use with the Atom Publishing Protocol but can be used with the Syndication Format as well, once it is properly up to date. At time of writing, it is lagging the specification somewhat; this situation should improve once both Atom standards are at version 1.0. Timothy Appnel has also written an Atom module, XML::Atom::Syndication, which is very promising indeed.

With this mishmash of options for parsing feeds, and the necessity to write code to identify the feed's standard and pass it off to the correct functions, things can get too complicated too quickly with Perl. Let's take it back a notch, then, and resort to first principals. The following hasn't changed from the first edition of this book. I omit Atom to wait for the specification to settle down, but you will be able to see quite plainly how it would work with this structure.

8.3.3.1 Parsing RSS as simply as possible

The disadvantage of RSS's split into two separate but similar specifications is that we can never be sure which of the standards your desired feeds will arrive in. If you restrict yourself to using only RSS 2.0, it is very likely that the universe will conspire to make the most interesting stuff available solely in RSS 1.0, or vice versa. So, no matter what you want to do with the feed, your approach must be able to handle both standards with equal aplomb. With that in mind, simple parsing of RSS can be done with a standard general XML parser.

XML parsers are useful tools to have around when dealing with either RSS 2.0 or 1.0. While RSS 2.0 is quite a simple format, and using a full-fledged XML parser on it does sometimes seem to be overkill, it does have a distinct advantage over the other methods: futureproofing. Either way, for the majority of purposes, the simplest XML parsers are perfectly useful. The Perl module XML::Simple is a good example. Example 8-5 is a simple script that uses XML::Simple to parse both RSS 2.0x and RSS 1.0 feeds into XHTML that is ready for server-side inclusion.

Example 8-5. Using XML::Simple to parse RSS

#!/usr/local/bin/perl     use strict; use warnings;     use LWP::Simple; use XML::Simple;     my $url=$ARGV[0];     # Retrieve the feed, or die gracefully my $feed_to_parse = get ($url) or die "I can't get the feed you want";     # Parse the XML my $parser = XML::Simple->new( ); my $rss = $parser->XMLin("$feed_to_parse");     # Decide on name for outputfile my $outputfile = "$rss->{'channel'}->{'title'}.html";     # Replace any spaces within the title with an underscore $outputfile =~ s/ /_/g;     # Open the output file open (OUTPUTFILE, ">$outputfile");     # Print the Channel Title print OUTPUTFILE '<div >'."\n".'<a href="'; print OUTPUTFILE "$rss->{'channel'}->{'link'}".'">'; print OUTPUTFILE "$rss->{'channel'}->{'title'}</a>\n</div>\n";     # Print the channel items print OUTPUTFILE '<div >'."\n"."<ul>"; print OUTPUTFILE "\n";     foreach my $item (@{$rss->{channel}->{'item'}}) {     next unless defined($item->{'title'}) && defined($item->{'link'});     print OUTPUTFILE '<li><a href="';     print OUTPUTFILE "$item->{'link'}";     print OUTPUTFILE '">';     print OUTPUTFILE "$item->{'title'}</a></li>\n";            }             foreach my $item (@{$rss->{'item'}}) {     next unless defined($item->{'title'}) && defined($item->{'link'});     print OUTPUTFILE '<li><a href="';     print OUTPUTFILE "$item->{'link'}";     print OUTPUTFILE '">';     print OUTPUTFILE "$item->{'title'}</a></li>\n";            }                        print OUTPUTFILE "</ul>\n</div>\n";    # Close the OUTPUTFILE close (OUTPUTFILE);

This script highlights various issues regarding the parsing of RSS, so it is worth dissecting closely. Start with the opening statements:

#!/usr/local/bin/perl     use strict; use warnings;     use LWP::Simple; use XML::Simple;     my $url=$ARGV[0];     # Retrieve the feed, or die gracefully my $feed_to_parse = get ($url) or die "I can't get the feed you want";

This is nice and standard Perlthe usual use strict; and use warnings; for good programming karma. Next, load the two necessary modules: XML::Simple (which you've been introduced to) and LWP::Simple retrieve the RSS feed from the remote server. This is indeed what to do next: take the command-line argument as the URL for the feed you want to parse. Place the entire feed in the scalar $feed_to_parse, ready for the next section of the script:

# Parse the XML my $parser = XML::Simple->new( ); my $rss = $parser->XMLin("$feed_to_parse");

This section fires up a new instance of the XML::Simple module and calls the newly initialized object $parser. It then reads the retrieved RSS feed and parses it into a tree, with the root of the tree called $rss. This tree is actually a set of hashes, with the element names as hash keys. In other words, you can do this:

# Decide on name for outputfile my $outputfile = "$rss->{'channel'}->{'title'}.html";     # Replace any spaces within the title with an underscore $outputfile =~ s/ /_/g;     # Open the output file open (OUTPUTFILE, ">$outputfile");

Here, you take the value of the title element within the channel, add the string .html, and make it the value of $outputfile. This is done for a simple reason: I wanted to make the user interface to this script as simple as possible. You can change it to allow the user to input the output filename himself, but I like the script to work one out automatically from the title element. Of course, many title elements use spaces, which makes a nasty mess of filenames, so you can use a regular expression to replace spaces with underscores. Now open up the file handle, creating the file if necessary.

With a file ready for filling, and an RSS feed parsed in memory, let's fill in some of the rest:

# Print the Channel Title print OUTPUTFILE '<div >'."\n".'<a  href="'; print OUTPUTFILE "$rss->{'channel'}->{'link'}".'">'; print OUTPUTFILE "$rss->{'channel'}->{'title'}</a>\n</div>\n";

Here, you start to make the XHTML version. Take the link and title elements from the channel and create a title that is a hyperlink to the destination of the feed. Assign it a div, so you can format it later with CSS, and include some new lines to make the XHTML source as pretty as can be:

# Print the channel items print OUTPUTFILE '<div >'."\n"."<ul>"; print OUTPUTFILE "\n";     foreach my $item (@{$rss->{channel}->{'item'}}) {     next unless defined($item->{'title'}) && defined($item->{'link'});     print OUTPUTFILE '<li><a href="';     print OUTPUTFILE "$item->{'link'}";     print OUTPUTFILE '">';     print OUTPUTFILE "$item->{'title'}</a></li>\n";            }             foreach my $item (@{$rss->{'item'}}) {     next unless defined($item->{'title'}) && defined($item->{'link'});     print OUTPUTFILE '<li><a href="';     print OUTPUTFILE "$item->{'link'}";     print OUTPUTFILE '">';     print OUTPUTFILE "$item->{'title'}</a></li>\n";            }                        print OUTPUTFILE "</ul>\n</div>\n";    # Close the OUTPUTFILE close (OUTPUTFILE);

The last section of the script deals with the biggest issue for all RSS parsing: the differences between RSS 2.0 and RSS 1.0. With XML::Simple, or any other tree-based parser, this is especially crucial, because the item appears in a different place in each specification. Remember: in RSS 2.0, item is a subelement of channel, but in RSS 1.0, they have equal weight.

So, in the preceding snippet you can see two foreach loops. The first one takes care of RSS 2.0 feeds, and the second covers RSS 1.0. Either way, they are encased inside another div and made into an ul unordered list. The script finishes by closing the file handle. Our work is done.

Running this from the command line, with the RSS feed from http://rss.benhammersley.com/index.xml, produces the result shown in Example 8-6.

Example 8-6. Content_Syndication_with_RSS.html

<div > <a href="http://rss.benhammersley.com/">Content Syndication with XML and RSS</a> </div> <div > <ul> <li><a href="http://rss.benhammersley.com/archives/001150.html">PHP parsing of RSS</a></li> <li><a href="http://rss.benhammersley.com/archives/001146.html">RSS for Pocket PC</a></li> <li><a href="http://rss.benhammersley.com/archives/001145.html">Syndic8 is One</a></li> <li><a href="http://rss.benhammersley.com/archives/001141.html">RDF mod_events</a></li> <li><a href="http://rss.benhammersley.com/archives/001140.html">RSS class for cocoa</a></li> <li><a href="http://rss.benhammersley.com/archives/001131.html">Creative Commons RDF</a></li> <li><a href="http://rss.benhammersley.com/archives/001129.html">RDF events in Outlook.</a></li> <li><a href="http://rss.benhammersley.com/archives/001128.html">Reading Online News</a></li> <li><a href="http://rss.benhammersley.com/archives/001115.html">Hep messaging server</a></li> <li><a href="http://rss.benhammersley.com/archives/001109.html">mod_link</a></li> <li><a href="http://rss.benhammersley.com/archives/001107.html">Individual Entries as RSS 1.0</a></li> <li><a href="http://rss.benhammersley.com/archives/001105.html">RDFMap</a></li> <li><a href="http://rss.benhammersley.com/archives/001104.html">They're Heeereeee</a></li> <li><a href="http://rss.benhammersley.com/archives/001077.html">Burton Modules</a></li> <li><a href="http://rss.benhammersley.com/archives/001076.html">RSS within XHTML documents UPDATED</a></li> </ul> </div>

You can then include this inside another page using server-side inclusion (described later in this chapter).

After all this detailing of additional elements, I hear you cry, where are they? Well, including extra elements in a script of this sort is rather simple. Here I've taken another look at the second foreach loop from the previous example. Notice the sections in bold type:

foreach my $item (@{$rss->{'item'}}) {     next unless defined($item->{'title'}) && defined($item->{'link'});     print OUTPUTFILE '<li><a href="';     print OUTPUTFILE "$item->{'link'}";     print OUTPUTFILE '">';     print OUTPUTFILE "$item->{'title'}</a>";     if ($item->{'dc:creator'}) {         print OUTPUTFILE '<span >Written  by';         print OUTPUTFILE "$item->{'dc:creator'}";         print OUTPUTFILE '</span>';         }     print OUTPUTFILE "<ol><blockquote>$item->{'description'}</blockquote></ol>";     print OUTPUTFILE "\n</li>\n";            }

This section now looks inside the RSS feed for a dc:creator element and displays it if it finds one. It also retrieves the contents of the description element and displays it as a nested item in the list. You might want to change this formatting, obviously.

By repeating the emphasized line, it is easy to add support for different elements as you see fit, and it's also simple to give each new element its own div or span class to control the onscreen formatting. For example:

if ($item->{'dc:creator'}) {    print OUTPUTFILE '<span >Written  by';    print OUTPUTFILE "$item->{'dc:creator'}";    print OUTPUTFILE '</span>';    } if ($item->{'dc:date'}) {    print OUTPUTFILE '<span >Date:';    print OUTPUTFILE "$item->{'dc:date'}";    print OUTPUTFILE '</span>';    } if ($item->{'annotate:reference'}) {    print OUTPUTFILE '<span ><a href="';    print OUTPUTFILE "$item->{'annotate:reference'}->{'rdf:resource'}";    print OUTPUTFILE '">Comment  on this</a></span>';        }

Most XML parsers found in scripting languages (Perl, Python, etc.) are really interfaces for Expat, the powerful XML parsing library. They therefore require Expat to be installed. Expat is available from http://expat.sourceforge.net/ and is released under the MIT License.

As you can see, the final extension prints the contents of the annotate:reference element. This, as mentioned in Chapter 7, is a single rdf:resource attribute. Note the way I get XML::Simple to read the attribute. It treats the attribute as just another leaf on the tree; you call the same way you would a subelement. You can use the same syntax for any attribute-only element.