Section 15.3. Manipulating Image Data with RMagick

15.2. Working with RSS and Atom

Time-sensitive content on the Internet is spread through what we call web feeds or syndication feeds or simply feeds. These are usually in a format that is some dialect of XML.

Probably the most common of these formats is RSS. This is an abbreviation for Rich Site Summary (or RDF Site Summary, according to some, where RDF itself means Resource Description Format).

So many things on the Web are temporary or transient. Blog entries, zine articles, and many more things are recognized as being short-term in nature. A web feed is a natural way to distribute or syndicate such things.

The Atom format is another popular format that many people say is superior. The trend now is not to think in terms of "an RSS feed" or "an Atom feed"let everything be simply "a feed."

Let's look briefly at processing both RSS and Atom. The former can be done with a Ruby standard library, but the latter requires another library not included with Ruby.

15.2.1. The rss Standard Library

RSS is XML-based, so you could simply parse it as XML. However, the fact that it is slightly higher-level makes it appropriate to have a dedicated parser for the format. Furthermore, the messiness of the RSS standard is legendary, and it is not unusual at all for broken software to produce RSS that a parser may have great difficulty parsing.

This inconvenience is even more true across incompatible versions of the standard; the common ones are 0.9, 1.0, and 2.0. The RSS versions, like the manufacturing of hotdogs, are something whose details you don't want to know unless you must.

Ruby has a standard RSS library that handles versions 0.9, 1.0, and 2.0 of the standard. The different versions, in fact, are handled seamlessly wherever possible; the library will detect the version of the input document if you don't specify it.

Let's look at an example. Here we take the feed from http://marsdrive.com and print the titles of the first few items in the feed:

require 'rss' require 'open-uri' URL = "http://www.marstoday.com/rss/mars.xml" open(URL) do |h|   resp = h.read   result = RSS::Parser.parse(resp,false)   puts "Channel: #{result.channel.title}"   result.items.each_with_index do |item,i|     i += 1     puts "#{i}  #{item.title}"   end end

Before going any further, let me talk about courtesy to feed providers. A program like the preceding one should be run with caution because it uses the provider's bandwidth. In any real application, such as an actual feed aggregator, caching should always be done. But that is beyond the scope of these simple examples.

In the preceding code, we are using the open-uri library for convenience. This is explained in greater detail in Chapter 18, "Network Programming"; for now, just be aware that it enables us to use the open method on a URI much as if it were a simple file.

Note how the RSS parser retrieves the channel for the RSS feed; our code then prints the title associated with that channel. There is also a list of items (retrieved by the items accessor), which can be thought of as a list of articles. Our code retrieves the entire list and prints the title of each one.

Of course, the output from this is highly time-sensitive; at the time I ran it, this was the partial output:

    Title: Mars Today Top Stories     1  NASA Mars Picture of the Day: Lava Levees     2  NASA Mars Global Surveyor TES Dust And Temperature Maps 25 June - 2 July 2006     3  Mars Institute Core Team Arrives at the HMP Research Station on Devon Island     4  Assessment of NASA's Mars Architecture 2007-2016     5  NASA Mars Picture of the Day: Rush Hour

It's also possible to generate RSS (see Listing 15.5). This procedure is basically the reverse of the process in the previous code fragment.

Listing 15.5. Creating an RSS Feed

require 'rss' feed = RSS::Rss.new("2.0") chan = RSS::Rss::Channel.new chan.description = "Feed Your Head" chan.link = "http://nosuchplace.org/home/" img = RSS::Rss::Channel::Image.new img.url = "http://nosuchplace.org/images/headshot.jpg" img.title = "Y.T." img.link = chan.link chan.image = img feed.channel = chan i1 = RSS::Rss::Channel::Item.new i1.title = "Once again, here we are" i1.link = "http://nosuchplace.org/articles/once_again/" i1.description = "Don't you feel more like you do now than usual?" i2 = RSS::Rss::Channel::Item.new i2.title = "So long, and thanks for all the fiche" i2.link = "http://nosuchplace.org/articles/so_long_and_thanks/" i2.description = "I really miss the days of microfilm..." i3 = RSS::Rss::Channel::Item.new i3.title = "One hand clapping" i3.link = "http://nosuchplace.org/articles/one_hand_clapping/" i3.description = "Yesterday I went to an amputee convention..." feed.channel.items << i1 << i2 << i3 puts feed

Most of the code in Listing 15.5 is intuitive. We create an empty RSS 2.0 feed (along with an empty channel and an empty image) and add data to these objects by means of accessors. The image is assigned to the channel, and the channel is assigned to the feed.

Finally we create a series of items and assign these to the feed. It is worth mentioning that the series of appends onto feed.channel.items is actually necessary. It's tempting to try the simple approach:

feed.channel.items = [i1,i2,i3]

However, this doesn't work; the Channel class, for whatever reason, does not have an items= accessor. We could say items[0] = i1 and so on, which would work well in a loop, the way we would do it in real life. There might be still other ways to accomplish this, but the technique used here works fine.

The rss library has many other features, most of which are not yet well documented. If you can't find the features you're looking for, a last resort is to scan the source code to see how it works.

Many people prefer Atom to RSS. The rss library doesn't handle Atom, but the excellent (nonstandard) library feedtools does. We'll look at it in the upcoming section.

15.2.2. The `feedtools` Library

The feedtools library (available as a gem) is the work of Bob Aman. It works with RSS and Atom in a more or less seamless way, storing all feeds in a common internal format (based primarily on Atom). It also has its own URI-handling code, so you don't have to use net/http or open-uri explicitly.

Here is a simple example corresponding to the first example in the previous section:

require 'feed_tools' URL = "http://www.marstoday.com/rss/mars.xml" feed = FeedTools::Feed.open(URL) puts "Description: #{feed.title}\n" feed.entries.each_with_index {|x,i| puts "#{i+1} #{x.title}" }

This is arguably a little more concise and clear than the other example. Some things might be less than clear; for example, there is no explicit channel method for a feed object. However, you can call methods such as title and description directly on the feed object because a feed is a single channel.

Here's an example that retrieves an Atom feed instead:

require 'feedtools' URL = "http://www.atomenabled.org/atom.xml" feed = FeedTools::Feed.open(URL) puts "Description: #{feed.title}\n" feed.entries.each_with_index {|x,i| puts "#{i+1}  #{x.title}" }

Notice how the only line that changed is the URL itself! This is a good thing, in case you were wondering; it means that we can process feeds more or less independently of the format in which they are stored. The output, of course, looks similar to what we saw previously:

Description: AtomEnabled.org 1  AtomEnabled's Atom Feed 2  Introduction to Atom 3  Moving from Atom 0.3 to 1.0 4  Atom 1.0 is Almost Final 5  Socialtext Supports Atom

Once again, let me stress that you shouldn't waste any feed provider's bandwidth. If you are doing a real application, it should handle caching appropriately; and if you are only doing testing, it's best to use a feed of your own. The feedtools library supports fairly sophisticated database-driven caching which should be adequate in most cases.

Now let's take the previous example and add two lines to it:

str = feed.build_xml("rss",2.0) puts str

What we've done here is "translate" the Atom feed to an RSS 2.0 feed. The same could be done for RSS 0.9 or RSS 1.0, of course; in fact, conversion can go the other direction also: We can read an RSS feed and produce an Atom feed. This is one of the strengths of this library.

At the time of this writing, feedtools was only at version 0.2.25; it will likely continue to change its API and feature set.

15.2. Working with RSS and Atom

15.2.1. The rss Standard Library

Listing 15.5. Creating an RSS Feed

15.2.2. The feedtools Library

15.2.2. The `feedtools` Library