A Simple Feed Aggregator

Table of contents:

Credit: Rod Gaither

XML is the basis for many specialized langages. One of the most popular is RSS, an XML format often used to store lists of articles from web pages. With a tool called an aggregator, you can collect weblog entries and articles from several web sites RSS feeds, and read all those web sites at once without having to skip from one to the other. Here, well create a simple aggregator in Ruby.

Before aggregating RSS feeds, lets start by reading a single one. Fortunately we have several options for parsing RSS feeds into Ruby data structures. The Ruby standard library has built-in support for the three major versions of the RSS format (0.9, 1.0, and 2.0). This example uses the standard rss library to parse an RSS 2.0 feed and print out the titles of the items in the feed:

	require open-uri

	url = http://www.oreillynet.com/pub/feed/1?format=rss2
	feed = RSS::Parser.parse(open(url).read, false)
	puts "=== Channel: #{feed.channel.title} ==="
	feed.items.each do |item|
	 puts item.title
	 puts " (#{item.link})"
	 puts item.description
	# === Channel: OReilly Network Articles ===
	# How to Make Your Sound Sing with Vocoders
	# (http://digitalmedia.oreilly.com/2006/03/29/vocoder-tutorial-and-tips.html)
	# …

Unfortunately, the standard rss library is a little out of date. Theres a newer syndication format called Atom, which serves the same purpose as RSS, and the rss library doesn support it. Any serious aggregator must support all the major syndication formats.

So instead, our aggregator will use Lucas Carlsons Simple RSS library, available as the simple-rss gem. This library supports the three main versions of RSS, plus Atom, and it does so in a relaxed way so that ill-formed feeds have a better chance of being read.

Heres the example above, rewritten to use Simple RSS. As you can see, only the name of the class is different:

	url = http://www.oreillynet.com/pub/feed/1?format=rss2
	feed = RSS::Parser.parse(open(url), false)
	puts "=== Channel: #{feed.channel.title} ==="
	feed.items.each do |item|
	 puts item.title
	 puts " (#{item.link})"
	 puts item.description

Now we have a general method of reading a single RSS or Atom feed. Time to work on some aggregation!

Although the aggregator will be a simple Ruby script, theres no reason not to use Rubys object-oriented features. Our approach will be to create a class to encapsulate the aggregators data and behavior, and then write a sample program to use the class.

The RSSAggregator class that follows is a bare-bones aggregator that reads from multiple syndication feeds when instantiated. It uses a few simple methods to expose the data it has read.

	# rss-aggregator.rb - Simple RSS and Atom Feed Aggregator

	require simple-rss
	require open-uri

	 def initialize(feed_urls)
	 @feed_urls = feed_urls
	 @feeds = []

	 def read_feeds
	 @feed_urls.each { |url| @feeds.push(SimpleRSS.new(open(url).read)) }
	 def refresh

	 def channel_counts
	 @feeds.each_with_index do |feed, index|
	 channel = "Channel(#{index.to_s}): #{feed.channel.title}"
	 articles = "Articles: #{feed.items.size.to_s}"
	 puts channel + ,  + articles

	 def list_articles(id)
	 puts "=== Channel(#{id.to_s}): #{@feeds[id].channel.title} ==="
	 @feeds[id].items.each { |item| puts   + item.title }

	 def list_all
	 @feeds.each_with_index { |f, i| list_articles(i) }

Now we just need a few more lines of code to instantiate and use an RSSAggregator object:

	test = RSSAggregator.new(ARGV)
	puts "

Heres the output from a run of the test program against a few feed URLs:

	$ ruby rss-aggregator.rb http://www.rubyriver.org/rss.xml 
	Channel(0): RubyRiver, Articles: 20
	Channel(1): Slashdot, Articles: 10
	Channel(2): OReilly Network Articles, Articles: 15
	Channel(3): OReilly Network Safari Bookshelf, Articles: 10
	=== Channel(0): RubyRiver ===
	 Mantis style isn	 eas…
	 Its wonderful when tw…
	 Red tailed hawk

While a long way from a fully functional RSS aggregator, this program illustrates the basic requirements of any real aggregator. From this starting point, you can expand and refine the features of RSSAggregator.

One very important feature missing from the aggregator is support for the If-Modified-Since HTTP request header. When you call RSSAggregator#refresh, your aggregator downloads the specified feeds, even if it just grabbed the same feeds and none of them have changed since then. This wastes bandwidth.

Polite aggregators keep track of when they last grabbed a certain feed, and when they request it again they do a conditional request by supplying an HTTP request header called If-Modified Since. The details are a little beyond our scope, but basically the web server serves the reuqested feed only if it has changed since the last time the RSSAggregator downloaded it.

Another important feature our RSSAggregator is missing is the ability to store the articles it fetches. A real aggregator would store articles on disk or in a database to keep track of which stories are new since the last fetch, and to keep articles available even after they become old news and drop out of the feed.

Our simple aggregator counts the articles and lists their titles for review, but it doesn actually provide access to the article detail. As seen in the first example, the SimpleRSS.item has a link attribute containing the URL for the article, and a description attribute containing the (possibly HTML) body of the article. A real aggregator might generate a list of articles in HTML format for use in a browser, or convert the body of each article to text for output to a terminal.

See Also

  • Recipe 14.1, "Grabbing the Contents of a Web Page"
  • Recipe 14.3, "Customizing HTTP Request Headers"
  • Recipe 11.15, "Converting HTML Documents from the Web into Text"
  • A good comparison of the RSS and Atom formats (http://www.intertwingly.net/wiki/pie/Rss20AndAtom10Compared)
  • Details on the Simple RSS project (http://simple-rss.rubyforge.org/)
  • The FeedTools project has a more sophisticated aggregator library that supports caching and If-Modified-Since; see http://sporkmonger.com/projects/feedtools/ for details
  • "HTTP Conditional Get for RSS Hackers" is a readable introduction to If-Modified-Since (http://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers)



Date and Time



Files and Directories

Code Blocks and Iteration

Objects and Classes8

Modules and Namespaces

Reflection and Metaprogramming


Graphics and Other File Formats

Databases and Persistence

Internet Services

Web Development Ruby on Rails

Web Services and Distributed Programming

Testing, Debugging, Optimizing, and Documenting

Packaging and Distributing Software

Automating Tasks with Rake

Multitasking and Multithreading

User Interface

Extending Ruby with Other Languages

System Administration

Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net