Reading RSS and Atom | Professional XML (Programmer to Programmer)

Reading RSS basically breaks down into two main activities: parsing the channel information and parsing the items. Generally, it is the items that are more important. If you are parsing only one or two feeds or feeds from the same source, it can be a fairly easy process. However, if you are trying to create a generic RSS reader, or even just a reader that works with both RSS 1.0 and 2.0 feeds, it is a different matter. Some feeds ignore date; others may ignore pubDate and add a Dublin Core date instead. Similarly, in some feeds the guid element points to a URN, whereas in others it is a URL. Some put the entire post into the description, whereas others include a content:body element. In short, writing a good, general-purpose RSS parser is difficult, and the lack of a DTD or XML Schema doesn't make things easier. It is impossible or even really difficult, but you must be aware of variability when writing the parser and try to test it on multiple feeds.

In addition to the general RSS variability, RSS 1.0 and 2.0 have radically different structures. If you are provided only a URL to an RSS feed, you should try to determine which of these two you have. To determine which you have, you can either use the MIME type or the document itself. The MIME type of an RSS 1.0 document should be application/rdf+xml, whereas the MIME type of an RSS 2.0 document should be application/rss+xml. (Atom documents should be application/atom+xml for those who don't want to extrapolate.) However, many feeds are actually encoded using the MIME type text/xml. This means that you can't use MIME type alone to differentiate the feed type. The root node can also be used to identify most documents. RSS 2.0 uses rss as a root node, whereas RSS 1.0 uses RDF and Atom feed. Further differentiating the feed type usually isn't necessary, as RSS 2.0 feeds should also be valid 0.91(Userland) or 0.92 feeds.

Reading Atom is similar to reading RSS, except that it is a much more predictable affair. Because the Atom specification is far more easily interpreted than the RSS 2.0 specification, developers are more likely to get it right. The one major cause of errors is that many feeds are still in Atom 0.3 format-the version that was available before ratification of the standard. This is mostly accurate compared with Atom 1.0, but a few notable differences exist. The following table outlines these differences.

Open table as spreadsheet

Item	Notes
namespace	In 0.3, the namespace was http://www.purl.org/atom/ns#, whereas for 1.0, it is http://www.w3.org/2005/Atom.
version	0.3 required the addition of a version attribute (as in RSS 2.0). This requirement has been removed.
subtitle	This element was named `tagline` in 0.3.
rights	This element was named `copyright` in 0.3.
updated	This element was named `modified` in 0.3.
published	This element was named `issued` in 0.3.
category	This element did not exist in 0.3.
icon	This element did not exist in 0.3.
logo	This element did not exist in 0.3.
source	This element did not exist in 0.3.
id	This element optional for the feed in 0.3; now it is required for both feed and entry elements.

Reading with .NET

Either of the two .NET idioms for dealing with XML (XmlReader, and XmlDocument) can be used for reading RSS and Atom documents. The benefits and consequences of each of these technologies are discussed in the following sections.

It might be tempting to make assumptions about the structure of the document-such as assuming that the title element is always the first child of an item. However, unless you have created all of the feeds you are processing, this will likely cause a break-quickly and when it puts you in the worst light (like during a demo to your CEO). Either use XPath statements to retrieve the appropriate elements or use conditional logic to retrieve the correct elements.

XmlDocument

XmlDocument is probably the one many people start with, loading the document into a DOM for processing. It certainly has the benefit of ease of use and familiarity. After it is loaded, you can use the SelectNodes and SelectSingleNode to extract the nodes you'd like. Alternately, you can walk the DOM, processing the feed as needed. Listing 18-4 shows a simple Console application that displays information from an Atom 1.0 feed.

Although it is simple, it is overkill to use the XmlDocument. It takes time and memory to build up the DOC structure in memory. In addition, the DOM is best when you need bi-directional access to the content. If you intend to go forward only through the feed, you are probably better served by the XmlReader. Listing 18-4 shows how to use XmlDocument to read Atom 1.0.

Listing 18-4: Using XmlDocument to read Atom 1.0

      using System;      using System.Xml;      class Reader {          [STAThread]          static void Main(string[] args) {              if (args.Length < 1) {                  Console.WriteLine("Simple Atom Reader");                  Console.WriteLine("Usage: simplereader <URL to RSS 2.0 feed>");                  Console.WriteLine(@"\tsimplereader                     http://www.oreillynet.com/pub/feed/20");              } else {                  XmlDocument doc = new XmlDocument();                  doc.Load(args[0]);                  XmlNamespaceManager mgr = new XmlNamespaceManager(doc.NameTable);                  mgr.AddNamespace("atom", "http://www.w3.org/2005/Atom");                  XmlElement root = doc.DocumentElement;                  //display some items from the feed                  Console.WriteLine("Feed Information");                  Console.WriteLine("Title:\t\t{0}",                     root.SelectSingleNode("atom:title", mgr).InnerText);                  Console.WriteLine("Subtitle:\t{0}",                     root.SelectSingleNode("atom:subtitle", mgr).InnerText);                  Console.WriteLine("URL:\t\t{0}",                     root.SelectSingleNode("atom:link[@rel='self']/@href",                     mgr).InnerText);                  Console.WriteLine("Items");                  foreach (XmlNode item in doc.SelectNodes("//atom:entry", mgr)) {                     Console.WriteLine("\tTitle:\t\t{0}",                         item.SelectSingleNode("atom:title", mgr).InnerText);                     Console.WriteLine("\tLink:\t\t{0}",                         item.SelectSingleNode("atom:link/@href", mgr).InnerText);                     XmlNode contentNode = item.SelectSingleNode("atom:summary", mgr);                     //default to showing the summary,                     // but show the content if it is available                     if(null == contentNode) {                         contentNode = item.SelectSingleNode("atom:content", mgr);                     }                     if (null != contentNode) {                         Console.WriteLine(contentNode.InnerText);                     }                     Console.WriteLine();                  }                  Console.WriteLine("Press Enter to end program");                  Console.ReadLine();              }          }      }

Because Atom actually lists a namespace, it's best to use an Xml NamespaceManager when working with Atom documents. The DOM. XmlNamespaceManager makes it easier to identify elements when you are using the SelectNodes and SelectSingleNode methods.

Displaying safe RSS

As RSS and Atom become more popular and because it is so easy to simply display the existing feed content on Web pages, it is becoming more and more important to ensure what you display is safe. Many HTML tags can be used to hijack pages, and these should be stripped before displaying the page. Although it is not absolutely necessary when redisplaying feeds you trust (ones you've created and no others), it is incredibly important to remove tags that might allow someone to alter or break your pages. There are three main solutions. I'll call them Safest, Safer, and Safish.

The safest solution is to remove all HTML tags from the content. However, this is hardly a useful suggestion in most cases because the tags (especially links) are the most useful part of the post. The next (Safer) solution is to wrap the displayed tag in an IFrame with the security=“restricted” attribute. This limits the capabilities of the code, preventing script from running and the display of new browser windows. However, this attribute has limited availability (only in Internet Explorer 6.0 SP1); therefore, it is not a valid solution except for intranet scenarios. Therefore, the Safish solution is likely the best option in most scenarios. Before displaying HTML from an arbitrary feed, you should strip out the elements and attributes listed in 18-the following table.

Open table as spreadsheet

Element/Attribute	Reason
script	Running any form of script from unknown sources is a request to have your Web site hacked.
object embed applet	These tags add ActiveX, Java, or other non-HTML elements to the page. ActiveX objects, in particular, are dangerous because they have full access to the client machine; however, any of these items can perform nasty tricks by being embedded in a browser.
frame iframe frameset	These tags allow for the addition of sub pages to a Web page. Apart from likely breaking the layout of your page, they can be used by an unscrupulous feed provider to execute code or otherwise manipulate the client.
on{something}	Similar to script removal, action attributes (such as onclick, onblur, or others) can be used to execute script.
meta link	These tags should only appear in the head of HTML pages and not in RSS feeds. In addition, browsers should not interpret them if they are in the body of a page (as they would be if they are from an RSS or Atom feed). However, to be safe, they should be stripped.
style	Both style elements and attributes can be used to import graphics and other items that can have a detrimental effect on your pages. Although most added style information is harmless, it is still better to be safe. Even more or less harmless style information can have a detrimental effect, if it overrides a global style. For example adding, "`a {color: #fff;}`" to styles, can cause all anchor tags to change in appearance.
img a	Although these tags can both be used to execute mischief on your Web pages, their removal is only optional. Likely, these tags are the main reasons you're thinking of displaying the RSS feed(s). Therefore, I'll leave this decision up to you. (Personally, I'd leave them in, but only link to RSS feeds I trusted.)

XmlReader

XmlReader, as you saw in Chapter 16, is the low-level, pull parser in .NET. You use the methods of the XmlReader to pull elements and attributes from the XML. XmlReader is a forward-only parser, meaning that you cannot go backwards through the XML. As only a small fraction of the content is in memory at any time, XmlReader excels when working with large documents for speed and memory usage. Listing 18-5 shows using XmlReader to serialize RSS 2.0 content into the RssFeed class. (To save space, not all properties are shown).

Listing 18-5: Reading RSS 2.0 with XmlReader

      using System;      using System.Collections.Generic;      using System.Text;      using System.Xml;      using System.IO;      namespace Wrox.ProXml {          public class RssFeed {              #region Private Members              private Dictionary<String, String> _properties =                  new Dictionary<String, String>();              private List<RssEntry> _entries = new List<RssEntry>();              #endregion              #region C'tors              public RssFeed() {                  //set up default properties                  this.Title = String.Empty;                  this.Link = String.Empty;                  this.Description = String.Empty;                  this.Language = "en-us";                  this.Copyright = "copyright 2006";                  this.ManagingEditor = string.Empty;                  this.PubDate = DateTime.Now.ToString("R");              }              #endregion              #region Properties              public Dictionary<String, String> Properties {                  get { return _properties; }              }              public List<RssEntry> Entries {                  get { return _entries; }              }              public string Title {                  get { return this.Properties["title"]; }                  set { this.Properties["title"] = value; }              }              public string Link {                  get { return this.Properties["link"]; }                  set { this.Properties["link"] = value; }              }              public string Description {                  get { return this.Properties["description"]; }                  set { this.Properties["description"] = value; }              }              public string Language {                  get { return this.Properties["language"]; }                  set { this.Properties["language"] = value; }              }              public string Copyright {                  get { return this.Properties["copyright"]; }                  set { this.Properties["copyright"] = value; }              }              public string ManagingEditor {                  get { return this.Properties["managingEditor"]; }                  set { this.Properties["managingEditor"] = value; }              }              public string PubDate {                  get { return this.Properties["pubDate"]; }                  set { this.Properties["pubDate"] = value; }              }              #endregion              #region Read XML              public void Load(String filename) {                  XmlReader reader = null;                  XmlReaderSettings settings = new XmlReaderSettings();                  settings.CheckCharacters = true;                  settings.CloseInput = true;                  settings.IgnoreWhitespace = true;                  reader = XmlReader.Create(filename, settings);                  this.Load(reader);              }              public void Load(Stream inputStream) {                  XmlReader reader = null;                  XmlReaderSettings settings = new XmlReaderSettings();                  settings.CheckCharacters = true;                  settings.CloseInput = true;                  settings.IgnoreWhitespace = true;                  reader = XmlReader.Create(inputStream, settings);                  this.Load(reader);              }              public void Load(XmlReader inputReader) {               inputReader.MoveToContent();               //move into the channel               while (inputReader.Read()) {                   if (inputReader.IsStartElement() && !inputReader.IsEmptyElement) {                       switch (inputReader.LocalName.ToLower()) {                          case "channel":                              //do nothing in this case                              break;                          case "item":                              //delegate parsing to the RssEntry class                              RssEntry entry = new RssEntry();                              entry.Load(inputReader);                              this.Entries.Add(entry);                              break;                          default:                              string field = inputReader.LocalName;                              this.Properties[field] = inputReader.ReadString();                              break;                         }                     }                  }              }              #endregion          }      }

The class uses a Dictionary to track the properties and an array for the child items. The Dictionary allows for the growth of the class to store necessary properties, including additional namespaces if they are added to the RSS. To make processing the class friendlier, additional named properties are created for title, link, description, and other elements of the RSS.

The Load method (highlighted in the listing) processes the RSS to populate the Dictionary. The additional two Load methods are provided to make it easier for end users to create the XmlReader that processes the RSS. The processing is fairly basic: the content of the child elements of the channel element are moved over as is to the properties Dictionary. Parsing of the item elements is delegated to the RssEntry class, as shown in Listing 18-6.

Listing 18-6: Reading RSS items with XmlReader

      using System;      using System.Collections.Generic;      using System.Text;      using System.Xml;      using System.IO;      namespace Wrox.ProXml {          public class RssEntry {              private Dictionary<String, String> _properties =                  new Dictionary<String, String>();              public RssEntry() {                  //set up default properties                  this.Title = String.Empty;                  this.Link = String.Empty;                  this.Description = String.Empty;              }              #region Properties              public Dictionary<String, String> Properties {                  get { return this._properties; }              }              public String Title {                  get { return this.Properties["title"]; }                  set { this.Properties["title"] = value; }              }              public String Link {                  get { return this.Properties["link"]; }                  set { this.Properties["link"] = value; }              }              public String Description {                  get { return this.Properties["description"]; }                  set { this.Properties["description"] = value; }              }              #endregion              public void Load(System.Xml.XmlReader inputReader) {                  while (inputReader.Read()) {                     if (inputReader.Name == "item" &&                         inputReader.NodeType == XmlNodeType.EndElement) {                         break;                     }                     if (inputReader.IsStartElement() && !inputReader.IsEmptyElement) {                         String field = inputReader.LocalName;                         this.Properties[field] = inputReader.ReadString();                     }                  }              }          }      }

As with the RssFeed, a Dictionary is used to store the values of the RSS elements.

After you have the RSS feed serialized into the RssFeed and RssEntry classes, displaying them becomes easy, as Listing 18-7 shows.

Listing 18-7: Using the RssFeed and RssEntry classes

      RssFeed feed = new RssFeed();      feed.Load(this.UrlField.Text);      MessageBox.Show(String.Format("Title: {0}\nItems: {1}",          feed.Title, feed.Entries.Count.ToString()),          "Feed Information");      foreach (RssEntry entry in feed.Entries) {          MessageBox.Show(entry.Description, entry.Title);      }

Reading RSS with Java

Reading RSS with Java is basically the same as with .NET. One method of reading RSS and Atom popular with Java developers (that is not available as part of the standard .NET class library) is Simple API for XML (SAX). This is an event-based parser for XML. The SAX code reads the XML and calls methods in your code when it encounters new elements, text, or errors in the document. This mechanism makes parsing small documents (like RSS feeds) fast and requires low memory overhead. See Chapter 14 for more details on SAX. Listing 18-8 shows reading RSS 1.0 documents with SAX.

Listing 18-8: Reading RSS 1.0 with SAX

      package com.wrox.proxml;      import java.io.*;      import java.util.Stack;      import javax.xml.parsers.ParserConfigurationException;      import javax.xml.parsers.SAXParser;      import javax.xml.parsers.SAXParserFactory;      import org.xml.sax.*;      import org.xml.sax.helpers.DefaultHandler;      public class RSSReader extends DefaultHandler {          static private Writer out;          static String lineEnd =  System.getProperty("line.separator");          Stack stack = new Stack();          StringBuffer value = null;        public static void main(String args []) {          // create an instance of RSSReader          DefaultHandler handler = new RSSReader();          try {            // Set up output stream            out = new OutputStreamWriter(System.out, "UTF8");            // get a SAX parser from the factory            SAXParserFactory factory = SAXParserFactory.newInstance();            SAXParser saxParser = factory.newSAXParser();            // parse the document from the parameter            emit("Feed information for " + args[0] + lineEnd);            saxParser.parse(args[0], handler);          } catch (Exception t) {            System.err.println(t.getClass().getName());            t.printStackTrace(System.err);          }        }        public void startElement(String namespaceURI,String sName,          String qName,Attributes attrs) throws SAXException {          String eName = sName; // element name          if ("".equals(eName)) eName = qName; // namespaceAware = false          stack.push(eName);          value = new StringBuffer();        }        public void endElement(String namespaceURI,String sName,String qName)          throws SAXException {          String element = null;          String section = null;          if(!stack.empty()){              element = (String)stack.pop();          }          if(!stack.empty()){              section = (String)stack.peek();          }          if (null != element && null != section ){              if (section.equalsIgnoreCase("channel")) {                  if(element.equalsIgnoreCase("title")) {                     emit("Title:\t" + value + lineEnd);                  } else if (element.equalsIgnoreCase("link")){                     emit("Link:\t" + value + lineEnd);                  }              } else if (section.equalsIgnoreCase("item")) {                  if(element.equalsIgnoreCase("title")) {                     emit("\tTitle:\t" + value + lineEnd);                  } else if (element.equalsIgnoreCase("description")) {                     emit("\t" + value + lineEnd);                  }              }          }        }        public void characters(char buf [], int offset, int len)          throws SAXException {          String s = new String(buf, offset, len);          value.append(s);        }        private static void emit(String s) throws SAXException {          try {            out.write(s);            out.flush();          } catch (IOException e) {            throw new SAXException("I/O error", e);          }        }      }

The code sets up the handlers for startElement and endElement and parses the RSS 1.0 document. At the start of each element, the new element is added to a Stack, and a new StringBuffer is created to hold the contents of the element. Most of the processing takes place in the endElement handler. This pops the element name that is ending off of the stack and determines where in the document the current element is located.