Section 15.2. Working with RSS and Atom | The Ruby Way, Second Edition: Solutions and Techniques in Ruby Programming (2nd Edition)

15.1. Parsing XML with REXML

XML (which "looks like" HTML or SGML) has been popular since the 1990s. It does, in fact, have some good qualities that make it preferable to fixed-column data storage. For example, the fields are given specific names, the overall design makes a hierarchical structure possible, and most of all, it allows variable length data.

Four decades ago, of course, memory constraints would have rendered XML largely impractical. But imagine it had been introduced then. The infamous "Y2K" problem gained much press in 1999 (though it turned out to be more of a nuisance than a problem), but with XML, it would not have shown up on anyone's radar. There was a Y2K issue solely because most of our legacy data was stored and manipulated in a fixed-length format. So, for all its shortcomings, XML has its uses. In Ruby, the most common way to manipulate XML is with the REXML library by Sean Russell. Since 2002, REXML (usually pronounced as three syllables, "rex-m-l") has been part of the standard Ruby distribution.

I should point out here that REXML is relatively slow. Whether it is fast enough for your application, only you can say. You may find, however, that as you scale up, you need to use the libxml2 binding (not covered in this book). This binding is of course very fast (being written in C) but is arguably not quite as Ruby-like.

REXML is a pure-Ruby XML processor conforming to the XML 1.0 standard. It is a nonvalidating processor, passing all of the OASIS nonvalidating conformance tests.

REXML has multiple APIs. This, of course, is intended for flexibility not confusion. The two classic APIs may be classified as DOM-like and SAX-like (or tree-based and stream-based). The first of these is a technique wherein the entire file is read into memory and stored in a hierarchical (tree-based) form. Let's look at these before mentioning the other APIs. The second is a "parse as you go" technique, useful when your documents are large or you have memory limitations; it parses the file as it reads it from disk, and the entire file is never stored in memory.

For all our XML code examples, let's use the same simple XML file (shown in Listing 15.1). It represents part of a private library of books.

Listing 15.1. The books.xml File

<library shelf="Recent Acquisitions">   <section name="Ruby">     <book isbn="0672328844">       <title>The Ruby Way</title>       <author>Hal Fulton</author>       <description>Second edition. The book you are now reading.                    Ain't recursion grand?       </description>   </section>   <section name="Space">     <book isbn="0684835509">       <title>The Case for Mars</title>       <author>Robert Zubrin</author>       <description>Pushing toward a second home for the human                    race.       </description>     </book>     <book isbn="074325631X">       <title>First Man: The Life of Neil A. Armstrong</title>       <author>James R. Hansen</author>       <description>Definitive biography of the first man on                    the moon.       </description>     </book>   </section> </library>

15.1.1. Tree Parsing

Let's first parse our XML data in tree fashion. We begin by requiring the rexml/document library; often we do an include REXML to import into the top-level namespace for convenience. Listing 15.2 illustrates a few simple techniques.

Listing 15.2. DOM-like Parsing

require 'rexml/document' include REXML input = File.new("books.xml") doc = Document.new(input) root = doc.root puts root.attributes["shelf"]      # Recent Acquisitions doc.elements.each("library/section") { |e| puts e.attributes["name"] } # Output: #   Ruby #   Space doc.elements.each("*/section/book") { |e| puts e.attributes["isbn"] } # Output: #   0672328844 #   0321445619 #   0684835509 #   074325631X sec2 = root.elements[2] author = sec2.elements[1].elements["author"].text       # Robert Zubrin

Notice in Listing 15.2 how attributes are represented as a hash. Elements can be accessed via a pathlike notation or by an integer. If you index by integer, remember that XML (by specification) is 1-based, not 0-based as Ruby is.

15.1.2. Stream Parsing

Suppose that we want to process this same data file in a stream-oriented way. (We probably wouldn't do that in reality because this file is small.) There are variations on this concept, but Listing 15.3 shows one way. The trick is to define a listener class whose methods will be the target of callbacks from the parser.

Listing 15.3. SAX-like Parsing

require 'rexml/document' require 'rexml/streamlistener' include REXML class MyListener   include REXML::StreamListener   def tag_start(*args)     puts "tag_start: #{args.map {|x| x.inspect}.join(', ')}"   end   def text(data)     return if data =~ /^\w*$/     # whitespace only     abbrev = data[0..40] + (data.length > 40 ? "..." : "")     puts "  text   :   #{abbrev.inspect}"   end end list = MyListener.new source = File.new "books.xml" Document.parse_stream(source, list)

The module StreamListener assists in this; basically it provides stubbed or empty callback methods. Any methods you define will override these. When the parser encounters an opening tag, it calls the tag_open method. You can think of this as behaving something like method_missing, with the tag name passed in as a parameter (and all the attributes in a hash). The text method acts similarly; for others, refer to detailed documentation at http://ruby-doc.org or elsewhere.

In Listing 15.3, which is somewhat contrived, every tag is logged when it is opened, and the enclosed text is logged in the same way. (For simplicity, the text is abbreviated.) The output looks like Listing 15.4.

Listing 15.4. Output from Stream Parsing Example

tag_start: "library", {"shelf"=>"Recent Acquisitions"} tag_start: "section", {"name"=>"Ruby"} tag_start: "book", {"isbn"=>"0672328844"} tag_start: "title", {}   text   :   "The Ruby Way" tag_start: "author", {}   text   :   "Hal Fulton" tag_start: "description", {}   text   :   "Second edition. The book you are now read..." tag_start: "section", {"name"=>"Space"} tag_start: "book", {"isbn"=>"0684835509"} tag_start: "title", {}   text   :   "The Case for Mars" tag_start: "author", {}   text   :   "Robert Zubrin" tag_start: "description", {}   text   :   "Pushing toward a second home for the huma..." tag_start: "book", {"isbn"=>"074325631X"} tag_start: "title", {}   text   :   "First Man: The Life of Neil A. Armstrong" tag_start: "author", {}   text   :   "James R. Hansen" tag_start: "description", {}   text   :   "Definitive biography of the first man on ..."

15.1.3. XPath and More

An alternative way to view XML is XPath. This is a kind of pseudo-language that describes how to locate specific elements and attributes in an XML document, treating that document as a logical ordered tree.

REXML has XPath support via the XPath class. It assumes tree-based parsing (document object model) as we saw in Listing 15.2. Refer to the following code:

# (setup omitted) book1 = XPath.first(doc, "//book")    # Info for first book found p book1 # Print out all titles XPath.each(doc, "//title") { |e| puts e.text } # Get an array of all of the "author" elements in the document. names = XPath.match(doc, "//author").map {|x| x.text } p names

The output from the preceding code looks like this:

<book isbn='0672328844'> ... </> The Ruby Way The Case for Mars First Man: The Life of Neil A. Armstrong ["Hal Fulton", "Robert Zubrin", "James R. Hansen"]

REXML also has an enhanced SAX2 style API (a superset with some Ruby-like additions of its own) and an experimental pull-parser. These are not covered in this book; refer to http://ruby-doc.org or any comparable resource.