Ruby | Professional XML (Programmer to Programmer)

Ruby is an interpreted scripting language, designed with object-orientation and simplicity in mind. It was written by Yukihiro "Matz" Matsumoto in 1995. He wanted to create a language that used the best parts of his favorite languages (Perl, Smalltalk, Eiffel, Ada, and Lisp). At the same time, he wanted to create a language that was expressive. That is, one that was simple to use and understand, but with a great deal of power. The name is a slight tribute to Perl, as Matz decided that keeping with the name of a precious gem would be appropriate.

Ruby has many of the features of Perl and Python, as well as features of more academic languages. It shares the text-processing capabilities of Perl and Python, as well as the dynamic nature of these languages. From the more academic languages, Ruby obtained lambda expressions-powerful inline functions-as well as strict object-orientation. It is this last feature that truly distinguishes Ruby from the previous two languages. While both Perl and Python have some aspects of object-orientation, they are more or less recent additions to the language. Ruby, on the other hand, was designed around the concepts of object-oriented programming. Recently, Ruby has grown in popularity, partly because of its completeness, but also due to the increasing popularity of a Web interface written to use it: Ruby on Rails.

Reading and Writing XML

Ruby has support for both reading and writing XML via the REXML library. This library is part of the base Ruby class library. This library was originally modeled off of the Electric XML library. This library was written in Java by The Mind Electric. However, the API is designed to fit into the Ruby way of doing things. This means, "Keep the common case simple, and the uncommon, possible." As such, the API does not completely follow standards, such as the W3C DOM. Instead, it forms a close, but more easily developed model.

Reading XML

Ruby has two main methods for reading XML content:

q A tree-based method that is similar, but not identical, to working with the DOM. While it maintains the XML file in a tree-based memory structure, it does not provide all the methods required by the W3C DOM implementation. Although this makes porting code that uses the W3C model more difficult to port to Ruby, it means that the resulting API is closer to the natural way of working with Ruby.
q A stream-based method based on SAX2. This method more closely follows the SAX model, as described in Chapter 13. The core parser handles three events: one that occurs at the beginning of each element, one at the end, and one for text nodes. You provide the parser with a file or block of XML to process, and it begins executing the methods in order. Although SAX processing is much faster than DOM processing, it is inherently a forward-only pass through the XML. In Ruby, you create a SAX parser by inheriting from a class, and overriding the appropriate methods (see below).

While some differences between the Ruby tree-based model and the W3C DOM are minor, they can trip you up. See the following table for some of the mostly commonly encountered differences between the two.

Open table as spreadsheet

W3C DOM method	Ruby equivalent	Description
`documentElement`	`root` or `root_node`	Returns the root element of the XML file
`addChild`	`<<` or `add`	`Adds a new node to the document tree.`
`childNodes`	`get_elements`	`Returns the collection of elements below the current element.`
`attributes`	`attribute`	Returns the collection of attributes for the selected element. Attributes can be identified by numerical index (starting with 1) or the name of the attribute.
`firstChild`	`elements[1]`	The elements method returns the collection of child elements of the selected element. Children can be identified by numerical index (starting with 1), string name, or via an XPath statement.
`nextSibling`	`next_element`	Returns the next sibling when iterating through a set of elements.
`getElementsByTagName`	`get_elements`	Both methods take an XPath statement, and return the collection of elements that match that statement.

Listing 17-17 shows some of the methods of REXML in use on the customer.xml file shown earlier (Listing 17-2).

Listing 17-17: Reading XML with Ruby

      require "rexml/document"      include REXML      doc = Document.new(File.new('customers.xml'), 'r')      puts ">>Print the full first element"      puts doc.root.elements[1]      puts      puts ">>Print the id of the first customer"      puts doc.root.elements['customer'].attributes['id']      puts      puts ">>Select an element via an XPath and display it"      puts doc.elements["//customer[@id='HUNGC']"]      puts      puts ">>Iterate over child elements"      el = doc.elements["//customer[@id='ANTON']"]      puts el.elements["company"].text      el.elements["contact"].each_element{|e| puts e.name+ ": " +e.text }      puts      puts ">>Select elements via XPath and display child elements"      el2 = doc.each_element("//customer/address[country='Canada']/city"){|e| puts      e.text}

Listing 17-16 begins with a required statement to load in the REXML library's document handling. Next, REXML is included, so that all references to objects in that namespace don't need the REXML prefix (that is, Document.new, not REXML::Document.new). The file is loaded into a Document object. It is at this point that the tree structures are built up in memory. The first highlighted line in the listing shows one difference between the REXML structure and the W3C DOM. The root element in Ruby is root, rather than the documentElement in the DOM.

Elements can be extracted from the document via the elements method. Notice that the first element is numbered 1, rather than 0. Alternatively, the name of the element or attribute can be used to identify the item in the collection. The third highlighted line in the preceding code shows a third method: Each of the collection methods (elements and attributes), as well as the each_??? methods accept an XPath statement to restrict the selection.

The last two samples in the previous listing show the each_element method that iterates over child elements. This is a shorthand method for doc.elements.each. As described earlier, the each_element also accepts an XPath statement to restrict the returned children. The output of the code in Listing 17-17 should look similar to Listing 17-18.

Listing 17-18: Output of Ruby tree-based processing

      >>Print the full first element      <customer id='ALFKI'>        <company>Alfreds Futterkiste</company>        <address>            <street>Obere Str. 57</street>            <city>Berlin</city>            <zip>12209</zip>            <country>Germany</country>        </address>        <contact>            <name>Maria Anders</name>            <title>Sales Representative</title>            <phone>030-0074321</phone>            <fax>030-0076545</fax>        </contact>        </customer>      >>Print the id of the first customer      ALFKI      >>Select an element via an XPath and display it      <customer id='HUNGC'>        <company>Hungry Coyote Import Store</company>        <address>            <street>City Center Plaza 516 Main St.</street>            <city>Elgin</city>            <region>OR</region>            <zip>97827</zip>            <country>USA</country>        </address>        <contact>            <name>Yoshi Latimer</name>            <title>Sales Representative</title>            <phone>(503) 555-6874</phone>            <fax>(503) 555-2376</fax>          </contact>           </customer>      >>Iterate over child elements      Antonio Moreno Taquería      name: Antonio Moreno      title: Owner      phone: (5) 555-3932      >>Select elements via XPath and display child elements      Montreal      Tsawassen      Vancouver

In addition to the methods listed previously, the Ruby XML implementation includes a number of methods that are designed to get information on the structure of the document. These methods use common Ruby idioms, making the code feel more Ruby-like. The following table outlines some of these methods.

Open table as spreadsheet

Method	Description
`each_element`	Iterates over the child elements of the selected element. Can be passed an XPath statement to restrict the elements iterated over.
`has_elements?`	Returns `true` if the current node has child elements.
`has_attributes?`	Returns `true` if the current node has attributes.
`has_text?`	Returns `true` if the current node has a child text node.
`text=`	Assigns a value as the inner text for an element.

Listing 17-19 shows some of these methods being used to query the structure of the XML.

Listing 17-19: Getting structure information with Ruby

      require "rexml/document"      include REXML      doc = Document.new(File.new('customers.xml'), 'r')      #get information about element      el = doc.elements["//customer[@id='RICAR']"]      if el.has_attributes?        puts  el.name+ " has attributes"        el.attributes.each {|name, value| puts name+ ": " +value}      end      def dump_element(e)        if e.has_elements?          puts          puts e.name+ " has children"           e.each_element{|el| dump_element(el)}        else          if e.has_text?            puts e.name+ ": " +e.text          end        end      end      if el.has_elements?          dump_element(el)      end

Listing 17-20 shows the output of the preceding code.

Listing 17-20: XML Structure Information

      customer has attributes      id: RICAR      customer has children      company: Ricardo Adocicados      address has children      street: Av. Copacabana, 267      city: Rio de Janeiro      region: RJ      zip: 02389-890      country: Brazil      contact has children      name: Janete Limeira      title: Assistant Sales Agent      phone: (21) 555-3412

In addition to the tree-based document parsing, REXML supports a stream-based parser. With this technique, you create one or more listener classes with methods that are called as the document is processed. You create a listener class by inheriting from the StreamListener class. You must then override the methods of the StreamListener class to provide the implementation you desire. The following table describes the most common methods you should override.

Open table as spreadsheet

Method	Description
`tag_start`	Called when the parser first encounters a new tag. The name of the new element will be passed to the method, as well as the attributes for the element. The attributes are provided in an array of name-value pairs. This method is usually used to prepare for the processing by identifying the elements you are interested in.
`tag_end`	Called when the parser encounters the end of an element. This is usually used to undo whatever settings where enabled when the corresponding `tag_start` method was called, such as turning off flags or decrementing counters.
`text`	Called when the parser encounters a text node. This is often where the bulk of the processing occurs when using stream-based parsers.

Listing 17-21 shows some of these methods processing the customers XML file.

Listing 17-21: Reading XML using streams with Ruby

      require 'rexml/document'      require 'rexml/streamlistener'      include REXML      include Parsers      class Listener        include StreamListener        def initialize          @cities = Hash.new(0)          @flag = false            end        def tag_start(name, attributes)          if name == 'city'            @flag = true          end        end        def tag_end(name)          if name == 'customers'            puts            dump_list          end        end        def text(text)          if @flag            puts "Adding " +text            @cities[text] = @cities[text] + 1            @flag = false          end        end        def dump_list()          puts ">> Count of each city"          @cities.each {|key, value| puts key+ ": " +value.to_s }          puts "==="       end      end      listener = Listener.new      parser = StreamParser.new(File.new("customers.xml"), listener)      parser.parse

The Listener class includes the StreamListener mixin and contains five methods, three of which are used by the streaming parser (tag_start, tag_end and text). The tag_start method is called as each new tag is reached by the parser, whereas tag_end is called at the end.

The tag_start method receives two parameters: the name of the element and an array containing the keys and values of the attributes for that element. As the code is identifying cities, it simply sets a flag if the parser has reached a city element.

The counting is done within the text method. As this will be called many times throughout the life of the application, however, it uses the @flag variable to determine if it is within a city element. If this is the case, the entry for the city in the hash table @cities is incremented. As Ruby is a dynamic language, if the city did not have an entry in the hash table, one would be created at this point, and the value set to 1. Finally, the flag is turned off. This could also have been done in the tag_end method.

Once the end of the document has been reached (identified by the tag_end method being called on the customers end element), the contents of the hash table are printed to the console. This method uses a Ruby block to print each entry in the @cities hash table.

Listing 17-22: Output of the Ruby stream-based processor

      Adding Lyon      Adding Reims      Adding Stuttgart      Adding Oulu      Adding Resende      Adding Seattle      Adding Helsinki      Adding Warszawa      >> Count of each city      Stuttgart: 1      Butte: 1      Kobenhavn: 1      Tsawassen: 1      London: 6      Brandenburg: 1      Cunewalde: 1      Marseille: 1      Berlin: 1      Sao Paulo: 4      Portland: 2      Lyon: 1      Albuquerque: 1      Warszawa: 1      Lille: 1      Frankfurt a.M.: 1

Writing XML

Writing XML with the REXML library is quite simple. The Document class is used to create the new document, whereas Element and Attribute classes add elements and attributes, respectively.

Writing XML using Ruby is significantly different from using the W3C DOM. Instead of sticking with the API used with the DOM, the authors of the Ruby library chose to follow common Ruby idioms. The following table outlines some of these methods.

Open table as spreadsheet

Method	Description
`Document.new`	`new(source = nil, context = {})` Constructor for the `Document` class. Creates a new document, using the provided file, string, or IO stream. If no parameter is supplied, it creates the document in memory. The context parameter is deprecated.
`XMLDecl.new`	`XMLDecl.new(version, encoding, standalone)` Returns the XML declaration string. Defaults are: version=1.0, encoding=UTF-8, and standalone=false.
`Element.new`	`new (arg = UNDEFINED, parent=nil, context=nil)` Creates a new element. The arg parameter can either be a string, providing a name for the newly created element, or another element, meaning that this new element is a shallow copy of the provided element. If parent is provided, the newly created element is a child of the parent. The context parameter provides a number of options for the content of the element.
`Element.add_element`	`add_element(arg=nil, arg2=nil)` Adds a new child element. The two arguments are the name of the newly added element and an optional hashtable containing the attributes for the new element. For example, the line in Listing 17-17 below: book3 = folder.add_element("bookmark", {"href"=>"http://www.wrox.com"})
`Element.add_attribute`	`add_attribute(key, value=nil)` Adds a new child attribute. The first parameter can be either an existing `Attribute` object, which would be copied into the parent element, or a string, in which case that becomes the name of the new attribute. The second parameter provides the value of the attribute.
`Element.<<`	`<< item` An alias for the `add` method. This is handy shorthand for adding new elements to the document.

Listing 17-23 shows the creation of a simple XBEL (XML Bookmarks Exchange Language) document using REXML. Note that this sample uses a number of different techniques on purpose; it shows the choices available for creating new elements and attributes.

Listing 17-23: Writing XML with Ruby

            require "rexml/document"      include REXML      doc = Document.new      doc << XMLDecl.new      doc << Element.new("xbel")      doc.root.attributes["version"] = "1.0"      folder = Element.new("folder")      folder << Element.new("title").add_text("Some useful bookmarks")      book1 = folder.add_element("bookmark")      book1.add_attribute("href", "http://www.geekswithblogs.net/evjen")      book1.add_element("title").add_text("Bill Evjen's Weblog")      book2 = Element.new("bookmark")      book2.add_attribute("href", "http://www.acmebinary.com/blogs/kent")      book2 << Element.new("title") << Text.new("Kent Sharkey's Weblog")      folder << book2      book3 = folder.add_element("bookmark", {"href"=>"http://www.wrox.com"})      book3 << Element.new("title") << Text.new("Wrox Home Page")      book3 << Element.new("desc") << Text.new("Home of great, red books")      doc.root.add_element(folder)      doc.write(File.new("output.xml", "w"), 2)

First, the REXML library is loaded with the require statement, and aliased with the include REXML statement. This eliminates the inclusion REXML:: at every use of the library. The XML document is created, and the standard XML declaration added. Next, the root element is added, along with an attribute.

Listing 17-24 shows the resulting XML document.

Listing 17-24: Output from Ruby

      <?xml version='1      <xbel version='1        <folder>          <title>Some useful bookmarks</title>          <bookmark href='http://www            <title>Bill Evjen's Weblog</title>          </bookmark>          <bookmark href='http://www            <title>Kent Sharkey's Weblog</title>          </bookmark>          <bookmark href='http://www            <title>Wrox Home Page</title>            <desc>Home of great, red books</desc>          </bookmark>        </folder>      </xbel>

New elements can be created either standalone (with Element.new), or as part of the existing structure (with add_element). Similarly, attributes can be added via add_attribute, or using Attribute.new. Finally, the append method (<<) is overridden to permit adding either elements or attributes. The choice in methods allows you to either select the method that works best for you or for the situation at hand.

Notice from the output that the text is automatically encoded in the case of the single quote characters. In addition, other characters not appropriate in XML files (such as & or “) will be encoded. This behavior can be overridden by adding the :raw value to the context (for Element.new). You may use this format when you are writing the entries to a CDATA block or other location where the characters may actually be valid.

Support for Other XML Formats

The base libraries for Ruby also include support for creating and accessing Web services using either SOAP or XML-RPC. In addition, they provide support for working with RSS and W3C XML Schemas. External libraries are generally installed as Ruby gems, a packaging format built into Ruby. As of this writing, there are approximately 1200 Ruby gems, many providing support for various XML formats. Some of the more notable Ruby gems include:

q Amrita2-An XHTML templating engine that provides for the transformation of XML documents into XHTML. It is similar in concept to XSLT, but does not follow that standard.
q FeedTools-A powerful library for working with RSS, Atom and CDF (Channel Definition Format) files.
q XMPP4R-A library for communicating with the XML format used by the Jabber instant messenger protocol.