XML Parsing in Ruby | The Ruby Way, Second Edition: Solutions and Techniques in Ruby Programming (2nd Edition)

	Ruby Way By Hal Fulton Slots : 1.0
	Table of Contents

Since the late 1990s, one of the biggest buzzword technologies in Internet programming has been XML. XML (Extensible Markup Language) is a text-based document specification language. It enables developers and users to easily create their own, parseable document formats. XML is both machine-parseable and easily readable by humans, making it a good choice for processes that require both manual and automated tasks. Its readability also makes it easy to debug.

As is the case with most modern programming languages, there are Ruby libraries that greatly simplify the tasks of parsing and creating XML documents. There are two prevalent ways to approach XML parsing: DOM (Document Object Model), a specification developed by the World Wide Web Consortium, providing a tree-like representation of a structured document, and event-based parsers including SAX (Simple API for XML), which view occurrences of elements in an XML document as events that can be handled by callbacks.

Two pervasive Ruby XML packages are available. The first and most widely used is XMLParser by Yoshida Masato. XMLParser is an interface to James Clark's popular expat library for C. It will be used in the majority of our examples because of its stability and broader acceptance. Fairly new on the scene is Jim Menard's NQXML (Not Quite XML). The advantage of NQXML that has many Ruby programmers excited is that it is written in pure Ruby, which makes it very easy to modify and to install or include in software distributions. An example of NQXML will be included at the end of this section.

A short sample XML document is presented in Listing 9.16.

Listing 9.16 Sample XML Document

 <?xml version="1.0" encoding="ISO-8859-1"?> <addressbook>     <person relationship="business">         <name>Matt Hooker</name>         <address>             <street>111 Central Ave.</street>             <city>Memphis</city>             <state>TN</state>             <company>J+H Productions</company>             <zipcode>38111</zipcode>         </address>         <phone>901-555-5255</phone>     </person>     <person relationship="friend">         <name>Michael Nilnarf</name>         <address>             <street>10 Kiehl Ave.</street>             <city>Sherwood</city>             <state>AR</state>             <zipcode>72120</zipcode>         </address>         <phone>501-555-6343</phone>     </person> </addressbook>

A detailed explanation of how XML itself works is beyond the scope of this text, but there are a few items worth noting. First, you can see that XML is made up of tags, which are pieces of text surrounded by < and >. Generally, these tags have a beginning, <mytag>, and an end, </mytag>; and they can contain either plain text or other tags. If a tag doesn't have a closing tag, it should contain a trailing slash, as in <mytag/>. Tags, also called elements, can optionally have attributes, as in <person relationship="friend">, which are name/value pairs placed in the tag itself. For a more detailed introduction to XML, consult a reference.

Using XMLParser (Tree-Based Processing)

The first step in using XMLParser's DOM parsing library is to perform some parser initialization and setup. The XML::DOM::Builder object is a parser whose specialty is buildingnot surprisinglyDOM trees. In the setup portion of the code, we perform tasks such as setting default encoding for tag names and data, and setting the base URI for locating externally referenced XML objects. Next, a call to Builder.parse creates a new Document object, which is the highest level object in the DOM tree hierarchy. Finally, any successive blocks of text in the DOM tree are merged with a call to Document.normalize and the tree is returned. Refer to Listing 9.17.

Listing 9.17 Setting Up a DOM Object

 def setup_dom(xml)   builder = XML::DOM::Builder.new(0)   builder.setBase("./")   begin     xmltree = builder.parse(xml, true)   rescue XMLParserError     line = builder.line     print "#{ $0} : #{ $!}  (in line #{ line} )\n"     exit 1   end   # Unify sequential Text nodes   xmltree.documentElement.normalize   return xmltree end

What has been created so far is a first-class Ruby object that provides a structured representation of our original XML document, including the data within. All that remains is to actually do something with this structure. This is the easy part. Refer to Listing 9.18.

Listing 9.18 Parsing a DOM Object

 xml = $<.read xmltree = setup_dom(xml) xmltree.getElementsByTagName("person").each do |person|   printPerson(person) end def printPerson(person)   rel = person.getAttribute("relationship")   puts "Found person of type #{ rel} ."   name = person.getElementsByTagName("name")[0].firstChild.data   puts "\tName is: #{ name} " end

With a call to our previously created setup_dom method, we get a handle to the DOM tree representing our XML document. The DOM tree is made up of a hierarchy of Nodes. A Node can optionally have children, which would be a collection of Nodes. In an object-oriented sense, extending from Node are higher level classes such as Element, Document, Attr, and others, modeling higher level behavior appropriately.

In our simple example, we use getElementsByTagName to iterate through all elements of type person. With each person, the printPerson method prints the recorded relationship and name of the person in the XML file. Of interest are the two different methods of storing and accessing data that are represented here. The relationship is stored as an attribute of the person element. For that reason, we get a handle to an object of type Attr with a call to the Element's getAttribute method, and then use to_s to convert it to a String. In the case of the person's name, we are storing it as character data within an Element. Character data is represented as a separate Node of type CharacterData. In this case, it appears as a child of the Node that represents the name element. To access it, we make a call to firstChild and then to the CharacterData's data method. For a more detailed treatment of XMLParser's DOM capabilities, refer to the samples and embedded documentation provided with the XMLParser distribution.

Using XMLParser (Event-Based Processing)

As mentioned earlier, a common alternative to DOM-based XML parsing is to view the parsing process as a series of events for which handlers can be written. This is SAX-like, or event-based parsing. In Listing 9.19, we'll reproduce the functionality of our DOM example using the event-based parsing method that XMLParser provides.

Listing 9.19 Event-Based Parsing

 require 'xmlparser' class XMLRetry<Exception; end class SampleParser<XMLParser   private   def startElement(name, attr)     if name == "person"        attr.each do |key, value|          print "Found person of type #{ value} .\n"        end     end     if name == "name"        $print_cdata = true        self.defaultCurrent     else     $print_cdata = false     end   end   def endElement(name)     if name == "name"     $print_cdata = false     end   end   def character(data)     if $print_cdata       puts ("\tName is: #{ data} ")     end   end end xml = $<.read parser = SampleParser.new def parser.unknownEncoding(e)   raise XMLRetry, e end begin   parser.parse(xml) rescue XMLRetry   newencoding = nil   e = $!.to_s   parser = SampleParser.new(newencoding)   retry rescue XMLParserError   line = parser.line   print "Parse error(#{ line} ): #{ $!} \n" end

To use XMLParser's event-based parsing API, you must define a class that extends from XMLParser. This class has a method, parse, which is responsible for the main logic of tokenizing an XML document and iterating over its pieces. Your job when writing an extension to XMLParser is to define methods that will be called by parse when certain events take place. This example defines three such methods: startElement, endElement, and character. Not surprisingly, startElement is called when an opening XML tag is encountered, endElement is called after finding a closing tag, and character is called for a block of character data. The XMLParser API defines 23 events like these, which can be defined if needed. Undefined events (events for which no method has been explicitly overridden by the end developer) are ignored. For a complete list of events, see the README file in the XMLParser distribution.

Admittedly, this example does not lend itself well to event-based parsing. The $print_cdata global variable is a hack to maintain state across events. Without $print_cdata, the character method would have no way of knowing if it had encountered character data inside a person tag or any other arbitrary spot in a document. This illustrates an interesting constraint in the event-based parsing model: It's up to you, the developer, to maintain the context in which these events are fired. Whereas DOM provides a neatly organized tree, event-based parsing triggers events that are totally unaware of each other.

Now that we've presented two different approaches to parsing XML, you might be asking yourself how to choose between them. For most people, DOM is the more intuitive solution. Its tree-based approach is easy to comprehend and easy to manage. It is ideal when viewing XML as a document in the truest sense of the word. The primary disadvantage of DOM is that it parses and loads the entire document into memory before any operations can be performed. This can have ramifications on the performance and scalability of an application. If speed is an issue, event-based parsing enables a program to react to each element as it is read. For example, if a program had to read XML data from a slow data source (an international network link, for example), it might be advantageous to start operating on the data as it streams in, rather than loading the entire document and parsing it after the fact. From a scalability perspective, event-based parsing can also be more efficient. If you wanted to parse an extremely large file, it might be better to parse while scanning through it, rather than allocating memory to store the entire file before parsing and operating on its data.

Using NQXML

For those with a need or desire to work with a pure Ruby solution to XML parsing, Jim Menard's NQXML is currently the only thing going. Presented in Listing 9.20 is an example of its (Not Quite) DOM-parsing capabilities.

Listing 9.20 Pure Ruby XML Parsing

 require 'nqxml/treeparser'     xml = $<.read     begin       doc = NQXML::TreeParser.new(xml).document       root = doc.rootNode       root.children.each do |node|         if node.entity.class == NQXML::Tag           if node.entity.name == "person"             rel = node.entity.attrs['relationship']             puts "Found a person of type #{ rel} ."           end           node.children.each do |subnode|             if subnode.entity.class == NQXML::Tag &&                  subnode.entity.name == "name"               puts "\tName is: #{ subnode.children[0].entity} "             end           end         else           puts node.entity.class         end       end     rescue NQXML::ParserError       # Do something meaningful     end

Structurally, the program is very similar to the XMLParser DOM example previously presented. NQXML closelybut looselyfollows the DOM way of doing things, so developers familiar with DOM should have little difficulty adjusting to the sometimes different class and method names of NQXML. NQXML also offers a SAX-like streaming parser that relies more heavily on Ruby's iterators than the callback methods of XMLParser.