Extracting Data from a Documents Tree Structure

Credit: Rod Gaither

Problem

You want to parse an XML file into a Ruby data structure, to traverse it or extract data from it.

Solution

Pass an XML document into the REXML::Document constructor to load and parse the XML. A Document object contains a tree of subobjects (of class Element and Text) rep-resenting the tree structure of the underlying document. The methods of Document and Element give you access to the XML tree data. The most useful of these methods is #each_element.

Heres some sample XML and the load process. The document describes a set of orders, each of which contains a set of items. This particular document contains a single order for two items.

	orders_xml = %{
	
	 
	 105
	 02/10/2006
	 Corner Store
	 
	 
	 
	 
	 
	}

	require 
exml/document
	orders = REXML::Document.new(orders_xml)

To process each order in this document, we can use Document#root to get the documents root element ()and then call Element#each_element to iterate over the children of the root element (the elements). This code repeatedly calls each to move down the document tree and print the details of each order in the document:

	orders.root.each_element do |order| # each  in 
	 order.each_element do |node| # , , etc. in 
	 if node.has_elements?
	 node.each_element do |child| # each  in 
	 puts "#{child.name}: #{child.attributes[desc]}"
	 end
	 else
	 # the contents of , , etc.
	 puts "#{node.name}: #{node.text}"
	 end
	 end
	end
	# number: 105
	# date: 02/10/2006
	# customer: Corner Store
	# item: Red Roses
	# item: Candy Hearts

Discussion

Parsing an XML file into a Document gives you a tree-like data structure that you can treat kind of like an array of arrays. Starting at the document root, you can move down the tree until you find the data that interests you. In the example above, note how the structure of the Ruby code mirrors the structure of the original document. Every call to each_element moves the focus of the code down a level: from to to to .

There are many other methods of Element you can use to navigate the tree structure of an XML document. Not only can you iterate over the child elements, you can reference a specific child by indexing the parent as though it were an array. You can navigate through siblings with Element.next_element and Element.previous_element. You can move up the document tree with Element.parent:

	my_order = orders.root.elements[1]
	first_node = my_order.elements[1]
	first_node.name # => "number"
	first_node.next_element.name # => "date"
	first_node.parent.name # => "order"

This only scratches the surface; there are many other ways to interact with the data loaded from an XML source. For example, explore the convenience methods Element.each_element_with_attribute and Element.each_element_with_text, which let you select elements based on features of the elements themselves.

See Also

  • The RDoc documentation for the REXML::Document and REXML::Element classes
  • The section "Tree Parsing XML and Accessing Elements" in the REXML Tutorial (http://www.germane-software.com/software/rexml/docs/tutorial.html#id2247335)
  • If you want to start navigating the document at some point other than the root, an XPath statement is probably the simplest way to get where you want; see Recipe 11.4, "Navigating a Document with XPath"


Strings

Numbers

Date and Time

Arrays

Hashes

Files and Directories

Code Blocks and Iteration

Objects and Classes8

Modules and Namespaces

Reflection and Metaprogramming

XML and HTML

Graphics and Other File Formats

Databases and Persistence

Internet Services

Web Development Ruby on Rails

Web Services and Distributed Programming

Testing, Debugging, Optimizing, and Documenting

Packaging and Distributing Software

Automating Tasks with Rake

Multitasking and Multithreading

User Interface

Extending Ruby with Other Languages

System Administration



Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net