Converting an XML Document into a Hash

Table of contents:

Problem

When you parse an XML document with Document.new, you get a representation of the document as a complex data structure. Youd like to represent an XML document using simple, built-in Ruby data structures.

Solution

Use the XmlSimple library, found in the xml-simple gem. It parses an XML document into a hash.

Consider an XML document like this one:

	xml = %{
	
	 Phyllo dough
	 Ice cream
	 
	 
	 
	 
	}

Heres how you parse it with XMLSimple:

	require 
ubygems
	require xmlsimple

	doc = XmlSimple.xml_in xml

And heres what it looks like:

	require pp
	pp doc
	# {"icecubetray"=>[{"cube2"=>[{}], "cube1"=>[{}]}],
	# "food"=>["Phyllo dough", "Ice cream"],
	# "scale"=>"celcius",
	# "temp"=>"-12"}

Discussion

XmlSimple is a lightweight alternative to the Document class. Instead of exposing a tree of Element objects, it exposes a nested structure of Ruby hashes and arrays. Theres no performance savings (XmlSimple actually builds a Document class behind the scenes and iterates over it, so its about half as fast as Document), but the resulting object is easy to use. XmlSimple also provides several tricks that can make a document more concise and navigable.

The most useful trick is the KeyAttr one. Suppose you had a better-organized freezer than the one above, a freezer in which everything had its own name attribute:^[2]

^[2] Okay, its not really better organized. In fact, its exactly the same. But it sure looks cooler!

	xml = %{
	
	 
	 
	 
	 
	 
	 
	}

You could parse this data with just a call to XmlSimple.xml_in, but you get a more concise representation by specifing the name attribute as a KeyAttr argument. Compare:

	parsed1 = XmlSimple.xml_in xml
	pp parsed1
	# {"scale"=>"celcius",
	# "item"=>
	# [{"name"=>"Phyllo dough", "type"=>"food"},
	# {"name"=>"Ice cream", "type"=>"food"},
	# {"name"=>"Ice cube tray",
	# "type"=>"container",
	# "item"=>
	# [{"name"=>"Ice cube", "type"=>"food"},
	# {"name"=>"Ice cube", "type"=>"food"}]}],
	# "temp"=>"-12"}

	parsed2 = XmlSimple.xml_in(xml, KeyAttr => 
ame)
	pp parsed2
	# {"scale"=>"celcius",
	# "item"=>
	# {"Phyllo dough"=>{"type"=>"food"},
	# "Ice cube tray"=>
	# {"type"=>"container",
	# "item"=>{"Ice cube"=>{"type"=>"food"}}},
	# "Ice cream"=>{"type"=>"food"}},
	# "temp"=>"-12"}

The second parsing is also easier to navigate:

	parsed1["item"].detect { |i| i[
ame] == Phyllo dough }[	ype]
	# => "food"
	parsed2["item"]["Phyllo dough"]["type"]
	# => "food"

But notice that the second parsing represents the ice cube tray as containing only one ice cube. This is because both ice cubes have the same name. When two tags at the same level have the same KeyAttr, one overwrites the other in the hash.

You can modify the data structure with normal Ruby hash and array methods, then write it back out to XML with XMLSimple.xml_out:

	parsed1["item"] << {"name"=>"Curry leaves", "type"=>"spice"}
	parsed1["item"].delete_if { |i| i["name"] == "Ice cube tray" }

	puts XmlSimple.xml_out(parsed1, "RootName"=>"freezer")
	# 
	# 
	# 
	# 
	#

Be sure to specify a RootName argument when you call xml_out. When it parses a file, XmlSimple removes one level of indirection by throwing away the name of your documents root element. You can prevent this by using the KeepRoot argument in your original call to xml_in. Youll need an extra hash lookup to navigate the resulting data structure, but youll retain the name of your root element.

	parsed3 = XmlSimple.xml_in(xml, KeepRoot=>true)
	# Now theres no need to add an extra root element when writing back to XML.
	XmlSimple.xml_out(parsed3, RootName=>nil)

One disadvantage of XmlSimple is that, since it puts elements into a hash, it replaces the order of the original document with the random-looking order of a Ruby hash. This is fine for a document listing the contents of a freezerwhere order doesn matterbut it would give interesting results if you tried to use it on a web page.

Another disadvantage is that, since an elements attributes and children are put into the same hash, you have no reliable way of telling one from the other. Indeed, attributes and subelements may even end up in a list together, as in this example:

	pp XmlSimple.xml_in(%{
	
	 Body of temporary worker who knew too much
	})
	# {"scale"=>"celcius",
	# "temp"=>["-12", "Body of temp worker who knew too much"]}

Converting an XML Document into a Hash

Problem

Solution

Discussion

See Also