Ruby Cookbook
Authors: Carlson L. Richardson L.
Published year:
Pages: 174-175/399
Buy this book on amazon.com >>

Chapter 11. XML and HTML

XML and HTML are the most popular markup languages (textual ways of describing structured data). HTML is used to describe textual documents, like you see on the Web. XML is used for just about everything else: data storage, messaging, configuration files, you name it. Just about every software buzzword forged over the past few years involves XML.

Java and C++ programmers tend to regard XML as a lightweight, agile technology, and are happy to use it all over the place. XML is a lightweight technology, but only compared to Java or C++. Ruby programmers see XML from the other end of the spectrum, and from there it looks pretty heavy. Simpler formats like YAML and JSON usually work just as well (see Recipe 13.1 or Recipe 13.2), and are easier to manipulate. But to shun XML altogether would be to cut Ruby off from the rest of the world, and nobody wants that. This chapter covers the most useful ways of parsing, manipulating, slicing, and dicing XML and HTML documents.

There are two standard APIs for manipulating XML: DOM and SAX. Both are overkill for most everyday uses, and neither is a good fit for Ruby's code-blockheavy style. Ruby's solution is to offer a pair of APIs that capture the style of DOM and SAX while staying true to the Ruby programming philosophy. [1] Both APIs are in the standard library's REXML package, written by Sean Russell.

[1] REXML also provides the SAX2Parser and SAX2Listener classes, which implement the basic SAX2 API.

Like DOM, the Document class parses an XML document into a nested tree of objects. You can navigate the tree with Ruby accessors (Recipe 11.2)or with XPath queries (Recipe 11.4). You can modify the tree by creating your own Element and Text objects (Recipe 11.9). If even Document is too heavyweight for you, you can use the XmlSimple library to transform an XML file into a nested Ruby hash (Recipe 11.6).

With a DOM-style API like Document , you have to parse the entire XML file before you can do anything. The XML document becomes a large number of Ruby objects nested under a Document object, all sitting around taking up memory. With a SAXstyle parser like the StreamParser class, you can process a document as it's parsed, creating only the objects you want. The StreamParser API is covered in Recipe 11.3.

The main problem with the REXML APIs is that they're very picky. They'll only parse a document that's valid XML, or close enough to be have an unambiguous representation. This makes them nearly useless for parsing HTML documents off the World Wide Web, since the average web page is not valid XML. Recipe 11.5 shows how to use the third-party tools Rubyful Soup and SGMLParser; they give a DOMor SAX-style interface that handles even invalid XML.

  • http://www.germane-software.com/software/rexml/

  • http://www.germane-software.com/software/rexml/docs/tutorial.html



Recipe 11.1. Checking XML Well-Formedness

Credit: Rod Gaither

Problem

You want to check that an XML document is well- formed before processing it.

Solution

The best way to see whether a document is well-formed is to try to parse it. The REXML library raises an exception when it can't parse an XML document, so just try parsing it and rescue any exception.

The valid_xml? method below returns nil unless it's given a valid XML document. If the document is valid, it returns a parsed Document object, so you don't have to parse it again:

require 'rexml/document'
	def  
valid_xml?(xml)
	 begin
	   REXML::Document.new(xml)
	 rescue REXML::ParseException
	   # Return nil if an exception is thrown
	 end
	end

Discussion

To be useful, an XML document must be structured correctly or "well-formed." For instance, an opening tag must either be self-closing or be paired with an appropriate closing tag.

As a file and messaging format, XML is often used in situations where you don't have control over the input, so you can't assume that it will always be well-formed. Rather than just letting REXML throw an exception, you'll need to handle ill-formed XML gracefully, providing options to retry or continue on a different path .

This bit of XML is not well-formed: it's missing ending tags for both the pending and done elements:

bad_xml = %{
	<tasks>
	 <pending>
	   <entry>Grocery Shopping</entry>
	 <done>
	   <entry>Dry Cleaning</entry>
	</tasks>}

	valid_xml?(bad_xml)                           # => nil

This bit of XML is well-formed, so valid_xml? returns the parsed Document object.

good_xml = %{
	<groceries>
	 <bread>Wheat</bread>
	 <bread>Quadrotriticale</bread>
	</groceries>}

	doc = valid_xml?(good_xml)
	doc.root.elements[1]                          # => <bread> … </>

When your program is responsible for writing XML documents, you'll want to write unit tests that make sure you generate valid XML. You can use a feature of the Test:: Unit library to simplify the checking. Since invalid XML makes REXML throw an exception, your unit test can use the assert_nothing_thrown method to make sure your XML is valid:

doc = nil
	assert_nothing_thrown {doc = REXML::Document.new(source_xml)}

This is a simple, clean test to verify XML when using a unit test.

Note that valid_xml? doesn't work perfectly : some invalid XML is unambiguous, which means REXML can parse it. Consider this truncated version of the valid XML example. It's missing its closing tags, but there's no ambiguity about which closing tag should come first, so REXML can parse the file and provide the closing tags:

invalid_xml = %{
	<groceries>
	 <bread>Wheat
	}

	(valid_xml? invalid_xml) == nil          # => false # That is, it is "valid"
	REXML::Document.new(invalid_xml).write
	# <groceries>
	#   <bread>Wheat
	# </bread></groceries>

See Also

  • Official information on XML can be found at http://www.w3.org/XML/

  • The Wikipedia has a good description of the difference between Well-Formed and Valid XML documents at http://en.wikipedia.org/wiki/Xml#Correctness_in_an_XML_document

  • Recipe 11.5, "Parsing Invalid Markup"

  • Recipe 17.3, "Handling an Exception"


Ruby Cookbook
Authors: Carlson L. Richardson L.
Published year:
Pages: 174-175/399
Buy this book on amazon.com >>

Similar books on Amazon