Checking XML Well-Formedness

Credit: Rod Gaither

Problem

You want to check that an XML document is well-formed before processing it.

Solution

The best way to see whether a document is well-formed is to try to parse it. The REXML library raises an exception when it can parse an XML document, so just try parsing it and rescue any exception.

The valid_xml? method below returns nil unless its given a valid XML document. If the document is valid, it returns a parsed Document object, so you don have to parse it again:

	require 
exml/document
	def 
valid_xml?(xml)
	 begin
	 REXML::Document.new(xml)
	 rescue REXML::ParseException
	 # Return nil if an exception is thrown
	 end
	end

Discussion

To be useful, an XML document must be structured correctly or "well-formed." For instance, an opening tag must either be self-closing or be paired with an appropriate closing tag.

As a file and messaging format, XML is often used in situations where you don have control over the input, so you can assume that it will always be well-formed. Rather than just letting REXML throw an exception, youll need to handle ill-formed XML gracefully, providing options to retry or continue on a different path.

This bit of XML is not well-formed: its missing ending tags for both the pending and done elements:

	bad_xml = %{
	
	 
	 Grocery Shopping
	 
	 Dry Cleaning
	}

	valid_xml?(bad_xml) # => nil

This bit of XML is well-formed, so valid_xml? returns the parsed Document object.

	good_xml = %{
	
	 Wheat
	 Quadrotriticale
	}

	doc = valid_xml?(good_xml)
	doc.root.elements[1] # => 

When your program is responsible for writing XML documents, youll want to write unit tests that make sure you generate valid XML. You can use a feature of the Test:: Unit library to simplify the checking. Since invalid XML makes REXML throw an exception, your unit test can use the assert_nothing_thrown method to make sure your XML is valid:

	doc = nil
	assert_nothing_thrown {doc = REXML::Document.new(source_xml)}

This is a simple, clean test to verify XML when using a unit test.

Note that valid_xml? doesn work perfectly: some invalid XML is unambiguous, which means REXML can parse it. Consider this truncated version of the valid XML example. Its missing its closing tags, but theres no ambiguity about which closing tag should come first, so REXML can parse the file and provide the closing tags:

	invalid_xml = %{
	
	 Wheat
	}

	(valid_xml? invalid_xml) == nil # => false # That is, it is "valid"
	REXML::Document.new(invalid_xml).write
	# 
	# Wheat
	# 

See Also

  • Official information on XML can be found at http://www.w3.org/XML/
  • The Wikipedia has a good description of the difference between Well-Formed and Valid XML documents at http://en.wikipedia.org/wiki/Xml#Correctness_in_an_XML_document
  • Recipe 11.5, "Parsing Invalid Markup"
  • Recipe 17.3, "Handling an Exception"


Strings

Numbers

Date and Time

Arrays

Hashes

Files and Directories

Code Blocks and Iteration

Objects and Classes8

Modules and Namespaces

Reflection and Metaprogramming

XML and HTML

Graphics and Other File Formats

Databases and Persistence

Internet Services

Web Development Ruby on Rails

Web Services and Distributed Programming

Testing, Debugging, Optimizing, and Documenting

Packaging and Distributing Software

Automating Tasks with Rake

Multitasking and Multithreading

User Interface

Extending Ruby with Other Languages

System Administration



Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net