Extracting Data While Parsing a Document

Credit: Rod Gaither


You want to process a large XML file without loading it all into memory.


The method REXML::Document.parse_stream gives you a fast and flexible way to scan a large XML file and process the parts that interest you.

Consider this XML document, the output of a hypothetical program that runs auto mated tasks. We want to parse the document and find the tasks that failed (that is, returned an error code other than zero).

	event_xml = %{

We can process the document as its being parsed by writing a REXML:: StreamListener subclass that responds to parsing events such as tag_start and tag_end. Heres a subclass that listens for tags with a nonzero value for their error attribute. It prints a message for every failed event it finds.


	class ErrorListener
	 include REXML::StreamListener
	 def tag_start(name, attrs)
	 if attrs["error"] != nil and attrs["error"] != "0"
	 puts %{Event "#{name}" failed for system "#{attrs["system"]}" } +
	 %{with code #{attrs["error"]}}

To actually parse the XML data, pass it along with the StreamListener into the method REXML::Document.parse_stream:

	REXML::Document.parse_stream(event_xml, ErrorListener.new)
	# Event "clean" failed for system "dev" with code 1
	# Event "backup" failed for system "dev" with code 2


We could find the failed events in less code by loading the XML into a Document and running an XPath query. That approach would work fine for this example, since the document only contains four events. It wouldn work as well if the document were a file on disk containing a billion events. Building a Document means building an elaborate in-memory data structure representing the entire XML document. If you only care about part of a document (in this case, the failed events), its faster and less memory-intensive to process the document as its being parsed. Once the parser reaches the end of the document, you e done.

The stream-oriented approach to parsing XML can be as simple as shown in this recipe, but it can also handle much more complex scenarios. Your StreamListener subclass can keep arbitrary state in instance variables, letting you track complex combinations of elements and attributes.

See Also

  • The RDoc documentation for the REXML::StreamParser class
  • The "Stream Parsing" section of the REXML Tutorial (http://www.germane-software.com/software/rexml/docs/tutorial.html#id2248457)
  • Recipe 11.2, " Extracting Data from a Documents Tree Structure"



Date and Time



Files and Directories

Code Blocks and Iteration

Objects and Classes8

Modules and Namespaces

Reflection and Metaprogramming


Graphics and Other File Formats

Databases and Persistence

Internet Services

Web Development Ruby on Rails

Web Services and Distributed Programming

Testing, Debugging, Optimizing, and Documenting

Packaging and Distributing Software

Automating Tasks with Rake

Multitasking and Multithreading

User Interface

Extending Ruby with Other Languages

System Administration

Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399

Similar book on Amazon

Flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net