Substituting XML Entities

Problem

Youve parsed a document that contains internal XML entities. You want to substitute the entities in the document for their values.

Solution

To perform entity substitution on a specific text element, call its value method. If its the first text element of its parent, you can call text on the parent instead.

Heres a simple document that defines and uses two entities in a single text node. We can substitute those entities for their values without changing the document itself:

	require 
exml/document

	str = %{
	
	 
	]>
	
	 &product; v&version; is the most advanced astronomy product on the market.
	}
	doc = REXML::Document.new str

	doc.root.children[0].value
	# => "
 Stargaze v2.3 is the most advanced astronomy product on the market.
"
	doc.root.text
	# => "
 Stargaze v2.3 is the most advanced astronomy product on the market.
"

	doc.root.children[0].to_s
	# => "
 &product; v&version; is the most advanced astronomy product on the market.
"
	doc.root.write
	# 
	# &product; v&version; is the most advanced astronomy program on the market.
	# 

Discussion

Internal XML entities are often used to factor out data that changes a lot, like dates or version numbers. But REXML only provides a convenient way to perform substitution on a single text node. What if you want to perform substitutions throughout the entire document?

When you call Document#write to send a document to some IO object, it ends up calling Text#to_s on each text node. As seen in the Solution, this method presents a "normalized" view of the data, one where entities are displayed instead of having their values substituted in.

We could write our own version of Document#write that presents an "unnormalized" view of the document, one with entity values substituted in, but that would be a lot of work. We could hack Text#to_s to work more like Text#value, or hack Text#write to call the value method instead of to_s. But its less intrusive to do the entity replacement outside of the write method altogether. Heres a class that wraps any IO object and performs entity replacement on all the text that comes through it:

	require delegate
	require 
exml/text
	class EntitySubstituter < DelegateClass(IO)
	 def initialize(io, document, filter=nil)
	 @document = document
	 @filter = filter
	 super(io)

	 end

	 def <<(s)
	 super(REXML::Text::unnormalize(s, @document.doctype, @filter))
	 end
	end

	output = EntitySubstituter.new($stdout, doc)
	doc.write(output)
	# 
	# 
	# ]>
	# 
	# Stargaze v2.3 is the most advanced astronomy product on the market.
	# 

Because it processes the entire output of Document#write, this code will replace all entity references in the document. This includes any references found in attribute values, which may or may not be what you want.

If you create a Text object manually, or set the value of an existing object, REXML assumes that you e giving it unnormalized text, and normalizes it. This can be problematic if your text contains strings that happen to be the values of entities:

	text_node = doc.root.children[0]
	text_node.value = "&product; v&version; has a catalogue of 2.3 " +
	 "million celestial objects."

	doc.write
	# 
	# 
	# ]>
	# &product; v&version; has a catalogue of &version; million celestial objects.
	 

To avoid this, you can create a "raw" text node:

	text_node.raw = true
	doc.write
	# 
	# 
	# ]>
	# &product; v&version; has a catalogue of 2.3 million celestial objects.

	text_node.value
	# => "Stargaze v2.3 has a catalogue of 2.3 million celestial objects."
	text_node.to_s
	# => "&product; v&version; has a catalogue of 2.3 million celestial objects."

In addition to entities you define, REXML automatically processes five named character entities: the ones for left and right angle brackets, single and double quotes, and the ampersand. Each is replaced with the corresponding ASCII character.

	str = %{
	  ]>
	 © &year; Komodo Dragon & Bob Productions
	}

	doc = REXML::Document.new str
	text_node = doc.root.children[0]

	text_node.value
	# => "© 2006 Komodo Dragon & Bob Productions"
	text_node.to_s
	# => "© &year; Komodo Dragon & Bob Productions"

"©" is an HTML character entity representing the copyright symbol, but REXML doesn know that. It only knows about the five XML character entities. Also, REXML only knows about internal entities: ones whose values are defined within the same document that uses them. It won resolve external entities.

See Also

  • The section "Text Nodes" of the REXML tutorial (http://www.germane-software.com/software/rexml/docs/tutorial.html#id2248004)


Strings

Numbers

Date and Time

Arrays

Hashes

Files and Directories

Code Blocks and Iteration

Objects and Classes8

Modules and Namespaces

Reflection and Metaprogramming

XML and HTML

Graphics and Other File Formats

Databases and Persistence

Internet Services

Web Development Ruby on Rails

Web Services and Distributed Programming

Testing, Debugging, Optimizing, and Documenting

Packaging and Distributing Software

Automating Tasks with Rake

Multitasking and Multithreading

User Interface

Extending Ruby with Other Languages

System Administration



Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net