Recipe12.6.Removing Whitespace-only Text Nodes from an XML DOM Node s Subtree


Recipe 12.6. Removing Whitespace-only Text Nodes from an XML DOM Node's Subtree

Credit: Brian Quinlan, David Wilson

Problem

You want to remove, from the DOM representation of an XML document, all the text nodes within a subtree, which contain only whitespace.

Solution

XML parsers consider several complex conditions when deciding which whitespace-only text nodes to preserve during DOM construction. Unfortunately, the result is often not what you want, so it's helpful to have a function to remove all whitespace-only text nodes from among a given node's descendants:

def remove_whilespace_nodes(node):     """ Removes all of the whitespace-only text decendants of a DOM node. """     # prepare the list of text nodes to remove (and recurse when needed)     remove_list = [  ]     for child in node.childNodes:         if child.nodeType == dom.Node.TEXT_NODE and not child.data.strip( ):             # add this text node to the to-be-removed list             remove_list.append(child)         elif child.hasChildNodes( ):             # recurse, it's the simplest way to deal with the subtree             remove_whilespace_nodes(child)     # perform the removals     for node in remove_list:         node.parentNode.removeChild(node)         node.unlink( )

Discussion

This recipe's code works with any correctly implemented Python XML DOM, including the xml.dom.minidom that is part of the Python Standard Library and the more complete DOM implementation that comes with PyXML.

The implementation of function remove_whitespace_node is quite simple but rather instructive: in the first for loop we build a list of all child nodes to remove, and then in a second, separate loop we do the removal. This precaution is a good example of a general rule in Python: do not alter the very container you're looping onsometimes you can get away with it, but it is unwise to count on it in the general case. On the other hand, the function can perfectly well call itself recursively within its first for loop because such a call does not alter the very list node.childNodes on which the loop is iterating (it may alter some items in that list, but it does not alter the list object itself).

See Also

Library Reference and Python in a Nutshell document the built-in XML support in the Python Standard Library.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net