Recipe12.5.Converting an XML Document into a Tree of Python Objects

Recipe 12.5. Converting an XML Document into a Tree of Python Objects

Credit: John Bair, Christoph Dietze

Problem

You want to load an XML document into memory, but you don't like the complicated access procedures of DOM. You'd prefer something more Pythonicspecifically, you'd like to map the document into a tree of Python objects.

Solution

To build our tree of objects, we can directly wrap the fast expat parser:

from xml.parsers import expat class Element(object):     ''' A parsed XML element '''     def _ _init_ _(self, name, attributes):         # Record tagname and attributes dictionary         self.name = name         self.attributes = attributes         # Initialize the element's cdata and children to empty         self.cdata = ''         self.children = [  ]     def addChild(self, element):         self.children.append(element)     def getAttribute(self, key):         return self.attributes.get(key)     def getData(self):         return self.cdata     def getElements(self, name=''):         if name:             return [c for c in self.children if c.name == name]         else:             return list(self.children) class Xml2Obj(object)     ''' XML to Object converter '''     def _ _init_ _(self):         self.root = None         self.nodeStack = [  ]     def StartElement(self, name, attributes):         'Expat start element event handler'         # Instantiate an Element object         element = Element(name.encode( ), attributes)         # Push element onto the stack and make it a child of parent         if self.nodeStack:             parent = self.nodeStack[-1]             parent.addChild(element)         else:             self.root = element         self.nodeStack.append(element)     def EndElement(self, name):         'Expat end element event handler'         self.nodeStack[-1].pop( )     def CharacterData(self, data):         'Expat character data event handler'         if data.strip( ):             data = data.encode( )             element = self.nodeStack[-1]             element.cdata += data     def Parse(self, filename):         # Create an Expat parser         Parser = expat.ParserCreate( )         # Set the Expat event handlers to our methods         Parser.StartElementHandler = self.StartElement         Parser.EndElementHandler = self.EndElement         Parser.CharacterDataHandler = self.CharacterData         # Parse the XML File         ParserStatus = Parser.Parse(open(filename).read( ), 1)         return self.root parser = Xml2Obj( ) root_element = parser.Parse('sample.xml')

Discussion

I saw Christoph Dietze's recipe (http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/116539) about turning the structure of an XML document into a simple combination of dictionaries and lists and thought it was a really good idea. This recipe is a variation on that idea, with several differences.

For maximum speed, the recipe uses the low-level expat parser directly. It would get no real added value from the richer SAX interface, much less from the slow and memory-hungry DOM approach. Building the parent-children connections is not hard even with an event-driven interface, as this recipe shows by using a simple stack for the purpose.

The main difference with respect to Dietze's original idea is that this recipe loads the XML document into a tree of Python objects (rather than a combination of dictionaries and lists), one per node, with nicely named attributes allowing access to each node's characteristicstagname, attributes (as a Python dictionary), character data (i.e., cdata in XML parlance) and children elements (as a Python list).

The various accessor methods of class Element are, of course, optional. You might prefer to access the attributes directly. I think they add no complexity and look nicer, but, obviously, your tastes may differ. This is, after all, just a recipe, so feel free to alter the mix of seasonings at will!

You can find other similar ideas (e.g., bypass the DOM, build something more Pythonic as the memory representation of an XML document) in many other excellent and more complete projects, such as PyRXP (http://www.reportlab.org/pyrxp.html), ElementTree (http://effbot.org/zone/element-index.htm), and XIST (http://www.livinglogic.de/Python/xist/).

Recipe12.5.Converting an XML Document into a Tree of Python Objects