Recipe 12.5. Converting an XML Document into a Tree of Python ObjectsCredit: John Bair, Christoph Dietze ProblemYou want to load an XML document into memory, but you don't like the complicated access procedures of DOM. You'd prefer something more Pythonicspecifically, you'd like to map the document into a tree of Python objects. SolutionTo build our tree of objects, we can directly wrap the fast expat parser: from xml.parsers import expat class Element(object): ''' A parsed XML element ''' def _ _init_ _(self, name, attributes): # Record tagname and attributes dictionary self.name = name self.attributes = attributes # Initialize the element's cdata and children to empty self.cdata = '' self.children = [ ] def addChild(self, element): self.children.append(element) def getAttribute(self, key): return self.attributes.get(key) def getData(self): return self.cdata def getElements(self, name=''): if name: return [c for c in self.children if c.name == name] else: return list(self.children) class Xml2Obj(object) ''' XML to Object converter ''' def _ _init_ _(self): self.root = None self.nodeStack = [ ] def StartElement(self, name, attributes): 'Expat start element event handler' # Instantiate an Element object element = Element(name.encode( ), attributes) # Push element onto the stack and make it a child of parent if self.nodeStack: parent = self.nodeStack[-1] parent.addChild(element) else: self.root = element self.nodeStack.append(element) def EndElement(self, name): 'Expat end element event handler' self.nodeStack[-1].pop( ) def CharacterData(self, data): 'Expat character data event handler' if data.strip( ): data = data.encode( ) element = self.nodeStack[-1] element.cdata += data def Parse(self, filename): # Create an Expat parser Parser = expat.ParserCreate( ) # Set the Expat event handlers to our methods Parser.StartElementHandler = self.StartElement Parser.EndElementHandler = self.EndElement Parser.CharacterDataHandler = self.CharacterData # Parse the XML File ParserStatus = Parser.Parse(open(filename).read( ), 1) return self.root parser = Xml2Obj( ) root_element = parser.Parse('sample.xml') DiscussionI saw Christoph Dietze's recipe (http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/116539) about turning the structure of an XML document into a simple combination of dictionaries and lists and thought it was a really good idea. This recipe is a variation on that idea, with several differences. For maximum speed, the recipe uses the low-level expat parser directly. It would get no real added value from the richer SAX interface, much less from the slow and memory-hungry DOM approach. Building the parent-children connections is not hard even with an event-driven interface, as this recipe shows by using a simple stack for the purpose. The main difference with respect to Dietze's original idea is that this recipe loads the XML document into a tree of Python objects (rather than a combination of dictionaries and lists), one per node, with nicely named attributes allowing access to each node's characteristicstagname, attributes (as a Python dictionary), character data (i.e., cdata in XML parlance) and children elements (as a Python list). The various accessor methods of class Element are, of course, optional. You might prefer to access the attributes directly. I think they add no complexity and look nicer, but, obviously, your tastes may differ. This is, after all, just a recipe, so feel free to alter the mix of seasonings at will! You can find other similar ideas (e.g., bypass the DOM, build something more Pythonic as the memory representation of an XML document) in many other excellent and more complete projects, such as PyRXP (http://www.reportlab.org/pyrxp.html), ElementTree (http://effbot.org/zone/element-index.htm), and XIST (http://www.livinglogic.de/Python/xist/). See AlsoLibrary Reference and Python in a Nutshell document the built-in XML support in the Python Standard Library, and xml.parsers.expat in particular. PyRXP is at http://www.reportlab.org/pyrxp.html; ElementTree is at http://effbot.org/zone/element-index.htm; XIST is at http://www.livinglogic.de/Python/xist/. |