12.7 Parsing an XML File with xml.parsers.expat


Credit: Mark Nenadov

12.7.1 Problem

The expat parser is normally used through the SAX interface, but sometimes you may want to use expat directly to extract the best possible performance.

12.7.2 Solution

Python is very explicit about the lower-level mechanisms that its higher-level modules' packages use. You're normally better off accessing the higher levels, but sometimes, in the last few stages of an optimization quest, or just to gain better understanding of what, exactly, is going on, you may want to access the lower levels directly from your code. For example, here is how you can use Expat directly, rather than through SAX:

import xml.parsers.expat, sys class MyXML:     Parser = ""     # Prepare for parsing     def _ _init_ _(self, xml_filename):         assert xml_filename != ""         self.xml_filename = xml_filename         self.Parser = xml.parsers.expat.ParserCreate(  )         self.Parser.CharacterDataHandler = self.handleCharData         self.Parser.StartElementHandler = self.handleStartElement         self.Parser.EndElementHandler = self.handleEndElement     # Parse the XML file     def parse(self):         try:             xml_file = open(self.xml_filename, "r")         except:             print "ERROR: Can't open XML file %s"%self.xml_filename             raise         else:             try: self.Parser.ParseFile(xml_file)             finally: xml_file.close(  )     # to be overridden by implementation-specific methods     def handleCharData(self, data): pass     def handleStartElement(self, name, attrs): pass     def handleEndElement(self, name): pass

12.7.3 Discussion

This recipe presents a reusable way to use xml.parsers.expat directly to parse an XML file. SAX is more standardized and rich in functionality, but expat is also usable, and sometimes it can be even lighter than the already lightweight SAX approach. To reuse the MyXML class, all you need to do is define a new class, inheriting from MyXML. Inside your new class, override the inherited XML handler methods, and you're ready to go.

Specifically, the MyXML class creates a parser object that does callbacks to the callables that are its attributes. The StartElementHandler callable is called at the start of each element, with the tag name and the attributes as arguments. EndElementHandler is called at the end of each element, with the tag name as the only argument. Finally, CharacterDataHandler is called for each text string the parser encounters, with the string as the only argument. The MyXML class uses the handleStartElement, handleEndElement, and handleCharData methods as such callbacks. Therefore, these are the methods you should override when you subclass MyXML to perform whatever application-specific processing you require.

12.7.4 See Also

Recipe 12.2, Recipe 12.3, Recipe 12.4, and Recipe 12.6 for uses of the higher-level SAX API; while Expat was the brainchild of James Clark, Expat 2.0 is a group project, with a home page at http://expat.sourceforge.net/.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2005
Pages: 346

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net