Another extremely useful Python module for XML processing is the xml.parsers.expat module. The expat module provides an interface to the expat nonvalidating XML parser. The expat XML parser is a fast parser that quickly parses XML files and uses handlers to process character data and markup. To use the expat parser to quickly search through an XML document and find specific data, define a search class that derived from the basic object class. When the search class is defined, add a startElement, endElement, and CharacterData method that can be used to override the handlers in the expat parser later. After you have defined the handler methods of the search object, define a parse routine that creates the expat parser by calling the ParserCreate() function of the expat module. The ParserCreate() function returns an expat parser object. After the expat parser object is created in the search object's parse routine, override the StartElementHandler, EndElementHandler, and CharacterDataHandler attributes of the parser object by assigning them to the corresponding methods in your search object. After you have overridden the handler functions of the expat parser object, the parse routine will need to invoke the Parse(buffer [, isFinal]) function of the expat parser object. The Parse function accepts a string buffer and parses it using the overridden handler methods. Note The isFinal argument is set to 1 if this is the last data to be parsed or 0 if there is more data to be parsed. After you have defined the search class, create an instance of the class and use the parse function you defined to parse the XML file and search for data. from xml.parsers import expat searchStringList = ["dayley@sfcn.org", "also"] searchElement = "email" xmlFile = "emails.xml" #Define a search class that will handle #elements and search character data class xmlSearch(object): def __init__ (self, cStr, nodeName): self.nodeName = nodeName self.curNode = 0 self.nodeActive = 0 self.hits = [] self.cStr = cStr def StartElement(self, name, attributes): if name == self.nodeName: self.nodeActive = 1 self.curNode += 1 def EndElement(self, name): if name == self.nodeName: self.nodeActive = 0 def CharacterData(self, data): if data.strip(): data = data.encode('ascii') if self.nodeActive: if data.find(self.cStr) != -1: if not self.hits.count(self.curNode): self.hits.append(self.curNode) print "\tFound %s..." % self.cStr def Parse(self, fName): #Create the expat parser object xmlParser = expat.ParserCreate() #Override the handler methods xmlParser.StartElementHandler = \ self.StartElement xmlParser.EndElementHandler = self.EndElement xmlParser.CharacterDataHandler =\ self.CharacterData #Parse the XML file xmlParser.Parse(open(fName).read(), 1) for searchString in searchStringList: #Create search class search = xmlSearch(searchString, searchElement) #Invoke the search objects Parse method print "\nSearching <%s> nodes . . ." % \ searchElement search.Parse(xmlFile) #Display parsed results print "Found '%s' in the following nodes:" % \ searchString print search.hits xml_search.py Searching <email> nodes . . . Found dayley@sfcn.org... Found dayley@sfcn.org... Found 'dayley@sfcn.org' in the following nodes: [1, 2] Searching <email> nodes . . . Found also... Found 'also' in the following nodes: [2] Output from xml_search.py code. |