Searching XML Documents


from xml.parsers import expat class xmlSearch(object):    def __init__ (self, cStr, nodeName):        self.nodeName = nodeName        self.curNode = 0        self.nodeActive = 0        self.hits = []        self.cStr = cStr    def StartElement(self, name, attributes):    def EndElement(self, name):    def CharacterData(self, data):    def Parse(self, fName):        xmlParser = expat.ParserCreate()        xmlParser.StartElementHandler = \          self.StartElement        xmlParser.EndElementHandler = self.EndElement        xmlParser.CharacterDataHandler =\               self.CharacterData        xmlParser.Parse(open(fName).read(), 1) search = xmlSearch(searchString, searchElement) search.Parse(xmlFile) print search.hits

Another extremely useful Python module for XML processing is the xml.parsers.expat module. The expat module provides an interface to the expat nonvalidating XML parser. The expat XML parser is a fast parser that quickly parses XML files and uses handlers to process character data and markup.

To use the expat parser to quickly search through an XML document and find specific data, define a search class that derived from the basic object class.

When the search class is defined, add a startElement, endElement, and CharacterData method that can be used to override the handlers in the expat parser later.

After you have defined the handler methods of the search object, define a parse routine that creates the expat parser by calling the ParserCreate() function of the expat module. The ParserCreate() function returns an expat parser object.

After the expat parser object is created in the search object's parse routine, override the StartElementHandler, EndElementHandler, and CharacterDataHandler attributes of the parser object by assigning them to the corresponding methods in your search object.

After you have overridden the handler functions of the expat parser object, the parse routine will need to invoke the Parse(buffer [, isFinal]) function of the expat parser object. The Parse function accepts a string buffer and parses it using the overridden handler methods.

Note

The isFinal argument is set to 1 if this is the last data to be parsed or 0 if there is more data to be parsed.


After you have defined the search class, create an instance of the class and use the parse function you defined to parse the XML file and search for data.

from xml.parsers import expat searchStringList = ["dayley@sfcn.org", "also"] searchElement = "email" xmlFile = "emails.xml" #Define a search class that will handle #elements and search character data class xmlSearch(object):     def __init__ (self, cStr, nodeName):         self.nodeName = nodeName         self.curNode = 0         self.nodeActive = 0         self.hits = []         self.cStr = cStr     def StartElement(self, name, attributes):         if name == self.nodeName:             self.nodeActive = 1             self.curNode += 1     def EndElement(self, name):         if name == self.nodeName:             self.nodeActive = 0     def CharacterData(self, data):        if data.strip():           data = data.encode('ascii')           if self.nodeActive:              if data.find(self.cStr) != -1:                 if not self.hits.count(self.curNode):                    self.hits.append(self.curNode)                    print "\tFound %s..." % self.cStr     def Parse(self, fName): #Create the expat parser object         xmlParser = expat.ParserCreate() #Override the handler methods         xmlParser.StartElementHandler = \             self.StartElement         xmlParser.EndElementHandler = self.EndElement         xmlParser.CharacterDataHandler =\             self.CharacterData #Parse the XML file         xmlParser.Parse(open(fName).read(), 1) for searchString in searchStringList: #Create search class     search = xmlSearch(searchString, searchElement) #Invoke the search objects Parse method     print "\nSearching <%s> nodes . . ." % \           searchElement     search.Parse(xmlFile) #Display parsed results     print "Found '%s' in the following nodes:" % \           searchString     print search.hits


xml_search.py

Searching <email> nodes . . .     Found dayley@sfcn.org...     Found dayley@sfcn.org... Found 'dayley@sfcn.org' in the following nodes: [1, 2] Searching <email> nodes . . .     Found also... Found 'also' in the following nodes: [2]


Output from xml_search.py code.



Python Phrasebook(c) Essential Code and Commands
Python Phrasebook
ISBN: 0672329107
EAN: 2147483647
Year: N/A
Pages: 138
Authors: Brad Dayley

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net