Extracting Text from XML Documents


from xml.parsers import expat #Define a class that will store the character data class xmlText(object):     def __init__ (self):         self.textBuff = ""     def CharacterData(self, data):         data = data.strip()         if data:             data = data.encode('ascii')             self.textBuff += data + "\n"     def Parse(self, fName):         xmlParser = expat.ParserCreate()     xmlParser.CharacterDataHandler = self.CharacterData         xmlParser.Parse(open(fName).read(), 1) xText = xmlText() xText.Parse(xmlFile) print xText.textBuff

A common task when parsing XML documents is to quickly retrieve the text from them without the markup tags and attribute data. The expat parser provided with Python provides a simple interface to manage just that. To use the expat parser to quickly parse through an XML document and store only the text, define a simple text parser class that derived from the basic object class.

When the text parser class is defined, add a CharacterData() method that can be used to override the CharacterDataHandlers() method of the expat parser. This method will store the text data passed to the handler when the document is parsed.

After you have defined the handler method of the text parser object, define a parse routine that creates the expat parser by calling the ParserCreate() function of the expat module. The ParserCreate() function returns an expat parser object.

After the expat parser object is created in the text parser objects' parse routine, override the CharacterDataHandler attribute of the parser object by assigning it to the CharacterData() method in your search object.

After you have overridden the handler function of the expat parser object, the parse routine will need to invoke the Parse(buffer [, isFinal]) function of the expat parser object. The Parse function accepts a string buffer and parses it using the overridden handler methods.

After you have defined the text parser class, create an instance of the class and use the Parse(file) function you defined to parse the XML file and retrieve the text.

from xml.parsers import expat xmlFile = "emails.xml" #Define a class that will store the character data class xmlText(object):     def __init__ (self):         self.textBuff = ""     def CharacterData(self, data):         data = data.strip()         if data:             data = data.encode('ascii')             self.textBuff += data + "\n"     def Parse(self, fName): #Create the expat parser object         xmlParser = expat.ParserCreate() #Override the handler methods         xmlParser.CharacterDataHandler = \             self.CharacterData #Parse the XML file         xmlParser.Parse(open(fName).read(), 1) #Create the text parser object xText = xmlText() #Invoke the text parser objects Parse method xText.Parse(xmlFile) #Display parsed results print "Text from %s\n====================" % xmlFile print xText.textBuff


xml_text.py

Text from emails.xml ========================== bwdayley@novell.com bwdayley@sfcn.org ddayley@sfcn.org Update List Please add me to the list. bwdayley@novell.com bwdayley@sfcn.org cdayley@sfcn.org More Updated List Please add me to the list also.


Output from xml_text.py code.




Python Phrasebook(c) Essential Code and Commands
Python Phrasebook
ISBN: 0672329107
EAN: 2147483647
Year: N/A
Pages: 138
Authors: Brad Dayley

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net