Recipe12.3.Extracting Text from an XML Document


Recipe 12.3. Extracting Text from an XML Document

Credit: Paul Prescod

Problem

You need to extract only the text from an XML document, not the tags.

Solution

Once again, subclassing SAX's ContentHandler makes this task quite easy:

from xml.sax.handler import ContentHandler import xml.sax import sys class textHandler(ContentHandler):     def characters(self, ch):         sys.stdout.write(ch.encode("Latin-1")) parser = xml.sax.make_parser( ) handler = textHandler( ) parser.setContentHandler(handler) parser.parse("test.xml")

Discussion

Sometimes you want to get rid of XML tagsfor example, to re-key a document or to spell-check it. This recipe performs this task and works with any well-formed XML document. It is quite efficient.

In this recipe's textHandler class, we subclass ContentHander's characters method, which the parser calls for each string of text in the XML document (excluding tags, XML comments, and processing instructions), passing as the only argument the piece of text as a Unicode string. We have to encode this Unicode before we can emit it to standard output. (See Recipe 1.22 for more information about emitting Unicode to standard output.) In this recipe, we're using the Latin-1 (also known as ISO-8859-1) encoding, which covers all western European alphabets and is supported by many popular output devices (e.g., printers and terminal-emulation windows). However, you should use whatever encoding is most appropriate for the documents you're handling, as long, of course, as that encoding is supported by the devices you need to use. The configuration of your devices may depend on your operating system's concepts of locale and code page. Unfortunately, these issues vary too much between operating systems for me to go into further detail.

A simple alternative, if you know that handling Unicode is not going to be a problem, is to use sgmllib. It's not quite as fast but somewhat more robust against XML of dubious well-formedness:

from sgmllib import SGMLParser class XMLJustText(SGMLParser):     def handle_data(self, data):         print data XMLJustText( ).feed(open('text.xml').read( ))

An even simpler and rougher way to extract text from an XML document is shown in Recipe 2.26.

See Also

Recipe 12.1 and Recipe 12.2 for other uses of SAX.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net