Recipe12.2.Counting Tags in a Document


Recipe 12.2. Counting Tags in a Document

Credit: Paul Prescod

Problem

You want to get a sense of how often particular elements occur in an XML document, and the relevant counts must be extracted rapidly.

Solution

You can subclass SAX's ContentHandler to make your own specialized classes for any kind of task, including the collection of such statistics:

from xml.sax.handler import ContentHandler import xml.sax class countHandler(ContentHandler):     def _ _init_ _(self):         self.tags={  }     def startElement(self, name, attr):         self.tags[name] = 1 + self.tags.get(name, 0) parser = xml.sax.make_parser( ) handler = countHandler( ) parser.setContentHandler(handler) parser.parse("test.xml") tags = handler.tags.keys( ) tags.sort( ) for tag in tags:     print tag, handler.tags[tag]

Discussion

When I start working with a new XML content set, I like to get a sense of which elements are in it and how often they occur. For this purpose, I use several small variants of this recipe. I could also collect attributes just as easily, as you can see, since attributes are also passed to the startElement method that I'm overriding. If you add a stack, you can also keep track of which elements occur within other elements (for this, of course, you also have to override the endElement method so you can pop the stack).

This recipe also works well as a simple example of a SAX application, usable as the basis for any SAX application. Alternatives to SAX include pulldom and minidom. For any simple processing (including this example), these alternatives would be overkill, particularly if the document you are processing is very large. DOM approaches are generally justified only when you need to perform complicated editing and alteration on an XML document, when the document itself is made complicated by references that go back and forth inside it, or when you need to correlate (i.e., compare) multiple documents.

ContentHandler subclasses offer many other options, and the online Python documentation does a pretty good job of explaining them. This recipe's countHandler class overrides ContentHandler's startElement method, which the parser calls at the start of each element, passing as arguments the element's tag name as a Unicode string and the collection of attributes. Our override of this method counts the number of times each tag name occurs. In the end, we extract the dictionary used for counting and emit it (in alphabetical order, which we easily obtain by sorting the keys).

See Also

Recipe 12.3 for other uses of SAX.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net