Recipe12.10.Merging Continuous Text Events with a SAX Filter


Recipe 12.10. Merging Continuous Text Events with a SAX Filter

Credit: Uche Ogbuji, James Kew, Peter Cogolo

Problem

A SAX parser can report contiguous text using multiple characters events (meaning, in practice, multiple calls to the characters method), and this multiplicity of events for a single text string may give problems to SAX handlers. You want to insert a filter into the SAX handler chain to ensure that each text node in the document is reported as a single SAX characters event (meaning, in practice, that it calls character just once).

Solution

Module xml.sax.saxutils in the standard Python library includes a class XMLFilterBase that we can subclass to implement any XML filter we may need:

from xml.sax.saxutils import XMLFilterBase class text_normalize_filter(XMLFilterBase):     """ SAX filter to ensure that contiguous text nodes are merged into one     """     def _ _init_ _(self, upstream, downstream):         XMLFilterBase._ _init_ _(self, upstream)         self._downstream = downstream         self._accumulator = [  ]     def _complete_text_node(self):         if self._accumulator:             self._downstream.characters(''.join(self._accumulator))             self._accumulator = [  ]     def characters(self, text):         self._accumulator.append(text)     def ignorableWhitespace(self, ws):         self._accumulator.append(text) def _wrap_complete(method_name):     def method(self, *a, **k):         self._complete_text_node( )         getattr(self._downstream, method_name)(*a, **k)     # 2.4 only: method._ _name_ _ = method_name     setattr(text_normalize_filter, method_name, method) for n in '''startElement startElementNS endElement endElementNS             processingInstruction comment'''.split( ):     _wrap_complete(n) if _ _name_ _ == "_ _main_ _":     import sys     from xml import sax     from xml.sax.saxutils import XMLGenerator     parser = sax.make_parser( )     # XMLGenerator is a special predefined SAX handler that merely writes     # SAX events back into an XML document     downstream_handler = XMLGenerator( )     # upstream, the parser; downstream, the next handler in the chain     filter_handler = text_normalize_filter(parser, downstream_handler)     # The SAX filter base is designed so that the filter takes on much of the     # interface of the parser itself, including the "parse" method     filter_handler.parse(sys.argv[1])

Discussion

A SAX parser can report contiguous text using multiple characters events (meaning, in practice, multiple calls to the characters method of the downstream handler). In other words, given an XML document whose content is 'abc', the text could technically be reported as up to three character events: one for the 'a' character, one for the 'b', and a third for the 'c'. Such an extreme case of "fragmentation" of a text string into multiple events is unlikely in real life, but it is not impossible.

A typical reason that might cause a parser to report text nodes a bit at a time would be buffering of the XML input source. Most low-level parsers use a buffer of a certain number of characters that are read and parsed at a time. If a text node straddles such a buffer boundary, many parsers will just wrap up the current text event and start a new one to send characters from the next buffer. If you don't account for this behavior in your SAX handlers, you may run into very obscure and hard-to-reproduce bugs. Even if the parser you usually use does combine text nodes for you, you never know when you may want to run your code in a situation where a different parser is selected. You'd need to write logic to accommodate the possibility, which can be rather cumbersome when mixed into typical SAX-style state machine logic.

The class text_normalize_filter presented in this recipe ensures that all text events are reported to downstream SAX handlers in the contiguous manner that most developers would expect. In this recipe's example case, the filter would consolidate the three characters events into a single one for the entire text node 'abc'.

For more information on SAX filters in general, see my article "Tip: SAX filters for flexible processing," http://www-106.ibm.com/developerworks/xml/library/x-tipsaxflex.html.


Python's XMLGenerator does not do anything with processing instructions, so, if you run the main code presented in this recipe on an XML document that uses them, you'll have a gap in the output, along with other minor deviations between input and output. Comments are similar but worse, because XMLFilterBase does not even filter them; if you do need to get comments, your test_normalize_filter class must multiply inherit from xml.sax.saxlib.LexicalHandler, as well as from xml.sax.saxutils.XMLFilterBase, and it must override the parse method as follows:

    def parse(self, source):         # force connection of self as the lexical handler         self._parent.setProperty(property_lexical_handler, self)         # Delegate to XMLFilterBase for the rest         XMLFilterBase.parse(self, source)

This code is hairy enough, using the "internal" attribute self._parent, and the need to deal properly with XML comments is rare enough, to make this addition somewhat doubtful, which is why it is not part of this recipe's Solution.

If you need ease of chaining to other filters, you may prefer not to take both upstream and downstream parameters in _ _init_ _. In this case, keep the same signature as XMLFilterBase._ _init_ _:

    def _ _init_ _(self, parent):         XMLFilterBase._ _init_ _(self, parent)         self._accumulator = [  ]

and change the _wrap_complete factory function so that the wrapper, rather than calling methods on the downstream handler directly, delegates to the default implementations in XMLFilterBase, which in turn call out to handlers that have been set on the filter with such methods as setContentHandler and the like:

def _wrap_complete(method_name):     def method(self, *a, **k):         self._complete_text_node( )         getattr(XMLFilterBase, method_name)(self, *a, **k)     # 2.4 only: method._ _name_ _ = method_name     setattr(text_normalize_filter, method_name, method)

This is slightly less convenient for the typical simple case, but it pays back this inconvenience by letting you easily chain filters:

parser = sax.make_parser( ) filtered_parser = text_normalise_filter(some_other_filter(parser))

as well as letting you use a filter in contexts that call the parse method on your behalf:

doc = xml.dom.minidom.parse(input_file, parser=filtered_parser)

See Also

Library Reference and Python in a Nutshell document the built-in XML support in the Python Standard Library.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net