Recipe12.9.Filtering Elements and Attributes Belonging to a Given Namespace


Recipe 12.9. Filtering Elements and Attributes Belonging to a Given Namespace

Credit: A.M. Kuchling

Problem

While parsing an XML document with SAX, you need to filter out all of the elements and attributes that belong to a particular namespace.

Solution

The SAX filter concept is just what we need here:

from xml import sax from xml.sax import handler, saxutils, xmlreader # the namespace we want to remove in our filter RDF_NS = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#' class RDFFilter(saxutils.XMLFilterBase):     def _ _init_ _ (self, *args):         saxutils.XMLFilterBase._ _init_ _(self, *args)         # initially, we're not in RDF, and just one stack level is needed         self.in_rdf_stack = [False]     def startElementNS(self, (uri, localname), qname, attrs):         if uri == RDF_NS or self.in_rdf_stack[-1] == True:             # skip elements with namespace, if that namespace is RDF or             # the element is nested in an RDF one -- and grow the stack             self.in_rdf_stack.append(True)             return         # Make a dict of attributes that DON'T belong to the RDF namespace         keep_attrs = {  }         for key, value in attrs.items( ):             uri, localname = key             if uri != RDF_NS:                 keep_attrs[key] = value         # prepare the cleaned-up bunch of non-RDF-namespace attributes         attrs = xmlreader.AttributesNSImpl(keep_attrs, attrs.getQNames( ))         # grow the stack by replicating the latest entry         self.in_rdf_stack.append(self.in_rdf_stack[-1])         # finally delegate the rest of the operation to our base class         saxutils.XMLFilterBase.startElementNS(self,                  (uri, localname), qname, attrs)     def characters(self, content):         # skip characters that are inside an RDF-namespaced tag being skipped         if self.in_rdf_stack[-1]:             return         # delegate the rest of the operation to our base class         saxutils.XMLFilterBase.characters(self, content)     def endElementNS (self, (uri, localname), qname):         # pop the stack -- nothing else to be done, if we were skipping         if self.in_rdf_stack.pop( ) == True:             return         # delegate the rest of the operation to our base class         saxutils.XMLFilterBase.endElementNS(self, (uri, localname), qname) def filter_rdf(input, output):     """ filter_rdf(input=some_input_filename, output=some_output_filename)         Parses the XML input from the input stream, filtering out all         elements and attributes that are in the RDF namespace.     """     output_gen = saxutils.XMLGenerator(output)     parser = sax.make_parser( )     filter = RDFFilter(parser)     filter.setFeature(handler.feature_namespaces, True)     filter.setContentHandler(output_gen)     filter.setErrorHandler(handler.ErrorHandler( ))     filter.parse(input) if _ _name_ _ == '_ _main_ _':     import StringIO, sys     TEST_RDF = '''<?xml version="1.0"?> <metadata xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"          xmlns:dc="http://purl.org/dc/elements/1.1/">    <title>  This is non-RDF content </title>    <rdf:RDF>      <rdf:Description rdf:about="%s">        <dc:Creator>%s</dc:Creator>      </rdf:Description>    </rdf:RDF>   <element /> </metadata> '''     input = StringIO.StringIO(TEST_RDF)     filter_rdf(input, sys.stdout)

This module, when run as a main script, emits something like:

<?xml version="1.0" encoding="iso-8859-1"?> <metadata xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"           xmlns:dc="http://purl.org/dc/elements/1.1/">    <title>  This is non-RDF content </title>   <element></element> </metadata>

Discussion

My motivation for originally writing this recipe came from processing files of metadata, containing RDF mixed with other elements. I wanted to generate a version of the metadata with the RDF filtered out.

The filter_rdf function does the job, reading XML input from the input stream and writing it to the output stream. The standard XMLGenerator class in xml.sax.saxutils is used to produce the output. Function filter_rdf internally uses a filtering class called RDFFilter, also shown in this recipe's Solution, pushing that filter on top of the XML parser to suppress elements and attributes belonging to the RDF_NS namespace.

Non-RDF elements contained within an RDF element are also removed. To modify this behavior, change the first line of the startElementNS method to use just if uri = = RDF_NS as the guard.

This code doesn't delete the xmlns declaration for the RDF namespace; I'm willing to live with a little unnecessary but harmless cruft in the output rather than go to huge trouble to remove it.

See Also

Library Reference and Python in a Nutshell document the built-in XML support in the Python Standard Library.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net