Section 18.6. XML Processing Tools

18.6. XML Processing Tools

Python ships with XML parsing support in its standard library and plays host to a vigorous XML special-interest group. XML (eXtended Markup Language) is a tag-based markup language for describing many kinds of structured data. Among other things, it has been adopted in roles such as a standard database and Internet content representation by many companies. As an object-oriented scripting language, Python mixes remarkably well with XML's core notion of structured document interchange.

XML is based upon a tag syntax familiar to web page writers. Python's standard library xml module package includes tools for parsing XML, in both the SAX and the DOM parsing models. In short, SAX parsers provide a subclass with methods called during the parsing operation, and DOM parsers are given access to an object tree representing the (usually) already parsed document. SAX parsers are essentially state machines and must record details as the parse progresses; DOM parsers walk object trees using loops, attributes, and methods defined by the DOM standard.

Beyond these parsing tools, Python also ships with an xmlrpclib to support the XML-RPC protocol (remote procedure calls that transmit objects encoded as XML over HTTP), as well as a standard HTML parser, htmllib, that works on similar principles and is based upon the sgmllib SGML parser module. The third-party domain has even more XML-related tools; some of these are maintained separately from Python to allow for more flexible release schedules.

18.6.1. A Brief Introduction to XML Parsing

XML processing is a large, evolving topic, and it is mostly beyond the scope of this book. For an example of a simple XML parsing task, though, consider the XML file in Example 18-9. This file defines a handful of O'Reilly Python booksISBN numbers as attributes, and titles and authors as nested tags.

Example 18-9. PP3E\Internet\Other\XML\books.xml

 <catalog>   <book isbn="0-596-00128-2">     <title>Python &amp; XML</title>     <author>Jones, Drake</author>   </book>   <book isbn="0-596-00085-5">     <title>Programming Python</title>     <author>Lutz</author>   </book>    <book isbn="0-596-00281-5">     <title>Learning Python</title>     <author>Lutz, Ascher</author>   </book>    <book isbn="0-596-00797-3">     <title>Python Cookbook</title>     <author>Martelli, Ravenscroft, Ascher</author>   </book>  <!-- imagine more entries here --> </catalog>

Now, suppose we wish to parse this XML code, extracting just the ISBN numbers and titles for each book defined, and stuffing the details into a dictionary indexed by ISBN number. Python's XML parsing tools let us do this in an accurate way. Example 18-10, for instance, defines a SAX-based parsing procedure: its class implements callback methods that will be called during the parse.

Example 18-10. PP3E\Internet\Other\XML\bookhandler.py

 ################################ # SAX is a callback-based API # for intercepting parser events ################################ import xml.sax.handler class BookHandler(xml.sax.handler.ContentHandler):   def _ _init_ _(self):     self.inTitle = 0                                # handle XML parser events     self.mapping = {}                               # a state machine model   def startElement(self, name, attributes):     if name == "book":                              # on start book tag       self.buffer = ""                              # save ISBN for dict key       self.isbn = attributes["isbn"]     elif name == "title":                           # on start title tag       self.inTitle = 1                              # save title text to follow   def characters(self, data):     if self.inTitle:                                # on text within tag       self.buffer += data                           # save text if in title   def endElement(self, name):     if name == "title":       self.inTitle = 0                              # on end title tag       self.mapping[self.isbn] = self.buffer         # store title text in dict

The SAX model is efficient, but it is potentially confusing at first glance, because the class must keep track of where the parse currently is using state information. For example, when the title tag is first detected, we set a state flag and initialize a buffer; as each character within the title tag is parsed, we append it to the buffer until the ending portion of the title tag is encountered. The net effect saves the title tag's content as a string.

To kick off the parse, we make a parser, set its handler to the class in Example 18-10, and start the parse; as Python scans the XML file our class's methods are called automatically as components are encountered:

 C:\...\PP3E\Internet\Other\XML>python >>> import xml.sax >>> import bookhandler >>> import pprint >>> >>> parser = xml.sax.make_parser( ) >>> handler = bookhandler.BookHandler( ) >>> parser.setContentHandler(handler) >>> parser.parse('books.xml') >>> >>> pprint.pprint(handler.mapping) {u'0-596-00085-5': u'Programming Python',  u'0-596-00128-2': u'Python & XML',  u'0-596-00281-5': u'Learning Python',  u'0-596-00797-3': u'Python Cookbook'}

When the parse is completed, we use the Python pprint ("pretty printer") module to display the resultthe mapping dictionary object attached to our handler. Beginning with Python 2.3 the Expat parser is included with Python as the underlying parsing engine that drives the events intercepted by our class.

DOM parsing is perhaps simpler to understandwe simply traverse a tree of objects after the parsebut it might be less efficient for large documents, if the document is parsed all at once ahead of time. DOM also supports random access to document parts via tree fetches; in SAX, we are limited to a single linear parse. Example 18-11 is a DOM-based equivalent to the SAX parser listed earlier.

Example 18-11. PP3E\Internet\Other\XML\dombook.py

 ##################################### # DOM gives the whole document to the # application as a traversable object ##################################### import pprint import xml.dom.minidom from xml.dom.minidom import Node doc = xml.dom.minidom.parse("books.xml")          # load doc into object                                                   # usually parsed up front mapping = {} for node in doc.getElementsByTagName("book"):     # traverse DOM object   isbn = node.getAttribute("isbn")                # via DOM object API   L = node.getElementsByTagName("title")   for node2 in L:     title = ""     for node3 in node2.childNodes:       if node3.nodeType == Node.TEXT_NODE:         title += node3.data     mapping[isbn] = title # mapping now has the same value as in the SAX example pprint.pprint(mapping)

The output of this script is the same as what we generated interactively for the SAX parser; here, though, it is built up by walking the document object tree after the parse has finished using method calls and attributes defined by the cross-language DOM standard specification:

 C:\...\PP3E\Internet\Other\XML>dombook.py {u'0-596-00085-5': u'Programming Python',  u'0-596-00128-2': u'Python & XML',  u'0-596-00281-5': u'Learning Python',  u'0-596-00797-3': u'Python Cookbook'}

Naturally, there is much more to Python's XML support than these simple examples imply. In deference to space, though, here are pointers to XML resources in lieu of additional examples:

Standard library: First, be sure to consult the Python library manual for more on the standard library's XML support tools. See the entries for xml.sax and xml.dom for more on this section's examples.
ElementTree: The popular ElementTree extension provides easy-to-use tools for parsing, changing, and generating XML documents. It represents documents as a tree of Python objects (in the spirit of HTMLgen described earlier in this chapter). As of this writing, ElementTree (and a fast C implementation of it) is still a third-party package, but its core components were scheduled to be incorporated into the standard library of Python 2.5 in six months, as the package xml.etree. See the 2.5 library manual for details.
PyXML SIG tools: You can also find Python XML tools and documentation at the XML Special Interest Group (SIG) web page at http://www.python.org (click on the SIGs link near the top). This SIG is dedicated to wedding XML technologies with Python, and it publishes a free XML tools package distribution called PyXML. That package contains tools not yet part of the standard Python distribution. Much of the standard library's XML support originated in PyXML, though that package still contains tools not present in Python itself.
Third-party tools: You can also find free, third-party Python support tools for XML on the Web by following links at the XML SIGs web page. Of special interest, the 4Suite package from Fourthought provides integrated tools for XML processing, including open technologies such as DOM, SAX, RDF, XSLT, XInclude, XPointer, XLink, and XPath.
Documentation: O'Reilly offers a book dedicated to the subject of XML processing in Python, Python & XML, written by Christopher A. Jones and Fred L. Drake, Jr.

As usual, be sure to check Python's web site or your favorite web search engine for more recent developments on this front.