18.6. XML Processing ToolsPython ships with XML parsing support in its standard library and plays host to a vigorous XML special-interest group. XML (eXtended Markup Language) is a tag-based markup language for describing many kinds of structured data. Among other things, it has been adopted in roles such as a standard database and Internet content representation by many companies. As an object-oriented scripting language, Python mixes remarkably well with XML's core notion of structured document interchange. XML is based upon a tag syntax familiar to web page writers. Python's standard library xml module package includes tools for parsing XML, in both the SAX and the DOM parsing models. In short, SAX parsers provide a subclass with methods called during the parsing operation, and DOM parsers are given access to an object tree representing the (usually) already parsed document. SAX parsers are essentially state machines and must record details as the parse progresses; DOM parsers walk object trees using loops, attributes, and methods defined by the DOM standard. Beyond these parsing tools, Python also ships with an xmlrpclib to support the XML-RPC protocol (remote procedure calls that transmit objects encoded as XML over HTTP), as well as a standard HTML parser, htmllib, that works on similar principles and is based upon the sgmllib SGML parser module. The third-party domain has even more XML-related tools; some of these are maintained separately from Python to allow for more flexible release schedules. 18.6.1. A Brief Introduction to XML ParsingXML processing is a large, evolving topic, and it is mostly beyond the scope of this book. For an example of a simple XML parsing task, though, consider the XML file in Example 18-9. This file defines a handful of O'Reilly Python booksISBN numbers as attributes, and titles and authors as nested tags. Example 18-9. PP3E\Internet\Other\XML\books.xml
Now, suppose we wish to parse this XML code, extracting just the ISBN numbers and titles for each book defined, and stuffing the details into a dictionary indexed by ISBN number. Python's XML parsing tools let us do this in an accurate way. Example 18-10, for instance, defines a SAX-based parsing procedure: its class implements callback methods that will be called during the parse. Example 18-10. PP3E\Internet\Other\XML\bookhandler.py
The SAX model is efficient, but it is potentially confusing at first glance, because the class must keep track of where the parse currently is using state information. For example, when the title tag is first detected, we set a state flag and initialize a buffer; as each character within the title tag is parsed, we append it to the buffer until the ending portion of the title tag is encountered. The net effect saves the title tag's content as a string. To kick off the parse, we make a parser, set its handler to the class in Example 18-10, and start the parse; as Python scans the XML file our class's methods are called automatically as components are encountered: C:\...\PP3E\Internet\Other\XML>python >>> import xml.sax >>> import bookhandler >>> import pprint >>> >>> parser = xml.sax.make_parser( ) >>> handler = bookhandler.BookHandler( ) >>> parser.setContentHandler(handler) >>> parser.parse('books.xml') >>> >>> pprint.pprint(handler.mapping) {u'0-596-00085-5': u'Programming Python', u'0-596-00128-2': u'Python & XML', u'0-596-00281-5': u'Learning Python', u'0-596-00797-3': u'Python Cookbook'} When the parse is completed, we use the Python pprint ("pretty printer") module to display the resultthe mapping dictionary object attached to our handler. Beginning with Python 2.3 the Expat parser is included with Python as the underlying parsing engine that drives the events intercepted by our class. DOM parsing is perhaps simpler to understandwe simply traverse a tree of objects after the parsebut it might be less efficient for large documents, if the document is parsed all at once ahead of time. DOM also supports random access to document parts via tree fetches; in SAX, we are limited to a single linear parse. Example 18-11 is a DOM-based equivalent to the SAX parser listed earlier. Example 18-11. PP3E\Internet\Other\XML\dombook.py
The output of this script is the same as what we generated interactively for the SAX parser; here, though, it is built up by walking the document object tree after the parse has finished using method calls and attributes defined by the cross-language DOM standard specification: C:\...\PP3E\Internet\Other\XML>dombook.py {u'0-596-00085-5': u'Programming Python', u'0-596-00128-2': u'Python & XML', u'0-596-00281-5': u'Learning Python', u'0-596-00797-3': u'Python Cookbook'} Naturally, there is much more to Python's XML support than these simple examples imply. In deference to space, though, here are pointers to XML resources in lieu of additional examples:
As usual, be sure to check Python's web site or your favorite web search engine for more recent developments on this front. |