Python | Professional XML (Programmer to Programmer)

Python is an interpreted scripting language that has been around for quite some time. It was created by Guido van Rossum in 1990. The name is not derived from the snake, but is in fact a reference to the television show, "Monty Python's Flying Circus." This reference tends to be repeated throughout Python code, resulting in many samples making reference to spam and parrots.

Python has many of the text-processing features of Perl, but with a more readable syntax and more object-orientation. One of the core tenets of Python coding is that readability is more important than brevity. While it may take more code in Python to perform a given task compared to some languages (most notably Perl), you are much more likely to understand just what the code is intended to do, even months later. In addition to text-processing, the breadth of libraries means that Python can write just about any type of application. It has been embraced fairly strongly by Web developers. The first version of Microsoft's Site Server was written in Python. In addition, the popular video-sharing site YouTube is reported to be written, "almost entirely in Python."

Python is available for almost every platform imaginable, including the new IronPython implementation for Microsoft's .NET platform. The current version as of this writing is 2.5. This section focuses on some of the more commonly used XML parsers. As with Perl, there are many more modules available. The Python Cheese Shop Web site (python.org/pypi) is the official central repository of Python modules. As of this writing, it lists over 1850 packages, of which there are approximately 50 for working with XML. While this seems lower than other languages, this is likely due to the strength of the core XML modules. For more details on Python, see Beginning Python (ISBN 978-0-7645-9654-4) or the Python Web site (python.org). The samples in this section were created using the Windows version of Python version 2.5.

Reading and Writing XML

Since Python 2, the standard library includes support for processing XML, including modules that include DOM, SAX, pull, and object-based syntaxes.

The DOM implementation is a tree-based model, and is modeled on the W3C DOM implementation. As with other DOM implementations, the result is an in-memory structure that contains the whole document. You can then move forward and backward through the document as necessary. SAX and pull provide lighter-weight models for reading XML. The SAX parsing is similar to the other implementations: You provide a number of event handlers that are called while the document is being read. Pull parsers, on the other hand, are generally called in a loop. Each time through the loop, the current position is advanced. Both SAX and pull parsers are best when you need to read through a document in a forward-only fashion, or when the document is quite large. Finally, object-based parsers convert the XML into Python data structures. While this provides the most natural, or Pythonic, means of working with XML, it is also typically the least portable code.

Reading XML

The DOM support in Python is included in the xml.dom and xml.dom.minidom libraries. The xml.dom library is a full implementation of the W3C DOM, while the xml.dom.minidom was designed as a lightweight version of the DOM, removing support for some of the less used features. Although reading XML with the DOM may not be the most "Pythonesque" means of processing XML, it is the most portable technique. For more details on the DOM, see Chapter 12. Listing 17-10 shows loading the customers.xml file using Python with the minidom library.

Listing 17-10: Loading XML using DOM in Python

      from xml.dom.minidom import parse      doc = parse("customers.xml")      print doc.documentElement.childNodes.length      print "========="      print "\tusing toxml"      print "Print first customer"      print doc.documentElement.childNodes[1].toxml()      print "========="      print "\tusing toprettyxml"      print doc.documentElement.childNodes[1].toprettyxml()      print "========="      print "getElementsByTagName returns array of customers"       customers = doc.getElementsByTagName("customer")      print customers[15].toprettyxml('..', '\n')      print "========="      print "Return Attribute"      print customers[34].attributes.item(0).value

In order to use the DOM objects, you import them into the Python script. The first line in the preceding code imports the parse object, which is used to load the local file. There is also a parseString object that is used to load string data containing XML. At this step, the memory structure of the XML document is generated. After it is loaded, you can use the DOM methods described in Chapter 12 to extract the nodes in the document. Each of the nodes in the minidom have the methods toxml() and toprettyxml() included. These methods are non-standard, and are used to export the node in XML. The difference between the two is that the toprettyxml supports the addition of parameters for altering the format. In the code in Listing 17-10, the code retrieves the first child of the root element, and prints it to the console. The first call, using toxml looks like the code in Listing 17-11 (extra lines have been removed).

Listing 17-11: Output using toxml

        using toxml      Print first customer      <customer >        <company>          Alfreds Futterkiste        </company>        <address>          <street>            Obere Str. 57          </street>          <city>            Berlin          </city>          <zip>            12209          </zip>          <country>            Germany          </country>        </address>        <contact>          <name>            Maria Anders          </name>          <title>            Sales Representative          </title>          <phone>            030-0074321          </phone>          <fax>            030-0076545          </fax>         </contact>      </customer>

In contrast, the default printout using toprettyxml creates a more compact document, as seen in Listing 17-12.

Listing 17-12: Output using toprettyxml

        using toprettyxml      <customer >          <company>Alfreds Futterkiste</company>          <address>            <street>Obere Str. 57</street>            <city>Berlin</city>            <zip>12209</zip>            <country>Germany</country>          </address>          <contact>            <name>Maria Anders</name>            <title>Sales Representative</title>            <phone>030-0074321</phone>            <fax>030-0076545</fax>          </contact>        </customer>

Alternately, you can use the properties of the toprettyxml to alter the characters used to indent the lines and the character to use at the end of lines.

Rather than use the DOM, you may want to use stream-based parsing. Tree-based parsers that provide a DOM interface have a bit of a bad reputation for memory use. This is because they must load the entire XML document into memory and create the in-memory representation of the DOM. Stream-based parsers, such as those based on SAX, do not have this limitation, because they hold only a small fraction of the document in memory at any one time. They are also incredibly fast at processing files. The major problem with stream-based processors, however, is that they are forward-only. If you want to move backwards through the document, stream-based parsers do not give you this capability. For more details on SAX, see Chapter 13. The Python SAX parser is defined in the xml.sax library. Within that library, there are three main functions that are used.

Open table as spreadsheet

Class	Description
`make_parser`	Creates a SAX parser. Before using this parser, you must set the class that will perform the processing. This class must inherit from `xml.sax.handler.ContentHandler`.
`parse`	Calls the `ContentHandler` to process the document. The class needs to be created first using `make_parser`. Takes a file as a parameter.
`parseString`	Calls the `ContentHandler` to process the document. The class needs to be created first using `make_parser`. Takes a string as a parameter.

Listing 17-13 shows using the SAX parser with Python.

Listing 17-13: Parsing XML with SAX using Python

            from xml.sax import make_parser      from xml.sax.handler import ContentHandler      file = "customers.xml"      class CityCounter(ContentHandler):        def __init__(self):          self.in_city = 0          self.cities = {}        def startElement(self, name, attrs):          if name == 'city':            self.in_city = 1        def endElement(self, name):          if name == 'customers':            for city, count in self.cities.items():              print city, ": ", count        def characters(self, text):          if self.in_city:            self.in_city = 0            if self.cities.has_key(text):              self.cities[text] = self.cities[text] + 1            else:              self.cities[text] = 1      #main routine      p = make_parser()      cc = CityCounter()      p.setContentHandler(cc)      p.parse(file)

As with other SAX-based parsers, you create one or more methods that are called by the parser when specific XML nodes are processed. In this case, the three methods are created in a class that inherits from the default SAX content handler (xml.sax.handler.ContentHandler). This enables chaining, in case you wanted to have multiple SAX processors working on the same XML file. This class is assigned to the parser using the setContentHandler method. In addition, you could use the setErrorHandler method to identify a handler that will be called if an error occurs during the processing of the XML. When the parse method of the SAX parser begins the processing, your methods are called as needed. Just as with other SAX implementations, the startElement method is called at the beginning of each element, endElement for the close of the element, and characters is called for the content of the element. You can test this script on the command line using the command (assuming that Python.exe is on your system path):

      python streamcustomers.py

Listing 17-14: Output of the Python SAX processor

            Boise :  1      Leipzig :  1      Caracas :  1      Strasbourg :  1      Lille :  1      Barcelona :  1      Oulu :  1      Aachen :  1      Warszawa :  1      Marseille :  1      Montreal :  1      Mannheim :  1      Elgin :  1      Reggio Emilia :  1      Toulouse :  1      Walla Walla :  1      Madrid :  3      San Cristobal :  1      Sevilla :  1      Kobenhavn :  1      Munchen :  1      Bruxelles :  1      London :  6      Helsinki :  1      Lisboa :  2      Portland :  2      Seattle :  1      Bräcke :  1

Writing XML

Just as Python has methods for reading, it also has a number of ways to write XML.

XML Bookmark Exchange Language (XBEL) is an XML format defined by the Python XML Special Interest Group as a format for applications to share Web browser bookmarks. It has many of the features of OPML, but has the advantage that it is not tied as closely to the implementation of one program as OPML is. In addition, it has a DTD, enabling validation. You can read more about the XBEL format at the XBEL Resources page (http://www.pyxml.sourceforge.net/topics/xbel/). Listing 17-15 shows the creation of a small XBEL file using Python.

Listing 17-15: Writing XML with Python

      from xml.dom.minidom import getDOMImplementation      from xml.dom import EMPTY_NAMESPACE      class simpleXBELWriter:        def __init__(self, name):       impl = getDOMImplementation()          doctype = impl.createDocumentType("xbel",            "+//IDN python.org//DTD XML Bookmark Exchange Language 1.0//EN//XML",            "http://www.python.org/topics/xml/dtds/xbel-1.0.dtd")          self.doc = impl.createDocument(EMPTY_NAMESPACE,            "xbel", doctype)          self.doc.documentElement.setAttribute("version", "1.0")          root = self.doc.createElement("folder")          self.doc.documentElement.appendChild(root)          def addBookmark(self, uri, title, desc=None):          book = self.doc.createElement("bookmark")          book.setAttribute("href", uri)          t = self.doc.createElement("title")          t.appendChild(self.doc.createTextNode(title))          book.appendChild(t)          if(desc):            d = self.doc.createElement("desc")            d.appendChild(self.doc.createTextNode(desc))            book.appendChild(d)          self.doc.getElementsByTagName("folder")[0].appendChild(book)          def Print(self):          print self.doc.toprettyxml()      w = simpleXBELWriter("Some useful bookmarks")      w.addBookmark("http://www.geekswithblogs.net/evjen",        "Bill Evjen's Weblog")      w.addBookmark("http://www.acmebinary.com/blogs/kent",        "Kent Sharkey's Weblog")      w.addBookmark("http://www.wrox.com",        "Wrox Home Page",        "Home of great, red books")      w.Print()

Just as with other DOM implementations, you create each node at the document level and append it where necessary. In the code in Listing 17-15, a doctype is first created, to enable validating parsers to check the resulting document. This uses the doctype defined for XBEL documents.

The basic structure of an XBEL document is a root node (xbel) containing multiple folder elements. The sample shown in Listing 17-15 creates only a single folder and inserts all bookmarks into it. In a more robust implementation of an XBEL generator, you would want to enable multiple nested folders. Title and, optionally, description elements are added to each bookmark. Listing 17-16 shows the resulting XML.

Listing 17-16: A created XBEL document

            <?xml version="1.0" encoding="UTF-8"?>      <!DOCTYPE xbel PUBLIC         "+//IDN python.org//DTD XML Bookmark Exchange Language 1.0//EN//XML"         "http://www.python.org/topics/xml/dtds/xbel-1.0.dtd">      <xbel version="1.0">        <folder>          <bookmark href="http://www.geekswithblogs.net/evjen">            <title>Bill Evjen's          </bookmark>          <bookmark href="http://www.acmebinary.com/blogs/kent">            <title>Kent Sharkey's          </bookmark>          <bookmark href="http://www.wrox.com">            <title>Wrox Home Page</title>            <desc>Home of great, red books</desc>          </bookmark>         </folder>      </xbel>

Support for Other XML Formats

As you might expect, what you have seen is only the tip of a huge iceberg. Python has a number of additional libraries for working with XML. Some of the most notable ones are:

q xml.marshal-Part of the PyXML distribution. This library enables a simple means of converting between Python objects and XML.
q XSLT-A number of XSLT processors are available for Python, including 4XSLT, Pyana, and libxslt.
q Web services-A number of Python Web service clients exist, including XML-RPC (xmlrpclib) and SOAP (SOAPpy).