Introduction


Credit: Paul Prescod, co-author of XML Handbook (Prentice-Hall)

XML has become a central technology for all kinds of information exchange. Today, most new file formats that are invented are based on XML. Most new protocols are based upon XML. It simply isn't possible to work with the emerging Internet infrastructure without supporting XML. Luckily, Python has had XML support since many versions ago, and Python's support for XML has kept growing and maturing year after year.

Python and XML are perfect complements. XML is an open standards way of exchanging information. Python is an open source language that processes the information. Python excels at text processing and at handling complicated data structures. XML is text based and is, above all, a way of exchanging complicated data structures.

That said, working with XML is not so seamless that it requires no effort. There is always somewhat of a mismatch between the needs of a particular programming language and a language-independent information representation. So there is often a requirement to write code that reads (i.e., deserializes or parses) and writes (i.e., serializes) XML.

Parsing XML can be done with code written purely in Python, or with a module that is a C/Python mix. Python comes with the fast Expat parser written in C. Many XML applications use the Expat parser, and one of these recipes accesses Expat directly to build its own concept of an ideal in-memory Python representation of an XML document as a tree of "element" objects (an alternative to the standard DOM approach, which I will mention later in this introduction).

However, although Expat is ubiquitous in the XML world, it is far from being the only parser available, or necessarily the best one for any given application. A standard API called SAX allows any XML parser to be plugged into a Python program. The SAX API is demonstrated in several recipes that perform typical tasks such as checking that an XML document is well formed, extracting text from a document, or counting the tags in a document. These recipes should give you a good understanding of how SAX works. One more advanced recipe shows how to use one of SAX's several auxiliary features, "filtering", to normalize "text events" that might otherwise happen to get "fragmented".

XML-RPC is a protocol built on top of XML for sending data structures from one program to another, typically across the Internet. XML-RPC allows programmers to completely hide the implementation languages of the two communicating components. Two components running on different operating systems, written in different languages, can still communicate easily. XML-RPC is built into Python. This chapter does not deal with XML-RPC, because, together with other alternatives for distributed programming, XML-RPC is covered in Chapter 15.

Other recipes in this chapter are a little bit more eclectic, dealing with issues that range from interfacing, to proprietary XML parsers and document formats, to representing an entire XML document in memory as a Python object. One, in particular, shows how to auto-detect the Unicode encoding that an XML document uses without parsing the document. Unicode is central to the definition of XML, so it's important to understand Python's Unicode support if you will be doing any sophisticated work with XML.

The PyXML extension package supplies a variety of useful tools for working with XML. PyXML offers a full implementation of the Document Object Model (DOM)as opposed to the subset bundled with Python itselfand a validating XML parser written entirely in Python. The DOM is a standard API that loads an entire XML document into memory. This can make XML processing easier for complicated structures in which there are many references from one part of the document to another, or when you need to correlate (i.e., compare) more than one XML document. One recipe shows how to use PyXML's validating parser to validate and process an XML document, and another shows how to remove whitespace-only text nodes from an XML document's DOM. You'll find many other examples in the documentation of the PyXML package (http://pyxml.sourceforge.net/).

Other advanced tools that you can find in PyXML or, in some cases, in FourThought's open source 4Suite package (http://www.4suite.org/) from which much of PyXML derives, include implementations of a variety of XML-related standards, such as XPath, XSLT, XLink, XPointer, and RDF. If PyXML is already an excellent resource for XML power users in Python, 4Suite is even richer and more powerful.

XML has become so pervasive that, inevitably, you will also find XML-related recipes in other chapters of this book. Recipe 2.26 strips XML markup in a very rough and ready way. Recipe 1.23 shows how to insert XML character references while encoding Unicode text. Recipe 10.17, parses a Mac OS X pinfo-format XML stream to get detailed system information. Recipe 11.10 uses Tkinter to display a XML DOM as a GUI Tree widget. Recipe 14.11 deals with two XML file formats related to RSS[1] feeds, fetching and parsing a FOAF[2]-format input to produce an OPML[3]-format resultquite a typical XML-related task in today's programming, and a good general example of how Python can help you with such tasks.

[1] RSS (Really Simple Syndication)

[2] FOAF (Friend of a Friend)

[3] OPML (Outline Processor Markup Language)

For more information on using Python and XML together, see Python and XML by Christopher A. Jones and Fred L. Drake, Jr. (O'Reilly).



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net