5.4 Understanding XML

Extensible Markup Language (XML) is a text format increasingly used for a wide variety of storage and transport requirements. Parsing and processing XML is an important element of many text processing applications. This section discusses the most common techniques for dealing with XML in Python. While XML held an initial promise of simplifying the exchange of complex and hierarchically organized data, it has itself grown into a standard of considerable complexity. This book will not cover most of the API details of XML tools; an excellent book dedicated to that subject is:

Python & XML, Christopher A. Jones & Fred L. Drake, Jr., O'Reilly 2002. ISBN: 0-596-00128-2.

The XML format is sufficiently rich to represent any structured data, some forms more straightforwardly than others. A task that XML is quite natural at is in representing marked-up text documentation, books, articles, and the like as is its parent SGML. But XML is probably used more often to represent data than texts record sets, OOP data containers, and so on. In many of these cases, the fit is more awkward and requires extra verbosity. XML itself is more like a metalanguage than a language there are a set of syntax constraints that any XML document must obey, but typically particular APIs and document formats are defined as XML dialects. That is, a dialect consists of a particular set of tags that are used within a type of document, along with rules for when and where to use those tags. What I refer to as an XML dialect is also sometimes more formally called "an application of XML."

THE DATA MODEL

At base, XML has two ways to represent data. Attributes in XML tags map names to values. Both names and values are Unicode strings (as are XML documents as a whole), but values frequently encode other basic datatypes, especially when specified in W3C XML Schemas. Attribute names are mildly restricted by the special characters used for XML markup; attribute values can encode any strings once a few characters are properly escaped. XML attribute values are whitespace normalized when parsed, but whitespace can itself also be escaped. A bare example is:

 >>> from xml.dom import minidom >>> x = '''<x a="b" d="e f g" num="38" />''' >>> d = minidom.parseString(x) >>> d.firstChild.attributes.items() [(u'a', u'b'), (u'num', u'38'), (u'd', u'e f g')] 

As with a Python dictionary, no order is defined for the list of key/value attributes of one tag.

The second way XML represents data is by nesting tags inside other tags. In this context, a tag together with a corresponding "close tag" is called an element, and it may contain an ordered sequence of subelements. The subelements themselves may also contain nested subelements. A general term for any part of an XML document, whether an element, an attribute, or one of the special parts discussed below, is a "node." A simple example of an element that contains some subelements is:

 >>> x = '''<?xml version="1.0" encoding="UTF-8"?> ... <root> ...   <a>Some data</a> ...   <b data="more data" /> ...   <c data="a list"> ...     <d>item 1</d> ...     <d>item 2</d> ...   </c> ... </root>''' >>> d = minidom.parseString(x) >>> d.normalize() >>> for node in d.documentElement.childNodes: ...     print node ... <DOM Text node "   "> <DOM Element: a at 7033280> <DOM Text node "   "> <DOM Element: b at 7051088> <DOM Text node "   "> <DOM Element: c at 7053696> <DOM Text node " "> >>> d.documentElement.childNodes[3].attributes.items() [(u'data', u'more data')] 

There are several things to notice about the Python session above.

  1. The "document element," named root in the example, contains three ordered subelement nodes, named a, b, and c.

  2. Whitespace is preserved within elements. Therefore the spaces and newlines that come between the subelements make up several text nodes. Text and subelements can intermix, each potentially meaningful. Spacing in XML documents is significant, but it is nonetheless also often used for visual clarity (as above).

  3. The example contains an XML declaration, <?xml...?>, which is optional but generally included.

  4. Any given element may contain attributes and subelements and text data.

OTHER XML FEATURES

Besides regular elements and text nodes, XML documents can contain several kinds of "special" nodes. Comments are common and useful, especially in documents intended to be hand edited at some point (or even potentially). Processing instructions may indicate how a document is to be handled. Document type declarations may indicate expected validity rules for where elements and attributes may occur. A special type of node called CDATA lets you embed mini-XML documents or other special codes inside of other XML documents, while leaving markup untouched. Examples of each of these forms look like:

 <?xml version="1.0" ?> <!DOCTYPE root SYSTEM "sometype.dtd"> <root> <!-- This is a comment --> This is text data inside the &lt;root&gt; element <![CDATA[Embedded (not well-formed) XML:          <this><that> >>string<< </that>]]> </root> 

XML documents may be either "well-formed" or "valid." The first characterization simply indicates that a document obeys the proper syntactic rules for XML documents in general: All tags are either self-closed or followed by a matching endtag; reserved characters are escaped; tags are properly hierarchically nested; and so on. Of course, particular documents can also fail to be well-formed but in that case they are not XML documents sensu stricto, but merely fragments or near-XML. A formal description of well-formed XML can be found at <http://www.w3.org/TR/REC-xml> and <http://www.w3.org/TR/xml11/>.

Beyond well-formedness, some XML documents are also valid. Validity means that a document matches a further grammatical specification given in a Document Type Definition (DTD), or in an XML Schema. The most popular style of XML Schema is the W3C XML Schema specification, found in formal detail at <http://www.w3.org/TR/xmlschema-0/> and in linked documents. There are competing schema specifications, however one popular alternative is RELAX NG, which is documented at <http://www.oasis-open.org/committees/relax-ng/>.

The grammatical specifications indicated by DTDs are strictly structural. For example, you can specify that certain subelements must occur within an element, with a certain cardinality and order. Or, certain attributes may or must occur with a certain tag. As a simple case, the following DTD is one that the prior example of nested subelements would conform to. There are an infinite number of DTDs that the sample could match, but each one describes a slightly different range of valid XML documents:

 <!ELEMENT root ((a|OTHER-A)?, b, c*)> <!ELEMENT a (#PCDATA)> <!ELEMENT b EMPTY> <!ATTLIST b data CDATA #REQUIRED             NOT-THERE (this | that) #IMPLIED> <!ELEMENT c (d+)> <!ATTLIST c data CDATA #IMPLIED> <!ELEMENT d (#PCDATA)> 

The W3C recommendation on the XML standard also formally specifies DTD rules. A few features of the above DTD example can be noted here. The element OTHER-A and the attribute NOT-THERE are permitted by this DTD, but were not utilized in the previous sample XML document. The quantifications ?, *, and +; the alternation |; and the comma sequence operator have similar meaning as in regular expressions and BNF grammars. Attributes may be required or optional as well and may contain any of several specific value types; for example, the data attribute must contain any string, while the NOT-THERE attribute may contain this or that only.

Schemas go farther than DTDs, in a way. Beyond merely specifying that elements or attributes must contain strings describing particular datatypes, such as numbers or dates, schemas allow more flexible quantification of subelement occurrences. For example, the following W3C XML Schema might describe an XML document for purchases:

 <xsd:element name="item">   <xsd:complexType>     <xsd:sequence>       <xsd:element name="USPrice" type="xsd:decimal"/>       <xsd:element name="shipDate" type="xsd:date"                    minOccurs="0" maxOccurs=3 />     </xsd:sequence>     <xsd:attribute name="partNum" type="SKU"/>   </xsd:complexType> </xsd:element> <!-- Stock Keeping Unit, a code for identifying products --> <xsd:simpleType name="SKU">    <xsd:restriction base="xsd:string">       <xsd:pattern value="\d{3}-[A-Z]{2}"/>    </xsd:restriction> </xsd:simpleType> 

An XML document that is valid under this schema is:

 <item partNum="123-XQ">   <USPrice>21.95</USPrice>   <shipDate>2002-11-26</shipDate> </item> 

Formal specifications of schema languages can be found at the above-mentioned URLs; this example is meant simply to illustrate the types of capabilities they have.

In order to check the validity of an XML document to a DTD or schema, you need to use a validating parser. Some stand-alone tools perform validation, generally with diagnostic messages in cases of invalidity. As well, certain libraries and modules support validation within larger applications. As a rule, however, most Python XML parsers are nonvalidating and check only for well-formedness.

Quite a number of technologies have been built on top of XML, many endorsed and specified by W3C, OASIS, or other standards groups. One in particular that you should be aware of is XSLT. There are a number of thick books available that discuss XSLT, so the matter is too complex to document here. But in shortest characterization, XSLT is a declarative programming language whose syntax is itself an XML application. An XML document is processed using a set of rules in an XSLT stylesheet, to produce a new output, often a different XML document. The elements in an XSLT stylesheet each describe a pattern that might occur in a source document and contain an output block that will be produced if that pattern is encountered. That is the simple characterization, anyway; in the details, "patterns" can have loops, recursions, calculations, and so on. I find XSLT to be more complicated than genuinely powerful and would rarely choose the technology for my own purposes, but you are fairly likely to encounter existing XSLT processes if you work with existing XML applications.

5.4.1 Python Standard Library XML Modules

There are two principle APIs for accessing and manipulating XML documents that are in widespread use: DOM and SAX. Both are supported in the Python standard library, and these two APIs make up the bulk of Python's XML support. Both of these APIs are programming language neutral, and using them in other languages is substantially similar to using them in Python.

The Document Object Model (DOM) represents an XML document as a tree of nodes. Nodes may be of several types a document type declaration, processing instructions, comments, elements, and attribute maps but whatever the type, they are arranged in a strictly nested hierarchy. Typically, nodes have children attached to them; of course, some nodes are leaf nodes without children. The DOM allows you to perform a variety of actions on nodes: delete nodes, add nodes, find sibling nodes, find nodes by tag name, and other actions. The DOM itself does not specify anything about how an XML document is transformed (parsed) into a DOM representation, nor about how a DOM can be serialized to an XML document. In practice, however, all DOM libraries including xml.dom incorporate these capabilities. Formal specification of DOM can be found at:

<http://www.w3.org/DOM/>

and:

<http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/>

The Simple API for XML (SAX) is an event-based API for XML documents. Unlike DOM, which envisions XML as a rooted tree of nodes, SAX sees XML as a sequence of events occurring linearly in a file, text, or other stream. SAX is a very minimal interface, both in the sense of telling you very little inherently about the structure of an XML documents, and also in the sense of being extremely memory friendly. SAX itself is forgetful in the sense that once a tag or content is processed, it is no longer in memory (unless you manually save it in a data structure). However, SAX does maintain a basic stack of tags to assure well-formedness of parsed documents. The module xml.sax raises exceptions in case of problems in well-formedness; you may define your own custom error handlers for these. Formal specification of SAX can be found at:

<http://www.saxproject.org/>

graphics/common.gif

xml.dom

The module xml.dom is a Python implementation of most of the W3C Document Object Model, Level 2. As much as possible, its API follows the DOM standard, but a few Python conveniences are added as well. A brief example of usage is below:

 >>> from xml.dom import minidom >>> dom = minidom.parse('address.xml') >>> addrs = dom.getElementsByTagName('address') >>> print addrs[1].toxml() <address city="New York" number="344" state="NY" street="118 St."/> >>> jobs = dom.getElementsByTagName('job-info') >>> for key, val in jobs[3].attributes.items(): ...     print key,'=',val employee-type = Part-Time is-manager = no job-description = Hacker 

SEE ALSO: gnosis.xml.objectify 409;

xml.dom.minidom

The module xml.dom.minidom is a lightweight DOM implementation built on top of SAX. You may pass in a custom SAX parser object when you parse an XML document; by default, xml.dom.minidom uses the fast, nonvalidating xml.parser.expat parser.

xml.dom.pulldom

The module xml.dom.pulldom is a DOM implementation that conserves memory by only building the portions of a DOM tree that are requested by calls to accessor methods. In some cases, this approach can be considerably faster than building an entire tree with xml.dom.minidom or another DOM parser; however, the xml.dom.pulldom remains somewhat underdocumented and experimental at the time of this writing.

xml.parsers.expat

Interface to the expat nonvalidating XML parser. Both the xml.sax and the xml.dom.minidom modules utilize the services of the fast expat parser, whose functionality lives mostly in a C library. You can use xml.parser.expat directly if you wish, but since the interface uses the same general event-driven style of the standard xml.sax, there is usually no reason to.

xml.sax

The package xml.sax implements the Simple API for XML. By default, xml.sax relies on the underlying xml.parser.expat parser, but any parser supporting a set of interface methods may be used instead. In particular, the validating parser xmlproc is included in the PyXML package.

When you create a SAX application, your main task is to create one or more callback handlers that will process events generated during SAX parsing. The most important handler is a ContentHandler, but you may also define a DTDHandler, EntityResolver, or ErrorHandler. Generally you will specialize the base handlers in xml.sax.handler for your own applications. After defining and registering desired handlers, you simply call the .parse() method of the parser that you registered handlers with. Or alternately, for incremental processing, you can use the feed() method.

A simple example illustrates usage. The application below reads in an XML file and writes an equivalent, but not necessarily identical, document to STDOUT. The output can be used as a canonical form of the document:

xmlcat.py
 #!/usr/bin/env python import sys from xml.sax import handler, make_parser from xml.sax.saxutils import escape class ContentGenerator(handler.ContentHandler):     def __init__(self, out=sys.stdout):         handler.ContentHandler.__init__(self)         self._out = out     def startDocument(self):         xml_decl = '<?xml version="1.0" encoding="iso-8859-1"?>\n'         self._out.write(xml_decl)     def endDocument(self):         sys.stderr.write("Bye bye!\n")     def startElement(self, name, attrs):         self._out.write('<' + name)         name_val = attrs.items()         name_val.sort()                 # canonicalize attributes         for (name, value) in name_val:             self._out.write(' %s="%s"' % (name, escape(value)))         self._out.write('>')     def endElement(self, name):         self._out.write('</%s>' % name)     def characters(self, content):         self._out.write(escape(content))     def ignorableWhitespace(self, content):         self._out.write(content)     def processingInstruction(self, target, data):         self._out.write('<?%s %s?>' % (target, data)) if __name__=='__main__':     parser = make_parser()     parser.setContentHandler(ContentGenerator())     parser.parse(sys.argv[1]) 
xml.sax.handler

The module xml.sax.handler defines classes ContentHandler, DTDHandler, EntityResolver, and ErrorHandler that are normally used as parent classes of custom SAX handlers.

xml.sax.saxutils

The module xml.sax.saxutils contains utility functions for working with SAX events. Several functions allow escaping and munging special characters.

xml.sax.xmlreader

The module xml.sax.xmlreader provides a framework for creating new SAX parsers that will be usable by the xml.sax module. Any new parser that follows a set of API conventions can be plugged in to the xml.sax.make_parser() class factory.

xmllib

Deprecated module for XML parsing. Use xml.sax or other XML tools in Python 2.0+.

xmlrpclib
SimpleXMLRPCServer

XML-RPC is an XML-based protocol for remote procedure calls, usually layered over HTTP. For the most part, the XML aspect is hidden from view. You simply use the module xmlrpclib to call remote methods and the module SimpleXMLRPCServer to implement your own server that supports such method calls. For example:

 >>> import xmlrpclib >>> betty = xmlrpclib.Server("http://betty.userland.com") >>> print betty.examples.getStateName(41) South Dakota 

The XML-RPC format itself is a bit verbose, even as XML goes. But it is simple and allows you to pass argument values to a remote method:

 >>> import xmlrpclib >>> print xmlrpclib.dumps((xmlrpclib.True,37,(11.2,'spam'))) <params> <param> <value><boolean>1</boolean></value> </param> <param> <value><int>37</int></value> </param> <param> <value><array><data> <value><double>11.199999999999999</double></value> <value><string>spam</string></value> </data></array></value> </param> </params> 

SEE ALSO: gnosis.xml.pickle 410;

5.4.2 Third-Party XML-Related Tools

A number of projects extend the XML capabilities in the Python standard library. I am the principle author of several XML-related modules that are distributed with the gnosis package. Information on the current release can be found at:

<http://gnosis.cx/download/Gnosis_Utils.ANNOUNCE>

The package itself can be downloaded as a distutils package tarball from:

<http://gnosis.cx/download/Gnosis_Utils-current.tar.gz>

The Python XML-SIG (special interest group) produces a package of XML tools known as PyXML. The work of this group is incorporated into the Python standard library with new Python releases not every PyXML tool, however, makes it into the standard library. At any given moment, the most sophisticated and often experimental capabilities can be found by downloading the latest PyXML package. Be aware that installing the latest PyXML overrides the default Python XML support and may break other tools or applications.

<http://pyxml.sourceforge.net/>

Fourthought, Inc. produces the 4Suite package, which contains a number of XML tools. Fourthought releases 4Suite as free software, and many of its capabilities are incorporated into the PyXML project (albeit at a varying time delay); however, Fourthought is a for-profit company that also offers customization and technical support for 4Suite. The community page for 4Suite is:

<http://4suite.org/index.xhtml>

The Fourthought company Web site is:

<http://fourthought.com/>

Two other modules are discussed briefly below. Neither of these are XML tools per se. However, both PYX and yaml fill many of the same requirements as XML does, while being easier to manipulate with text processing techniques, easier to read, and easier to edit by hand. There is a contrast between these two formats, however. PYX is semantically identical to XML, merely using a different syntax. YAML, on the other hand, has a quite different semantics from XML I present it here because in many of the concrete applications where developers might instinctively turn to XML (which has a lot of "buzz"), YAML is a better choice.

The home page for PYX is:

<http://pyxie.sourceforge.net/>

I have written an article explaining PYX in more detail than in this book at:

<http://gnosis.cx/publish/programming/xml_matters_17.html>

The home page for YAML is:

<http://yaml.org>

I have written an article contrasting the utility and semantics of YAML and XML at:

<http://gnosis.cx/publish/programming/xml_matters_23.html>

graphics/common.gif

gnosis.xml.indexer

The module gnosis.xml.indexer builds on the full-text indexing program presented as an example in Chapter 2 (and contained in the gnosis package as gnosis.indexer). Instead of file contents, gnosis.xml.indexer creates indices of (large) XML documents. This allows for a kind of "reverse XPath" search. That is, where a tool like 4xpath, in the 4Suite package, lets you see the contents of an XML node specified by XPath, gnosis.xml.indexer identifies the XPaths to the point where a word or words occur. This module may be used either in a larger application or as a command-line tool; for example:

 % indexer symmetric ./crypto1.xml::/section[2]/panel[8]/title ./crypto1.xml::/section[2]/panel[8]/body/text_column/code_listing ./crypto1.xml::/section[2]/panel[7]/title ./crypto2.xml::/section[4]/panel[6]/body/text_column/p[1] 4 matched wordlist: ['symmetric'] Processed in 0.100 seconds (SlicedZPickleIndexer) % indexer "-filter=*::/*/title" symmetric ./cryptol.xml::/section[2]/panel[8]/title ./cryptol.xml::/section[2]/panel[7]/title 2 matched wordlist: ['symmetric'] Processed in 0.080 seconds (SlicedZPickleIndexer) 

Indexed searches, as the example shows, are very fast. I have written an article with more details on this module:

<http://gnosis.cx/publish/programming/xml_matters_10.html>

gnosis.xml.objectify

The module gnosis.xml.objectify transforms arbitrary XML documents into Python objects that have a "native" feel to them. Where XML is used to encode a data structure, I believe that using gnosis.xml.objectify is the quickest and simplest way to utilize that data in a Python application.

The Document Object Model defines an OOP model for working with XML, across programming languages. But while DOM is nominally object-oriented, its access methods are distinctly un-Pythonic. For example, here is a typical "drill down" to a DOM value (skipping whitespace text nodes for some indices, which is far from obvious):

 >>> from xml.dom import minidom >>> dom_obj = minidom.parse('address.xml') >>> dom_obj.normalize() >>> print dom_obj.documentElement.childNodes[1].childNodes[3]\ ...                              .attributes.get('city').value Los Angeles 

In contrast, gnosis.xml.objectify feels like you are using Python:

 >>> from gnosis.xml.objectify import XML_Objectify >>> xml_obj = XML_Objectify('address.xml') >>> py_obj = xml_obj.make_instance() >>> py_obj.person[2].address.city u'Los Angeles' 
gnosis.xml.pickle

The module gnosis.xml.pickle lets you serialize arbitrary Python objects to an XML format. In most respects, the purpose is the same as for the pickle module, but an XML target is useful for certain purposes. You may process the data in an xml_pickle using standard XML parsers, XSLT processors, XML editors, validation utilities, and other tools.

In several respects, gnosis.xml.pickle offers finer-grained control than the standard pickle module does. You can control security permissions accurately; you can customize the representation of object types within an XML file; you can substitute compatible classes during the pickle/unpickle cycle; and several other "guru-level" manipulations are possible. However, in basic usage, gnosis.xml.pickle is fully API compatible with pickle. An example illustrates both the usage and the format:

 >>> class Container: pass ... >>> inst = Container() >>> dct = {1.7:2.5, ('t','u','p'):'tuple'} >>> inst.this, inst.num, inst.dct = 'that', 38, dct >>> import gnosis.xml.pickle >>> print gnosis.xml.pickle.dumps(inst) <?xml version="1.0"?> <!DOCTYPE PyObject SYSTEM "PyObjects.dtd"> <PyObject module="__main__"  > <attr name="this" type="string" value="that" /> <attr name="dct" type="dict"  >   <entry>     <key type="tuple"  >       <item type="string" value="t" />       <item type="string" value="u" />       <item type="string" value="p" />     </key>     <val type="string" value="tuple" />   </entry>   <entry>     <key type="numeric" value="1.7" />     <val type="numeric" value="2.5" />   </entry> </attr> <attr name="num" type="numeric" value="38" /> </PyObject> 

SEE ALSO: pickle 93; cPickle 93; yaml 415; pprint 94;

gnosis.xml.validity

The module gnosis.xml.validity allows you to define Python container classes that restrict their containment according to XML validity constraints. Such validity-enforcing classes always produce string representations that are valid XML documents, not merely well-formed ones. When you attempt to add an item to a gnosis.xml.validity container object that is not permissible, a descriptive exception is raised. Constraints, as with DTDs, may specify quantification, subelement types, and sequence.

For example, suppose you wish to create documents that conform with a "dissertation" Document Type Definition:

dissertation.dtd
 <!ELEMENT dissertation (dedication?, chapter+, appendix*)> <!ELEMENT dedication (#PCDATA)> <!ELEMENT chapter (title, paragraph+)> <!ELEMENT title (#PCDATA)> <!ELEMENT paragraph (#PCDATA I figure I table)+> <!ELEMENT figure EMPTY> <!ELEMENT table EMPTY> <!ELEMENT appendix (#PCDATA)> 

You can use gnosis.xml.validity to assure your application produced only conformant XML documents. First, you create a Python version of the DTD:

dissertation.py
 from gnosis.xml.validity import * class appendix(PCDATA):   pass class table(EMPTY):       pass class figure(EMPTY):      pass class _mixedpara(Or):     _disjoins = (PCDATA, figure, table) class paragraph(Some):    _type = _mixedpara class title(PCDATA):      pass class _paras(Some):       _type = paragraph class chapter(Seq):       _order = (title, _paras) class dedication(PCDATA): pass class _apps(Any):         _type = appendix class _chaps(Some):       _type = chapter class _dedi(Maybe):       _type = dedication class dissertation(Seq):  _order = (_dedi, _chaps, _apps) 

Next, import your Python validity constraints, and use them in an application:

 >>> from dissertation import * >>> chap1 = LiftSeq(chapter,('About Validity','It is a good thing')) >>> paras_ch1 = chap1[1] >>> paras_ch1 += [paragraph('OOP can enforce it')] >>> print chap1 <chapter><title>About Validity</title> <paragraph>It is a good thing</paragraph> <paragraph>OOP can enforce it</paragraph> </chapter> 

If you attempt an action that violates constraints, you get a relevant exception; for example:

 >>> try: ..     paras_ch1.append(dedication("To my advisor")) .. except ValidityError, x: ...    print x Items in _paras must be of type <class 'dissertation.paragraph'> (not <class 'dissertation.dedication'>) 
PyXML

The PyXML package contains a number of capabilities in advance of those in the Python standard library. PyXML was at version 0.8.1 at the time this was written, and as the number indicates, it remains an in-progress/beta project. Moreover, as of this writing, the last released version of Python was 2.2.2, with 2.3 in preliminary stages. When you read this, PyXML will probably be at a later number and have new features, and some of the current features will have been incorporated into the standard library. Exactly what is where is a moving target.

Some of the significant features currently available in PyXML but not in the standard library are listed below. You may install PyXML on any Python 2.0+ installation, and it will override the existing XML support.

  • A validating XML parser written in Python called xmlproc. Being a pure Python program rather than a C extension, xmlproc is slower than xml.sax (which uses the underlying expat parser).

  • A SAX extension called xml.sax.writers that will reserialize SAX events to either XML or other formats.

  • A fully compliant DOM Level 2 implementation called 4DOM, borrowed from 4Suite.

  • Support for canonicalization. That is, two XML documents can be semantically identical even though they are not byte-wise identical. You have freedom in choice of quotes, attribute orders, character entities, and some spacing that change nothing about the meaning of the document. Two canonicalized XML documents are semantically identical if and only if they are byte-wise identical.

  • XPath and XSLT support, with implementations written in pure Python. There are faster XSLT implementations around, however, that call C extensions.

  • A DOM implementation, called xml.dom.pulldom, that supports lazy instantiation of nodes has been incorporated into recent versions of the standard library. For older Python versions, this is available in PyXML.

  • A module with several options for serializing Python objects to XML. This capability is comparable to gnosis.xml.pickle, but I like the tool I created better in several ways.

PYX

PYX is both a document format and a Python module to support working with that format. As well as the Python module, tools written in C are available to transform documents between XML and PYX format.

The idea behind PYX is to eliminate the need for complex parsing tools like xml.sax. Each node in an XML document is represented, in the PYX format on a separate line, using a prefix character to indicate the node type. Most of XML semantics is preserved, with the exception of document type declarations, comments, and namespaces. These features could be incorporated into an updated PYX format, in principle.

Documents in the PYX format are easily processed using traditional line-oriented text processing tools like sed, grep, awk, sort, wc, and the like. Python applications that use a basic FILE.readline() loop are equally able to process PYX nodes, one per line. This makes it much easier to use familiar text processing programming styles with PYX than it is with XML. A brief example illustrates the PYX format:

 % cat test.xml <?xml version="1.0"?> <?xml-stylesheet href="test.css" type="text/css"?> <Spam flavor="pork">   <Eggs>Some text about eggs.</Eggs>   <MoreSpam>Ode to Spam (spam="smoked-pork")</MoreSpam> </Spam> % ./xmln test.xml ?xml-stylesheet href="test.css" type="text/css" (Spam Aflavor pork -\n (Eggs -Some text about eggs. )Eggs -\n (MoreSpam -Ode to Spam (spam="smoked-pork") )MoreSpam -\n )Spam 
4Suite

The tools in 4Suite focus on the use of XML documents for knowledge management. The server element of the 4Suite software is useful for working with catalogs of XML documents, searching them, transforming them, and so on. The base 4Suite tools address a variety of XML technologies. In some cases 4Suite implements standards and technologies not found in the Python standard library or in PyXML, while in other cases 4Suite provides more advanced implementations.

Among the XML technologies implemented in 4Suite are DOM, RDF, XSLT, XInclude, XPointer, XLink and XPath, and SOAP. Among these, of particular note is 4xslt for performing XSLT transformations. 4xpath lets you find XML nodes using concise and powerful XPath descriptions of how to reach them. 4rdf deals with "meta-data" that documents use to identify their semantic characteristics.

I detail 4Suite technologies in a bit more detail in an article at:

<http://gnosis.cx/publish/programming/xml_matters_15.html>

yaml

The native data structures of object-oriented programming languages are not straightforward to represent in XML. While XML is in principle powerful enough to represent any compound data, the only inherent mapping in XML is within attributes but that only maps strings to strings. Moreover, even when a suitable XML format is found for a given data structure, the XML is quite verbose and difficult to scan visually, or especially to edit manually.

The YAML format is designed to match the structure of datatypes prevalent in scripting languages: Python, Perl, Ruby, and Java all have support libraries at the time of this writing. Moreover, the YAML format is extremely concise and unobtrusive in fact, the acronym cutely stands for "YAML Ain't Markup Language." In many ways, YAML can act as a better pretty-printer than pprint, while simultaneously working as a format that can be used for configuration files or to exchange data between different programming languages.

There is no fully general and clean way, however, to convert between YAML and XML. You can use the yaml module to read YAML data files, then use the gnosis.xml.pickle module to read and write to one particular XML format. But when XML data starts out in other XML dialects than gnosis.xml.pickle, there are ambiguities about the best Python native and YAML representations of the same data. On the plus side and this can be a very big plus there is essentially a straight-forward and one-to-one correspondence between Python data structures and YAML representations.

In the YAML example below, refer back to the same Python instance serialized using gnosis.xml.pickle and pprint in their respective discussions. As with gnosis.xml.pickle but in this case unlike pprint the serialization can be read back in to re-create an identical object (or to create a different object after editing the text, by hand or by application).

 >>> class Container: pass ... >>> inst = Container() >>> dct = {1.7:2.5, ('t','u','p'):'tuple'} >>> inst.this, inst.num, inst.dct = 'that', 38, dct >>> import yaml >>> print yaml.dump(inst) --- !!__main__.Container dct:     1.7: 2.5     ?         - t         - u         - p : tuple num: 38 this: that 

SEE ALSO: pprint 94; gnosis.xml.pickle 410;



Text Processing in Python
Text Processing in Python
ISBN: 0321112547
EAN: 2147483647
Year: 2005
Pages: 59
Authors: David Mertz

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net