Section 24.3. Parsing XML with DOM


24.3. Parsing XML with DOM

SAX parsing does not build any structure in memory to represent the XML document. This makes SAX fast and highly scalable, as your application builds exactly as little or as much in-memory structure as needed for its specific tasks. However, for particularly complicated processing tasks involving reasonably small XML documents, you may prefer to let the library build in-memory structures that represent the whole XML document, and then traverse those structures. The XML standards describe the DOM (Document Object Model) for XML. A DOM object represents an XML document as a tree whose root is the document object, while other nodes correspond to elements, text contents, element attributes, and so on. The ElementTree module mentioned in the introduction of this chapter provides a different, more Pythonic (and faster) approach to build an in-memory representation of an XML document, while DOM mimics existing W3C standards (mostly developed with other languages, such as Java, in mind).

The Python standard library supplies a minimal implementation of the XML DOM standard: xml.dom.minidom. minidom builds everything up in memory, with the typical pros and cons of the DOM approach to parsing. The Python standard library also supplies a different DOM-like approach in module xml.dom.pulldom. pulldom occupies an interesting middle ground between SAX and DOM, presenting the stream of parsing events as a Python iterator object so that you do not code callbacks, but rather loop over the events and examine each event to see if it's of interest. When you do find an event of interest to your application, you ask pulldom to build the DOM subtree rooted in that event's node by calling method expandNode, and then work with that subtree as you would in minidom. Paul Prescod, pulldom's author and XML and Python expert, describes the net result as "80 percent of the performance of SAX, 80 percent of the convenience of DOM." Other DOM parsers are part of the PyXML and 4Suite extension packages, mentioned at the start of this chapter.

24.3.1. The xml.dom Package

The xml.dom package supplies exception class DOMException and subclasses of it to support fine-grained exception handling. xml.dom also supplies a class Node, typically used as a base class for all nodes by DOM implementations. Class Node itself supplies only constant attributes that give the codes for node types, such as ELEMENT_NODE for elements, ATTRIBUTE_NODE for attributes, and so on. xml.dom also supplies constant module attributes with the URIs of some important namespaces: XML_NAMESPACE, XMLNS_NAMESPACE, XHTML_NAMESPACE, and EMPTY_NAMESPACE.

24.3.2. The xml.dom.minidom Module

The xml.dom.minidom module supplies two functions.

parse

parse(file,parser=None)

file is a filename string or a file-like object open for reading, and contains an XML document. parser, if given, is an instance of a SAX parser class; otherwise, parse generates a default SAX parser by calling xml.sax.make_parser( ). parse returns a minidom document object instance that represents the given XML document.

parseString

parseString(string,parser=None)

Like parse, except that string is the XML document in string form.


xml.dom.minidom also supplies many classes as specified by the XML DOM standard. Almost all of these classes subclass Node. Class Node supplies the methods and attributes that all kinds of nodes have in common. A notable class of module xml.dom.minidom that is not a subclass of Node is AttributeList, identified in the DOM standard as NamedNodeMap, which is a mapping that collects the attributes of a single node of class Element.

For methods and attributes related to changing and creating XML documents, see "Changing and Generating XML" on page 606. Here, I present the classes, methods, and attributes that you use most often to traverse a DOM tree, usually after the tree has been built by parsing an XML document. For concreteness and simplicity, I mention Python classes. However, the DOM specifications deal with abstract interfaces, never with concrete classes. Your code must never deal with the class objects directly, only with instances of those classes. Do not type-test nodes (for example, don't use isinstance on them) and do not instantiate node classes directly (rather, use the factory methods covered in "Factory Methods of a Document Object" on page 607). This is good Python practice in general, but it's particularly important here.

24.3.2.1. Node objects

Each node n in the DOM tree is an instance of some subclass of Node; thus, n supplies all attributes and methods of Node, with appropriate overriding implementations if needed. The most frequently used methods and attributes are as follows.

attributes

The n.attributes attribute is either None or an AttributeList instance with all attributes of n.

childNodes

The n.childNodes attribute is a list of all nodes that are children of n, possibly an empty list.

firstChild

The n.firstChild attribute is None when n.childNodes is empty; otherwise, n.childNodes[0].

hasChildNodes

n.hasChildNodes( )

Like len(n.childNodes)!=0, but possibly faster.

isSameNode

n.isSameNode(other)

true when n and other refer to the same DOM node; otherwise, False. Do not use the normal Python idiom n is other: a DOM implementation is free to generate multiple Node instances that refer to the same DOM node. Therefore, to check the identity of DOM node references, always and exclusively use method isSameNode.

lastChild

The n.lastChild attribute is None when n.childNodes is empty; otherwise, n.childNodes[-1].

localName

The n.localName attribute is the local part of n's qualified name (relevant when namespaces are involved).

namespaceURI

The n.namespaceURI attribute is None when n's qualified name has no namespace part; otherwise, the namespace's URI.

nextSibling

The n.nextSibling attribute is None when n is the last child of n's parent; otherwise, the next child of n's parent.

nodeName

The n.nodeName attribute is n's name string. The string is a node-specific name when that makes sense for n's node type (e.g., the tag name when n is an Element); otherwise, a string starting with '#'.

nodeType

The n.nodeType attribute is n's type code, an integer that is one of the constant attributes of class Node.

nodeValue

The n.nodeValue attribute is None when n has no value (e.g., when n is an Element); otherwise, n's value (e.g., the text content when n is an instance of Text).

normalize

n.normalize( )

Normalizes the entire subtree rooted at n, merging adjacent Text nodes. Parsing may separate ranges of text in the XML document into arbitrary chunks; normalize ensures that text ranges remain separate only when there is markup between them.

ownerDocument

The n.ownerDocument attribute is the Document instance that contains n.

parentNode

The n.parentNode attribute is n's parent node in the DOM tree, or None for attribute nodes and nodes not in the tree.

prefix

The n.prefix attribute is None when n's qualified name has no namespace prefix; otherwise, the namespace prefix. Note that a name may have a namespace even if it has no namespace prefix.

previousSibling

The n.previousSibling attribute is None when n is the first child of n's parent; otherwise, the previous child of n's parent.


24.3.2.2. Attr objects

The Attr class is a subclass of Node that represents an attribute of an Element. Besides attributes and methods of class Node, an instance a of Attr supplies the following attributes.

ownerElement

The a.ownerElement attribute is the Element instance of which a is an attribute.

specified

The a.specified attribute is true if a was explicitly specified in the document, and false if obtained by default.


24.3.2.3. Document objects

The Document class is a subclass of Node whose instances are returned by the parse and parseString functions of module xml.dom.minidom. All nodes in the document refer to the same Document node as their ownerDocument attribute. To check this, however, you must exclusively use the isSameNode method, not Python identity checking (operator is). Besides the attributes and methods of class Node, d supplies the following attributes and methods.

doctype

The d.doctype attribute is the DocumentType instance that corresponds to d's DTD. This attribute comes directly from the !DOCTYPE declaration in d's XML source.

document-Element

The d.documentElement attribute is the Element instance that is d's root element.

getElementById

d.getElementById(elementId)

Returns the Element instance within the document that has the given ID (which element attributes are IDs is specified by the DTD), or None if there is no such instance (or the underlying parser does not supply ID information).

getElementsBy-TagName

d.getElementsByTagName(tagName)

Returns the list of Element instances in the document whose tag equals string tagName, in the same order as in the XML document. May be the empty list. When name is '*', returns the list of all Element instances in the document, with any tag.

getElementsBy-TagNameNS

d.getElementsByTagNameNS(namespaceURI,localName)

Returns the list of Element instances in the document with the given namespaceURI and localName, in the same order in the XML document. May be the empty list. A value of '*' for namespaceURI, localName, or both matches all values of the field.


24.3.2.4. Element objects

Element is a subclass of Node that represents tagged elements. Besides attributes and methods of Node, an instance e of Element supplies the following methods.

getAttribute

e.getAttribute(name)

Returns the value of e's attribute with the given name. Returns the empty string '' if e has no attribute with the given name.

getAttributeNS

e.getAttributeNS(namespaceURI,localName)

Returns the value of e's attribute with the given namespaceURI and localName.

getAttribute-Node

e.getAttributeNode(name)

Returns the Attr instance that is e's attribute with the given name, or None if e has no attribute with the given name.

getAttribute-NodeNS

e.getAttributeNodeNS(namespaceURI,localName)

Returns the Attr instance that is e's attribute with the given namespaceURI and localName, or None if e has no attribute with the given namespace and name.

getElementsBy-TagName

e.getElementsByTagName(tagName)

Returns the list of Element instances in the subtree rooted at e whose tag equals string tagName, in the same order as in the XML document. e is also included in the list, if e's tag equals tagName. getElementsbyTagName returns the empty list when no node in the subtree rooted at e has a tag equal to tagName. When tagName is '*', getElementsbyTagName returns the list of all Element instances within the subtree, starting with e.

getElementsBy-TagNameNS

e.getElementsByTagNameNS(namespaceURI,localName)

Returns the list of Element instances within the subtree rooted at e, with the given namespaceURI and localname, in the same order as in the XML document. A value of '*' for namespaceURI, localname, or both matches all values of the corresponding field. The list may include e or may be empty, just as for method getElementsByTagName.

hasAttribute

e.hasAttribute(name)

True if and only if e has an attribute with the given name. If the underlying parser extracts the relevant information from the DTD, hasAttribute is also true for attributes of e that have a default value, even when they are not explicitly specified.

hasAttributeNS

e.hasAttributeNS(namespaceURI,localName)

True if and only if e has an attribute with the given namespaceURI and localName. Same as method hasAttribute for attributes with default values in the DTD.


24.3.3. Parsing XHTML with xml.dom.minidom

The following example uses xml.dom.minidom to perform the same task as in the previous example for xml.sax, fetching a page from the Web with urllib, parsing it, and outputting the hyperlinks:

 import xml.dom.minidom, urllib, urlparse f = urllib.urlopen('http://www.w3.org/MarkUp/') doc = xml.dom.minidom.parse(f) links = doc.getElementsByTagName('a') seen = set( ) for a in links:     value = a.getAttribute('href')     if value and value not in seen:         seen.add(value)         pieces = urlparse.urlparse(value)         if pieces[0] == 'http' and pieces[1]!='www.w3.org':             print urlparse.urlunparse(pieces) 

In this example, we get the list of all elements with tag 'a', and the relevant attribute, if any, for each of them. We then work in the usual way with the attribute's value.

24.3.4. The xml.dom.pulldom Module

The xml.dom.pulldom module supplies two functions.

parse

parse(file,parser=None)

file is a filename or a file-like object open for reading, and contains an XML document. parser, if given, is an instance of a SAX parser class; otherwise, parse generates a default SAX parser by calling xml.sax.make_parser( ). parse returns a pulldom event stream instance that represents the given XML document.

parseString

parseString(string,parser=None)

Like parse, except that string is the XML document in string form.

xml.dom.pulldom also supplies class DOMEventStream, an iterator whose items are pairs (event,node), where event is a string that gives the event type, and node is an instance of an appropriate subclass of class Node. The possible values for event are constant uppercase strings that are also available as constant attributes of module xml.dom.pulldom with the same names: CHARACTERS, COMMENT, END_DOCUMENT, END_ELEMENT, IGNORABLE_WHITESPACE, PROCESSING_INSTRUCTION, START_DOCUMENT, and START_ELEMENT.

An instance d of class DOMEventStream supplies one other important method.

expandNode

d.expandNode(node)

node must be the latest instance of Node so far returned by iterating on d, i.e., the instance of Node returned by the latest call to d.next( ). expandNode processes the part of the XML document stream that corresponds to the subtree rooted at node so that you can then access the subtree with the usual minidom approach. d iterates on itself for the purpose so that after calling expandNode, the next call to d.next( ) continues right after the subtree thus expanded.


24.3.5. Parsing XHTML with xml.dom.pulldom

The following example uses xml.dom.pulldom to perform the same task as our previous examples, fetching a page from the Web with urllib, parsing it, and outputting the hyperlinks:

 import xml.dom.pulldom, urllib, urlparse f = urllib.urlopen('http://www.w3.org/MarkUp/') doc = xml.dom.pulldom.parse(f) seen = set( ) for event, node in doc:     if event=='START_ELEMENT' and node.nodeName=='a':         doc.expandNode(node)         value = node.getAttribute('href')         if value and value not in seen:             seen.add(value)             pieces = urlparse.urlparse(value)             if pieces[0] == 'http' and pieces[1]!='www.w3.org':                 print urlparse.urlunparse(pieces) 

In this example, we select only elements with tag 'a'. For each of them, we request full expansion, and then proceed just like in the minidom example (i.e., we get the relevant attribute, if any, then work in the usual way with the attribute's value). The expansion is in fact not necessary in this specific case, since we do not need to work with the subtree rooted in each element with tag 'a', just with the attributes, and attributes can be accessed without calling expandNode. Therefore, this example works just as well if you remove the call to doc.expandNode. However, I put the expandNode call in the example to show how this crucial method of pulldom is normally used in context.




Python in a Nutshell
Python in a Nutshell, Second Edition (In a Nutshell)
ISBN: 0596100469
EAN: 2147483647
Year: 2004
Pages: 192
Authors: Alex Martelli

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net