Recipe12.1.Checking XML Well-Formedness


Recipe 12.1. Checking XML Well-Formedness

Credit: Paul Prescod, Farhad Fouladi

Problem

You need to check whether an XML document is well formed (not whether it conforms to a given DTD or schema), and you need to do this check quickly.

Solution

SAX (presumably using a fast parser such as Expat underneath) offers a fast, simple way to perform this task. Here is a script to check well-formedness on every file you mention on the script's command line:

from xml.sax.handler import ContentHandler from xml.sax import make_parser from glob import glob import sys def parsefile(filename):     parser = make_parser( )     parser.setContentHandler(ContentHandler( ))     parser.parse(filename) for arg in sys.argv[1:]:     for filename in glob(arg):         try:             parsefile(filename)             print "%s is well-formed" % filename         except Exception, e:             print "%s is NOT well-formed! %s" % (filename, e)

Discussion

A text is a well-formed XML document if it adheres to all the basic syntax rules for XML documents. In other words, it has a correct XML declaration and a single root element, all tags are properly nested, tag attributes are quoted, and so on.

This recipe uses the SAX API with a dummy ContentHandler that does nothing. Generally, when we parse an XML document with SAX, we use a ContentHandler instance to process the document's contents. But in this case, we only want to know whether the document meets the most fundamental syntax constraints of XML; therefore, we need not do any processing, and the do-nothing handler suffices.

The parsefile function parses the whole document and throws an exception if an error is found. The recipe's main code catches any such exception and prints it out like this:

$ python wellformed.py test.xml test.xml is NOT well-formed! test.xml:1002:2: mismatched tag

This means that character 2 on line 1,002 has a mismatched tag.

This recipe does not check adherence to a DTD or schema, which is a separate procedure called validation. The performance of the script should be quite good, precisely because it focuses on performing a minimal irreducible core task. However, sometimes you need to squeeze out the last drop of performance because you're checking the well-formedness of truly huge files. If you know for sure that you do have Expat, specifically, installed on your system, you may alternatively choose to use Expat directly instead of SAX. To try this approach, you can change function parsefile to the following code:

import xml.parsers.expat def parsefile(file):     parser = xml.parsers.expat.ParserCreate( )     parser.ParseFile(open(file, "r"))

Don't expect all that much of an improvement in performance when using Expat directly instead of SAX. However, you might gain a little bit.

See Also

Recipe 12.2 and Recipe 12.3, for other uses of SAX; the PyXML package (http://pyxml.sourceforge.net/) includes the pure-Python validating parser xmlproc, which checks the conformance of XML documents to specific DTDs; the PyRXP package from ReportLab is a wrapper around the fast validating parser RXP (http://www.reportlab.com/xml/pyrxp.html), which is available under the GPL license.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net