Parsing


Parsing is the process of reading an XML document and reporting its content to a client application while checking the document for well- formedness . SAX represents parsers as instances of the XMLReader interface. The specific class that implements this interface varies from parser to parser. For example, in Xerces it's org.apache.xerces.parsers.SAXParser . In Crimson it's org.apache.crimson. parser.XMLReaderImpl . Most of the time you don't construct instances of this interface directly; instead, you use the static XMLReaderFactory.createXMLReader() factory method to create a parser-specific instance of this class. Then you pass InputSource objects containing XML documents to the parse() method of XMLReader . The parser reads the document, and throws an exception if it detects any well-formedness errors.

SAX in Other Languages

SAX has been unofficially ported to several other object-oriented languages, including C++, Visual Basic, Python, and Perl. The general patterns and names of most functions are the same, but the details of implementation are quite a bit different. For example, C++ doesn't have interfaces but does have multiple inheritance, so ContentHandler , XMLReader and the like become classes containing nothing but pure virtual functions. And because C++ string classes can't handle Unicode, parsers must instead use pointers to arrays of custom types such as XMLCh . Unfortunately, there's no standard C++ binding for SAX, so the custom classes vary from one parser to the next , and you can't easily port C++ SAX programs between different compilers and platforms in either binary or source form.

Although supporting the "Desperate Perl Hacker" was a goal of the original XML working group , Perl has always lagged behind other languages quite a bit when it comes to XML. The initial problem was the lack of support for Unicode, a sine qua non for XML. Today Perl has decent Unicode support. To really handle XML, you need at least version 5.005_52 of Perl, and preferably Perl 5.6.1 or later and ideally Perl 5.8.

Several XML parsers are available for Perl, but far and away the most popular is Larry Wall and Clark Cooper's XML::Parser. This is a wrapper around James Clark's expat [http://www.jclark.com/xml/ expat .html], an XML parser written in C. However, this parser isn't truly SAX compatible, even though it's used in a lot of legacy code. New projects should use XML::SAX [http://sax.perl.org/] instead.

In my opinion, however, even with this module, Perl is still not as ideal a language for processing XML as you might expect. Perl's strength is its ability to work with the implicit structure in text documents, such as tab-delimited text files and comma-separated values (CSV) files. However, XML documents tend to have very explicit structure that is easily addressed by a language like Java. Perl's strengths don't come into play, but you still suffer the numerous well-known disadvantages of working with Perl. The inevitable obfuscation of Perl code seems to me too high a price to pay.

Python probably has the best support for SAX and XML of any of the non-Java languages. XML parsing including a SAX port has been a standard part of Python since version 2.0. Furthermore, Python has a standard Unicode string type. This is not quite the same as Python's regular string type, but Python's weak typing means this isn't nearly as big an inconvenience as it is in C++. However, the fact remains that SAX is designed in and for Java, and Java is certainly the most convenient language with which to write SAX programs.

Example 6.1 demonstrates the complete process with a simple program whose main() method parses a document found at a URL entered on the command line. If this document is well- formed , a simple message to that effect is printed on System.out . Otherwise, if the document is not well-formed, the parser throws a SAXException . If an I/O error such as a broken network connection occurs, then the parse() method throws an IOException . In this case, you don't know whether or not the document is well-formed.

Example 6.1 A SAX Program That Parses a Document
 import org.xml.sax.*; import org.xml.sax.helpers.XMLReaderFactory; import java.io.IOException; public class SAXChecker {   public static void main(String[] args) {     if (args.length <= 0) {       System.out.println("Usage: java SAXChecker URL");       return;     }     try {       XMLReader parser = XMLReaderFactory.createXMLReader();       parser.parse(args[0]);       System.out.println(args[0] + " is well-formed.");     }     catch (SAXException e) {       System.out.println(args[0] + " is not well-formed.");     }     catch (IOException e) {       System.out.println(        "Due to an IOException, the parser could not check "        + args[0]       );     }   } } 

Note

Don't forget that you'll probably need to install a parser such as Xerces or lfred somewhere in your class path before you can compile or run this program. Only Java 1.4 and later include a built-in parser.


This program's output is straightforward. For example, here's the output I got when I first ran it across my Cafe con Leche home page:

 %  java SAXChecker http://www.cafeconleche.org  http://www.cafeconleche.org is not well-formed. 

After I located and fixed the bugs in that document, I got this output:

 %  java SAXChecker http://www.cafeconleche.org  http://www.cafeconleche.org is well-formed. 

However, some readers will encounter a different result when they run this program. In particular, you may get this output:

 %  java SAXChecker http://www.cafeconleche.org  org.xml.sax.SAXException: System property org.xml.sax.driver not specified 

What this really means is that your parser has not properly customized its version of the XMLReaderFactory class. Unfortunately, far too many parsers, including Xerces and Crimson, fail to do this. Consequently you need to set the org.xml.sax.driver Java system property to the fully package-qualified name of the Java class for your parser. For Xerces, it's org.apache.xerces.parsers.SAXParser . For Crimson, it's org.apache.crimson.parser.XMLReaderImpl . For other parsers, consult the parser documentation. You can specify a one-time value for this property using the -D flag to the Java interpeter like this:

 %  java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser   SAXChecker http://www.cafeconleche.org/  http://www.cafeconleche.org is well-formed. 


Processing XML with Java. A Guide to SAX, DOM, JDOM, JAXP, and TrAX
Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX
ISBN: 0201771861
EAN: 2147483647
Year: 2001
Pages: 191

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net