Features and Properties


SAX parsers, that is, XMLReader objects, are configured by setting features and properties. A feature has a boolean true/false value. A property has an object value. Both features and properties are named by absolute URIs. This allows just a handful of standard methods to support an arbitrary number of standard and nonstandard features and properties of various types.

Features and properties can be read-only, write-only (rare), or read-write. If you attempt to change a read-only feature or property, then a SAXNotSupportedException , a subclass of SAXException , is thrown. The accessibility of a feature or property can change depending on whether or not the XMLReader is currently parsing a document. For example, you can turn validation on or off before or after parsing a document, but not while the XMLReader is parsing a document.

Getting and Setting Features

The XMLReader interface provides these two methods to turn features on and off:

 public void  setFeature  (String  name,  boolean  value  ) throws  SAXNotRecognizedException, SAXNotSupportedException public boolean  getFeature  (String  name  ) throws  SAXNotRecognizedException, SAXNotSupportedException 

The first argument is the name of the feature to set or get. Feature names are absolute URIs. Standard features that are supported by multiple parsers have names that begin with http://xml.org/sax/features/ . For example, this next code fragment checks to see whether the XMLReader object parser is currently validating; and, if it isn't, turns on validation by setting the feature http://xml.org/sax/features/validation to true.

 if (!parser.getFeature("http://xml.org/sax/features/validation")) {   parser.setFeature("http://xml.org/sax/features/validation",    true); } 

Different parsers also support nonstandard, custom features. The names of these features begin with URLs somewhere in the parser vendor's domain. For example, nonstandard features of the Xerces parser from the Apache XML Project begin with http://apache.org/xml/features/ .

If the XMLReader object can never access the feature you're trying to get or set, then setFeature() throws a SAXNotRecognizedException . On the other hand, if you try to get or set a feature that the parser recognizes but is unable to access at the current time, then setFeature() throws a SAXNotSupportedException . Both are subclasses of SAXException . For example, if parser were a nonvalidating parser like gnu.xml.aelfred2.SAXDriver , then the preceding code would throw SAXNotRecognizedException . However, if parser were a validating parser like Xerces but the setFeature() method were invoked while it was parsing a document, then it would throw a SAXNotSupportedException because you can't turn on validation halfway through a document. Because these are checked exceptions, you'll need to either catch them or declare that your method throws them. For example:

 try {   if (!parser.getFeature(       "http://xml.org/sax/features/validation")) {        parser.setFeature("http://xml.org/sax/features/validation",        true);   } } catch (SAXNotRecognizedException) {   System.out.println(parser + " is not a validating parser."); } catch (SAXNotSupportedException) {   System.out.println(    "Cannot turn on validation right now. Try again later."   ); } 

Getting and Setting Properties

The XMLReader interface uses the following two methods to set and get the values of properties:

 public void  setProperty  (String  name,  Object  value  ) throws  SAXNotRecognizedException, SAXNotSupportedException  public Object  getProperty  (String  name  ) throws  SAXNotRecognizedException, SAXNotSupportedException 

Properties are named by absolute URIs, just like features. Standard properties have names that begin with http://xml.org/sax/properties/ such as http://xml.org/sax/properties/declaration-handler and http://xml.org/sax/properties/xml-string . However, most parsers also support some nonstandard, custom properties. The names of these will begin with URLs somewhere in the parser vendor's domain. For example, nonstandard properties of the Xerces parser from the Apache XML Project begin with http://apache.org/xml/properties/ , for example:

 http://apache.org/xml/properties/schema/external- noNamespaceSchemaLocation. 

The value of a property is an object, the type of which varies. For example, the value of the http://xml.org/sax/properties/declaration-handler property is an org.xml.sax.ext.DeclHandler , whereas the value of the http://xml.org/sax/properties/xml-string property is a java.lang.String . Passing an object of the wrong type for the property to setProperty() results in a SAXNotSupportedException .

For example, suppose you're using Xerces and you want to set the schema location for elements that are not in any namespace to http://www.example.com/schema.xsd . The following code fragment accomplishes that:

 try {   parser.setProperty("http://apache.org/xml/properties/schema/"    + "external-noNamespaceSchemaLocation",    "http://www.example.com/schema.xsd"); } catch (SAXNotRecognizedException) {   System.out.println(parser    + " is not a schema-validating parser."); } catch (SAXNotSupportedException) {   System.out.println(    "Cannot change the schema right now. Try again later."   ); } 

Required Features

There are only a couple of features that all SAX parsers must support, and no absolutely required properties. The two required features are

  • http://xml.org/sax/features/namespaces

  • http://xml.org/sax/features/namespace-prefixes

The http://xml.org/sax/features/namespaces feature determines whether namespace URIs and local names are passed to startElement() and endElement() . The default, true, passes both namespace URIs and local names. However, if http://xml.org/sax/features/namespaces is false, then the parser may pass the namespace URI and the local name, or it may just pass empty strings for these two arguments. The default is true, and there's not a lot of reason to change it. (You can always ignore the URI and local name if you don't need them.)

The http://xml.org/sax/features/namespace-prefixes feature determines two things:

  • Whether or not namespace declaration xmlns and xmlns: prefix attributes are included in the Attributes list passed to startElement() . The default, false, is not to include them.

  • Whether or not the qualified names should be passed as the third argument to the startElement() method. The default, false, is, not to require qualified names. However, even if http://xml.org/sax/features/namespace-prefixes is false, parsers are allowed to report the qualified name, and most do so.

For example, consider this start-tag:

 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"           xmlns:dc="http://www.purl.org/dc/" id="R1"> 

If http://xml.org/sax/features/namespace-prefixes is false and http://xml.org/sax/features/namespaces is true, then when a SAX parser reads this tag it may invoke the startElement() method in its registered ContentHandler object with these arguments:

 startElement(    namespaceURI="http://www.w3.org/1999/02/22-rdf-syntax-ns#",   localName = "RDF",   qualifiedName="",   attributes={id="R1"} ) 

Alternately, it can choose to provide the qualified name even though it isn't required to:

 startElement(    namespaceURI="http://www.w3.org/1999/02/22-rdf-syntax-ns#",   localName = "RDF",   qualifiedName="rdf:RDF",   attributes={id="R1"} ) 

However, if http://xml.org/sax/features/namespace-prefixes is true and http://xml.org/sax/features/namespaces is also true, then when a SAX parser reads this tag it invokes the startElement() method, like this:

 startElement(    namespaceURI="http://www.w3.org/1999/02/22-rdf-syntax-ns#",   localName = "RDF",   qualifiedName="rdf:RDF",   attributes={id="R1", xmlns:dc="http://www.purl.org/dc/",    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"} ) 

If http://xml.org/sax/features/namespace-prefixes is true and http://xml.org/sax/features/namespaces is false, then when a SAX parser reads this tag it may invoke the startElement() method, like this:

 startElement(    namespaceURI="",   localName = "",   qualifiedName="rdf:RDF",   attributes={id="R1", xmlns:dc="http://www.purl.org/dc/",    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"} ) 

Then again it may provide the namespace URI and local name anyway, even though it isn't required to:

 startElement(    namespaceURI="http://www.w3.org/1999/02/22-rdf-syntax-ns#",   localName = "RDF",   qualifiedName="rdf:RDF",   attributes={id="R1", xmlns:dc="http://www.purl.org/dc/",    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"} ) 

The possibilities can be summarized as follows :

  • The parser is guaranteed to provide the namespace URIs and local names of elements and attributes only if http://xml.org/sax/features/namespaces is true (which it is by default).

  • The parser is guaranteed to provide the qualified names of elements and attributes only if http://xml.org/sax/features/namespace-prefixes is true (which it is not by default).

  • The parser provides namespace declaration attributes if and only if http://xml.org/sax/features/namespace-prefixes is true (which it is not by default).

  • The parser always has the option to provide the namespace URI, local name, and qualified name, regardless of the values of http://xml.org/sax/features/namespaces and http://xml.org/sax/features/namespace-prefixes . But you should not rely on this behavior.

In other words, the defaults are fine as long as you don't care about namespace prefixes, only local names and URIs.

Standard Features

In addition to the two required features, SAX defines a number of standard features that parsers may support if they choose. These have names that are consistent across different parsers, including

  • http://xml.org/sax/features/external-general-entities

  • http://xml.org/sax/features/external-parameter-entities

  • http://xml.org/sax/features/string-interning

  • http://xml.org/sax/features/validation

external-general-entities

If http://xml.org/sax/features/external-general-entities is true, then the parser resolves all external general entity references. If it's false, then it does not. If the parser is validating, then this feature is required to be true.

The default is parser dependent. Not all parsers are able to resolve external entity references. Attempting to set this to true with a parser that cannot resolve external entity references will throw a SAXNotRecognizedException .

external-parameter-entities

If http://xml.org/sax/features/external-parameter-entities is true, then the parser resolves all external parameter entity references. If it's false, then it does not. If the parser is validating, then this feature is required to be true.

The default is parser dependent. Not all parsers are able to resolve external entity references. Attempting to set this to true with a parser that cannot resolve external entity references will throw a SAXNotRecognizedException .

string-interning

If http://xml.org/sax/features/string-interning is true, then the parser internalizes all XML names using the intern() method of the String class before passing them to the various callback methods. Thus if there are 100 different paragraph elements in your document, then the parser will only use one "paragraph" string for all 100 start-tags and 100 end-tags rather than 200 separate strings. This can save memory and also enable you to compare element names using the == operator instead of the equals() method. In addition to element names, this also affects attribute names, entity names, notation names, namespace prefixes, and namespace URIs. The default is parser dependent.

validation

If http://xml.org/sax/features/validation is true, then the parser validates the document against its DTD. Of course not all parsers are capable of doing this. Attempting to set http://xml.org/sax/features/validation to true for a parser that doesn't know how to validate will throw a SAXNotRecognizedException .

Because validation requires resolving all external entity references, setting http://xml.org/sax/features/validation to true automatically sets http://xml.org/sax/features/external-general-entities and ttp://xml.org/sax/features/external-parameter-entities to true as well.

The default value of this feature is allegedly parser dependent, but I've yet to encounter a parser that turns it on by default.

Example 7.9 is a program that uses this feature to validate documents. As well as setting the http://xml.org/sax/features/validation feature to true, it's also necessary to register an ErrorHandler object that can receive messages about validity errors.

Example 7.9 A SAX Program That Validates Documents
 import org.xml.sax.*; import org.xml.sax.helpers.XMLReaderFactory; import java.io.IOException; public class SAXValidator implements ErrorHandler {   // Flag to check whether any errors have been spotted.   private boolean valid = true;   public boolean isValid() {     return valid;   }   // If this handler is used to parse more than one document,   // its initial state needs to be reset between parses.   public void reset() {     // Assume document is valid until proven otherwise     valid = true;   }   public void warning(SAXParseException exception) {     System.out.println("Warning: " + exception.getMessage());     System.out.println(" at line " + exception.getLineNumber()      + ", column " + exception.getColumnNumber());     // Well-formedness is a prerequisite for validity     valid = false;   }   public void error(SAXParseException exception) {     System.out.println("Error: " + exception.getMessage());     System.out.println(" at line " + exception.getLineNumber()      + ", column " + exception.getColumnNumber());     // Unfortunately there's no good way to distinguish between     // validity errors and other kinds of non-fatal errors     valid = false;   }   public void fatalError(SAXParseException exception) {     System.out.println("Fatal Error: " + exception.getMessage());     System.out.println(" at line " + exception.getLineNumber()      + ", column " + exception.getColumnNumber());   }   public static void main(String[] args) {     if (args.length <= 0) {       System.out.println("Usage: java SAXValidator URL");       return;     }     String document = args[0];     try {       XMLReader parser = XMLReaderFactory.createXMLReader();       SAXValidator handler = new SAXValidator();       parser.setErrorHandler(handler);       // Turn on validation.       parser.setFeature(        "http://xml.org/sax/features/validation", true);       parser.parse(document);       if (handler.isValid()) {         System.out.println(document + " is valid.");       }       else {         // If the document isn't well-formed, an exception has         // already been thrown and this has been skipped.         System.out.println(document + " is well-formed.");       }     }     catch (SAXParseException e) {       System.out.print(document + " is not well-formed at ");       System.out.println("Line " + e.getLineNumber()        + ", column " +  e.getColumnNumber() );     }     catch (SAXException e) {       System.out.println("Could not check document because "        + e.getMessage());     }     catch (IOException e) {       System.out.println(        "Due to an IOException, the parser could not check "        + document       );     }   } } 

Following is the beginning of the output from running this program across the DocBook XML source code for an early draft of this chapter:

 %  java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser   SAXValidator  xmlreader.xml  Error: Element type "xinclude:include" must be declared.  at line 344, column 92 Error: Attribute "href" must be declared for  element type "xinclude:include".  at line 344, column 92 Error: Attribute "parse" must be declared for  element type "xinclude:include".  at line 344, column 92 Error: The content of element type "programlisting" must match "(#PCDATAfootnoterefxrefabbrevacronymcitationciterefentry citetitleemphasisfirsttermforeignphraseglosstermfootnote phrasequotetrademarkwordaswordlinkolinkulinkaction applicationclassnamemethodnameinterfacenameexceptionname ooclassoointerfaceooexceptioncommandcomputeroutputdatabase emailenvarerrorcodeerrornameerrortypefilenamefunction guibuttonguiiconguilabelguimenuguimenuitemguisubmenu hardwareinterfacekeycapkeycodekeycombokeysymliteral constantmarkupmedialabelmenuchoicemousebuttonoption optionalparameterpromptpropertyreplaceablereturnvalue sgmltagstructfieldstructnamesymbolsystemitemtokentype userinputvarnameanchorauthorauthorinitialscorpauthor modespecothercreditproductnameproductnumberrevhistoryremark subscriptsuperscriptinlinegraphicinlinemediaobject inlineequationsynopsiscmdsynopsisfuncsynopsisclasssynopsis fieldsynopsisconstructorsynopsisdestructorsynopsis methodsynopsisindextermbeginpagecolineannotation)*".  at line 344, column 110 ... xmlreader.xml is well-formed. 

SAXValidator is complaining about the XInclude elements I use to merge in source code examples such as Example 7.9. These are not expected by the DocBook DTD and need to be replaced before the file becomes valid. Once I do that, the merged file ( ch07.xml ) is valid:

 %  java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser   SAXValidator  ch07.xml  ch07.xml is valid. 

Standard Properties

SAX does not require parsers to support any properties, but it does define four standard properties that parsers may support if they choose. These are

  • http://xml.org/sax/properties/xml-string

  • http://xml.org/sax/properties/dom-node

  • http://xml.org/sax/properties/lexical-handler

  • http://xml.org/sax/properties/declaration-handler

xml-string

http://xml.org/sax/properties/xml-string is a read-only property that contains the string of text corresponding to the current SAX event. For example, in the startElement() method, this property would contain the actual start-tag that caused the method invocation.

This property is useful in a very straightforward program that echoes an XML document onto a Writer , as shown in Example 7.10. Assuming a validating parser, the parsing process merges a document that was originally split across multiple parsed entities into a single entity. Here each callback method in the ContentHandler simply invokes a private method that writes out the current value of the http://xml.org/sax/properties/xml-string property.

Example 7.10 A SAX Program That Echoes the Parsed Document
 import org.xml.sax.*; import org.xml.sax.helpers.XMLReaderFactory; import java.io.*; public class DocumentMerger implements ContentHandler {   private XMLReader parser;   private Writer out;   public DocumentMerger(XMLReader parser, Writer out) {     this.parser = parser;     this.out = out;   }   private void output() throws SAXException {     try {       String s = (String) parser.getProperty(        "http://xml.org/sax/properties/xml-string");       out.write(s);     }     catch (IOException e) {       throw new SAXException("Nested IOException", e);     }   }   public void setDocumentLocator(Locator locator) {}   public void startDocument() throws SAXException {     this.output();   }   public void endDocument() throws SAXException {     this.output();   }   public void startPrefixMapping(String prefix, String uri)    throws SAXException {     this.output();   }   public void endPrefixMapping(String prefix)    throws SAXException {     this.output();   }   public void startElement(String namespaceURI, String localName,    String qualifiedName, Attributes atts) throws SAXException {     this.output();   }   public void endElement(String namespaceURI, String localName,    String qualifiedName) throws SAXException {     this.output();   }   public void characters(char[] text, int start, int length)    throws SAXException {     this.output();   }   public void ignorableWhitespace(char[] text, int start,    int length) throws SAXException {     this.output();   }   public void processingInstruction(String target, String data)    throws SAXException {     this.output();   }   public void skippedEntity(String name)    throws SAXException {     this.output();   }   public static void main(String[] args) {     if (args.length <= 0) {       System.out.println(        "Usage: java DocumentMerger url"       );       return;     }     try {       XMLReader parser = XMLReaderFactory.createXMLReader();       // Since this just writes onto the console, it's best       // to use the system default encoding, which is what       // we get by not specifying an explicit encoding here.       Writer out = new OutputStreamWriter(System.out);       ContentHandler handler = new DocumentMerger(parser, out);       parser.setContentHandler(handler);       parser.parse(args[0]);       out.flush();       out.close();     }     catch (Exception e) {       System.err.println(e);     }   } } 

The document that is output may not be quite the same as the document that was read. Character references will have been resolved. General entity references will probably have been resolved. Parts of the prolog, particularly the DOCTYPE declaration, may be missing. Attributes that were read in from defaults in the DTD will be explicitly specified. However, the complete information content of the original document should be present, even if the form is different.

The biggest issue with this program is finding a parser that recognizes the http://xml.org/sax/properties/xml-string property. In my tests, Xerces 1.4.3, Crimson, and lfred all threw a SAXNotRecognizedException or a SAXNotSupportedException . I have not yet found a parser that supports this property, and there's some suspicion in the SAX community that defining it in the first place may have been a mistake.

dom-node

The http://xml.org/sax/properties/dom-node property contains the org.w3c.dom.Node object corresponding to the current SAX event. For example, in the startElement() and endElement() methods, this property contains an org.w3c.dom.Element object representing that element. In the characters() method, this property contains the org.w3c.dom.Text object that contained the characters from which the text had been read.

lexical-handler

Lexical events are those ephemera of parsing that don't really mean anything. In some sense, they really aren't part of the document's information. Comments are the most obvious example. On the other hand, lexical data also includes entity boundaries, CDATA section delimiters, and the DOCTYPE declaration. What unifies all of these is that they really don't matter 99.9 percent of the time. Unfortunately, there's that annoying 0.1 percent when you really do care about some lexical detail you'd normally ignore.

Parsers are not required to report lexical data; but if they want to do so, SAX provides a standard callback interface they can use, LexicalHandler , shown in Example 7.11. However, this interface is optional. Parsers are not required to support it. Notice that it is in the org.xml.sax.ext package, not the core org.xml.sax package.

Example 7.11 The LexicalHandler Interface
 package org.xml.sax.ext; public interface LexicalHandler {   public void startDTD(String name, String publicId,    String systemId) throws SAXException;   public void endDTD() throws SAXException;   public void startEntity(String name)    throws SAXException;   public void endEntity(String name) throws SAXException;   public void startCDATA() throws SAXException;   public void endCDATA() throws SAXException;   public void comment(char[] text, int start, int length)    throws SAXException; } 

Because parsers are not required to support the LexicalHandler interface, it can't be registered with a setLexicalHandler() method in XMLReader like the other callback interfaces. Instead, it's set as the value of the http://xml.org/sax/properties/lexical-handler property. Example 7.12 is a concrete implementation of LexicalHandler that dumps comments from an XML document onto System.out .

Example 7.12 An Implementation of the LexicalHandler Interface
 import org.xml.sax.*; import org.xml.sax.ext.LexicalHandler; import org.xml.sax.helpers.XMLReaderFactory; import java.io.IOException; public class CommentReader implements LexicalHandler {   public void comment (char[] text, int start, int length)    throws SAXException {     String comment = new String(text, start, length);     System.out.println(comment);   }   public static void main(String[] args) {     // set up the parser     XMLReader parser;     try {       parser = XMLReaderFactory.createXMLReader();     }     catch (SAXException e) {       System.err.println("Error: could not locate a parser.");       System.err.println(        "Try setting the org.xml.sax.driver system property to "        + "the fully package qualified name of your parser class."       );       return;     }     // turn on comment handling     try {       LexicalHandler handler = new CommentReader();       parser.setProperty(        "http://xml.org/sax/properties/lexical-handler", handler);     }     catch (SAXNotRecognizedException e) {       System.err.println(        "Installed XML parser does not provide lexical events...");       return;     }     catch (SAXNotSupportedException e) {       System.err.println(        "Cannot turn on comment processing here");       return;     }     if (args.length == 0) {       System.out.println("Usage: java CommentReader URL");     }     // start parsing...     try {       parser.parse(args[0]);     }     catch (SAXParseException e) {// well-formedness error       System.out.println(args[0] + " is not well formed.");       System.out.println(e.getMessage()        + " at line " + e.getLineNumber()        + ", column " + e.getColumnNumber());     }     catch (SAXException e) {// some other kind of error       System.out.println(e.getMessage());     }     catch (IOException e) {       System.out.println("Could not read " + args[0]        + " because of the IOException " + e);     }   }   // do-nothing methods not needed in this example   public void startDTD(String name, String publicId,    String systemId) throws SAXException {}   public void endDTD() throws SAXException {}   public void startEntity(String name) throws SAXException {}   public void endEntity(String name) throws SAXException {}   public void startCDATA() throws SAXException {}   public void endCDATA() throws SAXException {} } 

The main() method builds an XMLReader , constructs an instance of CommentReader , and uses setFeature() to make this CommentReader the parser's LexicalHandler . Then it parses the document indicated on the command line.

It's amusing to run this across the XML source for various W3C specifications. For example, here's the output when the XML version of the XML 1.0 specification, second edition, is fed into CommentReader :

 %  java CommentReader http://www.w3.org/TR/2000/REC-xml-20001006.xml  ArborText, Inc., 1988-2000, v.4002  ...............................................................  XML specification DTD .........................................  ............................................................... TYPICAL INVOCATION: #  <!DOCTYPE spec PUBLIC #       "-//W3C//DTD Specification V2.1//EN" #       "http://www.w3.org/XML/1998/06/xmlspec-v21.dtd"> PURPOSE:   This XML DTD is for W3C specifications and other technical reports.   It is based in part on the TEI Lite and Sweb DTDs. ... 

The comments you're seeing are actually from the DTD used by the XML specification. Comments and processing instructions in the DTD, both internal and external subsets , are reported to their respective callback methods, just like comments and processing instructions in the instance document.

Example 7.12 is a pure LexicalHandler that does not implement any of the other SAX callback interfaces such as ContentHandler . However, it's not uncommon to implement several callback interfaces in one class. Among other advantages, that makes it a lot easier to write programs that rely on information available in different interfaces.

declaration-handler

The http://xml.org/sax/properties/declaration-handler property identifies the parser's DeclHandler . DeclHandler , summarized in Example 7.13, is an optional interface in the org.xml.sax.ext package that parsers use to report the parts of the DTD that don't affect the content of instance documents, specifically ELEMENT, ATTLIST, and parsed ENTITY declarations. Together with the information reported by the DTDHandler , this gives you enough information to reproduce a parsed document's DTD. The reproduced DTD may not be identical to the original DTD. For example, parameter entities will have been resolved, and only the first declaration of each general entity will be reported. Nonetheless, the model represented by the entire DTD should be intact.

Example 7.13 The DeclHandler Interface
 package org.xml.sax.ext; public interface DeclHandler {   public void elementDecl(String name, String model)    throws SAXException;   public void attributeDecl(String elementName,    String attributeName, String type, String mode,    String defaultValue) throws SAXException;   public void internalEntityDecl(String name, String value)    throws SAXException;   public void externalEntityDecl(String name, String publicID,    String systemID) throws SAXException; } 

Example 7.14 is a little DeclHandler I whipped up to help me make sense out of heavily modular, very customizable DTDs such as XHTML 1.1 or SMIL 2.0. It takes advantage of the fact that all parameter entity references and conditional sections are replaced before the methods of DeclHandler are called. It implements the DeclHandler interface with methods that copy each declaration onto System.out . However, because parameter entity references and conditional sections are resolved before these methods are invoked, it outputs a single monolithic DTD. I can see, for example, exactly what the content model for an element such as blockquote really is without having to manually trace the parameter entity references through seven separate modules, and figuring out which modules are likely to be included and which ignored.

Example 7.14 A Program That Prints Out a Complete DTD
 import org.xml.sax.*; import org.xml.sax.ext.DeclHandler; import org.xml.sax.helpers.XMLReaderFactory; import java.io.IOException; public class DTDMerger implements DeclHandler {   public void elementDecl(String name, String model)    throws SAXException {     System.out.println("<!ELEMENT " + name + " " + model + " >");   }   public void attributeDecl(String elementName,    String attributeName, String type, String mode,    String defaultValue) throws SAXException {     System.out.print("<!ATTLIST ");     System.out.print(elementName);     System.out.print(" ");     System.out.print(attributeName);     System.out.print(" ");     System.out.print(type);     System.out.print(" ");     if (mode != null) {       System.out.print(mode + " ");     }     if (defaultValue != null) {       System.out.print('"' + defaultValue + "\" ");     }     System.out.println(">");   }   public void internalEntityDecl(String name,    String value) throws SAXException {     if (!name.startsWith("%")) {// ignore parameter entities       System.out.println("<!ENTITY " + name + " \""        + value + "\">");     }   }   public void externalEntityDecl(String name,    String publicID, String systemID) throws SAXException {     if (!name.startsWith("%")) {// ignore parameter entities       if (publicID != null) {         System.out.println("<!ENTITY " + name + " PUBLIC \""          + publicID + "\" \"" + systemID + "\">");       }       else {         System.out.println("<!ENTITY " + name + " SYSTEM \""          + systemID + "\">");       }     }   }   public static void main(String[] args) {     if (args.length <= 0) {       System.out.println("Usage: java DTDMerger URL");       return;     }     String document = args[0];     XMLReader parser = null;     try {       parser = XMLReaderFactory.createXMLReader();       DeclHandler handler = new DTDMerger();       parser.setProperty(        "http://xml.org/sax/properties/declaration-handler",        handler);       parser.parse(document);     }     catch (SAXNotRecognizedException e) {       System.err.println(parser.getClass()        + " does not support declaration handlers.");     }     catch (SAXNotSupportedException e) {       System.err.println(parser.getClass()        + " does not support declaration handlers.");     }     catch (SAXException e) {       System.err.println(e);       // As long as we finished with the DTD we really don't care     }     catch (IOException e) {       System.out.println(        "Due to an IOException, the parser could not check "        + document       );     }   } } 

I ran this program across the start of an XHTML document. (The document was the XHTML 1.1 specification itself, although that detail doesn't really matter because it's the DTD we care about here, not the instance document. In fact, the instance document doesn't even need to be well-formed, as long as the error isn't spotted until after the DOCTYPE declaration has been read.) Here is the beginning of the merged DTD:

 %  java DTDMerger http://www.w3.org/TR/xhtml11  <!ATTLIST a onfocus CDATA #IMPLIED > <!ATTLIST a onblur CDATA #IMPLIED > <!ATTLIST form onsubmit CDATA #IMPLIED > <!ATTLIST form onreset CDATA #IMPLIED > <!ATTLIST label onfocus CDATA #IMPLIED > ... 

If what you really want to know is the content specification for a particular element type, the output from this program is a lot easier to read than the original DTD. For example, here's the original ELEMENT declaration for the p element:

 <!ENTITY % p.element  "INCLUDE" >  <![%p.element;[ <!ENTITY % p.content      "( #PCDATA  %Inline.mix; )*" > <!ENTITY % p.qname  "p" > <!ELEMENT %p.qname;  %p.content; > <!-- end of p.element -->]]> 

Now here's the merged version:

 <!ELEMENT p    (#PCDATAbrspanemstrongdfncodesampkbdvarciteabbr    acronymqttibbigsmallsubsupbdoaimgmapobjectinput    selecttextarealabelbuttonrubyinsdelscriptnoscript)* > 

I think you'll agree that the second version is a lot easier to follow and understand. There are good and valid reasons to write the original DTD in the form used by the first declaration, but that's just not a form you want to present to a human being instead of a computer.

Xerces Custom Features

Individual parsers generally have a set of their own custom features and properties that control their own special capabilities. This allows you to configure a parser without having to go outside the standard SAX API, and thus binding your code to one specific parser. There's generally no problem with using these nonstandard features. Just be on the alert for SAXNotRecognizedException in case you later need to switch to a different parser that doesn't support the same features.

For purposes of illustration, I'll look at the custom features in Xerces 1.4.3, all of which are in the http://apache.org/xml/features/ hierarchy. Other parsers will have some features similar to these and some unique ones of their own. However, all will be in a domain of that parser vendor.

http://apache.org/xml/features/validation/schema

If true, Xerces will use any XML schemas it finds for applying default attribute values, for assigning types to attributes, and possibly for validation. (Validation also depends on the http://xml.org/sax/features/validation feature.) If false, then Xerces won't use schemas at all, just the DTD. The default is trueuse the schema if present.

http://apache.org/xml/features/validation/schema-full-checking

A number of features of the W3C XML Schema Language are extremely compute intensive . For example, the rather technical requirement for "Unique Particle Attribution" mandates that, given any element, it's possible to tell which part of a schema that element matches without considering the items the element contains or the elements that follow it. This is extremely difficult to state, much less implement, in a precisely correct way. Consequently, Xerces by default skips these expensive checks. However, if you want them performed despite their cost, you can turn them on by setting this feature to true.

http://apache.org/xml/features/validation/dynamic

If true, then Xerces will only attempt to validate documents that have a DOCTYPE declaration or an xsi:schemaLocation attribute. It will not attempt to validate merely well-formed documents that have neither .

http://apache.org/xml/features/validation/warn-on-duplicate-attdef

It is technically legal to declare the same attribute twice, and the declarations don't even have to be compatible. For example:

 <!ATTLIST Order id ID #IMPLIED>  <!ATTLIST Order id CDATA #REQUIRED> 

The parser simply picks the first declaration and ignores the rest. Nonetheless, this probably indicates a mistake in the DTD. If the warn-on-duplicate-attdef feature is true, then Xerces should warn of duplicate attribute declarations by invoking the warning() method in the registered ErrorHandler . The default is to warn of this problem.

http://apache.org/xml/features/validation/warn-on-undeclared-elemdef

It is technically legal to declare an attribute for an element that is not declared. This might happen if you delete an ELEMENT declaration but forget to delete one of the ATTLIST declarations for that element. Nonetheless, this almost certainly indicates a mistake. If this feature is true, then Xerces will warn of attribute declarations for nonexistent elements. The default is to warn of this problem.

http://apache.org/xml/features/allow-java-encodings

By default Xerces only recognizes the standard encoding names such as ISO-8859-1 and UTF-8. However, if this feature is turned on, then Xerces will also recognize Java style encoding names such as 8859_1 and UTF8. The default is false.

http://apache.org/xml/features/continue-after-fatal-error

If true, Xerces will continue to parse a document after it detects a well-formedness error in order to detect and report more errors. This is useful for debugging because it allows you to be informed of multiple errors and correct them before parsing a document again. The default is false. Note that the only thing Xerces will do after it sees the first well-formedness error is to look for more errors. It will not invoke any methods in any of the callback interfaces except ErrorHandler .

http://apache.org/xml/features/nonvalidating/load-dtd-grammar

If true, Xerces will attach default attributes to elements and specify attribute types even if it isn't validating. If false it won't. The default is true; and if validation is turned on, this feature is automatically turned on and cannot be turned off.

http://apache.org/xml/features/nonvalidating/load-external-dtd

If true, Xerces will load the external DTD subset. If false it won't. The default is true. If validation is turned on, this feature is automatically turned on and cannot be turned off.

Example 7.15 is a variation of the earlier SAXValidator program (Example 7.9) that uses Xerces custom features to provide as many warnings and errors as possible. Because it uses dynamic validation, it only reports validity errors if the document is in fact trying to be valid. It turns on all optional warnings, and it continues parsing after a fatal error so that it can find and report any more errors it spots in the document. This program is more useful for checking documents than the earlier generic program in Example 7.9. The downside is that it is totally dependent on the Xerces parser: It will not run with any other parser. Indeed it might even be problematic with earlier or later versions of Xerces. (I wrote this with version 1.4.3.)

Example 7.15 Making Maximal Use of Xerces' Special Capabilities
 import org.xml.sax.*; import org.xml.sax.helpers.XMLReaderFactory; import java.io.IOException; public class XercesChecker implements ErrorHandler {   // Flag to check whether any errors have been spotted.   private boolean valid = true;   public boolean isValid() {     return valid;   }   // If this handler is used to parse more than one document,   // its initial state needs to be reset between parses.   public void reset() {     // Assume document is valid until proven otherwise     valid = true;   }   public void warning(SAXParseException exception) {     System.out.println("Warning: " + exception.getMessage());     System.out.println(" at line " + exception.getLineNumber()      + ", column " + exception.getColumnNumber());     System.out.println(" in entity " + exception.getSystemId());   }   public void error(SAXParseException exception) {     System.out.println("Error: " + exception.getMessage());     System.out.println(" at line " + exception.getLineNumber()      + ", column " + exception.getColumnNumber());     // Unfortunately there's no good way to distinguish between     // validity errors and other kinds of non-fatal errors     valid = false;   }   public void fatalError(SAXParseException exception) {     System.out.println("Fatal Error: " + exception.getMessage());     System.out.println(" at line " + exception.getLineNumber()      + ", column " + exception.getColumnNumber());     System.out.println(" in entity " + exception.getSystemId());   }   public static void main(String[] args) {     if (args.length <= 0) {       System.out.println("Usage: java XercesChecker URL");       return;     }     String document = args[0];     try {       XMLReader parser = XMLReaderFactory.createXMLReader(        "org.apache.xerces.parsers.SAXParser"       );       XercesChecker handler = new XercesChecker();       parser.setErrorHandler(handler);       // This is a hack to fit some long lines of code that       // follow between the margins of this printed page       String features = "http://apache.org/xml/features/";       // Turn on Xerces specific features       parser.setFeature(features + "validation/dynamic", true);       parser.setFeature(features        + "validation/schema-full-checking", true);       parser.setFeature(features        + "validation/warn-on-duplicate-attdef", true);       parser.setFeature(features        + "validation/warn-on-undeclared-elemdef", true);       parser.setFeature(features + "continue-after-fatal-error",        true);       parser.parse(document);       if (handler.isValid()) {         System.out.println(document + " is valid.");       }       else {         // If the document isn't well-formed, an exception has         // already been thrown and this has been skipped.         System.out.println(document + " is well-formed.");       }     }     catch (SAXParseException e) {       System.out.print(document + " is not well-formed at ");       System.out.println("Line " + e.getLineNumber()        + ", column " +  e.getColumnNumber()        + " in file " + e.getSystemId());     }     catch (SAXException e) {       System.out.println("Could not check document because "        + e.getMessage());     }     catch (IOException e) {       System.out.println(        "Due to an IOException, the parser could not check "        + document       );     }   } } 

Following is the beginning of the output from running it across one of my web pages that was supposed to be well-formed HTML, but proved not to be:

 %  java XercesChecker http://www.cafeconleche.org/  Fatal Error: The element type "br" must be terminated by the  matching end-tag "</br>".  at line 73, column 4 Fatal Error: The element type "br" must be terminated by the  matching end-tag "</br>".  at line 74, column 16 Fatal Error: The element type "dd" must be terminated by the  matching end-tag "</dd>".  at line 123, column 4 Fatal Error: The element type "br" must be terminated by the  matching end-tag "</br>".  at line 162, column 4 Fatal Error: The reference to entity "section" must end with  the ';' delimiter.  at line 183, column 78  ... 

There were actually quite a few more errors than I've included here. The advantage of using XercesChecker over one of the earlier generic checking programs is that XercesChecker gives me a reasonably complete list of all errors in one pass. I couldn't necessarily do this with any off-the-shelf parser. With the earlier programs that stopped at the first fatal error, I'd have to fix one error, retest, fix the next error, retest, and so on until I had fixed the final error.

Xerces Custom Properties

DTDs require instance documents to specify what DTDs they should be validated against. Although often useful, this can be dangerous. For example, imagine that you've written an order processing system that accepts XML documents containing orders from many heterogenous systems around the world. You can't necessarily trust the people sending you orders to send them in the correct format, so as a first step you validate every order received. If the order document is invalid, your system rejects it.

This system has a flaw. Because the documents themselves specify which DTD they'll be validated against, hackers can introduce bad data into your system by replacing the system identifier for your DTD with a URI for a DTD on a site they control. Then they can send you a document that will test as valid, even though it's not, because it's being validated against the wrong DTD!

For this and other reasons, the schema specification explicitly states that the xsi:schemaLocation and xsi:noNamespaceSchemaLocation attributes are not the only way to attach a schema to an instance document. The client application parsing a document is allowed to override the schema locations given in the document with schemas of its own choosing. For this purpose, Xerces has two custom properties:

  • http://apache.org/xml/properties/schema/external-schemaLocation

  • http://apache.org/xml/properties/schema/external-noNamespaceSchemaLocation

Both of these properties are strings telling the parser where a schema for elements in particular namespaces (or no namespace) can be found. They have the same syntax as the xsi:schemaLocation and xsi:noNamespaceSchemaLocation attributes in instance documents. For example, this code fragment specifies that elements not in any namespace should be validated against the schema found at the relative URL orders.xsd:

 parser.setProperty(   "http://apache.org/xml/properties/schema/"  + "external-noNamespaceSchemaLocation", "orders.xsd"); 

The following code fragment specifies that elements in the http://schemas.xmlsoap.org/soap/envelope/namespace should be validated against the schema found at the URL http://www.w3.org/2002/06/soap-envelope ; and that elements in the http://schemas.xmlsoap.org/soap/encoding/ namespace should be validated against the schema found at the URL http://www.w3.org/2002/06/soap-encoding :

 parser.setProperty(  "http://apache.org/xml/properties/schema/external-SchemaLocation",  "http://schemas.xmlsoap.org/soap/envelope/ "  + "http://www.w3.org/2002/06/soap-envelope "  + "http://schemas.xmlsoap.org/soap/encoding/ "  + "http://www.w3.org/2002/06/soap-encoding"); 

If these properties are used and xsi:schemaLocation and/or xsi:noNamespaceSchemaLocation attributes are present in the instance document, then the schemas named by the properties take precedence.

These properties are only available in Xerces. Other parsers may support something similar, but if so they'll place it at their own URL. In fact, as I write this, Sun has just proposed [http://java.sun.com/xml/jaxp/change- requests -12.html], adding http://java.sun.com/xml/jaxp/properties/schemaLanguage and http://java.sun.com/xml/jaxp/properties/schemaLocation properties to JAXP. The rough idea is the same, although Sun's proposal would allow supporting arbitrary schema languages and allow the schemaLocation property to have a value from which the schema itself could be read, rather than merely giving the location of the schema. For example, it could be an InputStream or an InputSource object. Other parsers will doubtless implement this in other ways.



Processing XML with Java. A Guide to SAX, DOM, JDOM, JAXP, and TrAX
Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX
ISBN: 0201771861
EAN: 2147483647
Year: 2001
Pages: 191

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net