Practical Usage | Professional XML Development with Apache Tools: Xerces, Xalan, FOP, Cocoon, Axis, Xindice (Wrox Professional Guides)

We’ve covered a lot of ways you can use Xerces to get information out of XML documents and into your application. Here are two more practical usage tips.

Xerces isn’t thread safe. You can’t have two threads that execute a single Xerces instance at the same time. If you’re in a multithreaded situation, you should create one instance of Xerces for each thread. If for some reason you don’t want to do that, make sure the access to the parser instance is synchronized, or you’ll run into some nasty problems. A common solution pattern for concurrent systems is to provide the thread with a pool of parser instances that have already been created.

That leads us into the second tip. If your application is processing many XML documents, you should try to reuse parser instances. Both the Xerces SAXParser and DOMParser provide a method called reset that you can use to reset the parser’s internal data structures so the instance can be used to parse another document. This saves the overhead of creating all the internal data structures for each document. When you combine this with grammar caching, you can get some nice improvements in performance relative to creating a parser instance over and over again.

Common Problems

This section addresses some common problems that people encounter when they use Xerces. Most of these issues aren’t Xerces specific, but they happen so frequently that we wanted to address them.

Classpath problems—It’s a simple mistake but a surprisingly common one. Both xml-apis.jar and xercesImpl.jar must be on your classpath in order to use Xerces. Leaving one of them out will cause pain and suffering. If you want to use the samples, you need to include xercesSamples.jar on your classpath.

The other thing to beware of is strange interactions between your classpath and either the JDK 1.3 Extension Mechanism or the JDK 1.4 Endorsed Standards Override Mechanism. If it looks like you aren’t getting Xerces or the Xerces version that you think you’re using, look for old versions of Xerces in these places. You can determine the version of Xerces by executing the following at your command line:
```
java org.apache.xerces.impl.Version  
```
This command prints out the version of Xerces you’re using. You can also call the static method org.apache.xerces.impl.Version#getVersion from inside a program to get the version string.
Errors not reported or always reported to the console—If you don’t provide an ErrorHandler, one of two behaviors will occur. In every version of Xerces prior to 2.3.0, if no ErrorHandler is registered, no error messages are displayed. You must register your own ErrorHandler if you want error messages to be reported. This problem confused a lot of people, so in version 2.3.0 the behavior was changed so that error messages are echoed to the console when no ErrorHandler is registered. In these versions of Xerces, you need to register your own ErrorHandler to turn off the messages to the console.
Multiple calls to characters—In SAX applications, it’s common to forget that the characters callback may be called more than once for the character data inside an element. Unless you buffer up the text by, say, appending it to a StringBuffer, it may look like your application is randomly throwing away pieces of character data.
When is ignorableWhitespace called?—It’s not enough that the definition of ignorable whitespace is confusing to people. The ignorableWhitespace callback is called for ignorableWhitespace only when a DTD is associated with the document. If there’s no DTD, ignorableWhitespace isn’t called. This is true even if there is an XML schema but no DTD.
Forgot validation switches—Another common problem is forgetting to turn on the validation features. This is true both for DTD validation and for schema validation. A single feature must be turned on for DTD validation; but for schema validation you must have namespace support turned on in addition to the feature for schema validation. That’s three properties. Make sure you have them all on.
Multiple documents in one file—People like to try to put multiple XML documents into a single file. This isn’t legal XML, and Xerces won’t swallow it. You’ll definitely see errors for that.
Mismatched encoding declaration—The character encoding used in a file and the encoding name specified in the encoding declaration must match. The encoding declaration is the encoding="name" that appears after <? xml version="1.0" encoding="name"?> in an XML document. If the encoding of the file and the declared encoding don’t match, you may see errors about invalid characters.
Forgetting to use namespace-aware methods—If you’re working with namespaces, be sure to use the namespace-aware versions of the methods. With SAX this is fairly easy because most people are using the SAX 2.0 ContentHandler, which has only the namespace-aware callback methods. If you’re using DocumentHandler and trying to do namespaces, you’re in the wrong place. You need to use ContentHandler. In DOM-based parsers, this is a little harder because there are namespace-aware versions of methods that have the letters NS appended to their names. So, Element#getAttributeNS is the namespace-aware version of the Element#getAttribute method.
Out of memory using the DOM—Depending on the document you’re working with, you may see out-of-memory errors if you’re using the DOM. This happens because the DOM tends to be very memory intensive. There are several possible solutions. You can increase the size of the Java heap. You can use the DOM in deferred mode—if you’re using the JAXP interfaces, then you aren’t using the DOM in deferred mode. Finally, you can try to prune some of the nodes in the DOM tree by setting the feature http://apache.org/xml/features/dom/include-ignorable-whitespace to false.
Using appendChild instead of importNode across DOM trees—The Xerces DOM implementation tries to enforce some integrity constraints on the contents of the DOM. One common thing developers want to do is create a new DOM tree and then copy some nodes from another DOM tree into it. Usually they try to do this using Node#appendChild, and then they start seeing exceptions like DOMException: DOM005 Wrong document, which is confusing. To copy nodes between DOM trees you need to use the Document#importNode method, and then you can call the method you want to put the node into its new home.