Receiving Documents | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

In general, a single XMLReader may parse multiple documents and may do so with the same ContentHandler . Consequently, it's important to tell where one document ends and the next document begins. To provide this information, the parser invokes startDocument() as soon as it begins parsing a new document and before it invokes any other methods in ContentHandler . It calls endDocument() after it has finished parsing the document, and it will not report any further content from that document. No arguments are passed to either of these methods, which serve no purpose other than marking the beginning and end of a complete XML document.

Because an XMLReader may parse multiple documents with the same ContentHandler object, per-document data structures are normally initialized in the startDocument() method rather than in a constructor. These data structures can be flushed, saved, or committed as appropriate by the endDocument() method.

Caution

If you are using one ContentHandler for multiple documents, do not assume that the endDocument() method for the previous document actually ran. If one of the earlier methods such as startElement() threw an exception, it's likely that the parsing was not finished and that any cleanup code you put in endDocument() was not executed. For safety, it's a good idea to reinitialize all per-document data structures in startDocument() .

For example, let's revise the tag stripper program so that it can operate on multiple XML documents in series. Furthermore, rather than printing the results on a Writer , we'll store them in a List of String s. As is common in SAX programs, we need a data structure that holds the information collected from each document. For this simple program, a simple data structure suffices, namely a StringBuffer , which is stored in the currentDocument field. This field is initialized to a new StringBuffer object in the startDocument() method, and converted to a string and stored in the documents vector in the endDocument() method. Example 6.6 demonstrates the necessary ContentHandler class. The characters () method simply appends text to the currentDocument buffer.

Example 6.6 A ContentHandler Interface That Resets Its Data Structures Between Documents

 import org.xml.sax.*; import java.util.List; public class MultiTextExtractor implements ContentHandler {   private List documents;   // This field is deliberately not initialized in the   // constructor. It is initialized for each document parsed, not   // for each object constructed.   private StringBuffer currentDocument;   public MultiTextExtractor(List documents) {     if (documents == null) {       throw new NullPointerException(        "Documents list must be non-null");     }     this.documents = documents;   }   // Initialize the per-document data structures   public void startDocument() {     currentDocument = new StringBuffer();   }   // Flush and commit the per-document data structures   public void endDocument() {     String text = currentDocument.toString();     documents.add(text);   }   // Update the per-document data structures   public void characters(char[] text, int start, int length) {     currentDocument.append(text, start, length);   }   // do-nothing methods   public void setDocumentLocator(Locator locator) {}   public void startPrefixMapping(String prefix, String uri) {}   public void endPrefixMapping(String prefix) {}   public void startElement(String namespaceURI, String localName,    String qualifiedName, Attributes atts) {}   public void endElement(String namespaceURI, String localName,    String qualifiedName) {}   public void ignorableWhitespace(char[] text, int start,    int length) {}   public void processingInstruction(String target,    String data) {}   public void skippedEntity(String name) {} }