Input | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

Once you've got an instance of XMLReader , you're going to want to parse some documents with it. There are two methods that do this:

 public void  parse  (String  systemID  ) throws SAXException  public void  parse  (InputSource  in  ) throws SAXException

In the last chapter you learned how these methods call back to the client application. What I want to look at now is how the document is fed into the parser.

The parse() method takes either a String containing a system ID or an InputSource object as an argument. An InputSource is a wrapper for the various kinds of input streams and readers from which an XML document can be read. If a system ID is passed to the parse() method, then it's used to construct a new InputSource object that can be passed to the other overloaded parse() method.

The system ID is an absolute URL, such as http://www.example.com/example.xml ; or a relative URL, such as example.xml. Although XML allows any type of URI to be used for system IDs, the parser will need to resolve them. This means you need to use a URL here, not a URN; and that URL must have a scheme supported by your virtual machine's protocol handlers. http and ftp are generally safe choices. ^[1]

^[1] Technically, I suppose it's not absolutely necessary that a parser use the java.net.URL class and protocol handlers to download content from a URL. However, in practice all Java parsers I'm aware of are implemented this way.

Relative URLs are normally relative to the current working directory of the Java program. I've occasionally had problems resolving relative URLs on Windows when the URLs point into the local file system but outside the current working directory and its descendants. But most of the time, this is not a problem. Complicated relative URLs seem completely reliable on Unix and on web servers. More than anything else, this reflects the limitations of the java.net.URL class and the unfamiliarity with Windows among the developers who wrote the Java class library.

InputSource

The InputSource class demonstrated in Example 7.1 is an unusual three-way wrapper around an InputStream and/or a Reader and/or a system ID string. An XML document can be read from this source. Either the InputStream , the Reader , or the system ID can be set in the constructor. There's also a no-args constructor that sets none of these. Properties that aren't set in the constructor can be set later with various setter methods. An encoding property can be set via a setter method but not in a constructor. Finally, the current value of all four properties can be retrieved using getter methods. Any properties that are not explicitly set will be null.

Example 7.1 The SAX InputSource Class

 package org.xml.sax; public class InputSource {   public InputSource()   public InputSource(String systemID)   public InputSource(InputStream byteStream)   public InputSource(Reader characterStream)   public void        setPublicId(String publicID)   public String      getPublicId()   public void        setSystemId(String systemID)   public String      getSystemId()   public void        setByteStream(InputStream byteStream)   public InputStream getByteStream()   public void        setEncoding(String encoding)   public String      getEncoding()   public void        setCharacterStream(Reader characterStream)   public Reader      getCharacterStream() }

What's strange about this class is that there's no guarantee that the three properties of each InputSource object are in any way related to each other. The byte stream might read one XML document, the character stream a different document, and the system ID might point to still a third document or no document at all. When the parser attempts to parse an InputSource object, it first tries to read from the reader, ignoring the encoding. If no reader is available, the parser then tries to read the document from the InputStream using the specified encoding. If no encoding has been specified, the parser will attempt to determine it from the XML document itself by using the first few bytes of the file and the encoding declaration. If neither a Reader nor an InputStream is available, then the parser will try to open a connection to the URI identified by the system ID. If all three options fail, then the parse() method throws a SAXException .

Note

Do not make both the byte stream and the character stream non-null in an InputSource object. The parser will not use the additional source as a backup for a problematic initial source (for example, a system ID that points to a 404 Not Found error). It attempts to load the document from the first non-null property. If that fails, it throws an IOException or a SAXException . It does not consider any extra sources that might be available.

You should always set the system ID for an InputSource , even if you intend to read the actual data from a byte stream or a character stream. The system ID is the base URL for relative URLs found in the DOCTYPE declaration and external entity references. Some applications may also need it to resolve relative URLs in XML content, such as XLink and XInclude elements. Relative URLs cannot be resolved against an InputStream or Reader without the additional hint provided by the system ID. Finally, the system ID is used by the Locator and ErrorHandler interfaces to identify the file in which a particular element or problem appeared. This is particularly important when you're working with XML documents that are composed of many different files.

Most often a system ID is sufficient, but, the InputSource class can be quite useful when you can get a stream from a document but don't have a convenient URL for it. For example, one common criticism of XML is that it's verbosethat XML documents are too large compared with binary equivalents. This complaint is fallacious. In practice, most binary documents are bigger than XML equivalents. Nonetheless, this hasn't stopped uninformed developers from complaining about XML's verbosity . In any event, it's quite straightforward to compress XML documents and decompress them as necessary, particularly in Java where the java.util.zip package does all the hard work for you. The one downside to this approach is that URLs which point to your files no longer identify well-formed XML documents. Instead they point to a non-well- formed binary document. Thus they can't be used as system IDs. You can still get an InputStream from these compressed files, however; and you can use that InputStream to create an InputSource , which you then parse. For example, the following code parses the gzipped document found at http://www.example.com/bigdata.xml.gz :

 URL u = new URL("http://www.example.com/bigdata.xml.gz");  InputStream raw = u.openStream(); InputStream decompressed = new GZIPInputStream(raw); InputSource in = new InputSource(decompressed); in.setSystemId("http://www.example.com/bigdata.xml"); parser.parse(in);

Similar techniques apply anytime you've got an InputStream that doesn't come from a URL. If an XML document is stored in a String , you can use a StringReader to construct an InputSource . If the XML document is stored in a BLOB field in a relational database, you can use JDBC to retrieve a java.sql.Blob object, then use that class's getBinaryStream() method to convert the BLOB into an InputStream from which an InputSource can be constructed . If a UDP packet received by the DatagramSocket class contains an XML document, you can extract the data from the packet as a byte array using the DatagramPacket class's getData() method, construct a ByteArrayInputStream from that array, and use the ByteArrayInputStream to construct an InputSource object for parsing.

EntityResolver

An XML document is made up of entities. Each entity is identified by a public identifier, a system identifier, or both. The system IDs tend to be URLs, and the public IDs generally require some sort of catalog system that can convert them into URLs. An XML parser reads each entity using an InputSource connected to the correct URL. Most of the time, you simply give the parser a system ID or an InputSource pointing to the document entity, and let the parser figure out where to find any further entities referenced from the document entity. However, sometimes you may want the parser to read from different URLs than the ones the document specifies. For example, the parser might ask for the XHTML DTD from the W3C web site. You might choose to replace that with a cached copy stored locally. Or the parser might ask for the SMIL 1.0 DTD, but you want to give it the SMIL 2.0 DTD instead.

The EntityResolver interface allows you to filter the parser's requests for external parsed entities, so you can replace the files it requests with your own copies, either faithful or modified. You might even use this interface to provide some form of custom proxy server support, although chances are that would be better implemented at the socket level rather than in the parsing API.

EntityResolver is a callback interface much like ContentHandler . It is attached to an XMLReader with set and get methods:

 public void  setEntityResolver  (EntityResolver  resolver  ) public EntityResolver  getEntityResolver  ()

The EntityResolver interface, summarized in Example 7.2, contains just a single method, resolveEntity() . If you register an EntityResolver with an XMLReader , then every time that XMLReader needs to load an external parsed entity, it will pass the entity's public ID and system ID to resolveEntity() first. resolveEntity() can return either an InputSource or null. If it returns an InputSource , then this InputSource provides the entity's replacement text. If it returns null, then the parser reads the entity in the same way it would have were there not an EntityResolver , probably just by using the system ID and the java.net.URL class.

Example 7.2 The EntityResolver Interface

 package org.xml.sax; public interface EntityResolver {   public InputSource resolveEntity(String publicId,    String systemId) throws SAXException, IOException; }

Example 7.3 is a simple EntityResolver implementation that maps the XHTML public IDs to URLs on a local server at http://www.cafeconleche.org . A more extensible implementation would allow the lists of IDs and URLs to be customized.

Example 7.3 An XHTML EntityResolver

 import org.xml.sax.*; import java.util.Hashtable; public class LocalXHTML implements EntityResolver {   private Hashtable entities = new Hashtable();   // fill the list of URLs   public LocalXHTML() {     // The XHTML 1.0 DTDs     this.addMapping("-//W3C//DTD XHTML 1.0 Strict//EN",      "http://www.cafeconleche.org/DTD/xhtml1-strict.dtd");     this.addMapping("-//W3C//DTD XHTML 1.0 Transitional//EN",      "http://www.cafeconleche.org/DTD/xhtml1-transitional.dtd");     this.addMapping("-//W3C//DTD XHTML 1.0 Frameset//EN",      "http://www.cafeconleche.org/DTD/xhtml1-frameset.dtd");     // The XHTML 1.0 entity sets     this.addMapping("-//W3C//ENTITIES Latin 1 for XHTML//EN",      "http://www.cafeconleche.org/DTD/xhtml-lat1.ent");     this.addMapping("-//W3C//ENTITIES Symbols for XHTML//EN",      "http://www.cafeconleche.org/DTD/xhtml-symbol.ent");     this.addMapping("-//W3C//ENTITIES Special for XHTML//EN",      "http://www.cafeconleche.org/DTD/xhtml-special.ent");   }   private void addMapping(String publicID, String URL) {     entities.put(publicID, URL);   }   public InputSource resolveEntity(String publicID,    String systemID) throws SAXException {     if (entities.contains(publicID)) {       String url = (String) entities.get(publicID);       InputSource local = new InputSource(url);       return local;     }     else return null;   } }

Other schemes are certainly possible. For example, instead of looking at the public ID, you could replace the host in the system ID to load the DTDs from a mirror site. You could bundle the DTDs into your application's JAR file and load them from there. You could even hardwire the DTDs in the EntityResolver as string literals and load them with a StringReader .