5.12. XMLJava 1.4 and Java 5.0 have added powerful XML processing features to the Java platform:
Examples using each of these packages are presented in the following sections. 5.12.1. Parsing XML with SAXThe first step in parsing an XML document with SAX is to obtain a SAX parser. If you have a SAX parser implementation of your own, you can simply instantiate the appropriate parser class. It is usually simpler, however, to use the javax.xml.parsers package to instantiate whatever SAX parser is provided by the Java implementation. The code looks like this: import javax.xml.parsers.*; // Obtain a factory object for creating SAX parsers SAXParserFactory parserFactory = SAXParserFactory.newInstance(); // Configure the factory object to specify attributes of the parsers it creates parserFactory.setValidating(true); parserFactory.setNamespaceAware(true); // Now create a SAXParser object SAXParser parser = parserFactory.newSAXParser(); // May throw exceptions The SAXParser class is a simple wrapper around the org.xml.sax.XMLReader class. Once you have obtained one, as shown in the previous code, you can parse a document by simply calling one of the various parse() methods. Some of these methods use the deprecated SAX 1 HandlerBase class, and others use the current SAX 2 org.xml.sax.helpers.DefaultHandler class. The DefaultHandler class provides an empty implementation of all the methods of the ContentHandler , ErrorHandler , DTDHandler , and EntityResolver interfaces. These are all the methods that the SAX parser can call while parsing an XML document. By subclassing DefaultHandler and defining the methods you care about, you can perform whatever actions are necessary in response to the method calls generated by the parser. The following code shows a method that uses SAX to parse an XML file and determine the number of XML elements that appear in a document as well as the number of characters of plain text (possibly excluding "ignorable whitespace") that appear within those elements: import java.io.*; import javax.xml.parsers.*; import org.xml.sax.*; import org.xml.sax.helpers.*; public class SAXCount { public static void main(String[] args) throws SAXException,IOException, ParserConfigurationException { // Create a parser factory and use it to create a parser SAXParserFactory parserFactory = SAXParserFactory.newInstance(); SAXParser parser = parserFactory.newSAXParser(); // This is the name of the file you're parsing String filename = args[0]; // Instantiate a DefaultHandler subclass to do your counting for you CountHandler handler = new CountHandler(); // Start the parser. It reads the file and calls methods of the handler. parser.parse(new File(filename), handler); // When you're done, report the results stored by your handler object System.out.println(filename + " contains " + handler.numElements + " elements and " + handler.numChars + " other characters "); } // This inner class extends DefaultHandler to count elements and text in // the XML file and saves the results in public fields. There are many // other DefaultHandler methods you could override, but you need only // these. public static class CountHandler extends DefaultHandler { public int numElements = 0, numChars = 0; // Save counts here // This method is invoked when the parser encounters the opening tag // of any XML element. Ignore the arguments but count the element. public void startElement(String uri, String localname, String qname, Attributes attributes) { numElements++; } // This method is called for any plain text within an element // Simply count the number of characters in that text public void characters(char[] text, int start, int length) { numChars += length; } } } 5.12.2. Parsing XML with DOMThe DOM API is much different from the SAX API. While SAX is an efficient way to scan an XML document, it is not well-suited for programs that want to modify documents. Instead of converting an XML document into a series of method calls, a DOM parser converts the document into an org.w3c.dom.Document object, which is a tree of org.w3c.dom.Node objects. The conversion of the complete XML document to tree form allows random access to the entire document but can consume substantial amounts of memory. In the DOM API, each node in the document tree implements the Node interface and a type-specific subinterface. (The most common types of node in a DOM document are Element and Text nodes.) When the parser is done parsing the document, your program can examine and manipulate that tree using the various methods of Node and its subinterfaces. The following code uses JAXP to obtain a DOM parser (which, in JAXP parlance, is called a DocumentBuilder ). It then parses an XML file and builds a document tree from it. Next, it examines the Document tree to search for <sect1> elements and prints the contents of the <title> of each. import java.io.*; import javax.xml.parsers.*; import org.w3c.dom.*; public class GetSectionTitles { public static void main(String[] args) throws IOException, ParserConfigurationException, org.xml.sax.SAXException { // Create a factory object for creating DOM parsers and configure it DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setIgnoringComments(true); // We want to ignore comments factory.setCoalescing(true); // Convert CDATA to Text nodes factory.setNamespaceAware(false); // No namespaces: this is default factory.setValidating(false); // Don't validate DTD: also default // Now use the factory to create a DOM parser, a.k.a. DocumentBuilder DocumentBuilder parser = factory.newDocumentBuilder(); // Parse the file and build a Document tree to represent its content Document document = parser.parse(new File(args[0])); // Ask the document for a list of all <sect1> elements it contains NodeList sections = document.getElementsByTagName("sect1"); // Loop through those <sect1> elements one at a time int numSections = sections.getLength(); for(int i = 0; i < numSections; i++) { Element section = (Element)sections.item(i); // A <sect1> // The first Element child of each <sect1> should be a <title> // element, but there may be some whitespace Text nodes first, so // loop through the children until you find the first element // child. Node title = section.getFirstChild(); while(title != null && title.getNodeType() != Node.ELEMENT_NODE) title = title.getNextSibling(); // Print the text contained in the Text node child of this element if (title != null) System.out.println(title.getFirstChild().getNodeValue()); } } } 5.12.3. Transforming XML DocumentsThe javax.xml.transform package defines a TRansformerFactory class for creating TRansformer objects. A transformer can transform a document from its Source representation into a new Result representation and optionally apply an XSLT transformation to the document content in the process. Three subpackages define concrete implementations of the Source and Result interfaces, which allow documents to be transformed among three representations:
The following code shows one use of these packages to transform the representation of a document from a DOM Document tree into a stream of XML text. An interesting feature of this code is that it does not create the Document TRee by parsing a file; instead, it builds it up from scratch. import javax.xml.transform.*; import javax.xml.transform.dom.*; import javax.xml.transform.stream.*; import javax.xml.parsers.*; import org.w3c.dom.*; public class DOMToStream { public static void main(String[] args) throws ParserConfigurationException, TransformerConfigurationException, TransformerException { // Create a DocumentBuilderFactory and a DocumentBuilder DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); DocumentBuilder db = dbf.newDocumentBuilder(); // Instead of parsing an XML document, however, just create an empty // document that you can build up yourself. Document document = db.newDocument(); // Now build a document tree using DOM methods Element book = document.createElement("book"); // Create new element book.setAttribute("id", "javanut4"); // Give it an attribute document.appendChild(book); // Add to the document for(int i = 1; i <= 3; i++) { // Add more elements Element chapter = document.createElement("chapter"); Element title = document.createElement("title"); title.appendChild(document.createTextNode("Chapter " + i)); chapter.appendChild(title); chapter.appendChild(document.createElement("para")); book.appendChild(chapter); } // Now create a TransformerFactory and use it to create a Transformer // object to transform our DOM document into a stream of XML text. // No arguments to newTransformer() means no XSLT stylesheet TransformerFactory tf = TransformerFactory.newInstance(); Transformer transformer = tf.newTransformer(); // Create the Source and Result objects for the transformation DOMSource source = new DOMSource(document); // DOM document StreamResult result = new StreamResult(System.out); // to XML text // Finally, do the transformation transformer.transform(source, result); } } The most interesting uses of javax.xml.transform involve XSLT stylesheets. XSLT is a complex but powerful XML grammar that describes how XML document content should be converted to another form (e.g., XML, HTML, or plain text). A tutorial on XSLT stylesheets is beyond the scope of this book, but the following code (which contains only six key lines) shows how you can apply such a stylesheet (which is an XML document itself) to another XML document and write the resulting document to a stream: import java.io.*; import javax.xml.transform.*; import javax.xml.transform.stream.*; import javax.xml.parsers.*; import org.w3c.dom.*; public class Transform { public static void main(String[] args) throws TransformerConfigurationException, TransformerException { // Get Source and Result objects for input, stylesheet, and output StreamSource input = new StreamSource(new File(args[0])); StreamSource stylesheet = new StreamSource(new File(args[1])); StreamResult output = new StreamResult(new File(args[2])); // Create a transformer and perform the transformation TransformerFactory tf = TransformerFactory.newInstance(); Transformer transformer = tf.newTransformer(stylesheet); transformer.transform(input, output); } } 5.12.4. Validating XML DocumentsThe javax.xml.validation package allows you to validate XML documents against a schema. SAX and DOM parsers obtained from the javax.xml.parsers package can perform validation against a DTD during the parsing process, but this package separates validation from parsing and also provides general support for arbitrary schema types. All implementations must support W3C XML Schema and are allowed to support other schema types, such as RELAX NG. To use this package, begin with a SchemaFactory instancea parser for a specific type of schema. Use this parser to parse a schema file into a Schema object. Obtain a Validator from the Schema , and then use the Validator to validate your XML document. The document is specified as a SAXSource or DOMSource object. You may recall these classes from the subpackages of javax.xml.transform . If the document is valid, the validate( ) method of the Validator object returns normally. If it is not valid, validate( ) throws a SAXException . You can install an org.xml.sax.ErrorHandler object for the Validator to provide some control over the kinds of validation errors that cause exceptions. import javax.xml.XMLConstants; import javax.xml.validation.*; import javax.xml.transform.sax.SAXSource; import org.xml.sax.*; import java.io.*; public class Validate { public static void main(String[] args) throws IOException { File documentFile = new File(args[0]); // 1st arg is document File schemaFile = new File(args[1]); // 2nd arg is schema // Get a parser to parse W3C schemas. Note use of javax.xml package // This package contains just one class of constants. SchemaFactory factory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI); // Now parse the schema file to create a Schema object Schema schema = null; try { schema = factory.newSchema(schemaFile); } catch(SAXException e) { fail(e); } // Get a Validator object from the Schema. Validator validator = schema.newValidator(); // Get a SAXSource object for the document // We could use a DOMSource here as well SAXSource source = new SAXSource(new InputSource(new FileReader(documentFile))); // Now validate the document try { validator.validate(source); } catch(SAXException e) { fail(e); } System.err.println("Document is valid"); } static void fail(SAXException e) { if (e instanceof SAXParseException) { SAXParseException spe = (SAXParseException) e; System.err.printf("%s:%d:%d: %s%n", spe.getSystemId(), spe.getLineNumber(), spe.getColumnNumber(), spe.getMessage()); } else { System.err.println(e.getMessage()); } System.exit(1); } } 5.12.5. Evaluating XPath ExpressionsXPath is a language for referring to specific nodes in an XML document. For example, the XPath expression "//section/title/text( )" refers to the text inside of a <title> element inside a <section> element at any depth within the document. A full description of the XPath language is beyond the scope of this book. The javax.xml.xpath package, new in Java 5.0, provides a way to find all nodes in a document that match an XPath expression. import javax.xml.xpath.*; import javax.xml.parsers.*; import org.w3c.dom.*; public class XPathEvaluator { public static void main(String[] args) throws ParserConfigurationException, XPathExpressionException, org.xml.sax.SAXException, java.io.IOException { String documentName = args[0]; String expression = args[1]; // Parse the document to a DOM tree // XPath can also be used with a SAX InputSource DocumentBuilder parser = DocumentBuilderFactory.newInstance().newDocumentBuilder(); Document doc = parser.parse(new java.io.File(documentName)); // Get an XPath object to evaluate the expression XPath xpath = XPathFactory.newInstance().newXPath(); System.out.println(xpath.evaluate(expression, doc)); // Or evaluate the expression to obtain a DOM NodeList of all matching // nodes. Then loop through each of the resulting nodes NodeList nodes = (NodeList)xpath.evaluate(expression, doc, XPathConstants.NODESET); for(int i = 0, n = nodes.getLength(); i < n; i++) { Node node = nodes.item(i); System.out.println(node); } } } |