The CharacterData Interface | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

The CharacterData interface is a generic superinterface for nodes that are composed mostly of text, including Text , CDATASection , and Comment . The CharacterData interface is almost never used directly. Rather, it is used as an instance of one of these three subinterfaces. But you almost always work with text, CDATA section, and comment nodes using the methods of the CharacterData interface.

Example 11.7 summarizes the CharacterData interface. This interface has methods that manipulate the text content of this node. As usual, it also inherits all the methods of its superinterface Node such as getParentNode() and getNodeValue() .

Example 11.7 The CharacterData Interface

 package org.w3c.dom; public interface CharacterData extends Node {   public String getData() throws DOMException;   public void   setData(String data) throws DOMException;   public int    getLength();   public String substringData(int offset, int length)    throws DOMException;   public void   appendData(String data) throws DOMException;   public void   insertData(int offset, String data)    throws DOMException;   public void   deleteData(int offset, int length)    throws DOMException;   public void   replaceData(int offset, int length, String data)    throws DOMException; }

The getData() method returns a String containing the complete content of the node. Any escaped characters such as & or   are replaced by the actual characters they represent. The setData() method replaces the entire text content of the node. There's no need to escape the string you pass to this method. If the document is written out to a file or a stream, then the serialization code is responsible for escaping these characters. In memory, the type of the object is enough to determine whether a less-than sign is the start of a tag or just a less-than sign.

There are also methods to read and write only parts of the text content. The offsets are all zero based, as in Java's String class. For example, the following code fragment deletes the first six characters from the CharacterData object text :

 text.delete(0, 6);

Java's String type is a very good match for DOM strings. Each char in a Java String is a single UTF-16 code point. That is, most Unicode characters are represented by exactly one Java char . However, characters with code points greater than 65,535, such as many musical symbols, are represented by two char s each, one for each half of the surrogate pair representing the character in UTF-16. The getLength() method in this interface returns the number of UTF-16 code points, not the number of Unicode characters. This is also how the length() method in Java's String class behaves.

On Usenet, jokes that some people are likely to find offensive are often obscured by rotating the ASCII character set 13 places. That is, the first letter of the alphabet, A, is transformed into the fourteenth letter of the alphabet, N. The second letter of the alphabet, B, is transformed into the fifteenth letter of the alphabet, O, and so forth through M, which becomes Z. Then N is transformed into A, O into B, and so on through Z, which becomes M. It's not a particularly strong cipher, but it's enough to prevent people from accidentally reading something they don't want to read. It has the extra advantage of reversing itself. That is, running the cipher text through the rot-13 algorithm one more time restores the original text.

Example 11.8 is a simple program that obscures text nodes, comments, and CDATA sections by applying the rot-13 algorithm to them. The encoded documents are as well- formed and valid as the original documents. Only the character data gets changed, not the markup. ^[4] This program can also decode documents that are already encoded.

^[4] ROT13XML could also encode attribute values and processing instructions without affecting well- formedness or validity, but because DOM does not represent these nodes as instances of CharacterData , I leave this as an exercise for the reader.

Example 11.8 Rot-13 Encoder for XML Documents

 import javax.xml.parsers.*; import javax.xml.transform.*; import javax.xml.transform.stream.StreamResult; import javax.xml.transform.dom.DOMSource; import org.w3c.dom.*; import org.xml.sax.SAXException; import java.io.IOException; public class ROT13XML {   // note use of recursion   public static void encode(Node node) {     if (node instanceof CharacterData) {       CharacterData text = (CharacterData) node;       String data = text.getData();       text.setData(rot13(data));     }     // recurse the children     if (node.hasChildNodes()) {       NodeList children = node.getChildNodes();       for (int i = 0; i < children.getLength(); i++) {         encode(children.item(i));       }     }   }   public static String rot13(String s) {     StringBuffer out = new StringBuffer(s.length());     for (int i = 0; i < s.length(); i++) {       int c = s.charAt(i);       if (c >= 'A' && c <= 'M') out.append((char) (c+13));       else if (c >= 'N' && c <= 'Z') out.append((char) (c-13));       else if (c >= 'a' && c <= 'm') out.append((char) (c+13));       else if (c >= 'n' && c <= 'z') out.append((char) (c-13));       else out.append((char) c);     }     return out.toString();   }   public static void main(String[] args) {     if (args.length <= 0) {       System.out.println("Usage: java ROT13XML URL");       return;     }     String url = args[0];     try {       DocumentBuilderFactory factory        = DocumentBuilderFactory.newInstance();       DocumentBuilder parser = factory.newDocumentBuilder();       // Read the document       Document document = parser.parse(url);       // Modify the document       ROT13XML.encode(document);       // Write it out again       TransformerFactory xformFactory        = TransformerFactory.newInstance();       Transformer idTransform = xformFactory.newTransformer();       Source input = new DOMSource(document);       Result output = new StreamResult(System.out);       idTransform.transform(input, output);     }     catch (SAXException e) {       System.out.println(url + " is not well-formed.");     }     catch (IOException e) {       System.out.println(       "Due to an IOException, the parser could not encode " + url       );     }     catch (FactoryConfigurationError e) {       System.out.println("Could not locate a factory class");     }     catch (ParserConfigurationException e) {       System.out.println("Could not locate a JAXP parser");     }     catch (TransformerConfigurationException e) {       System.out.println("Could not locate a TrAX transformer");     }     catch (TransformerException e) {       System.out.println("Could not transform");     }   } // end main }

The encode() method recursively descends the tree, applying the rot-13 algorithm to every CharacterData object it finds, whether a Comment , Text , or CDATASection . The algorithm itself is encapsulated in the rot13() method. Because both methods merely operate on their arguments but otherwise have no interaction with any state maintained in the class, I made them static. The main() method encodes a document at a URL typed on the command line, and then copies the result to System.out .

Here's a joke encoded by this program. You'll have to run the program if you want to find out what it says. :-)

 D:\books\XMLJAVA>  java ROT13XML joke.xml  <?xml version="1.0" encoding="utf-8"?><joke>   Gubhfnaqf bs crbcyr nggraq gur Oheavat Zna srfgviny rirel lrne   va Arinqn'f Oynpx Ebpx Qrfreg. Guvf vf gur ovt uvccvr srfgviny,   jurer crbcyr eha nebhaq anxrq, qevax, naq trg fgbarq,   be nf Trbetr J. Ohfu yvxrf gb pnyy vg,   trg ernql gb eha sbe cerfvqrag </joke>