The Text Interface


The Text interface represents a text node. This can be a child of an element, an attribute, or an entity reference. When a document is built by a parser, each text node will contain the longest possible run of contiguous parsed character data from the document, and thus no text node will be adjacent to any other. By contrast, documents built in memory may contain adjacent text nodes. Invoking the normalize() method in the Node interface on any ancestor of the text nodes will merge these together.

Example 11.9 summarizes the Text interface. In addition to methods such as setData() and getNodeValue() that Text inherits from its superinterfaces, it has one new method that splits a Text object into two.

Example 11.9 The Text Interface
 package org.w3c.dom; public interface Text extends CharacterData {   public Text splitText(int offset) throws DOMException; } 

The splitText() method splits one text node into two by dividing its data at a specified offset. All characters after the split are eliminated from the original node. A new text node is created and returned. Both nodes are included in the tree. If the offset is less than zero or greater than the length of the data, then splitText() throws a DOMException with the code for INDEX_SIZE_ERR .

The main reason to split a text node is so that you can move or delete part of some text, but not the entire node. You also can use it to insert a new node in the middle of a run of text. For example, suppose date is an Element object representing this element:

 <date>2002-01-08</date> 

Now suppose you want to change date to represent this element:

 <date><year>2002</year><month>01</month><day>08</day></date> 

The following code will do it:

 Document document = date.getOwnerDocument();  Text yearText = (Text) date.getFirstChild(); Text slash = yearText.splitText(4); Text monthText = slash.splitText(1); Text nextSlash = monthText.splitText(2); Text dayText = nextSlash.splitText(1); Element year = document.createElement("year"); Element month = document.createElement("month"); Element day = document.createElement("day"); date.removeChild(slash); date.removeChild(monthText); date.removeChild(yearText); date.removeChild(nextSlash); date.removeChild(dayText); year.appendChild(yearText); month.appendChild(monthText); day.appendChild(dayText); date.appendChild(year); date.appendChild(month); date.appendChild(day); 

Much of the time, these operations can be more easily implemented through String methods.

Example 11.10 is a simple program that recursively descends a DOM tree and prints all text nodes on System.out . This has the effect of stripping out the markup while leaving all text inside the document intact.

Example 11.10 Printing the Text Nodes in an XML Document
 import javax.xml.parsers.*; import org.w3c.dom.*; import org.xml.sax.SAXException; import java.io.IOException; public class DOMTextExtractor {   public void processNode(Node node) {     if (node instanceof Text) {       Text text = (Text) node;       String data = text.getData();       System.out.println(data);     }   }   // note use of recursion   public void followNode(Node node) {     processNode(node);     if (node.hasChildNodes()) {       NodeList children = node.getChildNodes();       for (int i = 0; i < children.getLength(); i++) {         followNode(children.item(i));       }     }   }   public static void main(String[] args) {     if (args.length <= 0) {       System.out.println("Usage: java DOMTextExtractor URL");       return;     }     String url = args[0];     try {       DocumentBuilderFactory factory        = DocumentBuilderFactory.newInstance();       DocumentBuilder parser = factory.newDocumentBuilder();       // If expandEntityReferences isn't turned off, there       //  won't be any entity reference nodes in the DOM tree       factory.setExpandEntityReferences(false);       // Read the document       Document document = parser.parse(url);       // Process the document       DOMTextExtractor extractor = new DOMTextExtractor();       extractor.followNode(document);     }     catch (SAXException e) {       System.out.println(url + " is not well-formed.");     }     catch (IOException e) {       System.out.println(        "Due to an IOException, the parser could not check " + url       );     }     catch (FactoryConfigurationError e) {       System.out.println("Could not locate a factory class");     }     catch (ParserConfigurationException e) {       System.out.println("Could not locate a JAXP parser");     }   } // end main } 

Here is the result of running the XML specification through this program:

 D:\books\XMLJAVA>  java DOMTextExtractor   http://www.w3.org/TR/   2000/REC-xml-20001006.xml  Extensible Markup Language (XML) 1.0 (Second Edition) REC-xml-20001006 W3C Recommendation 6 October 2000 ... 

Notice that white space is included in text nodes and is significant. Text inside entity references is also found, one way or another. If the DOM parser is producing entity reference nodes, then the replacement text of the entity becomes children of the entity reference nodes. Otherwise, the replacement text of the entity is simply resolved into the surrounding text nodes.



Processing XML with Java. A Guide to SAX, DOM, JDOM, JAXP, and TrAX
Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX
ISBN: 0201771861
EAN: 2147483647
Year: 2001
Pages: 191

Similar book on Amazon

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net