The NamedNodeMap Interface | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

If for some reason you want all the attributes of an element or you don't know their names , you can use the getAttributes() method to retrieve a NamedNodeMap inherited from the Node . ^[3] The NamedNodeMap interface, summarized in Example 11.5, has methods to get and set the various named nodes as well as to iterate through the nodes as a list. Here it's used for attributes, but soon you'll see it used for notations and entities as well.

^[3] Why getAttributes() is in Node instead of Element I have no idea. Elements are the only kind of node that can have attributes. For all other types of node, getAttributes() returns null.

Example 11.5 The NamedNodeMap Interface

 package org.w3c.dom; public interface NamedNodeMap {   // for iterating through the map as a list   public Node item(int index);   public int  getLength();   // For working with particular items in the list   public Node getNamedItem(String name);   public Node setNamedItem(Node arg) throws DOMException;   public Node removeNamedItem(String name)    throws DOMException;   public Node getNamedItemNS(String namespaceURI,    String localName);   public Node setNamedItemNS(Node arg) throws DOMException;   public Node removeNamedItemNS(String namespaceURI,    String localName) throws DOMException; }

I'll demonstrate with an XLink spider program like the one in Chapter 6this time implementing the program on top of DOM rather than SAX. You can judge for yourself which one is more natural.

Recall that XLink is an attribute-based syntax for denoting connections between documents. The element that is the link has an xlink:type attribute with the value simple , and an xlink:href attribute whose value is the URL of the remote document. For example, the following book element points to this book's home page:

 <book xlink:type="simple"        xlink:href="http://www.cafeconleche.org/books/xmljava/"       xmlns:xlink="http://www.w3.org/1999/xlink">   Processing XML with Java </book>

The customary prefix xlink is bound to the namespace URI http://www.w3.org/1999/xlink . It's usually advisable to depend on the URI and not the prefix, which may change.

Relative URLs are relative to the nearest ancestor xml:base attribute if one is present, or the location of the document otherwise . For example, the book element in this library element also points to http://www.cafeconleche.org/books/xmljava/ .

 <library xml:base="http://www.cafeconleche.org/"           xmlns:xlink="http://www.w3.org/1999/xlink">   <book xlink:type="simple" xlink:href="books/xmljava/">     Processing XML with Java   </book> </library>

The prefix xml is bound to the namespace URI http://www.w3.org/XML/1998/namespace . This is a special case, however. The xml prefix cannot be changed, and it does not need to be declared.

Attributes provide all of the information needed to process the link. Consequently, a spider can follow XLinks without knowing any details about the rest of the markup in the document. Example 11.6 is such a program. Currently this spider does nothing more than follow the links and print their URLs. It would not be hard to add code to load the discovered documents into a database or perform some other useful operation, however. You would simply subclass DOMSpider while overriding the process() method.

Example 11.6 An XLink Spider That Uses DOM

 import org.xml.sax.SAXException; import javax.xml.parsers.*; import java.io.*; import java.util.*; import java.net.*; import org.w3c.dom.*; public class DOMSpider {   public static String XLINK_NAMESPACE    = "http://www.w3.org/1999/xlink";   // This will be used to read all the documents. We could use   // multiple parsers in parallel. However, it's a lot easier   // to work in a single thread, and doing so puts some real   // limits on how much bandwidth this program will eat.   private DocumentBuilder parser;   // Builds the parser   public DOMSpider() throws ParserConfigurationException {     try {       DocumentBuilderFactory factory        = DocumentBuilderFactory.newInstance();       factory.setNamespaceAware(true);       parser = factory.newDocumentBuilder();     }     catch (FactoryConfigurationError e) {       // I don't absolutely need to catch this, but I hate to       // throw an Error for no good reason.       throw new ParserConfigurationException(        "Could not locate a factory class");     }   }   // store the URLs already visited   private Vector visited = new Vector();   // Limit the amount of bandwidth this program uses   private int maxDepth = 5;   private int currentDepth = 0;   public void spider(String systemID) {     currentDepth++;     try {       if (currentDepth < maxDepth) {         Document document = parser.parse(systemID);         process(document, systemID);         Vector toBeVisited = new Vector();         // search the document for URIs,         // store them in vector, and print them         findLinks(document.getDocumentElement(),          toBeVisited, systemID);         Enumeration e = toBeVisited.elements();         while (e.hasMoreElements()) {           String uri = (String) e.nextElement();           visited.add(uri);           spider(uri);         }       }     }     catch (SAXException e) {       // Couldn't load the document,       // probably not well-formed XML, skip it     }     catch (IOException e) {       // Couldn't load the document,       // likely network failure, skip it     }     finally {       currentDepth--;       System.out.flush();     }   }   public void process(Document document, String uri) {     System.out.println(uri);   }   // Recursively descend the tree of one document   private void findLinks(Element element, List uris,    String base) {     // Check for an xml:base attribute     String baseAtt = element.getAttribute("xml:base");     if (!baseAtt.equals(""))  base = baseAtt;     // look for XLinks in this element     if (isSimpleLink(element)) {       String uri        = element.getAttributeNS(XLINK_NAMESPACE, "href");       if (!uri.equals("")) {         try {           String wholePage = absolutize(base, uri);           if (!visited.contains(wholePage)            && !uris.contains(wholePage)) {             uris.add(wholePage);           }         }         catch (MalformedURLException e) {           // If it's not a good URL, then we can't spider it           // anyway, so just drop it on the floor.         }       } // end if     } // end if     // process child elements recursively     NodeList children = element.getChildNodes();     for (int i = 0; i < children.getLength(); i++) {       Node node = children.item(i);       int type = node.getNodeType();       if (type == Node.ELEMENT_NODE) {         findLinks((Element) node, uris, base);       }     } // end for   }   // If you're willing to require Java 1.4, you can do better   // than this with the new java.net.URI class   private static String absolutize(String context, String uri)    throws MalformedURLException {     URL contextURL = new URL(context);     URL url = new URL(contextURL, uri);     // Remove fragment identifier if any     String wholePage = url.toExternalForm();     int fragmentSeparator = wholePage.indexOf('#');     if (fragmentSeparator != -1) {       // There is a fragment identifier       wholePage = wholePage.substring(0, fragmentSeparator);     }     return wholePage;   }   private static boolean isSimpleLink(Element element) {     String type      = element.getAttributeNS(XLINK_NAMESPACE, "type");     if (type.equals("simple")) return true;     return false;   }   public static void main(String[] args) {     if (args.length == 0) {       System.out.println("Usage: java DOMSpider topURL");       return;     }     // start parsing...     try {       DOMSpider spider = new DOMSpider();       spider.spider(args[0]);     }     catch (Exception e) {       System.err.println(e);       e.printStackTrace();     }   } // end main } // end DOMSpider

There are two levels of recursion here. The spider() method recursively spiders documents. The findLinks() method recursively searches through the elements in a document looking for XLinks. It adds the URLs found in these links to a list of unvisited pages. As each of these documents is finished, the next document is retrieved from the list and processed in turn . If it's an XML document, then it is parsed and passed to the process() method. Non-XML documents found at the end of XLinks are ignored.

I tested this program by pointing it at the Resource Directory Description Language (RDDL) [http://www.rddl.org/] specification, which is one of the few real-world documents I know of that uses XLinks. I was surprised to find out just how much XLinked XML there is out there, although as of yet most of it is simply more XML specifications. This must be what the Web felt like circa 1991. Here's a sample of the more interesting output:

 D:\books\XMLJAVA>  java DOMSpider http://www.rddl.org/  http://www.rddl.org/ http://www.rddl.org/purposes http://www.rddl.org/purposes/software http://www.rddl.org/rddl.rdfs http://www.rddl.org/rddl-integration.rxg http://www.rddl.org/modules/rddl-1.rxm ... http://www.w3.org/2001/XMLSchema http://www.w3.org/2001/XMLSchema.xsd http://www.examplotron.org http://www.examplotron.org/compile.xsl http://www.examplotron.org/examplotron.xsd http://www.examplotron.org/0/1/ http://www.examplotron.org/0/2/ http://www.examplotron.org/0/3/ http://webns.net/rdfs/ http://www.w3.org/2000/01/rdf-schema http://webns.net/rdfs/?format=rdf http://webns.net/foaf/ http://xmlns.com/foaf/0.1/ http://webns.net/foaf/?format=rdf http://webns.net/dc/ http://purl.org/dc/elements/1.1/ http://webns.net/dc/?format=rdf http://openhealth.org/XSet http://xsltunit.org/0/1/ http://xsltunit.org/0/1/xsltunit.xsl http://xsltunit.org/0/1/tst_library.xsl http://xsltunit.org/0/1/library.xml http://xsltunit.org/0/1/library.xsl http://venetica.com/venicebridgecontent/ http://www.venetica.com/VeniceBridgeContent http://www.venetica.com/VeniceBridgeContent/  VeniceBridgeContent40.xsd http://www.venetica.com/VeniceBridgeContent/VeniceBridgeContent.biz http://www.venetica.com/VeniceBridgeContent/rddl30.html http://www.w3.org/TR/xhtml-basic http://www.w3.org/TR/xml-infoset/ http://www.w3.org/TR/xhtml-modularization/