DOM is a recommendation by the World Wide Web Consortium (W3C). Designed to be a language-neutral interface to an in-memory representation of an XML document, versions of DOM are available in Java, ECMAscript, [2] Perl, and other languages.
While SAX defines an interface of handler methods, the DOM specification calls for a number of classes, each with an interface of methods that affect a particular type of XML markup. Thus, every object instance manages a portion of the document tree, providing accessor methods to add, remove, or modify nodes and data. These objects are typically created by a factory object, making it a little easier for programmers who only have to initialize the factory object themselves . In DOM, every piece of XML (the element, text, comment, etc.) is a node represented by a node object. The Node class is extended by more specific classes that represent the types of XML markup, including Element , Attr (attribute), ProcessingInstruction , Comment , EntityReference , Text , CDATASection , and Document . These classes are the building blocks of every XML tree in DOM. The standard also calls for a couple of classes that serve as containers for nodes, convenient for shuttling XML fragments from place to place. These classes are NodeList , an ordered list of nodes, like all the children of an element; and NamedNodeMap , an unordered set of nodes. These objects are frequently required as arguments or given as return values from methods. Note that these objects are all live, meaning that any changes done to them will immediately affect the nodes in the document itself, rather than a copy. When naming these classes and their methods, DOM merely specifies the outward appearance of an implementation and leaves the internal specifics up to the developer. Particulars like memory management, data structures, and algorithms are not addressed at all, as those issues may vary among programming languages and the needs of users. This is like describing a key so a locksmith can make a lock that it will fit into; you know the key will unlock the door, but you have no idea how it really works. Specifically , the outward appearance makes it easy to write extensions to legacy modules so they can comply with the standard, but it does not guarantee efficiency or speed. DOM is a very large standard, and you will find that implementations vary in their level of compliance. To make things worse , the standard has not one, but two (soon to be three) levels. DOM1 has been around since 1998, DOM2 emerged more recently, and they're already working on a third. The main difference between Levels 1 and 2 is that the latter adds support for namespaces. If you aren't concerned about namespaces, then DOM1 should be suitable for your needs. 10.9.1 Class Interface ReferenceIn this section, I describe the interfaces specified in DOM. 10.9.1.1 DocumentThe Document class controls the overall document, creating new objects when requested and maintaining high-level information such as references to the document type declaration and the root element. 10.9.1.1.1 PropertiesFollowing are the properties for the Document class.
10.9.1.1.2 MethodsHere are the methods for the Document class:
10.9.1.2 DocumentFragmentThe DocumentFragment class is used to contain a document fragment. Its children are (zero or more) nodes representing the tops of XML trees. This class contrasts with Document , which has at most one child element, the document root, plus metadata like the document type. In this respect, DocumentFragment 's content is not well-formed, though it must obey the XML well- formed rules in all other respects (no illegal characters in text, etc.) No specific methods or properties are defined; use the generic node methods to access data. 10.9.1.3 DocumentTypeThis class contains all the information contained in the document type declaration at the beginning of the document, except the specifics about an external DTD. Thus, it names the root element and any declared entities or notations in the internal subset. No specific methods are defined for this class, but the properties are public (but read-only). 10.9.1.3.1 PropertiesHere are the properties for the DocumentType class:
10.9.1.4 NodeAll node types inherit from the class Node . Any properties or methods common to all node types can be accessed through this class. A few properties, such as the value of the node, are undefined for some node types, like Element . The generic methods of this class are useful in some programming contexts, such as when writing code that processes nodes of different types. At other times, you'll know in advance what type you're working with, and you should use the specific class's methods instead. All properties but nodeValue and prefix are read-only. 10.9.1.4.1 PropertiesHere are the properties for the Node class:
10.9.1.4.2 MethodsHere are the methods for the Node class:
10.9.1.5 NodeListThis class is a container for an ordered list of nodes. It is "live," meaning that any changes to the nodes it references will appear in the document immediately. 10.9.1.5.1 PropertiesHere are the properties for the NodeList class:
10.9.1.5.2 MethodsHere are the properties for the NodeList class:
10.9.1.6 NamedNodeMapThis unordered set of nodes is designed to allow access to nodes by name. An alternate access by index is also provided for enumerations, but no order is implied . 10.9.1.6.1 PropertiesHere are the properties for the NamedNodeMap class:
10.9.1.6.2 MethodsHere are the properties for the NamedNodeMap class:
10.9.1.7 CharacterDataThis class extends Node to facilitate access to certain types of nodes that contain character data, such as Text , CDATASection , Comment , and ProcessingInstruction . Specific classes like Text inherit from this class. 10.9.1.7.1 PropertiesHere are the properties for the CharacterData class:
10.9.1.7.2 MethodsHere are the methods for the CharacterData class:
10.9.1.8 ElementThis is the most common type of node you will encounter. An element can contain other nodes and has attribute nodes. 10.9.1.8.1 PropertiesHere are the properties for the Element class:
10.9.1.8.2 MethodsHere are the methods for the Element class:
10.9.1.9 AttrThis kind of node represents attributes. 10.9.1.9.1 PropertiesHere are the properties for the Attr class:
10.9.1.10 TextThis type of node represents text. 10.9.1.10.1 MethodsHere are the methods for the Text class:
10.9.1.11 CDATASectionCDATASection is like a text node, but protects its contents from being parsed. It may contain markup characters (<, &) that would be illegal in text nodes. Use generic Node methods to access data. 10.9.1.12 ProcessingInstructionThis class represents processing instructions. 10.9.1.12.1 PropertiesHere are the properties for the ProcessingInstruction class:
10.9.1.13 CommentThis is a class representing comment nodes. Use the generic Node methods to access the data. 10.9.1.14 EntityReferenceThis is a reference to an entity defined by an Entity node. Sometimes the parser will be configured to resolve all entity references into their values for you. If that option is disabled, the parser should create this node. No explicit methods force resolution, but some actions to the node may have that side effect. 10.9.1.15 EntityThis class provides access to an entity in the document, based on information in an entity declaration in the DTD. 10.9.1.15.1 PropertiesHere are the properties for the Entity class:
10.9.1.16 NotationNotation represents a notation declaration appearing in the DTD. 10.9.1.16.1 PropertiesHere are the properties for the Notation class:
10.9.2 An Example in PerlPerl is quite different from Java. It was not designed from the outset to be object oriented. That functionality was added later in kind of an ad hoc manner. Perl is loose with type checking and rather idiomatic . For these reasons, it is not always taken seriously by XML pundits. Yet Perl is a fixture in the World Wide Web, being the original duct tape that holds web sites together. It has a huge following and excellent support in books and online resources, and it's very easy to get started using it. For small, quick-and-dirty utilities that achieve fast results, it simply cannot be beat. Having cut my teeth in the text processing world of publishing, I found Perl to be a boon. Including a Perl example to contrast with Java gives us a nice range of programming environments to showcase XML development strategies. If you are developing a large, complex system, you will likely want to consider Java for its robustness and strong object-oriented programming capabilities. If you want a small tool for simple tasks in shaping your XML files, then Perl would be a great candidate. The example I propose for using DOM is a small application that fixes a simple problem. When I used to prepare DocBook-XML documents for formatting, I found there were a few common structural errors that would cause problems in the formatting software. One of these was the tendency of busy indexing specialists to insert <indexterm> elements inside titles. It is an easy mistake to make, and just as easy to fix. Now I will show you how to go about solving this problem with Perl. My favorite parser in Perl is Matt Sargent's XML::LibXML. It is an interface to the C library libxml2 which is incredibly fast and reliable. This module also implements most of the DOM2 specification and adds XPath node-fetching capability. In this portion of the script, we set up the parser and use it to assemble DOM trees out of files from the command line: use XML::LibXML; my $parser = new XML::LibXML; # a parser object # This table gives us that ability to test the type of # most common nodes. It is not a complete list, but these are # the ones we are most likely to encounter (and care about # for this example). my %nodeTypes = ( element => 1, attribute => 2, text => 3, cdatasection => 4, entityref => 5, entitynode => 6, procinstruc => 7, comment => 8, document => 9 ); # Loop through the arguments on the command line, feeding them to # the parser as filenames. After testing that parsing was successful, # apply the map_proc_to_elems subroutine to the document node to # make the needed fixes. Finally, write the XML back out to the file. foreach my $fileName ( @ARGV ) { my $docRef; eval{ $docRef = $parser->parse_file( $fileName ); }; die( "Parser error: $@" ) if( $@ ); map_proc_to_elems( \&fix_iterms, $docRef ); open( OUT, ">$fileName" ) or die( "Can't write $fileName" ); print OUT $docRef->toString(); close OUT; } After instantiating the parser, we created a hash table that maps English words for node types to the numeric codes used in the parser. This will give us the ability to test what kind of node we are looking at when we traverse through the file. In the loop below that declaration, we take filenames from the command line argument list ( @ARGV ) and feed them to the parser. The eval{ } statement catches any parse errors, which we detect in the following die( ) statement. The parser puts helpful error messages in $@ to indicate what may have confused the parser. If all goes well, the parser will return a reference to the top of the DOM tree, specifically an XML::LibXML::Document object. The map_proc_to_elems( ) is a yet-to-be-written subroutine that will apply a procedure (also not yet written) to nodes in the DOM tree. This is where the real work will take place in the program. It makes changes directly to the object tree, so all we have to do is print it out as text with the toString( ) method. Now let us dig into the map_proc_to_elems( ) routine. The purpose of this function is to map a procedure to every element in the document: sub map_proc_to_elems { my( $proc, $nodeRef ) = @_; my $nodeType = $nodeRef->nodeType; if( $nodeType == $nodeTypes{document} ) { map_proc_to_elems( $proc, $nodeRef->getDocumentElement ); } elsif( $nodeType == $nodeTypes{element} ) { &$proc( $nodeRef ); foreach my $childNodeRef ( $nodeRef->getChildnodes ) { map_proc_to_elems( $proc, $childNodeRef ); } } } You start it with the document node or any element and it will visit every element in that subtree, recursing on the children and their children and so on. Testing the node's type allows us to make sure we don't try to apply the procedure to anything that isn't the document node or an element. The procedure to be applied comes in the form of a subroutine reference, which we dereference to call in two places: when the current node is a document node, and when it is an element. For any other case, the subroutine just returns without doing anything. Driving this traversal are the methods getDocumentElement( ) , which obtains the root element, and getChildnodes( ) , [3] which returns a list of child nodes in the order they appear in the document.
Now we turn our attention to the subroutine that performs the fix on elements. It is called fix_iterms( ) because it moves indexterm elements out of title elements where they would cause trouble. We could just as easily substitute this procedure with another that does something else to elements. That is the beauty of this program: it can be quickly re-engineered to do any task on elements you want. Here it is: sub fix_iterms { my $nodeRef = shift; # test: is this an indexterm? return unless( $nodeRef->nodeName eq 'indexterm' ); # test: is the parent a title? my $parentNodeRef = $nodeRef->parentNode; return unless( $parentNodeRef->nodeName eq 'title' ); # If we get this far, we must be # looking at an indexterm inside a title. # Therefore, remove this indexterm and # stick it just after the parent (title). $parentNodeRef->removeChild( $nodeRef ); my $ancestorNodeRef = $parentNodeRef->parentNode; $ancestorNodeRef->insertAfter( $nodeRef, $parentNodeRef ); } At the top of the procedure are lines that select which element to process. Since this procedure is called for every element, we have to weed out the ones we don't want to touch. The first test determines whether the element is an <indexterm> and, if it is not, returns immediately. The next two lines examine the parent of this element, aborting unless it is of type title . If processing gets past these two tests, we know this must be an indexterm inside a title . The processing that follows removes the offending indexterm element from its parent's list of children and inserts it into the list of its parent's parent's children, just after the parent. So the indexterm goes from being a child of title to being its sibling, positioned immediately after it. This puts the element where it will do no harm to the formatter and will still be seen by an index generator later. Wasn't that simple? Example 10-5 shows the complete program. Example 10-5. A DOM program for moving indexterms out of titles#!/usr/bin/perl use XML::LibXML; my $parser = new XML::LibXML; my %nodeTypes = ( element => 1, attribute => 2, text => 3, cdatasection => 4, entityref => 5, entitynode => 6, procinstruc => 7, comment => 8, document => 9 ); foreach my $fileName ( @ARGV ) { my $docRef; eval{ $docRef = $parser->parse_file( $fileName ); }; die( "Parser error: $@" ) if( $@ ); map_proc_to_elems( \&fix_iterms, $docRef ); open( OUT, ">$fileName" ) or die( "Can't write $fileName" ); print OUT $docRef->toString(); close OUT; } sub map_proc_to_elems { my( $proc, $nodeRef ) = @_; my $nodeType = $nodeRef->nodeType; if( $nodeType == $nodeTypes{document} ) { map_proc_to_elems( $proc, $nodeRef->getDocumentElement ); } elsif( $nodeType == $nodeTypes{element} ) { &$proc( $nodeRef ); foreach my $childNodeRef ( $nodeRef->getChildnodes ) { map_proc_to_elems( $proc, $childNodeRef ); } } } sub fix_iterms { my $nodeRef = shift; return unless( $nodeRef->nodeName eq 'indexterm' ); my $parentNodeRef = $nodeRef->parentNode; return unless( $parentNodeRef->nodeName eq 'title' ); $parentNodeRef->removeChild( $nodeRef ); my $ancestorNodeRef = $parentNodeRef->parentNode; $ancestorNodeRef->insertAfter( $nodeRef, $parentNodeRef ); } Now, let's make sure this thing works. Here is a sample data file, before processing: <chapter> <title><indexterm><primary>wee creatures</primary></indexterm> Habits of the Wood Sprite <indexterm><primary>woodland faeries</primary></indexterm></title> <indexterm> <primary>sprites</primary> <secondary>woodland</secondary> </indexterm> <para>The wood sprite likes to hang around rotting piles of wood and is easily dazzled by bright lights.</para> <section> <title><indexterm><primary>little people</primary></indexterm> Origins</title> <para>No one really knows where they came from.</para> <indexterm><primary>magical folk</primary></indexterm> </section> </chapter> I have placed indexterm s in various places, both inside and outside title s to see which ones are affected. Here is the result, after running the script on it: <?xml version="1.0"?> <chapter> <title>Habits of the Wood Sprite</title><indexterm><primary>woodland faeries</ primary></indexterm> <indexterm><primary>wee creatures</primary></indexterm> <indexterm> <primary>sprites</primary> <secondary>woodland</secondary> </indexterm> <para>The wood sprite likes to hang around rotting piles of wood and is easily dazzled by bright lights.</para> <section> <title>Origins</title><indexterm><primary>little people</primary></indexterm> <para>No one really knows where they came from.</para> <indexterm><primary>magical folk</primary></indexterm> </section> </chapter> The indexterm s have been moved out of the title s as we expected. Other indexterm s have not been affected. The other contents in titles are still there, unchanged, including some extra space that abutted the indexterm elements. In short, it worked! Perl works well for most of my XML needs. Historically, it has had a few issues with character encodings, but these problems are gradually going away as Perl adopts multibyte characters and adds support for Unicode. Check out http://www.cpan.org for a huge list of modules that do everything with XML including XSLT, XPath, DOM, SAX, and more. You will also want to check out Python, which many people tout as superior in its object-oriented support. It is quickly growing in popularity, though it will be a while before it can match Perl's wealth of libraries. |