Manipulating an XML Document

Problem

You want to represent an XML document as a C++ object so that you can manipulate its elements, attributes, text, DTD, processing instructions, and comments.

Solution

Use Xerces's implementation of the W3C DOM. First, use the class xercesc::DOMImplementationRegistry to obtain an instance of xercesc::DOMImplementation, then use the DOMImplementation to create an instance of the parser xercesc::DOMBuilder. Next, register an instance of xercesc::DOMErrorHandler to receive notifications of parsing errors, and invoke the parser's parseURI( ) method with your XML document's URI or file pathname as its argument. If the parse is successful, parseURI will return a pointer to a DOMDocument representing the XML document. You can then use the functions defined by the W3C DOM specification to inspect and manipulate the document.

When you are done manipulating the document, you can save it to a file by obtaining a DOMWriter from the DOMImplementation and calling its writeNode( ) method with a pointer to the DOMDocument as its argument.

Example 14-10 shows how to use DOM to parse the document animals.xml from Example 14-1, locate and remove the node corresponding to Herby the elephant, and save the modified document.

Example 14-10. Using DOM to load, modify, and then save an XML document

#include 
#include  // cout
#include 
#include 
#include 
#include 
#include "animal.hpp"
#include "xerces_strings.hpp"

using namespace std;
using namespace xercesc;

/*
 * Define XercesInitializer as in Example 14-8
 */

// RAII utility that releases a resource when it goes out of scope.
template
class DOMPtr {
public:
 DOMPtr(T* t) : t_(t) { }
 ~DOMPtr( ) { t_->release( ); }
 T* operator->( ) const { return t_; }
private:
 // prohibit copying and assigning
 DOMPtr(const DOMPtr&);
 DOMPtr& operator=(const DOMPtr&);
 T* t_;
};

// Reports errors encountered while parsing using a DOMBuilder.
class CircusErrorHandler : public DOMErrorHandler {
public:
 bool handleError(const DOMError& e)
 {
 std::cout << toNative(e.getMessage( )) << "
";
 return false;
 }
};

// Returns the value of the "name" child of an "animal" element.
const XMLCh* getAnimalName(const DOMElement* animal)
{
 static XercesString name = fromNative("name");

 // Iterate though animal's children
 DOMNodeList* children = animal->getChildNodes( );
 for ( size_t i = 0,
 len = children->getLength( ); 
 i < len; 
 ++i ) 
 {
 DOMNode* child = children->item(i);
 if ( child->getNodeType( ) == DOMNode::ELEMENT_NODE &&
 static_cast(child)->getTagName( ) == name )
 {
 // We've found the "name" element.
 return child->getTextContent( );
 }
 }
 return 0;
}

int main( )
{
 try {
 // Initialize Xerces and retrieve a DOMImplementation;
 // specify that you want to use the Load and Save (LS)
 // feature
 XercesInitializer init;
 DOMImplementation* impl = 
 DOMImplementationRegistry::getDOMImplementation(
 fromNative("LS").c_str( )
 );
 if (impl == 0) {
 cout << "couldn't create DOM implementation
";
 return EXIT_FAILURE;
 }

 // Construct a DOMBuilder to parse animals.xml.
 DOMPtr parser = 
 static_cast(impl)->
 createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS, 0);

 // Enable namespaces (not needed in this example)
 parser->setFeature(XMLUni::fgDOMNamespaces, true);

 // Register an error handler
 CircusErrorHandler err;
 parser->setErrorHandler(&err);

 // Parse animals.xml; you can use a URL here 
 // instead of a file name
 DOMDocument* doc = 
 parser->parseURI("animals.xml");

 // Search for Herby the elephant: first, obtain a pointer 
 // to the "animalList" element.
 DOMElement* animalList = doc->getDocumentElement( );
 if (animalList->getTagName( ) != fromNative("animalList")) {
 cout << "bad document root: " 
 << toNative(animalList->getTagName( ))
 << "
";
 return EXIT_FAILURE;
 }

 // Next, iterate through the "animal" elements, searching
 // for Herby the elephant.
 DOMNodeList* animals = 
 animalList->getElementsByTagName(fromNative("animal").c_str( ));
 for ( size_t i = 0, 
 len = animals->getLength( );
 i < len;
 ++i )
 {
 DOMElement* animal = 
 static_cast(animals->item(i));
 const XMLCh* name = getAnimalName(animal);
 if (name != 0 && name == fromNative("Herby")) {
 // Found Herby -- remove him from document.
 animalList->removeChild(animal);
 animal->release( ); // optional.
 break;
 }
 }

 // Construct a DOMWriter to save animals.xml.
 DOMPtr writer = 
 static_cast(impl)->createDOMWriter( );
 writer->setErrorHandler(&err);

 // Save animals.xml.
 LocalFileFormatTarget file("animals.xml");
 writer->writeNode(&file, *animalList);
 } catch (const SAXException& e) {
 cout << "xml error: " << toNative(e.getMessage( )) << "
";
 return EXIT_FAILURE;
 } catch (const DOMException& e) {
 cout << "xml error: " << toNative(e.getMessage( )) << "
";
 return EXIT_FAILURE;
 } catch (const exception& e) {
 cout << e.what( ) << "
";
 return EXIT_FAILURE;
 }
}

Discussion

Like the TinyXml parser, the Xerces DOM parser produces a representation of an XML document as a tree-structured C++ object with nodes representing the document's components. Xerces is a much more sophisticated parser, however: for instance, unlike TinyXml, it understands XML Namespaces and can parse complex DTDs. It also constructs a much more detailed representation of an XML document, including its processing instructions and the namespace URIs associated with elements and attributes. Most importantly, it provides access to this information through the interface described in the W3C DOM specification.

The W3C specification, which is still a work in progress, is divided into several "levels"; currently, there are three levels. The classes DOMImplementation, DOMDocument, DOMElement, and DOMNodeList, used in Example 14-10, are specified in DOM Level 1. The classes DOMBuilder and DOMWrite are specified in DOM Level 3, as part of the Load and Save recommendation.

The names of Xerces classes aren't always the same as the names of the W3C DOM interfaces they implement; this is because Xerces implements several specifications in a single namespace, and attaches prefixes to some class names to avoid name clashes.

Example 14-10 should now be pretty easy to understand. I start by initializing Xerces as shown in Example 14-8. Then I obtain a DOMImplementation from the DOMImplementationRegistry, requesting the Load and Save feature by passing the string "LS" to the static method DOMImplementationRegistry::getDOMImplementation(). I next obtain a DOMBuilder from the DOMIMplementation. I have to cast the DOMIMplementation to type DOMIMplementationLS, because Load and Save features are not accessible from the DOMIMplementation interface specified by W3C DOM level 1. The first argument to createDOMBuilder() indicates that the returned parser will operate in synchronous mode. The other possible mode, asynchronous mode, is not currently supported by Xerces.

After obtaining a DOMBuilder, I enable XML Namespace support, register an ErrorHandler, and parse the document. The parser returns a representation of the document as a DOMDocument; using the DOMDocument's getElementsByTagName() method, I obtain a DOMElement object corresponding to the document's animalList element and iterate over its children using an object of type DOMNodeList. When I find an element that has a child element of type name containing the text "Herby", I remove it from the document by calling the root element's removeChild( ) method.

Just as SAX2XMLReader has a parse( ) method taking an instance of InputSource, DOMBuilder has a parse( ) method taking an instance of xercesc::DOMInputSource, an abstract class encapsulating a source of character data. DOMInputSource has a concrete subclass Wrapper4DOMIn-putSource that can be used to transform an arbitrary InputSource into a xercesc::DOMInputSource. See Recipe 14.3.

Finally, I obtain a DOMWriter object from the DOMImplementation, in much the same way that I obtained a DOMBuilder, and save the modified XML document to disk by calling its writeNode( ) method with the document's root element as argument.

You must free pointers returned by methods of the form DOMImplementation::createXXX( ) by calling the method release( ). Use the DOMPtr utility from Example 14-10 to make sure such pointers are released even if an exception is thrown. Pointers returned by methods of the form DOMDocument::createXXX( ) need not be explicitly released, although they can be if they are no longer needed. See the Xerces documentation for details.

Building C++ Applications

Code Organization

Numbers

Strings and Text

Dates and Times

Managing Data with Containers

Algorithms

Classes

Exceptions and Safety

Streams and Files

Science and Mathematics

Multithreading

Internationalization

XML

Miscellaneous

Index