Parsing a Complex XML Document

Problem

You have a collection of data stored in an XML document that uses an internal DTD or XML Namespaces. You want to parse the document and turn the data it contains into a collection of C++ objects.

Solution

Use Xerces's implementation of the SAX2 API (the Simple API for XML, Version 2.0). First, derive a class from xercesc::ContentHandler; this class will receive notifications about the structure and content of your XML document as it is being parsed. Next, if you like, derive a class from xercesc::ErrorHandler to receive warnings and error notifications. Construct a parser of type xercesc::SAX2XMLReader, register instances of your handler classes using the parser's setContentHandler( ) and setErrorHandler() methods. Finally, invoke the parser's parse( ) method, passing the file pathname of your document as its argument.

For example, suppose you want to parse the XML document animals.xml from Example 14-1 and construct a std::vector of Animals representing the animals listed in the document. (See Example 14-2 for the definition of the class Animal.) In Example 14-3, I showed how to do this using TinyXml. To make the problem more challenging, let's add namespaces to the document, as shown in Example 14-5.

Example 14-5. List of circus animals, using XML Namespaces






 
 Herby
 elephant
 1992-04-23
 
 
 
 
 

To parse this document with SAX2, define a ContentHandler, as shown in Example 14-6, and an ErrorHandler, as shown in Example 14-7. Then construct a SAX2XMLReader, register your handlers, and run the parser. This is illustrated in Example 14-8.

Example 14-6. A SAX2 ContentHandler for parsing the document animals.xml

#include  // runtime_error
#include 
#include 
#include  // Contains no-op 
 // implementations of
 // the various handlers
#include "xerces_strings.hpp" // Example 14-4
#include "animal.hpp"

using namespace std;
using namespace xercesc;

// Returns an instance of Contact based 
// on the given collection of attributes
Contact contactFromAttributes(const Attributes &attrs)
{
 // For efficiency, store frequently used string 
 // in static variables
 static XercesString name = fromNative("name");
 static XercesString phone = fromNative("phone");

 Contact result; // Contact to be returned.
 const XMLCh* val; // Value of name or phone attribute.

 // Set Contact's name.
 if ((val = attrs.getValue(name.c_str( ))) != 0) {
 result.setName(toNative(val));
 } else {
 throw runtime_error("contact missing name attribute");
 }

 // Set Contact's phone number.
 if ((val = attrs.getValue(phone.c_str( ))) != 0) {
 result.setPhone(toNative(val));
 } else {
 throw runtime_error("contact missing phone attribute");
 }

 return result;
}

// Implements callbacks that receive character data and
// notifications about the beginnings and ends of elements 
class CircusContentHandler : public DefaultHandler {
public:
 CircusContentHandler(vector& animalList) 
 : animalList_(animalList)
 { }

 // If the current element represents a veterinarian or trainer,
 // use attrs to construct a Contact object for the current 
 // Animal; otherwise, clear currentText_ in preparation for the 
 // characters( ) callback
 void startElement( 
 const XMLCh *const uri, // namespace URI
 const XMLCh *const localname, // tagname w/ out NS prefix
 const XMLCh *const qname, // tagname + NS pefix
 const Attributes &attrs ) // elements's attributes
 {
 static XercesString animalList = fromNative("animalList");
 static XercesString animal = fromNative("animal");
 static XercesString vet = fromNative("veterinarian");
 static XercesString trainer = fromNative("trainer");
 static XercesString xmlns = 
 fromNative("http://www.feldman-family-circus.com");

 // Check namespace URI
 if (uri != xmlns)
 throw runtime_error(
 string("wrong namespace uri: ") + toNative(uri)
 );
 if (localname == animal) {
 // Add an Animal to the list; this is the new
 // "current Animal"
 animalList_.push_back(Animal( ));
 } else if (localname!= animalList) {
 Animal& animal = animalList_.back( );
 if (localname == vet) {
 // We've encountered a "veterinarian" element.
 animal.setVeterinarian(contactFromAttributes(attrs));
 } else if (localname == trainer) {
 // We 've encountered a "trainer" element.
 animal.setTrainer(contactFromAttributes(attrs));
 } else {
 // We've encountered a "name" , "species", or 
 // "dateOfBirth" element. Its content will be supplied
 // by the callback function characters( ).
 currentText_.clear( );
 }
 }
 }

 // If the current element represents a name, species, or date
 // of birth, use the text stored in currentText_ to set the
 // appropriate property of the current Animal.
 void endElement( 
 const XMLCh *const uri, // namespace URI
 const XMLCh *const localname, // tagname w/ out NS prefix
 const XMLCh *const qname ) // tagname + NS pefix
 {
 static XercesString animalList = fromNative("animal-list");
 static XercesString animal = fromNative("animal");
 static XercesString name = fromNative("name");
 static XercesString species = fromNative("species");
 static XercesString dob = fromNative("dateOfBirth");

 if (localname!= animal && localname!= animalList) {
 // currentText_ contains the content of the element 
 // which has ended. Use it to set the current Animal's 
 // properties.
 Animal& animal = animalList_.back( );
 if (localname == name) {
 animal.setName(toNative(currentText_));
 } else if (localname == species) {
 animal.setSpecies(toNative(currentText_));
 } else if (localname == dob) {
 animal.setDateOfBirth(toNative(currentText_));
 } 
 }
 }
 // Receives notifications when character data is encountered
 void characters( const XMLCh* const chars, 
 const unsigned int length ) 
 {
 // Append characters to currentText_ for processing by
 // the method endElement( )
 currentText_.append(chars, length);
 }
private:
 vector& animalList_;
 XercesString currentText_;
};

 

Example 14-7. A SAX2 ErrorHandler

#include  // runtime_error
#include 

// Receives Error notifications.
class CircusErrorHandler : public DefaultHandler {
public: 
 void warning(const SAXParseException& e)
 {
 /* do nothing */
 }
 void error(const SAXParseException& e)
 {
 throw runtime_error(toNative(e.getMessage( )));
 }
 void fatalError(const SAXParseException& e) { error(e); }
};

 

Example 14-8. Parsing the document animals.xml with the SAX2 API

#include 
#include  // cout
#include  // auto_ptr
#include 
#include 
#include 
#include 
#include "animal.hpp"
#include "xerces_strings.hpp" // Example 14-4

using namespace std;
using namespace xercesc;

// RAII utility that initializes the parser and frees resources
// when it goes out of scope
class XercesInitializer {
public:
 XercesInitializer( ) { XMLPlatformUtils::Initialize( ); }
 ~XercesInitializer( ) { XMLPlatformUtils::Terminate( ); }
private:
 // Prohibit copying and assignment
 XercesInitializer(const XercesInitializer&);
 XercesInitializer& operator=(const XercesInitializer&);
};

int main( )
{
 try {
 vector animalList;

 // Initialze Xerces and obtain parser
 XercesInitializer init; 
 auto_ptr ]
 parser(XMLReaderFactory::createXMLReader( ));

 // Register handlers
 CircusContentHandler content(animalList);
 CircusErrorHandler error;
 parser->setContentHandler(&content);
 parser->setErrorHandler(&error);

 // Parse the XML document
 parser->parse("animals.xml");
 
 // Print animals' names
 for ( vector::size_type i = 0,
 n = animalList.size( );
 i < n;
 ++i )
 {
 cout << animalList[i] << "
";
 }
 } catch (const SAXException& e) {
 cout << "xml error: " << toNative(e.getMessage( )) << "
";
 return EXIT_FAILURE;
 } catch (const XMLException& e) {
 cout << "xml error: " << toNative(e.getMessage( )) << "
";
 return EXIT_FAILURE;
 } catch (const exception& e) {
 cout << e.what( ) << "
";
 return EXIT_FAILURE;
 }
}

 

Discussion

Some XML parsers parse an XML document and return it to the user as a complex C++ object. The TinyXml parser and the W3C DOM parser that you'll see in the next recipe both work this way. The SAX2 parser, by contrast, uses a collection of callback function to deliver information about an XML document to the user as the document is being parsed. The callback functions are grouped into several handler interfaces: a ContentHandler receives notifications about an XML document's elements, attributes, and text, an ErrorHandler receives warnings and error notifications, and a DTDHandler receives notifications about an XML document's DTD.

Designing a parser around a collection of callback function has several important advantages. For example, it makes it possible to parse documents that are too large to fit into memory. In addition, it can save processing time by avoiding the numerous dynamic allocations needed to construct nodes in an internal representation of an XML document, and by allowing the user to construct her own representation of a document's data directly, instead of having to traverse the document tree as I did in Example 14-3.

Example 14-8 is pretty straightforward: I obtain a SAX2 parser, register a ContentHandler and ErrorHandler, parse the document animals.xml, and print the list of Animals populated by the ContentHandler. There are two interesting points: First, the function XMLReaderFactory::createXMLReader() returns a dynamically allocated instance of SAX2XMLReader that must be freed explicitly by the user; I use a std::auto_ptr for this purpose to make sure that the parser is deleted even in the event of an exception. Second, the Xerces framework must be initialized using xercesc::XMLPlatformUtils::Initialize( ) and be cleaned up using xercesc::XMLPlatformUtils::Terminate( ). I encapsulate this initialization and cleanup in a class called XercesInitializer, which calls XMLPlatformUtils::Initialize( ) in its constructor and XMLPlatformUtils::Terminate( ) in its destructor. This ensures that Terminate( ) is called even if an exception is thrown. This is an example of the Resource Acquisition Is Initialization (RAII) technique demonstrated in Example 8-3.

Let's look at how the class CircusContentHandler from Example 14-6 implements the SAX2 ContentHandler interface. The SAX 2 parser calls the method startElement( ) each time it encounters the opening tag of an element. If the element has an associated namespace, the first argument, uri, contains the element's namespace URI, and the second argument, localname, contains the portion of the element's tag name following its namespace prefix. If the element has no associated namespace, these two arguments are empty strings. The third argument contains the element's tag name, if the element has no associated namespace; if the element does have an associated namespace, this argument may contain the element's tag name as it appears in the document being parsed, but it may also be an empty string. The fourth argument is an instance of the class Attributes, which represents the element's collection of attributes.

In the implementation of startElement( ) in Example 14-6, I ignore the animalList element. When I encounter an animal element, I add a new Animal to its list of animalslet's call this Animal the current Animaland delegate the job of setting the Animal's properties to the handlers for other elements. When I encounter a veterinarian or trainer element, I call the function contactFromAttributes to construct an instance of Contact from the element's collection of attributes, and then use this Contact to set the current Animal's veterinarian or trainer property. When I encounter a name, species, or dateOfBirth element, I clear the member variable currentText_, which will be used to store the element's textual content.

The SAX2 parser calls the method characters( ) to deliver the character data contained by an element. The parser is allowed to deliver an element's character in a series of calls to characters( ); until an element's closing tag is encountered, there's no guarantee that all its character data has been delivered. Consequently, in the implementation of characters( ), I simply append the provided characters to the member variable currentText_, which I use to set the current Animal's name, species, or date of birth as soon as a closing name, species, or dateOfBirth tag is encountered.

The SAX2 parser calls the method endElement( ) each time it leaves an element. Its arguments have the same interpretation as the first three arguments to startElement() . In the implementation of endElement( ) in Example 14-6, I ignore all elements other than name, species, and dateOBirth. When a callback corresponding to one of these elements occurssignaling that the parser is just leaving the elementI use the character data stored in currentText_ to set the current Animal's name, species, or date of birth.

Several important features of SAX2 are not illustrated in Examples Example 14-6, Example 14-7, and Example 14-8. For example, the class SAX2XMLReader provides an overload of the method parse( ) taking an instance of xercesc::InputSource as an argument instead of a C-style string. InputSource is an abstract class encapsulating a source of character data; its concrete subclasses, including xercesc::MemBufInputSource and xercesc::URLInputSource, allow the SAX2 parser to parse XML documents stored in locations other than the local filesystem.

Furthermore, the ContentHandler interface contains many additional methods, such as startDocument( ) and endDocmuent( ), which signal the start and end of the XML document, and setLocator( ), which allows you to specify a Locator object which keeps track of the current position in the file being parsed. There are also other handler interfaces, including DTDHandler and EntityResolver--from the core SAX 2.0 specificationand DeclarationHandler and LexicalHandler--from the standardized extensions to SAX 2.0.

It's also possible for a single class to implement several handler interfaces. The class xercesc::DefaultHandler makes this easy, because it derives from all the handler interfaces and provides no-op implementations of their virtual functions. Consequently, I could have added the methods from CircusErrorHandler to CircusContentHandler, and modified Example 14-8 as follows:

// Register handlers
CircusContentHandler handler(animalList);
parser->setContentHandler(&handler);
parser->setErrorHandler(&handler);

There's one last feature of Example 14-8 you should notice: CircusContentHandler makes no attempt to verify that the document being parsed has the correct structurefor instance, that its root is an animalList element or that all the children of the root are animal elements. This is in sharp contrast with Example 14-3. For example, the main( ) function in Example 14-3 verifies that the top-level element is an animalList, and the function nodeToAnimal( ) verifies that its argument represents an animal element with exactly five child elements of type name, species, dateOfBirth, veterinarian, and trainer.

It's possible to modify Example 14-6 so that it performs this type of error checking. The ContentHandler in Example 14-9, for instance, verifies that the document's root element is an animalList, that its children are of type animal, and the children of an animal element don't contain other elements. It works by maintaining three boolean flags, parsingAnimalList_, parsingAnimal_, and parsingAnimalChild_, which record the region of the document that is being parsed at any given time. The methods startElement( ) and endElement() simply update these flags and check them for consistency, delegating the task of updating the current Animal to the helper methods startAnimalChild( ) and endElementChild( ), whose implementations are very similar to the implementations of startElement( ) and endElement( ) in Example 14-6.

Example 14-9. A SAX2 ContentHandler for animals.xml that checks the document's structure

// Implements callbacks which receive character data and
// notifications about the beginnings and ends of elements
class CircusContentHandler : public DefaultHandler {
public:
 CircusContentHandler(vector& animalList) 
 : animalList_(animalList), // list to be populated
 parsingAnimalList_(false), // parsing state
 parsingAnimal_(false), // parsing state
 parsingAnimalChild_(false) // parsing state
 { }

 // Receives notifications from the parser each time 
 // beginning of an element is encountered
 void startElement( 
 const XMLCh *const uri, // Namespace uri
 const XMLCh *const localname, // simple tag name
 const XMLCh *const qname, // qualified tag name
 const Attributes &attrs ) // Collection of attributes
 {
 static XercesString animalList = fromNative("animalList");
 static XercesString animal = fromNative("animal");
 static XercesString xmlns = 
 fromNative("http://www.feldman-family-circus.com");

 // Validate the namespace uri
 if (uri != xmlns)
 throw runtime_error(
 string("wrong namespace uri: ") + toNative(uri)
 );

 // (i) Update the flags parsingAnimalList_, parsingAnimal_, 
 // and parsingAnimalChild_, which indicate where we are
 // within the document 
 // (ii) verify that the elements are correctly
 // nested; 
 // (iii) Delegate most of the work to the method 
 // startAnimalChild( )
 if (!parsingAnimalList_) { 
 // We've just encountered the document root
 if (localname == animalList) { 
 parsingAnimalList_ = true; // Update parsing state.
 } else {
 // Incorrect nesting
 throw runtime_error(
 string("expected 'animalList', got ") + 
 toNative(localname )
 );
 }
 } else if (!parsingAnimal_) {
 // We've just encountered a new animal
 if (localname == animal) {
 parsingAnimal_ = true; // Update parsing state.
 animalList_.push_back(Animal( )); // Add an Animal to the list.
 } else {
 // Incorrect nesting
 throw runtime_error(
 string("expected 'animal', got ") + 
 toNative(localname )
 );
 }
 } else {
 // We're in the middle of parsing an animal element.
 if (parsingAnimalChild_) {
 // Incorrect nesting
 throw runtime_error("bad animal element"); 
 } 
 // Update parsing state.
 parsingAnimalChild_ = true; 

 // Let startAnimalChild( ) do the real work
 startAnimalChild(uri, localname, qname, attrs); 
 }
 }

 
 void endElement( 
 const XMLCh *const uri, // Namespace uri
 const XMLCh *const localname, // simple tag name
 const XMLCh *const qname ) // qualified tag name
 {
 static XercesString animalList = fromNative("animal-list");
 static XercesString animal = fromNative("animal");

 // Update the flags parsingAnimalList, parsingAnimal_,
 // and parsingAnimalChild_; delegate most of the work
 // to endAnimalChild( )
 if (localname == animal) {
 parsingAnimal_ = false;
 } else if (localname == animalList) {
 parsingAnimalList_ = false;
 } else {
 endAnimalChild(uri, localname, qname);
 parsingAnimalChild_ = false;
 }
 }

 // Receives notifications when character data is encountered
 void characters(const XMLCh* const chars, const unsigned int length) 
 {
 // Append characters to currentText_ for processing by
 // the method endAnimalChild( )
 currentText_.append(chars, length);
 }
private:
 // If the current element represents a veterinarian or trainer,
 // use attrs to construct a Contact object for the current 
 // Animal; otherwise, clear currentText_ in preparation for the 
 // characters( ) callback
 void startAnimalChild(
 const XMLCh *const uri, // Namespace uri
 const XMLCh *const localname, // simple tag name
 const XMLCh *const qname, // qualified tag name
 const Attributes &attrs ) // Collection of attributes
 {
 static XercesString vet = fromNative("veterinarian");
 static XercesString trainer = fromNative("trainer");

 Animal& animal = animalList_.back( );
 if (localname == vet) {
 // We've encountered a "veterinarian" element.
 animal.setVeterinarian(contactFromAttributes(attrs));
 } else if (localname == trainer) {
 // We've encountered a "trainer" element.
 animal.setTrainer(contactFromAttributes(attrs));
 } else {
 // We've encountered a "name" , "species", or 
 // "dateOfBirth" element. Its content will be supplied
 // by the callback function characters( ).
 currentText_.clear( );
 }
 }

 // If the current element represents a name, species, or date
 // of birth, use the text stored in currentText_ to set the
 // appropriate property of the current Animal.
 void endAnimalChild(
 const XMLCh *const uri, // Namespace uri
 const XMLCh *const localname, // simple tag name
 const XMLCh *const qname ) // qualified tag name
 {
 static XercesString name = fromNative("name");
 static XercesString species = fromNative("species");
 static XercesString dob = fromNative("dateOfBirth");

 // currentText_ contains the content of the element which has
 // just ended. Use it to set the current Animal's properties.
 Animal& animal = animalList_.back( );
 if (localname == name) {
 animal.setName(toNative(currentText_));
 } else if (localname == species) {
 animal.setSpecies(toNative(currentText_));
 } else if (localname == dob) {
 animal.setDateOfBirth(toNative(currentText_));
 } 
 }

 vector& animalList_; // list to be populated
 bool parsingAnimalList_; // parsing state
 bool parsingAnimal_; // parsing state
 bool parsingAnimalChild_; // parsing state
 XercesString currentText_; // character data of the
 // current text node
};

Comparing Example 14-9 with Example 14-6, you can see how complex it can be to verify a document's structure using callbacks. What's more, Example 14-6 still doesn't perform as much checking as Example 14-3: it doesn't verify that the children of an animal element appear in the correct order, for instance. Fortunately, there are much easier ways to verify a document's structure using SAX2, as you'll see in the Recipe 14.5 and Recipe 14.6.

See Also

Recipe 14.1, Recipe 14.4, Recipe 14.5, and Recipe 14.6





C++ Cookbook
Secure Programming Cookbook for C and C++: Recipes for Cryptography, Authentication, Input Validation & More
ISBN: 0596003943
EAN: 2147483647
Year: 2006
Pages: 241
Simiral book on Amazon

Flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net