Parsing a Simple XML Document

Problem

You have a collection of data stored in an XML document. You want to parse the document and turn the data it contains into a collection of C++ objects. Your XML document is small enough to fit into memory and doesn't use an internal Document Type Definition (DTD) or XML Namespaces.

Solution

Use the TinyXml library. First, define an object of type TiXmlDocument and call its LoadFile() method, passing the pathname of your XML document as its argument. If LoadFile( ) returns true, your document has been successfully parsed. If parsing was successful, call the RootElement() method to obtain a pointer to an object of type TiXmlElement representing the document root. This object has a hierarchical structure that reflects the structure of your XML document; by traversing this structure, you can extract information about the document and use this information to create a collection of C++ objects.

For example, suppose you have an XML document animals.xml representing a collection of circus animals, as shown in Example 14-1. The document root is named animalList and has a number of child animal elements each representing an animal owned by the Feldman Family Circus. Suppose you also have a C++ class named Animal, and you want to construct a std::vector of Animals corresponding to the animals listed in the document.

Example 14-1. An XML document representing a list of circus animals






 
 Herby
 elephant
 1992-04-23
 
 
 
 
 Sheldon
 parrot
 1998-09-30
 
 
 
 
 Dippy
 penguin
 2001-06-08

Example 14-2 shows how the definition of the class Animal might look. Animal has five data members corresponding to an animal's name, species, date of birth, veterinarian, and trainer. An animal's name and species are represented as std::strings, its date of birth is represented as a boost::gregorian::date from Boost.Date_Time, and its veterinarian and trainer are represented as instances of the class Contact, also defined in Example 14-2. Example 14-3 shows how to use TinyXml to parse the document animals.xml, traverse the parsed document, and populate a std::vector of Animals using data extracted from the document.

Example 14-2. The header animal.hpp

#ifndef ANIMALS_HPP_INCLUDED
#define ANIMALS_HPP_INCLUDED

#include 
#include 
#include  // runtime_error
#include 
#include 

// Represents a veterinarian or trainer
class Contact {
public:
 Contact( ) { }
 Contact(const std::string& name, const std::string& phone)
 : name_(name)
 { 
 setPhone(phone);
 }
 std::string name( ) const { return name_; }
 std::string phone( ) const { return phone_; }
 void setName(const std::string& name) { name_ = name; }
 void setPhone(const std::string& phone)
 { 
 using namespace std;
 using namespace boost;
 // Use Boost.Regex to verify that phone 
 // has the form (ddd)ddd-dddd
 static regex pattern("\([0-9]{3}\)[0-9]{3}-[0-9]{4}");
 if (!regex_match(phone, pattern)) {
 throw runtime_error(string("bad phone number:") + phone);
 }
 phone_ = phone;
 }
private:
 std::string name_;
 std::string phone_;
};

// Compare two Contacts for equality; used in Recipe 14.9
// (for completeness, you should also define operator!=)
bool operator==(const Contact& lhs, const Contact& rhs)
{
 return lhs.name( ) == rhs.name( ) && lhs.phone( ) == rhs.phone( );
}

// Writes a Contact to an ostream
std::ostream& operator<<(std::ostream& out, const Contact& contact)
{
 out << contact.name( ) << " " << contact.phone( );
 return out;
}

// Represents an animal 
class Animal {
public:
 // Default constructs an Animal; this is 
 // the constructor you'll use most
 Animal( ) { }

 // Constructs an Animal with the given properties; 
 // you'll use this constructor in Recipe 14.9
 Animal( const std::string& name, 
 const std::string& species, 
 const std::string& dob, 
 const Contact& vet, 
 const Contact& trainer )
 : name_(name),
 species_(species),
 vet_(vet),
 trainer_(trainer)
 { 
 setDateOfBirth(dob);
 }

 // Getters
 std::string name( ) const { return name_; }
 std::string species( ) const { return species_; }
 boost::gregorian::date dateOfBirth( ) const { return dob_; }
 Contact veterinarian( ) const { return vet_; }
 Contact trainer( ) const { return trainer_; }

 // Setters
 void setName(const std::string& name) { name_ = name; }
 void setSpecies(const std::string& species) { species_ = species; }
 void setDateOfBirth(const std::string& dob) 
 { 
 dob_ = boost::gregorian::from_string(dob); 
 }
 void setVeterinarian(const Contact& vet) { vet_ = vet; }
 void setTrainer(const Contact& trainer) { trainer_ = trainer; }
private:
 std::string name_;
 std::string species_;
 boost::gregorian::date dob_;
 Contact vet_;
 Contact trainer_;
};

// Compare two Animals for equality; used in Recipe 14.9
// (for completeness, you should also define operator!=)
bool operator==(const Animal& lhs, const Animal& rhs)
{
 return lhs.name( ) == rhs.name( ) && 
 lhs.species( ) == rhs.species( ) && 
 lhs.dateOfBirth( ) == rhs.dateOfBirth( ) && 
 lhs.veterinarian( ) == rhs.veterinarian( ) && 
 lhs.trainer( ) == rhs.trainer( );
}

// Writes an Animal to an ostream
std::ostream& operator<<(std::ostream& out, const Animal& animal)
{
 out << "Animal {
"
 << " name=" << animal.name( ) << ";
"
 << " species=" << animal.species( ) << ";
"
 << " date-of-birth=" << animal.dateOfBirth( ) << ";
"
 << " veterinarian=" << animal.veterinarian( ) << ";
"
 << " trainer=" << animal.trainer( ) << ";
"
 << "}";
 return out;
}

#endif // #ifndef ANIMALS_HPP_INCLUDED

Example 14-3. Parsing animals.xml with TinyXml

#include 
#include  // cout
#include  // runtime_error
#include  // EXIT_FAILURE
#include  // strcmp
#include 
#include 
#include "animal.hpp"

using namespace std;

// Extracts the content of an XML element that contains only text
const char* textValue(TiXmlElement* e)
{
 TiXmlNode* first = e->FirstChild( ); 
 if ( first != 0 && 
 first == e->LastChild( ) &&
 first->Type( ) == TiXmlNode::TEXT )
 {
 // the element e has a single child, of type TEXT;
 // return the child's
 return first->Value( );
 } else {
 throw runtime_error(string("bad ") + e->Value( ) + " element");
 }
}

// Constructs a Contact from a "veterinarian" or "trainer" element
Contact nodeToContact(TiXmlElement* contact)
{
 using namespace std;
 const char *name, *phone;
 if ( contact->FirstChild( ) == 0 &&
 (name = contact->Attribute("name")) && 
 (phone = contact->Attribute("phone")) )
 {
 // The element contact is childless and has "name" 
 // and "phone" attributes; use these values to
 // construct a Contact
 return Contact(name, phone);
 } else {
 throw runtime_error(string("bad ") + contact->Value( ) + " element");
 }
}

// Constructs an Animal from an "animal" element
Animal nodeToAnimal(TiXmlElement* animal)
{
 using namespace std;

 // Verify that animal corresponds to an "animal" element
 if (strcmp(animal->Value( ), "animal") != 0) {
 throw runtime_error(string("bad animal: ") + animal ->Value( ));
 }

 Animal result; // Return value
 TiXmlElement* element = animal->FirstChildElement( );

 // Read name
 if (element && strcmp(element->Value( ), "name") == 0) {
 // The first child element of animal is a "name"
 // element; use its text value to set the name of result
 result.setName(textValue(element));
 } else {
 throw runtime_error("no name attribute");
 }

 // Read species
 element = element->NextSiblingElement( );
 if (element && strcmp(element->Value( ), "species") == 0) {
 // The second child element of animal is a "species"
 // element; use its text value to set the species of result
 result.setSpecies(textValue(element));
 } else {
 throw runtime_error("no species attribute");
 }

 // Read date of birth
 element = element->NextSiblingElement( );
 if (element && strcmp(element->Value( ), "dateOfBirth") == 0) {
 // The third child element of animal is a "dateOfBirth"
 // element; use its text value to set the date of birth
 // of result
 result.setDateOfBirth(textValue(element));
 } else {
 throw runtime_error("no dateOfBirth attribute");
 }

 // Read veterinarian
 element = element->NextSiblingElement( );
 if (strcmp(element->Value( ), "veterinarian") == 0) { 
 // The fourth child element of animal is a "veterinarian"
 // element; use it to construct a Contact object and
 // set result's veterinarian
 result.setVeterinarian(nodeToContact(element));
 } else {
 throw runtime_error("no veterinarian attribute");
 }

 // Read trainer
 element = element->NextSiblingElement( );
 if (strcmp(element->Value( ), "trainer") == 0) { 
 // The fifth child element of animal is a "trainer"
 // element; use it to construct a Contact object and
 // set result's trainer
 result.setTrainer(nodeToContact(element));
 } else {
 throw runtime_error("no trainer attribute");
 }

 // Check that there are no more children
 element = element->NextSiblingElement( );
 if (element != 0) {
 throw runtime_error(
 string("unexpected element:") + 
 element->Value( )
 );
 
 }

 return result;
}

int main( )
{
 using namespace std;

 try {
 vector animalList;

 // Parse "animals.xml"
 TiXmlDocument doc("animals.xml");
 if (!doc.LoadFile( )) 
 throw runtime_error("bad parse");
 
 // Verify that root is an animal-list
 TiXmlElement* root = doc.RootElement( );
 if (strcmp(root->Value( ), "animalList") != 0) {
 throw runtime_error(string("bad root: ") + root->Value( ));
 }

 // Traverse children of root, populating the list 
 // of animals
 for ( TiXmlElement* animal = root->FirstChildElement( );
 animal;
 animal = animal->NextSiblingElement( ) )
 {
 animalList.push_back(nodeToAnimal(animal));
 }
 
 // Print the animals' names
 for ( vector::size_type i = 0,
 n = animalList.size( );
 i < n;
 ++i )
 {
 cout << animalList[i] << "
";
 }
 } catch (const exception& e) {
 cout << e.what( ) << "
";
 return EXIT_FAILURE;
 }
}

Discussion

TinyXml is an excellent choice for applications that need to do just a bit of XML processing. Its source distribution is small, it's easy to build and integrate with projects, and it has a very simple interface. It also has a very permissive license. Its main limitations are that it doesn't understand XML Namespaces, can't validate against a DTD or schema, and can't parse XML documents containing an internal DTD. If you need to use any of these features, or any of the XML-related technologies such as XPath or XSLT, you should use the other libraries covered in this chapter.

The TinyXml parser produces a representation of an XML document as a tree whose nodes represent the elements, text, comments and other components of an XML document. The root of the tree represents the XML document itself. This type of representation of a hierarchical document as a tree is known as a Document Object Model (DOM). The TinyXml DOM is similar to the one designed by the World Wide Web Consortium (W3C), although it does not conform to the W3C specification. In keeping with the minimalist spirit of TinyXml, the TinyXml DOM is simpler than the W3C DOM, but also less powerful.

The nodes in the tree representing an XML document can be accessed through the interface TiXmlNode, which provides methods to access a node's parent, to enumerate its child nodes, and to remove child nodes or insert additional child nodes. Each node is actually an instance of a more derived type; for example, the root of the tree is an instance of TiXmlDocument, nodes representing elements are instances TiXmlElement, and nodes representing text are instances of TiXmlText. The type of a TiXmlNode can be determined by calling its Type( ) method; once you know the type of a node, you can obtain a representation of the node as a more derived type by calling one of the convenience methods such as toDocument() , toElement( ) and toText( ). These derived types contain additional methods appropriate to the type of node they represent.

It's now easy to understand Example 14-3. First, the function textValue( ) extracts the text content from an element that contains only text, such as name, species, or dateOfBirth. It does this by first checking that an element has only one child, and that the child is a text node. It then obtains the child's text by calling the Value() method, which returns the textual content of a text node or comment node, the tag name of an element node, and the filename of a root node.

Next, the function nodeToContact( ) takes a node corresponding to a veterinarian or trainer element and constructs a Contact object from the values of its name and phone attributes, which it retrieves using the Attribute( ) method.

Similarly, the function nodeToAnimal( ) takes a node corresponding to an animal element and constructs an Animal object. It does this by iterating over the node's children using the NextSiblingElement( ) method, extracting the data contained in each element, and setting the corresponding property of the Animal object. The data is extracted using the function textValue( ) for the elements name, species, and dateOfBirth and the function nodeToContact( ) for the elements veterinarian and TRainer.

In the main function, I first construct a TiXmlDocument object corresponding to the file animals.xml and parse it using the LoadFile() method. I then obtain a TiXmlElement corresponding to the document root by calling the RootElement( ) method. Next, I iterate over the children of the root element, constructing an Animal object from each animal element using the function nodeToAnimal( ). Finally, I iterate over the collection of Animal objects, writing them to standard output.

One feature of TinyXml that is not illustrated in Example 14-3 is the SaveFile( ) method of TiXmlDocument, which writes the document represented by a TiXmlDocument to a file. This allows you to parse an XML document, modify it using the DOM interface, and save the modified document. You can even create a TiXmlDocument from scratch and save it to disk:

// Create a document hello.xml, consisting 
// of a single "hello" element
TiXmlDocument doc;
TiXmlElement root("hello");
doc.InsertEndChild(root);
doc.SaveFile("hello.xml");