Parsing XML with XmlTextReader

Let’s start by looking at how you can parse XML with the XmlTextReader class. XmlTextReader provides you with a way to parse XML data that minimizes resource usage by reading forward through the document, recognizing elements as it reads. Very little data is cached in memory, but the forward-only style has two main consequences. The first is that it isn’t possible to go back to an earlier point in the file without starting to read from the top again. The second consequence is slightly more subtle: elements are read and presented to you one by one, with no context. So, if you need to keep track of where an element occurs within the document structure, you’ll need to do it yourself. If either of these consequences sound like limitations to you, you might need to use the XmlDocument class, which is discussed in the “Using XmlDocument” section later in this chapter.

XmlTextReader uses a pull model, which means that you call a function to get the next node when you’re ready. This model is in contrast to the widely used SAX (Simple API for XML) API, which uses a push model, meaning that it fires events at callback functions that you provide. The following tables list the main properties and methods of the XmlTextReader class.

Property	Description
AttributeCount	Returns the number of attributes on the current node
Depth	Returns the depth of the current node in the tree
Encoding	Returns the character encoding of the document
EOF	Returns true if the reader is at the end of the stream
HasValue	Returns true if the current node can have a value
IsEmptyElement	Returns true if the current element has no value
Item	Gets the value of an attribute
LineNumber	Returns the current line number
LinePosition	Returns the character position within the current line
LocalName	Returns the name of the current element without a namespace prefix
Name	Returns the full name of the current element
Namespaces	Determines whether the parser should use namespaces
NamespaceURI	Gets the namespace URI for the current node
NodeType	Gets the type of the current node
Prefix	Returns the current namespace prefix
ReadState	Returns the state of the reader (for example, closed, at the end of the file, or still reading)
Value	Gets the value for the current node
XmlLang	Gets the current xml:lang scope

Method	Description
Close	Changes the state of the reader to Closed, and closes the underlying stream.
GetAttribute	Gets the value of an attribute.
IsStartElement	Returns true if the current node is a start tag.
MoveToAttribute	Moves to the attribute with a specified index or name.
MoveToContent	Moves to the next content node. This method will skip over non-content nodes, such as those of type ProcessingInstruction, DocumentType, Comment, Whitespace, or SignificantWhitespace.
MoveToElement	Moves to the element that contains the current attribute.
MoveToFirstAttribute, MoveToNextAttribute	Iterates over the attributes for an element.
Read	Reads the next node from the stream.
ReadAttributeValue	Processes attribute values that contain entities.
ReadBase64, ReadBinHex	Reads content encoded as Base64 or BinHex (binary to hexadecimal).
ReadChars	Reads character content.
ReadString	Reads the content of an element or a text node as a string.

The most important function in the second of these tables is Read, which tells the XmlTextReader to fetch the next node from the document. Once you’ve got the node, you can use the NodeType property to find out what you have. You’ll get one of the members of the XmlNodeType enumeration, whose members are listed in the following table.

Node Type	Description
Attribute	An attribute, for example, type=hardback
CDATA	A CDATA section
Comment	An XML comment
Document	The document object, representing the root of the XML tree
DocumentFragment	A fragment of XML that isn’t a document in itself
DocumentType	A document type declaration
Element, EndElement	The start and end of an XML element
Entity, EndEntity	The start and end of an entity declaration
EntityReference	An entity reference (for example, &lt;)
None	Used if the node type is queried when no node has been read
Notation	A notation entry in a DTD
ProcessingInstruction	An XML processing instruction
SignificantWhitespace	White space in a mixed content model document, or when xml:space=preserve has been set
Text	The text content of an element
Whitespace	White space between markup
XmlDeclaration	The XML declaration at the top of a document

The following exercise will show you how to read an XML document using the XmlTextReader class. Here’s the sample XML document used by this exercise and the other exercises in this chapter, which lists details of three volcanoes and which contains many common XML constructs:

<?xml version="1.0" ?> <!-- Volcano data --> <geology> <volcano name="Erebus"> <location>Ross Island, Antarctica</location> <height value="3794" unit="m"/> <type>stratovolcano</type> <eruption>constant activity</eruption> <magma>basanite to trachyte</magma> </volcano> <volcano name="Hekla"> <location>Iceland</location> <type>stratovolcano</type> <height value="1491" unit="m"/> <eruption>1970</eruption> <eruption>1980</eruption> <eruption>1991</eruption> <magma>calcalkaline</magma> <comment>The type is actually intermediate between crater row and stratovolcano types</comment> </volcano> <volcano name="Mauna Loa"> <location>Hawaii</location> <type>shield</type> <height value="13677" unit="ft"/> <eruption>1984</eruption> <magma>basaltic</magma> </volcano> </geology>

Create a new Visual C++ Console Application (.NET) project named CppXmlTextReader.
Add the following two lines to the top of CppXmlTextReader.cpp:
```
#using <System.xml.dll> using namespace System::Xml;
```
The code for the XML classes lives in System.xml.dll, so include it via a #using directive. It’s also going to be easier to use the classes if you include a using directive for the System::Xml namespace, as shown in the preceding code.
Because you’re going to supply the name of the XML document when you run the program from the command line, change the declaration of the _tmain function to include the command-line argument parameters, as follows:
```
int _tmain(int argc, char* argv[])
```

Add this code to the start of the _tmain function to check the number of arguments and save the path:

// Check for required arguments if (argc < 2) { Console::WriteLine(S"Usage: CppXmlTextReader path"); return -1; } String* path = new String(argv[1]);

Now that you’ve got the path, create an XmlTextReader to parse the file.
```
try { // Create the reader... XmlTextReader* rdr = new XmlTextReader(path); } catch (Exception* pe) { Console::WriteLine(pe->ToString()); }
```
The XmlTextReader constructor takes the name of the document you want to parse. It’s a good idea to catch exceptions here because several things can go wrong at this stage, including passing the constructor a bad path name. You can build and run the application from the command line at this stage if you want to check that the file opens correctly.

Note that XmlTextReader isn’t limited to reading from files. Alternative constructors let you take XML input from URLs, streams, strings, and other TextReader objects.

Parsing the file simply means making repeated calls to the Read function until the parser runs out of XML to read. The simplest way to do this is to put a call to Read inside a while loop.
Add this code to the end of the code inside the try block:
```
// Read nodes while (rdr->Read()) { // do something with the data } 
```
The Read function returns true or false depending on whether there are any more nodes to read.
Each call to Read positions the XmlTextReader on a new node, and you query the NodeType property to find out which of the node types listed in the preceding table you are dealing with. Add the following code, which checks the node type against several of the most common types:
```
// Read nodes while (rdr->Read()) { switch (rdr->NodeType) { case XmlNodeType::XmlDeclaration: Console::WriteLine(S"-> XML declaration"); break; case XmlNodeType::Document: Console::WriteLine(S"-> Document node"); break; case XmlNodeType::Element: Console::WriteLine(S"-> Element node, name={0}", rdr->Name); break; case XmlNodeType::EndElement: Console::WriteLine(S"-> End element node, name={0}", rdr->Name); break; case XmlNodeType::Text: Console::WriteLine(S"-> Text node, value={0}", rdr->Value); break; case XmlNodeType::Comment: Console::WriteLine(S"-> Comment node, name={0}, value={1}", rdr->Name, rdr->Value); break; case XmlNodeType::Whitespace: break; default: Console::WriteLine(S"** Unknown node type"); break; } }
```
Every time a new node is read, the switch statement checks its type against members of the XmlNodeType enumeration. I haven’t included the cases for every possible node type, but only those that occur in the sample document.

You’ll notice that the Name and Value properties are used for some node types. Whether a node has a Name and a Value depends on the node type. For example, elements have names and can have values, and comments have a value (the comment text) but not names. Processing instructions normally have both names and values.

Also notice that nodes of type XmlNodeType::Whitespace are simply discarded. The volcanoes.xml file contains plenty of white space to make it readable to humans, but the CppXmlTextReader program isn’t really interested in white space, so the program prints nothing when it encounters a white space node.
Build the application, and run it from the command line, giving the name of an XML file:
```
CppXmlTextReader volcanoes.xml
```
The first few lines of the output should look like this:
```
-> XML declaration -> Comment node, name=, value= Volcano data -> Element node, name=geology -> Element node, name=volcano -> Element node, name=location -> Text node, value=Ross Island, Antarctica -> End element node, name=location -> Element node, name=height -> Element node, name=type -> Text node, value=stratovolcano -> End element node, name=type -> Element node, name=eruption -> Text node, value=constant activity
```
The first node is the XML declaration at the top of the document, and it’s followed by a comment, whose value is the comment text. Each XML element in the document will produce a matching pair of Element and EndElement nodes, with the content of a node represented by a nested Text node.

You can see that the nodes are presented to you in linear sequence, so if you want to keep track of the hierarchical structure of the document, you’re going to have to put code in place to do it yourself.

Verifying Well-Formed XML

XML that is correctly constructed is called well-formed XML, which means that elements will be correctly nested and that every element tag will have a matching end element tag. If the XmlTextReader encounters badly formed XML, it will throw an XmlException to tell you what it thinks is wrong. As with all parsing errors, the place where it’s reported might be some distance from the real site of the error.

Handling Attributes

XML elements can include attributes, which consist of name/value pairs and are always string data. In the sample XML file, the volcano element has a name attribute, and the height element has value and unit attributes. To process the attributes on an element, add code to the Element case in the switch statement so that it looks like this:

case XmlNodeType::Element: Console::WriteLine(S"-> Element node, name={0}", rdr->Name); if (rdr->AttributeCount > 0) { Console::Write(" "); while (rdr->MoveToNextAttribute()) Console::Write(" {0}={1}", rdr->Name, rdr->Value); Console::WriteLine(S); } break;

The AttributeCount property will tell you how many attributes an element has, and the MoveToNextAttribute method will let you iterate over the collection of elements, each of which has a name and a value. Alternatively, you can use the MoveToAttribute function to position the reader on a particular attribute by specifying either a name or a zero-based index.

Attributes are read along with the element node that they’re part of. When reading attributes, you can use the MoveToElement method to position the reader back to the parent element. When you run the code, you should see output similar to this for nodes that have attributes:

-> Element node, name=height value=13677 unit=ft