Parsing XML with XmlTextReader


Parsing XML with XmlTextReader

Let’s start by looking at how you can parse XML with the XmlTextReader class. XmlTextReader provides you with a way to parse XML data that minimizes resource usage by reading forward through the document, recognizing elements as it reads. Very little data is cached in memory, but the forward-only style has two main consequences. The first is that it isn’t possible to go back to an earlier point in the file without starting to read from the top again. The second consequence is slightly more subtle: elements are read and presented to you one by one, with no context. So, if you need to keep track of where an element occurs within the document structure, you’ll need to do it yourself. If either of these consequences sound like limitations to you, you might need to use the XmlDocument class, which is discussed in the “Using XmlDocument” section later in this chapter.

XmlTextReader uses a pull model, which means that you call a function to get the next node when you’re ready. This model is in contrast to the widely used SAX (Simple API for XML) API, which uses a push model, meaning that it fires events at callback functions that you provide. The following tables list the main properties and methods of the XmlTextReader class.

Property

Description

AttributeCount

Returns the number of attributes on the current node

Depth

Returns the depth of the current node in the tree

Encoding

Returns the character encoding of the document

EOF

Returns true if the reader is at the end of the stream

HasValue

Returns true if the current node can have a value

IsEmptyElement

Returns true if the current element has no value

Item

Gets the value of an attribute

LineNumber

Returns the current line number

LinePosition

Returns the character position within the current line

LocalName

Returns the name of the current element without a namespace prefix

Name

Returns the full name of the current element

Namespaces

Determines whether the parser should use namespaces

NamespaceURI

Gets the namespace URI for the current node

NodeType

Gets the type of the current node

Prefix

Returns the current namespace prefix

ReadState

Returns the state of the reader (for example, closed, at the end of the file, or still reading)

Value

Gets the value for the current node

XmlLang

Gets the current xml:lang scope

Method

Description

Close

Changes the state of the reader to Closed, and closes the underlying stream.

GetAttribute

Gets the value of an attribute.

IsStartElement

Returns true if the current node is a start tag.

MoveToAttribute

Moves to the attribute with a specified index or name.

MoveToContent

Moves to the next content node. This method will skip over non-content nodes, such as those of type ProcessingInstruction, DocumentType, Comment, Whitespace, or SignificantWhitespace.

MoveToElement

Moves to the element that contains the current attribute.

MoveToFirstAttribute, MoveToNextAttribute

Iterates over the attributes for an element.

Read

Reads the next node from the stream.

ReadAttributeValue

Processes attribute values that contain entities.

ReadBase64, ReadBinHex

Reads content encoded as Base64 or BinHex (binary to hexadecimal).

ReadChars

Reads character content.

ReadString

Reads the content of an element or a text node as a string.

The most important function in the second of these tables is Read, which tells the XmlTextReader to fetch the next node from the document. Once you’ve got the node, you can use the NodeType property to find out what you have. You’ll get one of the members of the XmlNodeType enumeration, whose members are listed in the following table.

Node Type

Description

Attribute

An attribute, for example, type=hardback

CDATA

A CDATA section

Comment

An XML comment

Document

The document object, representing the root of the XML tree

DocumentFragment

A fragment of XML that isn’t a document in itself

DocumentType

A document type declaration

Element, EndElement

The start and end of an XML element

Entity, EndEntity

The start and end of an entity declaration

EntityReference

An entity reference (for example, <)

None

Used if the node type is queried when no node has been read

Notation

A notation entry in a DTD

ProcessingInstruction

An XML processing instruction

SignificantWhitespace

White space in a mixed content model document, or when xml:space=preserve has been set

Text

The text content of an element

Whitespace

White space between markup

XmlDeclaration

The XML declaration at the top of a document

The following exercise will show you how to read an XML document using the XmlTextReader class. Here’s the sample XML document used by this exercise and the other exercises in this chapter, which lists details of three volcanoes and which contains many common XML constructs:

<?xml version="1.0" ?> <!-- Volcano data --> <geology> <volcano name="Erebus"> <location>Ross Island, Antarctica</location> <height value="3794" unit="m"/> <type>stratovolcano</type> <eruption>constant activity</eruption> <magma>basanite to trachyte</magma> </volcano> <volcano name="Hekla"> <location>Iceland</location> <type>stratovolcano</type> <height value="1491" unit="m"/> <eruption>1970</eruption> <eruption>1980</eruption> <eruption>1991</eruption> <magma>calcalkaline</magma> <comment>The type is actually intermediate between crater row and stratovolcano types</comment> </volcano> <volcano name="Mauna Loa"> <location>Hawaii</location> <type>shield</type> <height value="13677" unit="ft"/> <eruption>1984</eruption> <magma>basaltic</magma> </volcano> </geology>

  1. Create a new Visual C++ Console Application (.NET) project named CppXmlTextReader.

  2. Add the following two lines to the top of CppXmlTextReader.cpp:

    #using <System.xml.dll> using namespace System::Xml;

    The code for the XML classes lives in System.xml.dll, so include it via a #using directive. It’s also going to be easier to use the classes if you include a using directive for the System::Xml namespace, as shown in the preceding code.

  3. Because you’re going to supply the name of the XML document when you run the program from the command line, change the declaration of the _tmain function to include the command-line argument parameters, as follows:

    int _tmain(int argc, char* argv[])
  4. Add this code to the start of the _tmain function to check the number of arguments and save the path:

    // Check for required arguments if (argc < 2) { Console::WriteLine(S"Usage: CppXmlTextReader path"); return -1; } String* path = new String(argv[1]);
  5. Now that you’ve got the path, create an XmlTextReader to parse the file.

    try { // Create the reader... XmlTextReader* rdr = new XmlTextReader(path); } catch (Exception* pe) { Console::WriteLine(pe->ToString()); }

    The XmlTextReader constructor takes the name of the document you want to parse. It’s a good idea to catch exceptions here because several things can go wrong at this stage, including passing the constructor a bad path name. You can build and run the application from the command line at this stage if you want to check that the file opens correctly.

    Note that XmlTextReader isn’t limited to reading from files. Alternative constructors let you take XML input from URLs, streams, strings, and other TextReader objects.

    Parsing the file simply means making repeated calls to the Read function until the parser runs out of XML to read. The simplest way to do this is to put a call to Read inside a while loop.

  6. Add this code to the end of the code inside the try block:

    // Read nodes while (rdr->Read()) { // do something with the data } 

    The Read function returns true or false depending on whether there are any more nodes to read.

  7. Each call to Read positions the XmlTextReader on a new node, and you query the NodeType property to find out which of the node types listed in the preceding table you are dealing with. Add the following code, which checks the node type against several of the most common types:

    // Read nodes while (rdr->Read()) { switch (rdr->NodeType) { case XmlNodeType::XmlDeclaration: Console::WriteLine(S"-> XML declaration"); break; case XmlNodeType::Document: Console::WriteLine(S"-> Document node"); break; case XmlNodeType::Element: Console::WriteLine(S"-> Element node, name={0}", rdr->Name); break; case XmlNodeType::EndElement: Console::WriteLine(S"-> End element node, name={0}", rdr->Name); break; case XmlNodeType::Text: Console::WriteLine(S"-> Text node, value={0}", rdr->Value); break; case XmlNodeType::Comment: Console::WriteLine(S"-> Comment node, name={0}, value={1}", rdr->Name, rdr->Value); break; case XmlNodeType::Whitespace: break; default: Console::WriteLine(S"** Unknown node type"); break; } }

    Every time a new node is read, the switch statement checks its type against members of the XmlNodeType enumeration. I haven’t included the cases for every possible node type, but only those that occur in the sample document.

    You’ll notice that the Name and Value properties are used for some node types. Whether a node has a Name and a Value depends on the node type. For example, elements have names and can have values, and comments have a value (the comment text) but not names. Processing instructions normally have both names and values.

    Also notice that nodes of type XmlNodeType::Whitespace are simply discarded. The volcanoes.xml file contains plenty of white space to make it readable to humans, but the CppXmlTextReader program isn’t really interested in white space, so the program prints nothing when it encounters a white space node.

  8. Build the application, and run it from the command line, giving the name of an XML file:

    CppXmlTextReader volcanoes.xml

    The first few lines of the output should look like this:

    -> XML declaration -> Comment node, name=, value= Volcano data -> Element node, name=geology -> Element node, name=volcano -> Element node, name=location -> Text node, value=Ross Island, Antarctica -> End element node, name=location -> Element node, name=height -> Element node, name=type -> Text node, value=stratovolcano -> End element node, name=type -> Element node, name=eruption -> Text node, value=constant activity

    The first node is the XML declaration at the top of the document, and it’s followed by a comment, whose value is the comment text. Each XML element in the document will produce a matching pair of Element and EndElement nodes, with the content of a node represented by a nested Text node.

    You can see that the nodes are presented to you in linear sequence, so if you want to keep track of the hierarchical structure of the document, you’re going to have to put code in place to do it yourself.

Verifying Well-Formed XML

XML that is correctly constructed is called well-formed XML, which means that elements will be correctly nested and that every element tag will have a matching end element tag. If the XmlTextReader encounters badly formed XML, it will throw an XmlException to tell you what it thinks is wrong. As with all parsing errors, the place where it’s reported might be some distance from the real site of the error.

Handling Attributes

XML elements can include attributes, which consist of name/value pairs and are always string data. In the sample XML file, the volcano element has a name attribute, and the height element has value and unit attributes. To process the attributes on an element, add code to the Element case in the switch statement so that it looks like this:

case XmlNodeType::Element: Console::WriteLine(S"-> Element node, name={0}", rdr->Name); if (rdr->AttributeCount > 0) { Console::Write(" "); while (rdr->MoveToNextAttribute()) Console::Write(" {0}={1}", rdr->Name, rdr->Value); Console::WriteLine(S); } break;

The AttributeCount property will tell you how many attributes an element has, and the MoveToNextAttribute method will let you iterate over the collection of elements, each of which has a name and a value. Alternatively, you can use the MoveToAttribute function to position the reader on a particular attribute by specifying either a name or a zero-based index.

Attributes are read along with the element node that they’re part of. When reading attributes, you can use the MoveToElement method to position the reader back to the parent element. When you run the code, you should see output similar to this for nodes that have attributes:

-> Element node, name=height value=13677 unit=ft




Microsoft Visual C++  .NET(c) Step by Step
Microsoft Visual C++ .NET(c) Step by Step
ISBN: 735615675
EAN: N/A
Year: 2003
Pages: 208

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net