10.2. Techniques for Reading XML Data

< Day Day Up >

XML can be represented in two basic ways: as the familiar external document containing embedded data, or as an in-memory tree structure know as a Document Object Model (DOM). In the former case, XML can be read in a forward-only manner as a stream of tokens representing the file's content. The object that performs the reading stays connected to the data source during the read operations. The XmlReader and XmlTextReader shown in Figure 10-3 operate in this manner.

Figure 10-3. Classes to read XML data

More options are available for processing the DOM because it is stored in memory and can be traversed randomly. For simply reading a tree, the XmlNodeReader class offers an efficient way to traverse a tree in a forward, read-only manner. Other more sophisticated approaches that also permit tree modification are covered later in this section.

XmlReader Class

XmlReader is an abstract class possessing methods and properties that enable an application to pull data from an XML file one node at a time in a forward-only, read-only manner. A depth-first search is performed, beginning with the root node in the document. Nodes are inspected using the Name, NodeType, and Value properties.

XmlReader serves as a base class for the concrete classes XmlTextReader and XmlNodeReader. As an abstract class, XmlReader cannot be directly instantiated; however, it has a static Create method that can return an instance of the XmlReader class. This feature became available with the release of .NET Framework 2.0 and is recommended over the XmlTextReader class for reading XML streams.

Listing 10-6 illustrates how to create an XmlReader object and use it to read the contents of a short XML document file. The code is also useful for illustrating how .NET converts the content of the file into a stream of node objects. It's important to understand the concept of nodes because an XML or HTML document is defined (by the official W3C Document Object Model (DOM) specification^[2] ) as a hierarchy of node objects.

^[2] W3C Document Object Model (DOM) Level 3 Core Specification, April, 2004, http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html

Listing 10-6. Using `XmlReader` to Read an XML Document

 // Include these namespaces: // using System.Xml; // using System.Xml.XPath; public void ShowNodes() {    //(1) Settings object enables/disables features on XmlReader     XmlReaderSettings settings = new XmlReaderSettings();    settings.ConformanceLevel = ConformanceLevel.Fragment;    settings.IgnoreWhitespace = true;      try    {       //(2) Create XmlReader object        XmlReader rdr = XmlReader.Create("c:\\oscarsshort.xml",                                         settings);       while (rdr.Read())       {          Format(rdr);       }       rdr.Close();    }    catch (Exception e)    {        Console.WriteLine ("Exception: {0}", e.ToString());    } } private static void Format(XmlTextReader reader) {    //(3) Print Current node properties    Console.Write( reader.NodeType+ "<" + reader.Name + ">" +                    reader.Value);    Console.WriteLine(); }

Before creating the XmlReader, the code first creates an XmlReaderSettings object. This object sets features that define how the XmlReader object processes the input stream. For example, the ConformanceLevel property specifies how the input is checked. The statement

 settings.ConformanceLevel = ConformanceLevel.Fragment;

specifies that the input must conform to the standards that define an XML 1.0 document fragment an XML document that does not necessarily have a root node.

This object and the name of the XML document file are then passed to the Create method that returns an XmlReader instance:

 XmlReader rdr = XmlReader.Create("c:\\oscarsshort.xml, settings);

The file's content is read in a node at a time by the XmlReader.Read method, which prints the NodeType, Name, and Value of each node. Listing 10-7 shows the input file and a portion of the generated output. Line numbers have been added so that an input line and its corresponding node information can be compared.

Listing 10-7. XML Input and Corresponding Nodes

Input File: oscarsshort.xml

 (1) <?xml version="1.0" standalone="yes"?> (2) <films> (3)  <movies> (4)    <!-- Selected by AFI as best movie --> (5)    <movie_ID>5</movie_ID> (6)    <![CDATA[<a href="http://www.imdb.com/tt0467/">Kane</a>]]> (7)    <movie_Title>Citizen Kane </movie_Title> (8)    <movie_Year>1941</movie_Year> (9)    <movie_Director>Orson Welles</movie_Director> (10)   <bestPicture>Y</bestPicture> (11) </movies>   (12)</films>

Program Output (NodeType, <Name>, Value):

 (1) XmlDeclaration<xml>version="1.0" standalone="yes" (2) Element<films> (3) Element<movies> (4) Comment<> Selected by AFI as best movie (5) Element<movie_ID>       Text<>5     EndElement<movie_ID> (6) CDATA<><a href="http://www.imdb.com/tt0467/">Kane</a> (7) Element<movie_Title>       Text<>Citizen Kane      EndElement<movie_Title>        ... (12)EndElement<films>

Programs that use XmlReader typically implement a logic pattern that consists of an outer loop that reads nodes and an inner switch statement that identifies the node using an XMLNodeType enumeration. The logic to process the node information is handled in the case blocks:

 while (reader.Read()) {    switch (reader.NodeType)    {       case XmlNodeType.Element:       // Attributes are contained in elements          while(reader.MoveToNextAttribute())          {             Console.WriteLine(reader.Name+reader.Value);          }       break;       case XmlNodeType.Text:       // Process ..       break;       case XmlNodeType.EndElement       // Process ..       break;    } }

The Element, Text, and Attribute nodes mark most of the data content in an XML document. Note that the Attribute node is regarded as metadata attached to an element and is the only one not exposed directly by the XmlReader.Read method. As shown in the preceding code segment, the attributes in an Element can be accessed using the MoveToNextAttribute method.

Table 10-1 summarizes the node types. It is worth noting that these types are not an arbitrary .NET implementation. With the exception of Whitespace and XmlDeclaration, they conform to the DOM Structure Model recommendation.

Table 10-1. `XmlNodeType` Enumeration
Option	Description and Use
`Attribute`	An attribute or value contained within an element. Example: <movie_title genre="comedy">The Lady Eve </movie_title> Attribute is `genre="comedy"`. Attributes must be located within an element if(reader.NodeType==XmlNodeType.Element){ while(reader.MoveToNextAttribute()) { Console.WriteLine(reader.Name+reader.Value); }
`CData`	Designates that the element is not to be parsed. Markup characters are treated as text: ![CDATA[<ELEMENT> <a href="http://www.imdb.com">movies</a> </ELEMENT>]]>
`Comment`	To make a comment: `<!-- comment -->` To have comments ignored: `XmlReaderSettings.IgnoreComment = true;`
`Document`	A document root object that provides access to the entire XML document.
`DocumentFragment`	A document fragment. This is a node or subtree with a document. It provides a way to work with part of a document.
`DocumentType`	Document type declaration indicated by `<!DOCTYPE … >`. Can refer to an external Document Type Definition (DTD) file or be an inline block containing `Entity` and `Notation` declarations.
`Element`	An XML element. Designated by the `< >` brackets: `<movie_Title>`
`EndElement`	An XML end element tag. Marks the end of an element: `</movie_Title>`
`EndEntity`	End of an `Entity` declaration.
`Entity`	Defines text or a resource to replace the entity name in the XML. An entity is defined as a child of a document type node: <!DOCTYPE movies[ <!ENTITY leadingactress "stanwyck"> ]> XML would then reference this as: `<actress>&leadingactress;</actress>`
`EntityReference`	A reference to the entity. In the preceding example, `&leadingactress;` is an `EntityReference`.
`Notation`	A notation that is declared within a `DocumentType` declaration. Primary use is to pass information to the XML processor. Example: `<!NOTATION homepage="www.sci.com" !>`
`ProcessingInstruction`	Useful for providing information about how the data was generated or how to process it. Example: `<?pi1 Requires IE 5.0 and above ?>`
`Text`	The text content of a node.
`Whitespace`	Whitespace refers to formatting characters such as tabs, line feeds, returns, and spaces that exist between the markup and affect the layout of a document.
`XmlDeclaration`	The first node in the document. It provides version information. `<?xml version="1.0" standalone="yes"?>`

XmlNodeReader Class

The XmlNodeReader is another forward-only reader that processes XML as a stream of nodes. It differs from the XmlReader class in two significant ways:

It processes nodes from an in-memory DOM tree structure rather than a text file.
It can begin reading at any subtree node in the structure not just at the root node (beginning of the document).

In Listing 10-8, an XmlNodeReader object is used to list the movie title and year from the XML-formatted movies database. The code contains an interesting twist: The XmlNodeReader object is not used directly, but instead is passed as a parameter to the constructor of an XmlReader object. The object serves as a wrapper that performs the actual reading. This approach has the advantage of allowing the XmlSettings values to be assigned to the reader.

Listing 10-8. Using `XmlNodeReader` to Read an XML Document

 private void ListMovies() {    // (1) Specify XML file to be loaded as a DOM    XmlDocument doc = new XmlDocument();    doc.Load("c:\\oscarwinners.xml");    // (2) Settings for use with XmlNodeReader object    XmlReaderSettings settings = new XmlReaderSettings();    settings.ConformanceLevel = ConformanceLevel.Fragment;    settings.IgnoreWhitespace = true;    settings.IgnoreComments = true;    // (3) Create a nodereader object    XmlNodeReader noderdr = new XmlNodeReader(doc);    // (4) Create an XmlReader as a wrapper around node reader    XmlReader reader = XmlReader.Create(noderdr, settings);    while (reader.Read())    {       if(reader.NodeType==XmlNodeType.Element){          if (reader.Name == "movie_Title")          {             reader.Read();  // next node is text for title             Console.Write(reader.Value);    // Movie Title          }          if (reader.Name == "movie_Year")          {             reader.Read();  // next node is text for year             Console.WriteLine(reader.Value); // year          }       }    } }

The parameter passed to the XmlNodeReader constructor determines the first node in the tree to be read. When the entire document is passed as in this example reading begins with the top node in the tree. To select a specific node, use the XmlDocument.SelectSingleNode method as illustrated in this segment:

 XmlDocument doc = new XmlDocument(); doc.Load("c:\\oscarwinners.xml");  // Build tree in memory XmlNodeReader noderdr = new        XmlNodeReader(doc.SelectSingleNode("films/movies[2]"));

Refer to Listing 10-1 and you can see that this selects the second movies element group, which contains information on Casablanca.

If your application requires read-only access to XML data and the capability to read selected subtrees, the XmlNodeReader is an efficient solution. When updating, writing, and searching become requirements, a more sophisticated approach is required; we'll look at those techniques later in this section.

The XmlReaderSettings Class

A significant advantage of using an XmlReader object directly or as a wrapper is the presence of the XmlReaderSettings class as a way to define the behavior of the XmlReader object. Its most useful properties specify which node types in the input stream are ignored and whether XML validation is performed. Table 10-2 lists the XmlReaderSettings properties.

Table 10-2. Properties of the `XmlReaderSettings` Class
Property	Default Value	Description
`CheckCharacters`	true	Indicates whether characters and XML names are checked for illegal XML characters. An exception is thrown if one is encountered.
`CloseInput`	false	An `XmlReader` object may be created by passing a stream to it. This property indicates whether the stream is closed when the reader object is closed.
`ConformanceLevel`	Document	Indicates whether the XML should conform to the standards for a `Document` or `DocumentFragment`.
`DtdValidate`	false	Indicates whether to perform DTD validation.
IgnoreComments IgnoreInlineSchema IgnoreProcessingInstructions IgnoreSchemaLocation IgnoreValidationWarnings IgnoreWhitespace	false true false true true false	Specify whether a particular node type is processed or ignored by the `XmlReader.Read` method.
LineNumberOffset LinePositionOffset	0 0	`XmlReader` numbers lines in the XML document beginning with 0. Set this property to change the beginning line number and line position values.
`Schemas`	is empty	Contains the `XmlSchemaSet` to be used for XML Schema Definition Language (XSD) validation.
`XsdValidate`	false	Indicates whether XSD validation is performed.

Using an XML Schema to Validate XML Data

The final two properties listed in Table 10-2 Schemas and XsdValidate are used to validate XML data against a schema. Recall that a schema is a template that describes the permissible content in an XML file or stream. Validation can be (should be) used to ensure that data being read conforms to the rules of the schema. To request validation, you must add the validating schema to the XmlSchemaSet collection of the Schemas property; next, set XsdValidate to true; and finally, define an event handler to be called if a validation error occurs. The following code fragment shows the code used with the schema and XML data in Listings 10-1 and 10-3:

 XmlReaderSettings settings = new XmlReaderSettings(); // (1) Specify schema to be used for validation settings.Schemas.Add(null,"c:\\oscarwinners.xsd"); // (2) Must set this to true settings.XsdValidate = true; // (3) Delegate to handle validation error event settings.ValidationEventHandler += new        System.Xml.Schema.ValidationEventHandler(SchemaValidation); // (4) Create reader and pass settings to it XmlReader rdr = XmlReader.Create("c:\\oscarwinners.xml",       settings); // process XML data ... ... // Method to handle errors detected during schema validation private void SchemaValidation(object sender, System.Xml.Schema.ValidationEventArgs e) {    MessageBox.Show(e.Message); }

Note that a detected error does not stop processing. This means that all the XML data can be checked in one pass without restarting the program.

Options for Reading XML Data

All the preceding examples that read XML data share two characteristics: data is read a node at a time, and a node's value is extracted as a string using the XmlReader.Value property. This keeps things simple, but ignores the underlying XML data. For example, XML often contains numeric data or data that is the product of serializing a class. Both cases can be handled more efficiently using other XmlReader methods.

XmlReader has a suite of ReadValueAsxxx methods that can read the contents of a node in its native form. These include ReadValueAsBoolean, ReadValueAsDateTime, ReadValueAsDecimal, ReadValueAsDouble, ReadValueAsInt32, ReadValueAsInt64, and ReadValueAsSingle. Here's an example:

 int age; if(reader.Name == "Age") age= reader.ReadValueAsInt32();

XML that corresponds to the public properties or fields of a class can be read directly into an instance of the class with the ReadAsObject method. This fragment reads the XML data shown in Listing 10-1 into an instance of the movies class. Note that the name of the field or property must match an element name in the XML data.

 // Deserialize XML into a movies object if (rdr.NodeType == XmlNodeType.Element && rdr.Name == "movies") {    movies m = (movies)rdr.ReadAsObject(typeof(movies));    // Do something with object } // XML data is read directly into this class public class movies {    public int movie_ID;    public string movie_Title;    public string movie_Year;    private string director;    public string bestPicture;    public string movie_Director     {       set { director = value; }       get { return (director); }    } }

< Day Day Up >

10.2. Techniques for Reading XML Data

Figure 10-3. Classes to read XML data

XmlReader Class

Listing 10-6. Using `XmlReader` to Read an XML Document

Listing 10-7. XML Input and Corresponding Nodes

Table 10-1. `XmlNodeType` Enumeration

XmlNodeReader Class

Listing 10-8. Using `XmlNodeReader` to Read an XML Document

The XmlReaderSettings Class

Table 10-2. Properties of the `XmlReaderSettings` Class

Using an XML Schema to Validate XML Data

Options for Reading XML Data

10.2. Techniques for Reading XML Data

Figure 10-3. Classes to read XML data

XmlReader Class

Listing 10-6. Using XmlReader to Read an XML Document

Listing 10-7. XML Input and Corresponding Nodes

Table 10-1. XmlNodeType Enumeration

XmlNodeReader Class

Listing 10-8. Using XmlNodeReader to Read an XML Document

The XmlReaderSettings Class

Table 10-2. Properties of the XmlReaderSettings Class

Using an XML Schema to Validate XML Data

Options for Reading XML Data

Listing 10-6. Using `XmlReader` to Read an XML Document

Table 10-1. `XmlNodeType` Enumeration

Listing 10-8. Using `XmlNodeReader` to Read an XML Document

Table 10-2. Properties of the `XmlReaderSettings` Class