Reading and Writing Streamed XML | Performance Consulting: A Practical Guide for HR and Learning Professionals

Chapter 11 - Manipulating XML

bySimon Robinsonet al.
Wrox Press 2002

Now that we have seen how things can be done today, let's take a look at what .NET will allow us to do. We start by looking at how to read and write XML.

The XmlReader and XmlWriter classes will feel familiar to anyone who has ever used SAX. XmlReader- based classes provide a very fast, forward-only, read-only cursor that streams the XML data for processing. Since it is a streaming model, the memory requirements are not very demanding. However, you don't have the navigation flexibility and the read/write capabilities that would be available from a DOM-based model. XmlWriter- based classes will produce an XML document that conforms to the W3C's XML 1.0 Namespace Recommendations.

XmlReader and XmlWriter are both abstract classes. The graphic below shows what classes are derived from XmlReader and XmlWriter :

XmlTextReader and XmlTextWriter work with either a stream-based object or TextReader / TextWriter objects from the System.IO namespace. XmlNodeReader uses an XmlNode as its source instead of a stream. The XmlValidatingReader adds DTD and schema validation and therefore offers data validation. We'll look at these in a little more detail later in the chapter.

Using the XmlTextReader Class

Again, XmlTextReader is a lot like SAX. One of the biggest differences, however, is that while SAX is a push type of model (that is, it pushes data out to the application, and the developer has to be ready to accept it), the XmlTextReader has a pull model, where data is pulled in to the application requesting it. This gives an easier and more intuitive programming model. Another advantage to this is that a pull model can be selective about the data that is sent to the application: if you don't want all of the data, then you don't need to process it. In a push model, all of the XML data has to be processed by the application whether it is needed or not.

Let's take a look at a very simple example of reading XML data, and then we can take a closer look at the XmlTextReader class. You'll find the code in the XmlReaderSample1 folder. Instead of using the namespace MSXML2 as in the previous example, we will now be using the following:

   using System.Xml;

We also remove need to remove the following line from the module level code:

 private DOMDocument40 doc;

This is what our button click event handler looks like now:

 protected void button1_Click (object sender, System.EventArgs e) {   //Modify this path to find books.xml     string fileName = "..\..\..\books.xml";     //Create the new TextReader Object     XmlTextReader tr = new XmlTextReader(fileName);     //Read in a node at a time     while(tr.Read())     {     if(tr.NodeType == XmlNodeType.Text)     listBox1.Items.Add(tr.Value);     }   }

This is XmlTextReader at its simplest. First we create a string object with the name of the XML file. We then create a new XmlTextReader passing in the fileName string. XmlTextReader has thirteen different constructor overloads. We can pass in various combinations of strings (filenames and URLs), streams and NameTables (when an element or attribute name occurs several times, it can be stored in a NameTable , which allows for faster comparisons).

Just after an XmlTextReader object has been initialized , no node is selected. This is the only time that a node isn't current. When we go into the tr.Read() loop, the first Read() will move us to the first node in the document. This would typically be the XML declaration node. In this sample, as we move to each node we compare tr.NodeType against the XmlNodeType enumeration, and when we find a text node, we add the text value to the listbox. Here is a screenshot after the listbox is loaded:

Read Methods

There are several ways to move through the document. As we just saw, Read() takes us to the next node. We can then check to see if the node has a value ( HasValue() ) or, as you will see shortly, if the node has any attributes ( HasAttributes() ). We can also use the ReadStartElement() method, which will check to see if the current node is the start element, and then position you on to the next node. If you are not on the start element, an XmlException is raised. Calling this method is the same as calling the IsStartElement() method, followed by a Read() .

The ReadString() and ReadChars() methods both read in the text data from an element. ReadString() returns a string object containing the data, while ReadChars() reads the data into an array of char s.

ReadElementString() is similar to ReadString() , except that you can optionally pass in the name of an element. If the next content node is not a start tag, or if the Name parameter does not match the current node Name , then an exception is raised.

Here is an example of how ReadElementString() can be used (you'll find the code in the XmlReaderSample2 folder). Notice that this example uses FileStream s, so you will need to make sure that you include the System.IO namespace via a using statement.

 protected void button1_Click (object sender, System.EventArgs e) {   //use a filestream to get the data     FileStream fs = new FileStream("..\..\..\books.xml",FileMode.Open);     XmlTextReader tr = new XmlTextReader(fs);     while(!tr.EOF)     {     //if we hit an element type, try and load it in the listbox     if(tr.MoveToContent() == XmlNodeType.Element && tr.Name=="title")     {     listBox1.Items.Add(tr.ReadElementString());     }     else     {     //otherwise move on     tr.Read();     }     }   }

In the while loop we use MoveToContent() to find each node of type XmlNodeType.Element with the name title . We use the EOF property of the XmlTextReader as the loop condition. If the node is not of type Element or not named title , the else clause will issue a Read() method to move to the next node. When we find a node that matches the criteria, we add the result of a ReadElementString() to the listbox . This should leave us with just the book titles in the listbox. Notice that we don't have to issue a Read() call after a successful ReadElementString() . This is because ReadElementString() consumes the entire Element , and positions you on the next node.

If you remove && tr.Name=="title" from the if clause, you will now have to catch the XmlException exception when it is thrown. If you look at the data file, you will see that the first element that MoveToContent() will find is the < bookstore > element. Since it is an element, it will pass the check in the if statement. However, since it does not contain a simple text type, it will cause ReadElementString() to raise an XmlException . One way to work around this is to put the ReadElementString() call in a function of its own. Then, if the call to ReadElementString() fails inside this function, we can deal with the error and return back to the calling function.

Let's do this; we'll call this new method LoadList() , and pass in the XmlTextReader as a parameter. This is what the sample code looks like with these changes (you'll find the code in the XmlReaderSample3 folder):

 protected void button1_Click (object sender, System.EventArgs e)  {    //use a filestream to get the data    FileStream fs = new FileStream("..\..\..\books.xml",FileMode.Open);    XmlTextReader tr = new XmlTextReader(fs);    while(!tr.EOF)    {       //if we hit an element type, try and load it in the listbox   if(tr.MoveToContent() == XmlNodeType.Element)   {   LoadList(tr);   }       else        {          //otherwise move on          tr.Read();       }    } }   private void LoadList(XmlReader reader)     {     try     {     listBox1.Items.Add(reader.ReadElementString());     }     // if an XmlException is raised, ignore it.     catch(XmlException er){}     }

This is what you should see when you run this code:

Looks familiar? It's the same result that we had before. What we are seeing is that there is more then one way to accomplish the same goal. This is where the flexibility of the classes in the System.Xml namespace starts to become apparent.

Retrieving Attribute Data

As you play with the sample code, you may notice that when the nodes are read in, you don't see any attributes. This is because attributes are not considered part of a document's structure. When you are on an element node, you can check for the existence of attributes, and optionally retreive the attribute values.

For example, the HasAttributes property will return true if there are any attributes, otherwise false is returned. The AttributeCount property will tell you how many attributes there are, and the GetAttribute() method will get an attribute by name or by index. If you want to iterate through the attributes one at a time, there are also MoveToFirstAttribute() and MoveToNextAttribute() methods.

Here is an example of iterating through the attributes from XmlReaderSample4 :

 protected void button1_Click (object sender, System.EventArgs e) {   //set this path to match your data path structure     string fileName = "..\..\..\books.xml";     //Create the new TextReader Object     XmlTextReader tr = new XmlTextReader(fileName);     //Read in node at a time     while(tr.Read())     {     //check to see if it's a NodeType element     if(tr.NodeType == XmlNodeType.Element)     {     //if it's an element, then let's look at the attributes.     for(int i = 0; i < tr.AttributeCount; i++) {     listBox1.Items.Add(tr.GetAttribute(i));     }     }   }

This time we are looking for element nodes. When we find one, we loop through all of the attributes, and using the GetAttribute() method, we load the value of the attribute into the listbox. In this example those attributes would be genre , publicationdate, and ISBN .

Using the XmlValidatingReader Class

If you want to validate an XML document, you'll need to use the XmlValidatingReader class. It contains the same functionality as XmlTextReader (both classes extend XmlReader ) but XmlValidatingReader adds a ValidationType property, a Schemas property and a SchemaType property.

You set the ValidationType property to the type of validation that you want to do. The valid values for this property are:

Property Value	Description
Auto	If a DTD is declared in a < !DOCTYPE... > declaration, that DTD will be loaded and processed. Default attributes and general entities defined in the DTD will be made available. If an XSD schemalocation attribute is found, the XSD is loaded and processed, and will return any default attributes defined in the schema. If a namespace with the MSXML x-schema: prefix is found, it will load and process the XDR schema and return any default attributes defined.
DTD	Validate according to DTD rules.
Schema	Validate according to XSD schema.
XDR	Validate according to XDR schema.
None	No validation is performed.

Once this property is set, a ValidationEventHandler will need to be assigned. This is an event that gets raised when a validation error occurs. You can then react to the error in any way you see fit.

Let's look at an example of how this works. First we will add an XDR (XM- Data Reduced) schema namespace to our books.xml file, and rename this file booksVal.xml . It now looks like this:

 <?xml version='1.0'?> <!-- This file represents a fragment of a book store inventory database -->   <bookstore xmlns="x-schema:books.xdr">   <book genre="autobiography" publicationdate="1981" ISBN="1-861003-11-0">       <title>The Autobiography of Benjamin Franklin</title>       <author>          <first-name>Benjamin</first-name>          <list-name>Franklin</list-name>       </author>       <price>8.99</price>    </book>    ... </bookstore>

Notice that the bookstore element now has the attribute xmlns="x-schema:books.xdr" . This will point to the following XDR schema, called books.xdr :

   <?xml version="1.0"?>     <Schema xmlns="urn:schemas-microsoft-com:xml-data"     xmlns:dt="urn:schemas-microsoft-com:datatypes">     <ElementType name="first-name" content="textOnly"/>     <ElementType name="last-name" content="textOnly"/>     <ElementType name="name" content="textOnly"/>     <ElementType name="price" content="textOnly" dt:type="fixed.14.4"/>     <ElementType name="author" content="eltOnly" order="one">     <group order="seq">     <element type="name"/>     </group>     <group order="seq">     <element type="first-name"/>     <element type="last-name"/>     </group>     </ElementType>     <ElementType name="title" content="textOnly"/>     <AttributeType name="genre" dt:type="string"/>     <ElementType name="book" content="eltOnly">     <attribute type="genre" required="yes"/>     <element type="title"/>     <element type="author"/>     <element type="price"/>     </ElementType>     <ElementType name="bookstore" content="eltOnly">     <element type="book"/>     </ElementType>     </Schema>

Now everything looks good except for the fact that we have a couple of attributes in the XML file that are not defined in the schema ( publicationdate and ISBN from the book element). We have added these in order to show that validation is really taking place by raising a validation error. We can use the following code (from XmlReaderSample5 ) to verify this.

First, you will also need to add:

   using System.Xml.Schema;

to your class. Then add the following to the button event handler:

   protected void button1_Click (object sender, System.EventArgs e)     {     //change this to match your path structure.     string fileName = "..\..\..\booksVal.xml";     XmlTextReader tr=new XmlTextReader(fileName);     XmlValidatingReader trv = new XmlValidatingReader(tr);     //Set validation type     trv.ValidationType=ValidationType.XDR;     //Add in the Validation eventhandler     trv.ValidationEventHandler +=     new ValidationEventHandler(this.ValidationEvent);     //Read in node at a time     while(trv.Read())     {     if(trv.NodeType == XmlNodeType.Text)     listBox1.Items.Add(trv.Value);     }     }     public void ValidationEvent (object sender, ValidationEventArgs args)     {     MessageBox.Show(args.Message);     }

Here we create an XmlTextReader to pass to the XmlValidatingReader . Once the XmlValidatingReader ( trv ) is created, we can use it in much the same way that we used XmlTextReader in the previous examples. The differences are that we specify the ValidationType , and add a ValidationEventHandler . You can handle the validation error any way that you see fit; in this example we are showing a MessageBox with the error. This is what the MessageBox looks like when the ValidationEvent is raised:

Unlike some parsers, once a validation error occurs, XmlValidatingReader will keep on reading. It's up to you to stop the reading and deal with the errors accordingly if you believe that the error is serious enough.

Using the Schemas Property

The Schemas property of XmlValidatingReader holds an XmlSchemaCollection , which is found in the System.Xml.Schema namespace. This collection holds pre-loaded XSD and XDR schemas. This allows for very fast validation, especially if you need to validate several documents, since the schema will not have to be reloaded on each validation. In order to utilize this performance gain, you create an XmlSchemaCollection object. The Add() method, used to populate an XmlSchemaCollection , has four overloads. You can pass in an XmlSchema- based object, an XmlSchemaCollection -based object, a string with the namespace along with a string with the URI of the schema file, and finally a string with the namespace and an XmlReader -based object that contains the schema.

Using the XmlTextWriter Class

The XmlTextWriter class allows you write out XML to a stream, a file, or a TextWriter object. Like XmlTextReader , it does so in a forward-only, non-cached manner. XmlTextWriter is highly configurable, allowing you to specify such things as whether or not to indent, the amount to indent, what quote character to use in attribute values, and whether namespaces are supported.

Let's look at a simple example to see how the XmlTextWriter class can be used. This can be found in the XmlWriterSample1 folder:

 private void button1_Click(object sender, System.EventArgs e) {   // change to match your path structure     string fileName="..\..\..\booknew.xml";     // create the XmlTextWriter     XmlTextWriter tw=new XmlTextWriter(fileName,null);     // set the formatting to indented     tw.Formatting=Formatting.Indented;     tw.WriteStartDocument();     // Start creating elements and attributes     tw.WriteStartElement("book");     tw.WriteAttributeString("genre","Mystery");     tw.WriteAttributeString("publicationdate","2001");     tw.WriteAttributeString("ISBN","123456789");     tw.WriteElementString("title","The Case of the Missing Cookie");     tw.WriteStartElement("author");     tw.WriteElementString("name","Cookie Monster");     tw.WriteEndElement();     tw.WriteElementString("price","9.99");     tw.WriteEndElement();     tw.WriteEndDocument();     //clean up     tw.Flush();     tw.Close();     }

Here we are writing to a new XML file called booknew.xml , and adding the data for a new book. Note that XmlTextWriter will overwrite an existing file with a new one. We will look at inserting a new element or node into an existing document later in the chapter. We are instantiating the XmlTextWriter object using a FileStream object as a parameter. We could also pass in a string with a filename and path, or a TextWriter -based object. The next thing that we do is set the Indenting property. Once this is set, child nodes are automatically indented from the parent. WriteStartDocument() will add the document declaration. Now we start writing data. First comes the book element, and then we add the genre , publicationdate , and ISBN attributes. Now we write the title , author , and price elements. Notice that the author element has a child element name.

When we click on the button, we'll produce the booknew.xml file, which looks like this:

   <?xml version="1.0"?>     <book genre="Mystery" publicationdate="2001" ISBN="123456789">     <title>The Case of the Missing Cookie</title>     <author>     <name>Cookie Monster</name>     </author>     <price>9.99</price>     </book>

The nesting of elements is controlled by paying attention to when you start and finish writing elements and attributes. You can see this when we add the name child element to the authors element. Note how the WriteStartElement() and WriteEndElement() method calls are arranged, and how that arrangement produces the nested elements in the output file.

To go along with the WriteElementString() and WriteAttributeString() methods, there are several other specialized write methods. WriteCData() will output a CData section (< !CDATA[...]] >), writing out the text it takes as a parameter. WriteComment() writes out a comment in proper XML format. WriteChars() writes out the contents of a char buffer. This works in similar way to the ReadChars() method that we looked at earlier; they both use the same type of parameters. WriteChars() needs a buffer (an array of characters), the starting position for writing (an integer) and the number of characters to write (an integer).

Reading and writing XML using the XmlReader and XmlWriter -based classes is surprisingly flexible and simple to use. Next, we will look at how the DOM is implemented in the System.Xml namespace, through the XmlDocument and XmlNode classes.