Processing XML Documents in Memory | Microsoft Visual J# .NET (Core Reference) (Pro-Developer)

I l @ ve RuBoard

As mentioned earlier, it is not always convenient to process information in one pass. For some types of applications, it will be more convenient to keep an in-memory copy of the data being processed. To process an XML document in this way, you could use single-pass processing, converting the document's contents into internal, application-specific objects. You could manipulate these objects and then serialize them out as XML again once they've been processed . However, you might find it easier just to keep the data in an XML format in memory so you can manipulate it in situ and write it out again quite easily.

In-Memory Processing

The DOM defined by the W3C provides a model of an XML document from a programmer's point of view. Interfaces are provided for searching, navigating, retrieving content, and creating content. DOM implementations provide this functionality by creating an in-memory replica of the document on which these operations can be performed. The advantages of using an in-memory replica include the ability to access any part of the document at any time and the ability to easily add and remove parts of a document. The main disadvantage of in-memory processing is that it can consume large amounts of memory when processing large documents.

DOM represents the document as a tree, with text content and attributes forming the leaf nodes and the hierarchy of elements providing the branches, as shown in Figure 5-4.

Figure 5-4. DOM representation of an XML document

To process a DOM tree, you start at a particular point in the tree and navigate or search relative to that point. From this position, you can retrieve individual objects representing elements, attributes, and content or collections of such objects. The DOM model specifies particular interfaces for different parts of an XML document, such as elements and text content, some of which are shown in Figure 5-5. You can use the DOM interfaces to process objects you have retrieved from the DOM tree.

Figure 5-5. Mapping different parts of an XML document to different DOM interface types

The general principle of DOM is that everything is a node. The DOM Node interface exposes methods to navigate up (to the parent), down (to children), and sideways (to siblings) through the DOM tree. Because all nodes in the DOM tree support the Node interface, this provides a common navigational model. Individual nodes can be of particular types, such as elements or attributes. Additional type-specific interfaces are provided to make it easier to retrieve information from the node, such as the name and content of an element.

The .NET Framework Class Library contains a set of classes that represent the different parts of a DOM tree. The Microsoft DOM implementation extends the functionality offered by the W3C interfaces by supplying convenient helper functionality and extended capabilities.

Loading XML into the XmlDocument Class

The class System.Xml.XmlDocument provides access to an XML document through DOM interfaces. You can load an XML document into an instance of XmlDocument and then process it. The code to instantiate and load a document is straightforward:

 XmlDocumentdoc=newXmlDocument(); doc.Load("CakeCatalog.xml");

The Load method has various overloaded forms that can accept XML from the following sources:

A URL or filename passed as a string
An instance of System.IO.Stream from which the XML document can be read
A TextReader representing the XML document
Any subclass of XmlReader

Alternatively, if the XML document is sent to an application as a String , you can use the LoadXml method to parse it. The Load and LoadXml methods will throw exceptions if the document being loaded is not well- formed .

If the XML document must be validated as it is read in, you must load it through an XmlValidatingReader ” the XmlDocument class itself does not have any validation capabilities. You should instantiate an XmlValidatingReader , passing it the document to process, setting its schema cache, and supplying a validation event handler as described earlier. You should then pass this XmlValidatingReader to the Load method of the XmlDocument . Validation failures will be flagged to the event handler you defined.

As with validation, the XmlDocument class does not provide control over the handling of entities. Entities will be resolved when you load a document, but they will not necessarily be expanded. Control over entity expansion is governed by the input to the Load method. By default, entities will be represented in the DOM tree as instances of XmlEntityReference . If you use an XmlValidatingReader to load the XML document into the DOM, you can set the EntityHandling flag to ExpandEntities . This setting will replace all entities in the document with their values as the tree is created.

Obtaining Information from a DOM Document

After the document is loaded, you can extract information from it by locating particular elements within the document and retrieving their text content or attributes. You can locate elements in two principal ways:

If you know the structure of the document, you can navigate through the DOM tree by moving up, down, or sideways.
You can search the document for particular elements. Several search mechanisms are available. The standard DOM mechanisms are covered in this chapter along with a more powerful search method based on the XPathNavigator class. XPathNavigator uses XPath syntax, which is described in Chapter 6.

Naturally, there's nothing to stop you from combining these two models by searching for a particular part of the tree and then navigating manually within that tree fragment.

Retrieving Information from a DOM Element

To navigate the DOM tree, you must start from somewhere, and the logical place is the root element of the document. Remember that the root element is the outermost element of your document. You can obtain the root element (also known as the document element ) directly from the XmlDocument :

 XmlDocumentdoc=newXmlDocument(); doc.Load("CakeCatalog.xml"); XmlElementroot=doc.get_DocumentElement();

All elements are represented by instances of the XmlElement class. You can retrieve name, namespace, attribute and content information from the element. The XmlElement class has properties for the name, local name, and namespace URI:

 privatevoidlistElement(XmlElementelement) { Console.WriteLine("Elementname: " +element.get_LocalName()); Console.WriteLine("Elementnamespace: " + element.get_NamespaceURI()); }

You can test to see whether the element has attributes and, if so, you can retrieve them as an XmlAttributeCollection using the Attributes property. The Count property of the collection gives you the number of attributes contained in the collection. You can then retrieve individual attributes using the indexed ItemOf property, which can be accessed using the get_ItemOf method and which takes an int index value. Each attribute retrieved is of type XmlAttribute , which has properties reflecting the name, local name, namespace URI, and value of the attribute. Here is an example:

 if(element.get_HasAttributes()) { XmlAttributeCollectionattributes=element.get_Attributes(); Console.WriteLine("Elementhas: " +attributes.get_Count()+ " attributes"); for(inti=0;i<attributes.get_Count();i++) { XmlAttributeattribute=attributes.get_ItemOf(i); Console.WriteLine("\tAttributename: " + attribute.get_LocalName()); Console.WriteLine("\tAttributenamespace: " + attribute.get_NamespaceURI()); Console.WriteLine("\tAttributevalue: " + attribute.get_Value()); } }

Alternatively, if you know the name of the attribute whose value you want to retrieve, you can use the XmlElement class's GetAttribute method, passing in the name and namespace information for the attribute. This method returns the value as a string. You can query the XmlElement class's HasAttribute method to determine whether the element has an attribute of a particular name.

You can retrieve the element's content in several ways. But you should first use the IsEmpty property to test that it actually has content. The simplest way of retrieving the content is then to examine the InnerXml property, as shown here:

 if(!element.get_IsEmpty()) { Console.WriteLine("Elementhascontent:\n**********************"); Console.Write(element.get_InnerXml()); Console.WriteLine("\n**********************"); }

Figure 5-6 shows a listing of the information about the root element of SchemaCakeCatalog.xml. One interesting point to note is that the inner XML has had the appropriate namespace attribute set on it. (It inherits this from the root element.)

Figure 5-6. Listing the root element of the SchemaCakeCatalog.xml file

If you're handling an element containing text, you can use the get_InnerText method to retrieve just the text content. Alternatively, given that any text content of an element is held as a child node of that element, you can navigate downward and extract the content directly from the child node. To retrieve information from an XmlText node, you can use its Value property or Data property (which return the same value for an XmlText node).

Navigating the DOM Tree

You can use methods inherited from the XmlNode class to navigate the DOM tree relative to any given node. Unlike the XmlReader , the DOM does not retain the concept of a current node; you can hold references to as many nodes as you like and navigate relative to any node to which you have a reference. For example, consider the following method that searches for elements in a document:

 privatevoidlookForElements(XmlElementelement) { listElement(element); if(element.get_HasChildNodes()) { for(XmlNodechild=element.get_FirstChild(); child!=null; child=child.get_NextSibling()) { if(child.get_NodeType()==XmlNodeType.Element) { lookForElements((XmlElement)child); } } } }

The listElement method shown earlier displays the contents of an element to the console. Once the contents of the element passed in have been displayed, we use the HasChildNodes property to check whether the element has children. If so, the FirstChild of the element will be retrieved and its NodeType will be tested to see whether it is an XmlNodeType . Element . If the child is an element, a recursive call will be made to this method, and all the elements below this child will be listed. Once the first child has been processed, you can obtain the NextSibling and process it the same way. You can keep processing siblings until the NextSibling property is null . All of the DOM processing described so far is illustrated in the DOMCatalogReader.jsl file in the DOMCatalogReader sample project.

You have seen that you can navigate downward to children and sideways to the next sibling. You can also navigate to the previous sibling using the PreviousSibling property. To iterate through a list of the current node's children, you can invoke the Microsoft DOM extension method GetEnumerator . This method returns a System.Collections.IEnumerator interface reference that will iterate through that XmlNode class's children. An alternative implementation of the lookForElements method that uses an enumerator can be found in the sample file IteratorDOMCatalogReader.jsl.

To move upward to a node's parent, you can query the ParentNode property. If you want to go all the way to the top of the document, you can use the OwnerDocument property, which contains the XmlDocument containing this node.

Searching the DOM Tree

An alternative to walking through the whole document is to search the DOM tree. The simplest way to perform this search is to use the GetElementsByTagName method provided by the XmlElement and XmlDocument classes. This method expects the name and namespace information for the required element type, and it returns an XmlNodeList that contains matching elements. Here's a form of the lookForElements method ( renamed searchForElements ) that uses the GetElementsByTagName method:

 privatevoidsearchForElements(XmlElementstart,Stringname, Stringnamespace) { XmlNodeListnodes=start.GetElementsByTagName(name,namespace); System.Collections.IEnumeratorenumerator=nodes.GetEnumerator(); while(enumerator.MoveNext()) { XmlElementelement=(XmlElement)enumerator.get_Current(); listElement(element); } }

The searchForElements method is implemented in the sample file GetElementsByTagNameDOMCatalogReader.jsl.

The XML family of standards defines a powerful language for locating one or more nodes within an XML document. This standard is called XPath , and it is heavily used by XSLT for the identification of the parts of an XML document to be transformed. One of the Microsoft extensions to the DOM, the XmlNode interface, allows you to specify an XPath pattern and use it to locate nodes within a part of an XML document. Two methods are available in the XmlNode interface: SelectNodes and SelectSingleNode . The first of these returns an Xml NodeList containing all the nodes that match a given pattern. The SelectSingleNode method returns only the first node found that matches the pattern. These methods come in two forms, one of which uses an XmlNamespaceManager to associate prefixes with namespaces. This allows the prefixes to be used safely as part of the XPath string.

The following example, taken from the sample file SelectNodesDOMCatalogReader.jsl, shows how all of the Option elements can be retrieved from SchemaCakeCatalog.xml:

 XmlDocumentdoc=... XmlNamespaceManagernsmgr=newXmlNamespaceManager(doc.get_NameTable()); nsmgr.AddNamespace("cakes",  "http://www.fourthcoffee.com/SchemaCakeCatalog.xsd"); XmlElementroot=doc.get_DocumentElement(); XmlNodeListnodeList=root.SelectNodes("//cakes:Option",nsmgr); for(inti=0;i<nodeList.get_Count();i++) { XmlElementelement=(XmlElement)nodeList.get_ItemOf(i); listElement(element); }

A more powerful (but more complex) strategy is to use an XPathNavigator object to search for nodes within a document. The XmlDocument class, and any other subclass of XmlNode , contains a CreateNavigator method. This method returns an instance of an XPathNavigator that is a read-only representation of the XML tree below the given node. You can then specify an XPath expression that will be used to search the tree by one of the XPathNavigator class's Select methods. These methods return an XPathNodeIterator that can be used to walk through the matching nodes. The detailed use of XPathNavigator is beyond the scope of this book, but Chapter 6 contains more on XPath expressions.

Treating a DOM Fragment as a Stream

When we examined the XmlReader class earlier in the chapter, you saw that one of its subclasses was XmlNodeReader . This class allows you to treat part or all of a DOM tree as a stream. You can pass an XmlNode to the constructor of an XmlNodeReader and use it as an XmlReader from then on.

I l @ ve RuBoard