10.4. Using XPath to Search XML

< Day Day Up >

A significant benefit of representing XML in a tree model as opposed to a data stream is the capability to query and locate the tree's content using XML Path Language (XPath). This technique is similar to using a SQL command on relational data. An XPath expression (query) is created and passed to an engine that evaluates it. The expression is parsed and executed against a data store. The returned value(s) may be a set of nodes or a scalar value.

XPath is a formal query language defined by the XML Path Language 2.0 specification (www.w3.org/TR/xpath). Syntactically, its most commonly used expressions resemble a file system path and may represent either the absolute or relative position of nodes in the tree.

In the .NET Framework, XPath evaluation is exposed through the XPathNavigator abstract class. The navigator is an XPath processor that works on top of any XML data source that exposes the IXPathNavigable interface. The most important member of this interface is the CreateNavigator method, which returns an XPathNavigator object. Figure 10-4 shows three classes that implement this interface. Of these, XmlDocument and XmlDataDocument are members of the System.Xml namespace; XPathDocument (as well as the XmlNavigator class) resides in the System.Xml.XPath namespace.

XmlDocument. Implements the W3C Document Object Model (DOM) and supports XPath queries, navigation, and editing.
XmlDataDocument. In addition to the features it inherits from XmlDocument, it provides the capability to map XML data to a DataSet. Any changes to the DataSet are reflected in the XML tree and vice versa.
XPathDocument. This class is optimized to perform XPath queries and represents XML in a tree of read-only nodes that is more streamlined than the DOM.

Figure 10-4. XML classes that support XPath navigation

Constructing XPath Queries

Queries can be executed against each of these classes using either an XPathNavigator object or the SelectNodes method implemented by each class. Generic code looks like this:

 // XPATHEXPRESSION is the XPath query applied to the data // (1) Return a list of nodes XmlDocument doc = new XmlDocument(); doc.Load("movies.xml"); XmlNodeList selection = doc.SelectNodes(XPATHEXPRESSION); // (2) Create a navigator and execute the query XPathNavigator nav = doc.CreateNavigator(); XPathNodeIterator iterator = nav.Select(XPATHEXPRESSION);

The XpathNodeIterator class encapsulates a list of nodes and provides a way to iterate over the list.

As with regular expressions (refer to Chapter 5, "C# Text Manipulation and File I/O"), an XPath query has its own syntax and operators that must be mastered in order to efficiently query an XML document. To demonstrate some of the fundamental XPath operators, we'll create queries against the data in Listing 10-10.

Listing 10-10. XML Representation of Directors/Movies Relationship

 <films>   <directors>     <director_id>54</director_id>     <first_name>Martin</first_name>     <last_name>Scorsese</last_name>     <movies>       <movie_ID>30</movie_ID>       <movie_Title>Taxi Driver</movie_Title>       <movie_DirectorID>54</movie_DirectorID>       <movie_Year>1976</movie_Year>     </movies>     <movies>       <movie_ID>28</movie_ID>       <movie_Title>Raging Bull </movie_Title>       <movie_DirectorID>54</movie_DirectorID>       <movie_Year>1980</movie_Year>     </movies>   </directors> </films>

Table 10-3 summarizes commonly used XPath operators and provides an example of using each.

Table 10-3. XPath Operators
Operator	Description
Child operator (`/`)	References the root of the XML document, where the expression begins searching. The following expression returns the `last_name` node for each director in the table: `/films/directors/last_name`
Recursive descendant operator (`//`)	This operator indicates that the search should include descendants along the specified path. The following all return the same set of `last_name` nodes. The difference is that the first begins searching at the root, and second at each `directors` node: //last_name //directors//last_name
Wildcard operator (`*`)	Returns all nodes below the specified path location. The following returns all nodes that are descendants of the `movies` node: `//movies/*`
Current operator (`.`)	Refers to the currently selected node in the tree, when navigating through a tree node-by-node. It effectively becomes the root node when the operator is applied. In this example, if the current node is a `directors` node, this will find any `last_name` child nodes: `.//last_name`
Parent operator (`..`)	Used to represent the node that is the parent of the current node. If the current node were a `movies` node, this would use the `directors` node as the start of the path: `../last_name`
Attribute operator (`@`)	Returns any attributes specified. The following example would return the movie's runtime assuming there were attributes such as `<movie_ID time="98">` included in the XML. `//movies//@time`
Filter operator (`[ ]`)	Allows nodes to be filtered based on a matching criteria. The following example is used to retrieve all movie titles directed by Martin Scorsese: //directors[last_name='Scorsese'] /movies/movie_Title
Collection operator (`[ ]`)	Uses brackets just as the filter, but specifies a node based on an ordinal value. Is used to distinguish among nodes with the same name. This example returns the node for the second movie, Raging Bull: `//movies[2]` (Index is not 0 based.)
Union operator (`\|`)	Returns the union of nodes found on specified paths. This example returns the first and last name of each director: `//last_name \| //first_name`

Note that the filter operator permits nodes to be selected by their content. There are a number of functions and operators that can be used to specify the matching criteria. Table 10-4 lists some of these.

Table 10-4. Functions and Operators used to Create an XPath Filter
Function/Operator	Description
`and`, `or`	Logical operators.
Example: `"directors[last_name= 'Scorsese' and first_name= 'Martin']"`
`position( )`	Selects node(s) at specified position.
Example: `"//movies[position()=2]"`
`contains(node,string)`	Matches if node value contains specified string.
Example: `"//movies[contains(movie_Title,'Tax')]"`
`starts-with(node,string)`	Matches if node value begins with specified string.
Example: `"//movies[starts-with(movie_Title,'A')]"`
`substring-after(string,string)`	Extracts substring from the first string that follows occurrence of second string.
Example: `"//movies[substring-after('The Graduate','The ')='Graduate']"`
`substring(string, pos,length)`	Extracts substring from node value.
Example: `"//movies[substring(movie_Title,2,1)='a']"`

Refer to the XPath standard (http://www.w3.org/TR/xpath) for a comprehensive list of operators and functions.

Let's now look at examples of using XPath queries to search, delete, and add data to an XML tree. Our source XML file is shown in Listing 10-10. For demonstration purposes, examples are included that represent the XML data as an XmlDocument, XPathDocument, and XmlDataDocument.

XmlDocument and XPath

The expression in this example extracts the set of last_name nodes. It then prints the associated text. Note that underneath, SelectNodes uses a navigator to evaluate the expression.

 string exp = "/films/directors/last_name"; XmlDocument doc = new XmlDocument(); doc.Load("directormovies.xml");  // Build DOM tree XmlNodeList directors = doc.SelectNodes(exp); foreach(XmlNode n in directors)    Console.WriteLine(n.InnerText);  // Last name or director

The XmlNode.InnerText property concatenates the values of child nodes and displays them as a text string. This is a convenient way to display tree contents during application testing.

XPathDocument and XPath

For applications that only need to query an XML document, the XPathDocument is the recommended class. It is free of the overhead required for updating a tree and runs 20 to 30 percent faster than XmlDocument. In addition, it can be created using an XmlReader to load all or part of a document into it. This is done by creating the reader, positioning it to a desired subtree, and then passing it to the XPathDocument constructor. In this example, the XmlReader is positioned at the root node, so the entire tree is read in:

 string exp = "/films/directors/last_name"; // Create method was added with .NET 2.0 XmlReader rdr = XmlReader.Create("c:\\directormovies.xml"); // Pass XmlReader to the constructor xDoc = new XPathDocument(rdr); XPathNavigator nav= xDoc.CreateNavigator(); XPathNodeIterator iterator; iterator = nav.Select(exp); // List last name of each director while (iterator.MoveNext())    Console.WriteLine(iterator.Current.Value); // Now, list only movies for Martin Scorsese string exp2 =     "//directors[last_name='Scorsese']/movies/movie_Title"; iterator = nav.Select(exp2); while (iterator.MoveNext())    Console.WriteLine(iterator.Current.Value);

Core Note

Unlike the SelectNodes method, the navigator's Select method accepts XPath expressions as both plain text and precompiled objects. The following statements demonstrate how a compiled expression could be used in the preceding example:

 string exp = "/films/directors/last_name"; // use XmlNavigator to create XPathExpression object XPathExpression compExp = nav.Compile(exp); iterator = nav.Select(compExp);

Compiling an expression improves performance when the expression (query) is used more than once.

XmlDataDocument and XPath

The XmlDataDocument class allows you to take a DataSet (an object containing rows of data) and create a replica of it as a tree structure. The tree not only represents the DatSet, but is synchronized with it. This means that changes made to the DOM or DataSet are automatically reflected in the other.

Because XmlDataDocument is derived from XmlDocument, it supports the basic methods and properties used to manipulate XML data. To these, it adds methods specifically related to working with a DataSet. The most interesting of these is the GeTRowFromElement method that takes an XmlElement and converts it to a corresponding DataRow.

A short example illustrates how XPath is used to retrieve the set of nodes representing the movies associated with a selected director. The nodes are then converted to a DataRow, which is used to print data from a column in the row.

 // Create document by passing in associated DataSet XmlDataDocument xmlDoc = new XmlDataDocument(ds); string exp = "//directors[last_name='Scorsese']/movies"; XmlNodeList nodeList =           xmlDoc.DocumentElement.SelectNodes(exp); DataRow myRow; foreach (XmlNode myNode in nodeList) {    myRow = xmlDoc.GetRowFromElement((XmlElement)myNode);    if (myRow != null){       // Print Movie Title from a DataRow       Console.WriteLine(myRow["movie_Title"].ToString());    } }

This class should be used only when its hybrid features add value to an application. Otherwise, use XmlDocument if updates are required or XPathDocument if the data is read-only.

Adding and Removing Nodes on a Tree

Besides locating and reading data, many applications need to add, edit, and delete information in an XML document tree. This is done using methods that edit the content of a node and add or delete nodes. After the changes have been made to the tree, the updated DOM is saved to a file.

To demonstrate how to add and remove nodes, we'll operate on the subtree presented as text in Listing 10-10 and as a graphical tree in Figure 10-5.

Figure 10-5. Subtree used to delete and remove nodes

This example uses the XmlDocument class to represent the tree for which we will remove one movies element and add another one. XPath is used to locate the movies node for Raging Bull along the path containing Scorsese as the director:

 "//directors[last_name='Scorsese']/movies[movie_Title=       'Raging Bull']"

This node is deleted by locating its parent node, which is on the level directly above it, and executing its RemoveChild method.

Listing 10-11. Using `XmlDocument` and `XPath` to Add and Remove Nodes

 Public void UseXPath() {    XmlDocument doc = new XmlDocument();    doc.Load("c:\\directormovies.xml");    // (1) Locate movie to remove    string exp = "//directors[last_name='Scorsese']/          movies[movie_Title='Raging Bull']";    XmlNode movieNode = doc.SelectSingleNode(exp);    // (2) Delete node and child nodes for movie    XmlNode directorNode = movieNode.ParentNode;    directorNode.RemoveChild(movieNode);    // (3) Add new movie for this director    //     First, get and save director's ID    string directorID =           directorNode.SelectSingleNode("director_id").InnerText;    // XmlElement is dervied from XmlNode and adds members    XmlElement movieEl = doc.CreateElement("movies");    directorNode.AppendChild(movieEl);    // (4) Add Movie Description    AppendChildElement(movieEl, "movie_ID", "94");    AppendChildElement(movieEl, "movie_Title", "Goodfellas");    AppendChildElement(movieEl, "movie_Year", "1990");    AppendChildElement(movieEl, "movie_DirectorID",                                 directorID);    // (5) Save updated XML Document    doc.Save("c:\\directormovies2.xml"); } // Create node and append to parent public void AppendChildElement(XmlNode parent, string elName,                                 string elValue) {    XmlElement newEl =           parent.OwnerDocument.CreateElement(elName);    newEl.InnerText = elValue;    parent.AppendChild(newEl); }

Adding a node requires first locating the node that will be used to attach the new node. Then, the document's Createxxx method is used to generate an XmlNode or XmlNode-derived object that will be added to the tree. The node is attached using the current node's AppendChild, InsertAfter, or InsertBefore method to position the new node in the tree. In this example, we add a movies element that contains information for the movie Goodfellas.