6.2 Data Model | Secure XML: The New Syntax for Signatures and Encryption

The XPath data model is relatively simple. Any XML document or object is a set of nodes of one of seven types (listed below). These nodes are organized into a hierarchical tree. In addition to this tree structure, a linear ordering of the nodes is maintained; this ordering is called "document order."

The document order of nodes matches the order in which the first character of that node appears in the document character string form. Thus an element node precedes all of its children, because the element start tag's opening left angle bracket occurs before all element content, attributes, or namespace declarations. By convention, the root node, which has no character representation, comes first in document order:

Root node
Element nodes
Attribute nodes
Namespace nodes
Text nodes
Processing instruction nodes
Comment nodes

Note that no provision is made for DTDs or the XML declaration. In effect, XPath takes the point of view that DTDs and the XML declaration are part of the external form of XML. After reading XML into an application, the DTD and declaration have already been taken into account and are no longer useful. Similarly, the parsing of characters into an XPath node-set removes external artifacts such as CDATA sections and character references. Example 6-1 and Figure 6-2 provide an example of some XML and the resulting XPath data model.

Figure 6-2. Xpath data model

graphics/06fig02.gif

Example 6-1 External XML

 <?xml version="1.0" ?> <example>qwert yuiop<ens:foo xmlns:ens="http://bar.example"> <bar a='b' c='d'/><bar/> <!--fun--></ens:foo><![CDATA[more text]]></example> <!--more fun-->

The following sections describe the seven node types in XPath, including the string value and extended name for each node type. In XPath, every node has a "string value" and some have "extended names"; both of these values are accessible through XPath functions, as described in Section 6.5.

6.2.1 Root Nodes

Every XPath node-set has one and only one root node for its tree. The top-level element of an XML document, often called the "root element," is a child of the XPath root node in the XPath node-set for that document. It is necessary to provide the root node because comments and processing instructions can appear both before and after the root element. The root node is the parent of such outside-of-document nodes. Many uses of XPath are intended to also apply to general external parsed entities and so allow multiple element children of the root node.

The root node has no parent. Every other node in an XPath node-set has exactly one parent.

The string value of the root node is the concatenation of the string value of all text node descendants of the root nodes organized in document order. Root nodes do not have extended names.

6.2.2 Element Nodes

An element node exists for each element in the original XML object. The XPath string value of an element is the concatenation of the string values of all of its text node descendants. For example, the text value of

 <e>(one<!--two--> three<f g="hijklmnop">four <?five      six?>seven</f>nine) xyz</e>

 (one threefour sevennine) xyz

The extended name of an element node is its local name and the URI of its namespace, if any. It may consist of the namespace bound to its namespace prefix if one is present or the default namespace if no prefix is present. The URI of the extended name is null only if no namespace prefix exists and the default namespace is null or not declared in scope.

6.2.3 Attribute Nodes

Every element node has an associated and possibly empty set of attribute nodes. (Namespace declarations are not attributes, although they may look like them.) Although XPath considers the element to be the parent of its attribute nodes, it does not consider the attribute nodes to be children or descendants of their element. As a consequence, to access the attribute nodes, an application must use different XPath operations than the application uses to access children. This treatment of attributes in XPath differs from that found in the Document Object Model [DOM]: DOM does not treat elements as the parents of their attributes.

Because the XPath model is invoked after XML has been parsed by an application on input, the XPath node-set includes the default attributes. Attributes in the xml namespace, which affect all descendants of an element until they are overridden, such as xml:lang, nevertheless appear only as single attribute nodes for the elements in whose start tags they occur.

The string value of an attribute node is the normalized attribute value. See [XML].

The extended name of an attribute is its local name and the URI of its namespace if it has a namespace prefix. The URI part of the extended name is null if no namespace prefix is present.

6.2.4 Namespace Nodes

As shown in Example 6-1 and Figure 6-2, namespace declarations do not simply create namespace nodes attached to the element in whose start tag the namespace declaration occurs. Rather, XPath creates namespace nodes below all descendant elements of that element, except at and below element nodes where a new namespace declaration with the same prefix overrides the ancestral declaration.

Perhaps XPath decided to replicate the namespace declaration nodes over descendant elements to make it easier to access the set of namespace declarations in scope. In reality, this choice destroys information. Consider Example 6-2. After parsing and conversion to an XPath node-set, it is no longer possible to tell that the namespace declaration of prefix "x" appeared on element "B." Because of the namespace declaration at element "A," XPath would have created a namespace node at all descendants of "A" and "B," even if that same namespace declaration had not occurred at "B." Similarly, after conversion to an XPath node-set, you can't tell whether this declaration occurred on elements "C" or "D" in the input. This problem makes canonicalization more difficult, as explained in Chapter 9.

The string value of a namespace node is the namespace URI that is bound to the prefix. Thus the string value of the namespace nodes created in Example 6-2 is

 http://foo.example/bar

As you might guess, the extended name for a namespace node has the prefix and the local name and the URI as the namespace. If the declaration involves the default prefix, the local name is null.

Example 6-2 XPath and namespaces

 <x:A xmlns:x="http://foo.example/bar">   <x:B xmlns:x="http://foo.example/bar">     <x:C><D/></x:C>   </x:B> </x:A>

6.2.5 Text Nodes

Character content appears as text nodes. Text nodes always have at least one character in them, and a text node in the full XPath node-set never has an adjacent text node. Contiguous text in the original input, however long, appears as a single text node. An XPath selection, however, can select two or more text nodes without select intervening nodes of other types. Processing these nodes can, therefore, result in processing of two or more text nodes in a row.

The character data in a text node is the internal application version. Thus CDATA sections have already been processed, the character referred to replaces character references, and so on. For example, the external representation

 &quot;<![CDATA[>"<]]>&amp;&quot;

appears as a text node with content

 ">"<&"

The string value of a text node is its character data. A text node has no extended name.

6.2.6 Processing Instruction Nodes

The extended name of a processing instruction is the target name of the processing instruction with a null URI. The string value of a processing instruction is the part of the instruction that occurs after the target and after any white space separating it from the target. The string value does not include the terminating "?>". Thus

 <?foo  ?>

has a null string value, whereas

 <?bar Bokm 42 yZgv?>

has the string value "Bokm 42 yZgv".

Note that any processing instructions in the DTD and the XML declaration do not appear as nodes of any sort in the corresponding XPath node-set. Other than that, a processing instruction node corresponds to each processing instruction in the external XML. (Although the XML declaration looks like a processing instruction, it is defined not to be one.)

6.2.7 Comment Nodes

A comment node does not have an extended name. The string value of a comment comprises everything between the opening "". For example, the string value of

 <!---->

is null, whereas that of

 <!--Four score and Seven.-->

is "Four score and Seven."