Understanding the XPath Data Model | XPath Kick Start: Navigating XML with XPath 1.0 and 2.0

The XPath 2.0 data model is based on an XML document's infoset . A document's infoset contains all the document's data reduced to a standard form in a set of properties; you can read all about infosets at www.w3.org/TR/xml-infoset/.

For example, if you're working with a processing instruction, the infoset will contain a number of properties for that processing instruction, including target , content , base-uri , and parent . These properties are then translated into XPath 2.0 data model properties.

The details of how this works are not directly important to us because they're handled by the software, and the XPath 2.0 properties for a node aren't directly available to us anyway (these properties are accessed by the XPath 2.0 processor when you use the XPath 2.0 language). However, it's good to know how the process works in overview.

In general, an XML document is first reduced to its infoset, which may be validated by an XML schema (although XPath 2.0 makes provisions for DTD validation, it's clear they're focusing on schemas), resulting in a Post Schema Validated Infoset (PSVI). The PSVI's properties are then converted to the corresponding XPath 2.0 data model properties and made available to XPath processors.

All the data from the PSVI is represented in sequences (single items, like single nodes, are represented as singleton sequences). As you know, sequences can contain nodes or atomic values, or a mix of the two. The XPath 2.0 data model uses the same seven node kinds as the XPath 1.0 data model does (except that root nodes are now called document nodes):

Document nodes
Element nodes
Attribute nodes
Processing instruction nodes
Comment nodes
Text nodes
Namespace nodes

Note that in the XPath 2.0 data model, each node has two types of valuesits string value and its typed value . The string value is just the string value of the node. Its typed value, on the other hand, is of the type the node has been declared to be. For example, if you've declared an element to contain decimal data, and if it holds the string "1.0", its type value will be the decimal value 1.0. As a result of schema validation, every element and attribute node has a type annotation , which is the name of the type against which the node was successfully validated. For attribute nodes, the type annotation is always the name of a simple type. For element nodes, the type annotation may be the name of a simple or a complex type. Now that there is more type data in the type annotation, nodes also have an associated typed value.

Typed values of attributes and elements based on simple types are just sequences of atomic values corresponding to the node's content after validation. The typed value of an element based on a complex type, on the other hand, is considered undefined.

Atomic values, on the other hand, correspond to the primitive simple types defined by the XML Schema specification, or values whose types are derived from those types by restriction in a schema.

That's what the picture looks like in overview. Now we'll take a closer look at the various legal items in the XPath 2.0 data modelnodes and atomic valuesstarting with the kinds of nodes allowed.

The first node kind we'll take a look at is the document node in XPath 2.0.

Document Nodes

The document node (the same as XPath 1.0 root nodes) encapsulates the entire XML documentit's the starting point in the tree that describes the XML document. In the XPath 2.0 data model, document nodes have a number of properties derived from the PSVI. You don't access these properties directly (the software you're using does)but they give you an idea of the data that is available for a document node:

base-uri
children
unparsed-entities
document-uri

Every document node must have a unique identity and must be distinct from all other nodes. If there are children, they must consist only of element, processing instruction, comment, and text nodes. You cannot have attribute, namespace, and document nodes as direct children of a document node.

Note also that the sequence of nodes in the children property is ordered (those nodes will be in document order), and the children property must not contain two consecutive text nodes (they must be merged to normalize that text). In well- formed XML documents, the children of the document node must not be empty and must consist only of element nodes, processing-instruction nodes, and comment nodes. Exactly one of these children, the document element, is an element node. (Don't confuse the document element with the document nodethe document element contains all the other elements in the document.)

Included in the information available to XPath 2.0 software about a document node is: the base URI, the kind of node the node is (which returns "document" in this case), the string value of the node (which is all the string values of all text node descendants concatenated together), the typed value of this node (which is its string value), the node's children, and the URI of the document itself.

Element Nodes

Element nodes encapsulate XML elements. In the XPath 2.0 data model, elements have these properties:

base-uri
node-name
parent
type
children
attributes
namespaces

In addition, element nodes must have a type annotation, which indicates what type of element they are. (As mentioned earlier, exactly how the type annotation works is implementation-specific at this point, and is not defined by XPath 2.0.) Element nodes must also have a unique identity, distinct from all other nodes. If there are children, the children of an element must be only element, processing instruction, comment, and text nodes. Attribute, namespace, and document nodes cannot be element node children.

Also, the children property may not contain two consecutive text nodes, and the sequence of nodes in the children property is ordered (in document order). The attributes of elements must have distinct names , as well as the namespace modes of an element, if there are any. And no namespace node may have the name "xmlns".

ELEMENT AND ATTRIBUTE NODES THAT DO NOT HAVE PARENTS

The XPath 2.0 data model supports element and attribute nodes that do not have parents. It does this to let you work with partial results during expression processing. However, as you'd expect, these elements or attributes may not be children of any other node.

Included in the PSVI for element nodes is this kind of information:

The element's base URI
The node kind (which returns "element" here)
The node name (which is the qualified name of the element)
The parent of the element
Its string value
Its typed value
The children of the element if there are any
Its attributes
Its namespaces, if there are any

Attribute Nodes

In the XPath 2.0 data model, attribute nodes encapsulate XML attributes. Attributes have these properties:

node-name
string-value
parent
type

In XPath 2.0, attribute nodes must have a type annotation, which indicates what type of element they are. Attribute nodes must also have a unique identity, distinct from all other nodes. Note that in XPath 2.0, the element node that owns an attribute is often called its parent. However, an attribute node is not considered a child of its parent element.

Included in the information about an attribute in the PSVI are its base URI, node kind (which is "attribute" here), its node name (which is the qualified name of the attribute), its parent element, its string value, its typed value, and its type.

Namespace Nodes

In XPath 2.0, namespace nodes encapsulate XML namespaces. Namespaces have these properties in the XPath 2.0 data model:

prefix
uri
parent

Namespace nodes must have a unique identity, distinct from all other nodes. Namespace prefixes may be an empty sequence. In fact, if the namespace URI is an empty string, the prefix must be an empty sequence.

USING THE NAMESPACE AXIS

Because the namespace axis is deprecated in XPath 2.0, the information held in namespace nodes is instead made available to applications using two functions: get-in-scope-namespaces and get- namespace-uri-for-prefix .

The information in the data model stored for namespace nodes includes its base URI, its node kind (which returns "namespace" here), its node name (which returns a qualified name with the namespace prefix and an empty URI), its parent node, and its string value (which is the namespace URI of the node).

Processing Instruction Nodes

In XPath 2.0, processing instruction nodes encapsulate XML processing instructions. Processing instructions have these properties:

target
content
base-uri
parent

Included in the information the XPath 2.0 data model stores for processing instructions is its base URI, node kind (returns "processing-instruction"), node name (returns a qualified name with the processing-instruction target in the local-name and an empty URI), its parent, its string value (the content of the processing-instruction), and its typed value (which is the string value of the processing-instruction).

Comment Nodes

Comment nodes encapsulate XML comments. Comments have these properties:

content
parent

Included in the information the XPath 2.0 data model stores for comments is the base URI of the comment's parent, its node kind (which returns "comment" here), its parent, its string value, and its typed value (which is just the string value of the comment).

Text Nodes

Text nodes encapsulate XML character content. Text nodes have these properties in the XPath 2.0 data model:

content
parent

In XPath 2.0, text nodes cannot contain the empty string as its content, and document and element nodes impose the constraint that two consecutive text nodes can never occur as adjacent siblings.

Included in the information the XPath 2.0 data model stores for text nodes are the base URI of the node's parent, the node kind (which returns "text") here, the parent element or document node, and its string value (which is just the content of the text node).

Atomic Values

As opposed to nodes, atomic values correspond to the primitive simple types defined by the XML Schema specification, or values whose types are derived from them by restriction in a schema. Here are the primitive simple types as predefined in the XML schema specificationthe xs namespace corresponds to "http://www.w3.org/2001/XMLSchema":

xs:string
xs:boolean
xs:decimal
xs:float
xs:double
xs:duration
xs:dateTime
xs:time
xs:date
xs:gYearMonth
xs:gYear
xs:gMonthDay
xs:gDay
xs:gMonth
xs:hexBinary
xs:base64Binary
xs:anyURI
xs:QName
xs:NOTATION

Along with the primitive simple types, types that are derived from them (including by the user ) by restriction are considered atomic types in the XPath 2.0 data model. You use the <restriction> element in an XML schema to derive this kind of a type and restrict the types of values it can take.

MORE ON PRIMITIVE TYPES

For more on the primitive simple types built into XML schema, see http://www.w3.org/TR/xmlschema-2.

In the following example, we're declaring a derived and restricted type named StateAbbreviation , and using a <pattern> element to restrict it to two-character strings like AZ or CA:

 <simpleType name='StateAbbreviation'>     <restriction base='xs:string'>       <pattern value='[A-Z]{2}'/>     </restriction> </simpleType>

There are some types built into XPath 2.0 that have already been derived by restriction from the XML schema xs:duration type. These types are in the namespace http://www.w3.org/2003/05/xpath-datatypes , which is represented by the prefix xdt :

xdt:dayTimeDuration is a subtype of xs:duration , which contains only day, hour , minute, and second components . In XPath 2.0, if you subtract two date values, the result is of the xdt:dayTimeDuration type.
xdt:yearMonthDuration is a subtype of xs:duration , which is restricted to only year and month components.

In addition, there are three abstract types, xdt:anyAtomicType , xdt:untypedAtomic , and xdt:untypedAny , which are now built into XPath 2.0. Because they're abstract, you can't create variables of these types directly, but you can use them in certain places as we'll see in the next chapter (for example, see the discussion of the instance of expression in the next chapter, where we can use xdt:anyAtomicType to indicate that we want to match any atomic type) :

xdt:anyAtomicType is an abstract type that is the base type of all atomic values. All atomic types, such as xs:integer , xs:string , and xdt:untypedAtomic , are subtypes of xdt:anyAtomicType .
xdt:untypedAtomic is an atomic type used for untyped data, such as text that is not given a specific type by schema validation. xdt:untypedAtomic is the type used to annotate unvalidated attribute nodes, for example, attribute nodes in well-formed documents. xdt:untypedAny is the type used to annotate unvalidated element nodes, for example, elements in well-formed documents.
xdt:untypedAny is a type that annotates an element whose type is unknown (such as might occur in a schemaless document).

As in XPath 1.0, you often don't deal with the various data types directly. Instead, you might use functions or expressions that return items of the various data types. For example, say that you wanted to use the XPath 2.0 current-dateTime function, which returns a value of the xs:dateTime type. Here's how we might assign an XSLT variable named rightNow the xs:dateTime value returned by current-dateTime :

 <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">  <xsl:variable name="rightNow" select="current-dateTime()" />  .         .         . </xsl:stylesheet>

Now we can use this new variable as a valid XPath 2.0 expression, because its type, xs:dateTime , is valid in XPath 2.0. Here's how that might work in a style sheet that just displays the current date and time (replacing the document node of whatever document you use it with that data) :

 <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:variable name="rightNow" select="current-dateTime()" />  <xsl:template match="/">   The date and time is:   <xsl:value-of select="$rightNow"/>   </xsl:template>  </xsl:stylesheet>

And here's what you get when you use the style sheet with Saxonas you can see, our xs:dateTime variable was indeed supported:

 <?xml version="1.0" encoding="UTF-8"?>         The date and time is:         2003-08-29T19:38:01.787Z

However, sometimes you do want to work with the supported data types explicitly. Here's an example, where we're using the xs:date constructor to create an xs:date value:

 <xsl:stylesheet version="2.0"     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"     xmlns:xs="http://www.w3.org/2001/XMLSchema">  <xsl:template match="/">   <xsl:value-of   select="xs:date('2004-09-02')"/>   </xsl:template>  </xsl:stylesheet>

TYPES DERIVED BY LIST OR UNION

What about types you may derive in XML schema that are not restricted, such as types derived by list or union? Items of these types are converted into sequences in XPath 2.0it's easy to see how list types are converted into sequences, but union types are more troublesome . When you derive a type from the union of other types, that union is converted into a simple sequence of the types in the union, one after the other. The actual type defined by union is not preserved, although its components are. Only the type of each individual item in the union is kept in this case.

Now that we've created our xs:date value, Saxon is able to display its value to us this way:

 <?xml version="1.0" encoding="UTF-8"?>     2004-09-02