Section 24.1. The XPath data model | XML in Office 2003: Information Sharing with Desktop XML


Prev	don't be afraid of buying books	Next

24.1. The XPath data model

It is only possible to construct an address – any address – given a model, For instance the US postal system is composed of a model of states containing cities containing streets with house numbers. To some degree the model falls naturally out of the geography of the country but it is mostly artificial. State and city boundaries are not exactly visible from an airplane. We give new houses street numbers so that they can be addressed within the postal system's model.

Relational databases also have a model that revolves around tables, records, columns, foreign keys and so forth. This "relational model" is the basis for the SQL query language. Just as SQL depends on the relational model, XPath depends on a formal model of the logical structure and data in an XML document.

24.1.1 Why do we need a model?

You may wonder if XML really needs a formal model. It seems so simple: elements within elements, attributes of elements and so forth. It is simple but there are details that need to be standardized in order for addresses to behave in a reliable fashion. The tricky part is that there are many ways of representing what might seem to be the "same" information. We can represent a less-than symbol in at least four ways:

a predefined entity reference: <
a CDATA section: <![CDATA[<]]>
a decimal Unicode character reference: <
a hex Unicode character reference: <

We could also reference a text entity that embeds a CDATA section and a text entity that embeds another text entity that embeds a character reference, etc. In a query you would not want to explicitly search for the less-than symbol in all of these variations. It would be easier to have a processor that could magically normalize them to a single model. Every XPath-based query engine needs to get exactly the same data model from any particular XML document.

24.1.2 Tree addressing

The XPath data model views a document as a tree of nodes, or node tree. Most nodes correspond to document components, such as elements and attributes.

It is very common to think of XML documents as being either families (elements have child elements, parent elements and so forth) or trees (roots, branches and leaves). This is natural: trees and families are both hierarchical in nature, just as XML documents are. XPath uses both metaphors but tends to lean more heavily on the familial one.^[1]

^[1] Politicians take note: in this case, family values win out over environmentalism!

XPath uses genealogical taxonomy to describe the hierarchical makeup of an XML document, referring to children, descendants, parents and ancestors. The parent is the element that contains the element under discussion. The list of ancestors includes the parent, the parent's parent and so forth. A list of descendants includes children, children's children and so forth.

As there is no culture-independent way to talk about the first ancestor, XPath calls it the "root". The root is not an element. It is a logical construct that holds the document element and any comments and processing instructions that precede and follow it.

Trees in computer science are very rarely (if ever) illustrated as a natural tree is drawn, with the root at the bottom and the branches and leaves growing upward. Far more typically, trees are depicted with the root at the top just as family trees are. This is probably due to the nature of our writing systems and the way we have learned to read.^[2] Accordingly, this chapter refers to stepping "down" the tree towards the leaf-like ends and "up" the tree towards the root as the tree is depicted in Figure 24-1. One day we will genetically engineer trees to grow this way and nature will be in harmony with technology.

^[2] To do: rotate all tree diagrams for Japanese edition of this book!

Figure 24-1. Vertical tree depictions

24.1.3 Node tree construction

A node tree is built by an XPath processor after parsing an XML document like that in Example 24-1.

Example 24-1. Sample document

 <?xml version="1.0"?> <!--start--> <part-list><part-name nbr="A12">bolt</part-name> <part-name nbr="B45">washer</part-name><warning type="ignore"/> <!--end of list--><?cursor blinking?> </part-list> <!--end of file-->

In constructing the node tree, the boundaries and contents of "important" constructs are preserved, while other constructs are discarded. For example, entity references to both internal and external entities are expanded and character references are resolved. The boundaries of CDATA sections are discarded. Characters within the section are treated as character data.

The node tree constructed from the document in Example 24-1 is shown in Figure 24-2. In the following sections, we describe the components of node trees and how they are used in addressing. You may want to refer back to this diagram from time to time as we do so.

Figure 24-2. Node tree for document in Example 24-1

24.1.4 Node types

The XPath data model describes seven types of nodes used to construct the node tree representing any XML document. We are interested primarily in the root, element, attribute and text node types, but will briefly discuss the others.

For each node type, XPath defines a way to compute a string-value (labeled "value" in Figure 24-2). Some node types also have a "name".

24.1.4.1 Root node

The top of the hierarchy that represents the XML document is the root node.

It is important to remember that in the XPath data model the root of the tree representing an XML document is not the document (or root) element of the document. A root node is different from a root element. The root node contains the root element.

The nodes that are children of the root node represent the document element and the comments and processing instructions found before and after the document element.

24.1.4.2 Element nodes

Every element in an XML document is represented in the node tree as an element node. Each element has a parent node. Usually an element's parent is another element but the document element has as its parent the root node.

Element nodes can have as their children other element nodes, text nodes, comment nodes and processing instruction nodes.

An element node also exhibits properties, such as its name, its attributes and information about its active namespaces.

Element nodes in documents with DTDs may have unique identifiers. These allow us to address element nodes by name. IDs are described in 15.3.3.2, "ID and IDREF attributes", on page 361.

The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in the document order. You can think of it as all of the data with none of the markup, organized into one long character string.

24.1.4.3 Text nodes

The XML Recommendation describes character data as all text that is not markup. In other words it is the textual data content of the document and it does not include data in attribute values, processing instructions and comments.

XPath does not care how a character was originally represented. The string "<>" in an XML document is simply "<>" from the data model's point of view. The same goes for "<>" and "<![CDATA[<>]]>". The characters represented by any of these will be grouped with the data characters that precede and follow them and called a "text node." The individual characters of the text node are not considered its children: they are just part of its value. Text nodes do not have any children.^[3]

^[3] As the word "text" means something different in XPath from its meaning in the XML Recommendation, we try always to say "text node", even when the context is clear, reserving "text" as a noun for its normal meaning.

Remember that whitespace is significant. A text node might contain nothing else. In Figure 24-2, for example, nodes T2, T4, and T5 contain line feed characters, represented by hexadecimal character references.^[4]

^[4] Character references are described in 15.6, "Character references", on page 368

24.1.4.4 Attribute nodes

If an element has attributes then these are represented as attribute nodes. These nodes are not considered children of the element node. They are more like cousins who live in the guest house.

An attribute node exhibits name, string-value, and namespace URI properties. Defaulted attributes are reported as having the default values. The data model does not record whether they were explicitly specified or merely defaulted. No node is created for an unspecified attribute that had an #IMPLIED default value declared. Attribute nodes are also not created for attributes used as namespace declarations.

Note that an XML processor is not required to read an external DTD unless it is validating the document. This means that detection of ID attributes and default attribute values is not mandatory.

24.1.4.5 Other node types

Namespace nodes keep track of the set of namespace prefix/URI pairs that are in effect at a particular point in a document. Like attribute nodes, namespace nodes are attached to element nodes and are not in any particular order.

Each comment and processing instruction in the XML document is instantiated as a comment or processing instruction node in the node tree. The string-value property accesses the content of these constructs, as you can see in Figure 24-2.


	Amazon