document

The document() function finds an external XML document by resolving a URI reference, parses the XML into a tree structure, and returns its root node. It may also be used to find a set of external documents, and it may be used to find a node other than the root by using a fragment identifier in the URI.

For example, the expression «document ('data.xml') » looks for the file data.xml in the same directory as the stylesheet, parses it, and returns the root node of the resulting tree.

Changes in 2.0

XPath 2.0 defines a simplified version of the document() function called doc() . The full document() function is retained as an XSLT function for backwards compatibility with XSLT 1.0.

The specification of the function has been generalized to allow the first argument to be an arbitrary sequence of URIs, and it has also become less prescriptive, to allow greater freedom to configure the way in which the URI is interpreted and the way in which the retrieved documents are parsed.

Signature

Argument	Data type	Meaning
href	item() *	A sequence, which may contain values of type xs:string or xs:anyURI , or nodes containing such values. These URIs are used to locate the documents to be loaded.
base (optional)	node()	If the argument is present, it must be a node. The base URI of this node is used for resolving any relative URIs found in the first argument.
Result	node() *	A sequence of nodes, in document order. In the common case where a single URI is specified, and this URI contains no fragment identifier, the result will normally be a single document node.

Effect

In brief, the document() function locates an XML document, using a URI. The resulting XML document is parsed and a tree is constructed . On completion, the result of the document() function is the document node of the new document.

If a sequence of URIs is provided, rather than a single URI, then the result is a sequence of document nodes. If a URI contains a fragment identifier, then the result may be an element node rather than a document node. The details are described in the following sections.

I will describe the effect of the function by considering three separate cases, which reflect different ways of determining a base URI to use for resolving relative URIs. However, first a word about URIs and URLs, which are terms I use rather freely throughout this section.

Resolving the URI

The XSLT specification always uses the term URI: Uniform Resource Identifier. The concept of a URI is a generalization of the URLs (Uniform Resource Locators) that are widely used on the Web today and displayed on every cornflakes packet. The URI extends the URL mechanism, which is based on the established Domain Name System (with its hierarchic names such as www.ibm.com and www.cam.ac.uk ), to allow other global naming and numbering schemes, including established ones such as ISBN book numbers and international telephone numbers . While URIs are a nice idea, the only ones that really enable you to retrieve resources on the Web are the familiar URLs. This is why the terms URI and URL seem to be used rather interchangeably in this section and indeed throughout the book. If you read carefully , though, you'll see that I've tried to use both terms correctly.

The way URIs are used to locate XML documents, and the way these XML documents are parsed to create a tree representation, is not defined in detail. In fact, the XSLT document() function defines this process in terms of the XPath doc() function, and the XPath doc() function essentially says that it's a piece of magic performed by the context of the XPath expression, not by the XPath processor itself. This reflects the reality that when you are using an application programming interface (API) such as the Java JAXP interface or the System.Xml.Xsl class in Microsoft's .NET, you can supply your own code that maps URIs to document nodes in any way you like. (The relevant class is called URIResolver in JAXP, XmlResolver in .NET.) This might not even involve any parsing of a real XML file; for example, the URIResolver might actually retrieve data from a relational database, and return an XML document that encapsulates the results of the query.

There's an expectation, though, that most XSLT processors-unless running in some kind of secure environment-will allow you to specify a URL (typically one that starts «http: » or «file: » ) that can be dereferenced in the usual way to locate a source XML document, which is then parsed. The details of how it is parsed, for example whether schema or Document Type Definition (DTD) validation is attempted and whether XInclude processing is performed, are likely to depend on configuration settings (perhaps options on the command line, or properties set via the processor's API). The language specification leaves this open ended.

A URI used as input to the document() function should generally identify an XML document. If the URI is invalid, or if it doesn't identify any resource, or if that resource is not an XML document, the specification leaves it up to the implementation to decide what to do: it can either report the error, or ignore that particular URI. Implementations may go beyond this, for example if the URI identifies an HTML document they may attempt to convert the HTML to XML-this is all outside the scope of the W3C specifications.

A URI can be relative rather than absolute. A typical example of a relative URI is data.xml . Such a URI is resolved (converted to an absolute, globally unique URI) by interpreting it as relative to some base URI. By default, a relative URI that appears in the text of an XML document is interpreted relative to the URI of the document (or more precisely, the XML entity) that contains it, which in the case of the document() function is usually either the source document or the stylesheet. So if the relative URI data.xml appears in the source document, the system will try to find the file in the same directory as the source document, while if it appears in the stylesheet, the system will look in the directory containing the stylesheet. The base URI of a node in an XML document can be changed using the xml:base attribute, and this will be taken into account. In addition, the document() function provides a second argument so that the base URI can be specified explicitly, if required.

The actual rule is that the href argument may be a sequence of nodes or atomic values. In the case of a node in this sequence, the node may contain a URI (or indeed, a sequence of URIs), and if such a URI is relative then it is expanded against the base URI of the node from which it came. In the case of an atomic value in the sequence, this must be an xs:string or xs:anyURI value, and it is expanded using the base URI of the stylesheet.

The expansion of relative URIs exploits the fact that in the XPath data model, described on page 53 in Chapter 2, every node has a base URI. (Don't confuse this with the namespace URI, which is quite unrelated.) By default, the base URI of a node in the source document or the stylesheet will be the URI of the XML document or entity from which the node was constructed. In some cases, for example when the input comes from a Document Object Model (DOM) document or from a relational database, it may be difficult for the processor to determine the base URI (the concept does not exist in the DOM standard). What happens in this situation is implementer defined. Microsoft, whose MSXML3 processor is built around its DOM implementation, has extended its DOM so it retains knowledge of the URI from which the document was loaded.

With XSLT 2.0, you can override the default rules for establishing the base URI of a node by using the xml:base attribute of an element. This attribute is defined in a W3C Recommendation called XML Base ( http://www.w3.org/TR/xmlbase/ ); it is intended to fulfill the same function as the <base> element in HTML. If an element has an xml:base attribute, the value of the attribute must be a URI, and this URI defines the base URI for the element itself and for all descendants of the element node, unless overridden by another xml:base attribute.

The URI specified in xml:base may itself be a relative URI, in which case it is resolved relative to the base URI of the parent of the element containing the xml:base attribute (that is, the URI that would have been the base URI of the element if it hadn't had an xml:base attribute).

With XSLT 2.0, it is also possible that the node used to establish the base URI for the document() function will be a node in a temporary tree created as the value of a variable. Normally, the base URI for such a node will be the base URI of the <xsl:variable> (or <xsl:param> , or <xsl: with-param > ) element that defines the temporary tree. But if an element in the stylesheet has an xml:base attribute, that defines the base URI in the same way as for a source document.

If several calls on the document() function use the same URI (after expansion of a relative URI into an absolute URI), then the same document node is returned each time. You can tell that it's the same node because the «is » operator returns true: «document('a.xml')is document('a.xml') » will always be true. If you use a different URI in two calls, then you may or may not get the same document node back: «document('a.xml') is document('A.XML') » might be either true or false.

A fragment identifier identifies a part of a resource: for example, in the URL http://www.wrox.com/booklist#april2004 , the fragment identifier is april2004 . In principle, a fragment identifier allows the URI to reference a node or set of nodes other than the root node of the target document; for example, the fragment identifier could be an XPointer expression containing a complex expression to select nodes within the target document. In practice though, this is all implementation defined. The interpretation of a fragment identifier depends on the media type (often called MIME type) of the returned document. Implementations are not required to support any particular media types (which means they are not required to support fragment identifiers at all). Many products support a simple fragment identifier consisting of a name that must be the value of an ID attribute in the target document, and support for XPointer fragment identifiers is likely to become increasingly common now that a usable XPointer specification has finally been ratified.

Parsing the Document

Once the URI has been resolved against a base URI, the next steps are to fetch the XML document found at that URI, and then to parse it into a tree representation. The specification says very little about these processes, which allows the implementation considerable freedom to configure what kind of URLs are acceptable, and how the parsing is done. It is not even required that the resource starts life as XML: an implementation could quite legitimately return a document node that represents an HTML document, or the results of a database query. If the URL does refer to an XML file, there are still variations allowed in how it is parsed, for example whether DTD or schema validation takes place, and whether XInclude references are expanded. A vendor might provide additional options such as the ability to strip comments, processing instructions, and unreferenced namespaces. You need to check the documentation for your product to see how such factors can be controlled.

The specification does say that whites-pace-only nodes are stripped following the same rules as for the source document, based on the <xsl: strip-space > and <xsl: preserve-space > declarations in force. This is true even if the document happens to be a stylesheet.

URIs Held in Nodes

For a simple case such as «document(@href) » , the result is a single node, namely the root node of the document referenced by the href attribute of the context node.

More generally, the argument may be a sequence of nodes, each of which contains a sequence of URIs. The result is then the sequence obtained by processing each of these in turn . For example, «document(//@href) » returns the sequence of documents located by dereferencing the URIs in all the href attributes in the original context document. The result is returned in document order of the returned nodes (a somewhat academic concept since they will usually be different documents). The result is not necessarily in the order of the href attributes, and duplicates will be eliminated.

If any of the nodes contains a relative URI, it will be resolved relative to the base URI of that node. The base URI of a node is established using the rules given on page 53. In fact, each node in the supplied sequence could potentially have a different base URI.

This all sounds terribly complicated, but all it really means is that if the source document contains the link «data.xml » , then the system will look for the file data.xml in the same directory as the source document.

These rules also cover the case where the argument is a reference to a variable containing a temporary tree, for example:

  <xsl:variable name="index">index.xml</xsl:variable>   <xsl:for-eaeh select="document($index)">   . . .   </xsl:for-each>

In this case relative URI «index.xml » is resolved relative to the base URI of the <xsl:variable> element in the stylesheet, which is generally the URI of the stylesheet module itself.

Usage

A common use of the document() function is to access a document referenced from the source document, typically in an attribute such as href . For example, a book catalog might include links to reviews of each book, in a format such as:

  <book>   <review date="1999-12-28" publication="New York Times"   text="reviews/NYT/19991228/rev3.xml"/>   <review date="2000-01-06" publication="Washington Post"   text="reviews/WPost/20000106/revl2.xml"/>   </book>

If you want to incorporate the text of these reviews in your output document, you can achieve this using the document() function. For example:

  <xsl:template match="book">   <xsl:for-each select="review">   <h2>Review in <xsl:value-of select="@publication"/></h2>   <xsl:apply-templates select="document(@text)"/>   </xsl:for-each>   </xsl:template>

As the argument @text is a node, the result will be the root node of the document whose URI is the value of the text attribute, interpreted relative to the base URI of the <review> element, which (unless it comes from an external XML entity or is affected by an xml:base attribute on some ancestor node) will be the same as the URI of the source document itself.

Note that in processing the review document, exactly the same template rules are used as we used for the source document itself. There is no concept of particular template rules being tied to particular document types. If the review document uses the same element tags as the book catalog, but with different meanings, this can potentially create problems. There are two possible ways round this:

Namespaces: use a different namespace for the book catalog and for the review documents.
Modes: use a different mode to process nodes in the review document, so that the <xsl:apply- templates> instruction in the example would become:

  <xsl:apply-templates select="document(@text)" mode="review"/>

You might find that even if the element names are distinct, the use of modes is a good discipline for maintaining readability of your stylesheet. For more detail on modes, see <xsl:apply-templates> (page 187) and <xsl:template> (page 450) in Chapter 5.

Another useful approach, which helps to keep your style.sheet modular, is to include the templates for processing the review document in a separate stylesheet incorporated using <xsl:include> .

Using the document() Function to Analyze a Stylesheet

A stylesheet is an XML document, so it can be used as the input to another stylesheet. This makes it very easy to write little tools that manipulate stylesheets. This example shows such a tool, designed to report on the hierarchic structure of the modules that make up a stylesheet.

This example uses the document() function to examine a stylesheet and see which stylesheet modules it incorporates using <xsl:include> or <xsl:import> . The modules referenced by <xsl:include> or <xsl:import> are fetched and processed recursively.

Source

The source is any stylesheet, preferably one that uses <xsl:include> or <xsl:import> . A file dummy .xsl is provided in the code download for the book for you to use as a sample.

Stylesheet

The stylesheet list-includes.xsl uses the document() function to access the document referenced in the href attribute of <xsl:include> or <xsl:import> . It then applies the same template rules to this document, recursively. Note that the root template is applied only to the initial source document, to create the HTML skeleton page.

  <xsl:transform   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"   version="1.0"   >   <xsl:template match="/">   <html><body>   <h1>Stylesheet Module Structure</h1>   <ul>   <xsl:apply-templates select="*/xsl:include    */xsl:import"/>   </ul>   </body></html>   </xsl:template>   <xsl:template match="xsl:include    xsl:import">   <li><xsl:value-of select="concat(local-name(),'s ',@href)"/>   <xsl:variable name="module" select="document(@href)"/>   <ul>   <xsl:apply-templates   select="$module/*/xsl:include    $module/*/xsl:import"/>   </ul>   </li>   </xsl:template>   </xsl:transform>

Output

The output for the dummy.xsl stylesheet is as shown in Figure 7-1.

Figure 7-1

URIs as Atomic Values

As an alternative to supplying a URI that is held in the content of a node, the first argument may supply a URI as an atomic string. For convenience, the function accepts both xs:string and xs:anyURI types, as well as untyped atomic values. (Untyped atomic values are unlikely to arise in practice, since they normally arise only from atomizing a node in a schema-less document, and if you supply a node as an argument to the document() function, then the rules that apply are those in the previous section, URIs Held in Nodes .)

The first argument may be evaluated to produce a single atomic value containing a URI, or a sequence of them. It is even possible to mix atomic values and nodes in the input sequence; nodes are processed as described in the previous section, and atomic values as described here.

The most common case is a URL hard-coded in the stylesheet, for example «document ('tax-rates.xml') » .

Another common case is «document ('') » , which refers to the stylesheet itself. This construct was often used with XSLT 1.0, where it provided a convenient way to maintain lookup tables in the stylesheet itself. It is likely to be less common with XSLT 2.0, since the ability to hold a temporary tree in a global variable is usually much more convenient . The URI may be supplied as an xs:string , an xs:anyURI , or an untyped atomic value, and in each case is converted to a string. (XSLT 1.0 also allowed it to be supplied as a boolean or an integer, which creates a theoretical backwards incompatibility -but since converting a boolean or number is unlikely to yield a useful URL, the point is rather academic.)

The string is treated as a URI reference; that is, a URI optionally followed by a fragment identifier separated from the URI proper by a «# » character. If it is a relative URI, it is treated as being relative to the base URI of the stylesheet element that contains the expression in which the function call was encountered . This will normally be the URI of the principal stylesheet document, but it may be different if <xsl:include> or <xsl:import> was used, or if pieces of the stylesheet are contained in external XML entities, or if the base URI of any relevant element in the stylesheet has been set explicitly by using the xml:base attribute.

Again, all this really means is that relative URLs are handled just like relative URLs in HTML. If you write «document('tax-rates.xml') » in a particular stylesheet module, then the system looks for the file tax-rates.xml in the same directory as that stylesheet module.

If the string is an empty string, then the document referenced by the base URI is used. The XSLT specification states that «document('') » will return the root node of the stylesheet. Strictly speaking, however, this is true only if the base URI of the XSL element containing the call to the document() function is the same as the system identifier of the stylesheet module. If the base URI is different, perhaps because the stylesheet has been built up from a number of external entities, or because the xml:base attribute has been used, the object loaded by «document ('') » will not necessarily be the current stylesheet module; in fact, it might not be a well- formed document at all, in which case an error will be reported .

The specification refers to RFC2396 ( http://www.ietf.org/rfc/rfc2396.txt ) for the definitive interpretation of the use of a zero-length relative URI to refer to the containing document. This is actually a nifty piece of buck-passing, since there has been some debate as to what exactly RFC2396 means in this case. Whereas in all other cases the RFC talks about resolving a relative URI against a base URI, in this case it talks about using the "current document," a concept that some people claim is a different thing. This means there is disagreement as to whether xml:base should affect the meaning of this particular relative URI.

If the call is contained in a stylesheet brought in using <xsl:include> or <xsl:import> , it returns the root node of the included or imported stylesheet, not that of the principal stylesheet document.

Usage

With XSLT 1.0, this form of the document() function was very useful for handling data used by the stylesheet for reference information: for example, lookup tables to expand abbreviations, message files in different languages, or the text of the message of the day, to be displayed to users on the login screen. Such data can either be in the stylesheet itself (referenced as «document('') ») , or be in a separate file held in the same directory as the stylesheet (referenced as «document('messages.xml') ») or a related directory (for example «document('../data/messages.xml') » .

With XSLT 2.0, it is no longer necessary to use a secondary document for these purposes, because the data can be held in a tree-valued variable in the stylesheet and accessed directly. However, it may in some cases be more convenient to maintain the data in a separate file (for example, it makes it easier to generate the data periodically from a database), and in any case you may still want to write stylesheets that work with XSLT 1.0 processors, especially if you want the transformation to happen client-side. So I'll show the XSLT 1.0 technique first, and then show how the same problem can be tackled in XSLT 2.0.

XSLT allows data such as lookup tables to appear within any top-level stylesheet element, provided it belongs to a non-default namespace.

A Lookup Table in the Stylesheet

This example uses data in a lookup table to expand abbreviations of book categories. Two techniques are shown: in the first example the lookup table is held in the stylesheet; in the second example it is held in a separate XML document.

Source

This is the booklist.xml file we saw earlier.

  <booklist>   <book category="S">   <title>Number, the Language of Science</title>   <author>Danzig</author>   </book>   <book category="FC">   <title>The Young Visiters</title>   <author>Daisy Ashford</author>   </book>   <book category="FC">   <title>When We Were Very Young</title>   <author>A. A. Milne</author>   </book>   <book category="CS">   <title>Design Patterns</title>   <author>Erich Gamma</author>   <author>Richard Helm</author>   <author>Ralph Johnson</author>   <author>John Vlissides</author>   </book>   </booklist>

Stylesheet

The stylesheet is list-categories.xsl . It processes each of the <book> elements in the source file and, for each one, finds the <book:category> element in the stylesheet whose code attribute matches the category attribute of the <book> . Note the use of current() to refer to the current book; it would be wrong to use «. » here, because «. » refers to the context node, which is the <book:category> element being tested .

  <xsl:transform   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"   version="1.0"   xmlns:book="books.uri"   exclude-result-prefixes="book"   >   <xsl:template match="/">   <html><body>   <xsl:for-each select="//book">   <h1><xsl:value-of select="title"/></h1>   <p>Category: <xsl:value-of   select="document('')/*/book:category   [@code=current()/@category]/@desc"/>   </p>   </xsl:for-each>   </body></html>   </xsl:template>   <book:category code="S" desc=  ^"  Science"/>   <book:category code="CS" desc="Computing"/>   <book:category code="FC" desc="Children's Fiction"/>   </xsl:transform*

Output

The output of this stylesheet is as follows .

  <html>   <body>   <h1>Number, the Language of Science</h1>   <p>Category. Science</p>   <h1>The Young Visiters</h1>   <p>Category: Children's Fiction</p>   <h1>When We Were Very Young</h1>   <p>Category: Children's Fiction</p>   <h1>Design Patterns'</h1>   <p>Category: Computing</p>   </body>   </html>

XSLT 2.0 Stylesheet

Now, let's modify this stylesheet to take advantage of XSLT 2.0 facilities. It's renamed list-categories2-0.xsl . It isn't a big change; the lines that are different are shown with a shaded background.

  <xsl:transform   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"   version="2.0"   xmlns:book="books.uri"   exclude-result-prefixes="book"   >   <xsl:template match="/">   <html><body>   <xsl:for-each select="//book">   <hl><xsl:value-of-select="title"/></h1>   </p>   </xsl:for-each>   </body></html>   </xsl:template>   <category code="S" desc="Science"/>   <category code="CS" desc="Computing"/>   <category code="FC" desc="Children's Fiction"/>   </xsl:transform>

Supplying an Explicit Base URI

This section discusses what happens when the second argument to the document() function is supplied. In this instance, instead of using the containing node or the stylesheet as the base for resolving a relative URI, the base URI of the node supplied as the second argument is used. In other words, if a node in href contains a relative URL such as «data.xml » , the system will look for the file data.xml in the directory containing the XML document from which the node in $base was derived.

The value of the second argument must be a single node. For example, the call «document(@href, / ) » will use the root node of the source document as the base URI, even if the element containing the href attribute was found in an external entity with a different URI.

Usage

This option is not one that you will need to use very often, but it is there for completeness. If you want to interpret a URI relative to the stylesheet, you can write, for example:

  document(@href, document(''))

This works because the second argument returns the root node of the stylesheet, which is then used as the base URI for the relative URI contained in the href attribute.

With the extended function library that XPath 2.0 makes available, an alternative is to resolve the relative URI yourself by calling the resolve-uri () function, which is described in XPath 2.0 Programmer's Reference . This allows you to resolve a relative URI against any base URI, which does not have to be the base URI of any particular node.

Changes in 2.0

Signature

Effect

Resolving the URI

Parsing the Document

URIs Held in Nodes

Usage

URIs as Atomic Values

Usage

Supplying an Explicit Base URI

Usage

See Also