unparsed-text

The unparsed-text() function returns the content of an external file in the form of a string.

Changes in 2.0

This function is new in XSLT 2.0.

Signature

Argument	Data Type	Meaning
href	xs:string	The URI of the external text file to be loaded.
encoding (optional)	xs:string	The character encoding of the text in the file.
Result	xs:string	The textual content of the file.

Effect

This function is analogous to the document() function described on page 532, except that the file referenced by the URI is treated as text rather than as XML. The file is located and its textual content is returned as the result of the unparsed-text() function, in the form of a string.

The value of the href argument is a URI. It mustn't contain a fragment identifier (the part marked with a «# » sign). It may be an absolute URI or a relative URI; if it is relative, it is resolved against the base URI of the stylesheet. This is true even if the relative URI is contained in a node in a source document. In this situation it is a good idea to resolve the URI yourself before calling this function. For example, if the URI is in an attribute called src then the call might be «unparsed-text( resolve-uri (@src, base-uri (@src))) » . The resolve-uri() function is described in XPath 2.0 Programmer's Reference ; its first argument is the relative URI, and the second argument is the base URI used to resolve it.

The optional encoding argument specifies the character encoding of the file. This can be any character encoding supported by the implementation; the only encodings that an implementation must support are UTF-8 and UTF-16. The system will not necessarily use this encoding: the rules for deciding an encoding are as follows , and are based on the rules given in the XLink recommendation:

First, the processor looks for so-called external encoding information. This typically means information supplied in an HTTP header, but the term is general and could apply to any metadata associated with the file, for example WebDAV properties.
Next, it looks at the media type (MIME type), and if this identifies the file as XML, then it determines the encoding using the same rules as an XML parser (for example, it looks for an XML declaration, and if there is none, it looks for a byte order mark). Why would you use this function, rather than document() , to access an XML document? The thinking is that it is quite common for one XML document to act as an envelope for another XML document that is carried transparently in a CDATA section, and if you want to create such a composite document, you will want to read the payload document without parsing it.
Next, it uses the encoding argument if this has been supplied.
If there is no encoding argument, it tries to use UTF-8 encoding.

Various errors can occur in this process. In most cases there is a defined recovery action, so the processor has an option of treating the error as fatal or struggling on. Some processors will provide configuration options that pass this choice on the user . If the file identified by the URI cannot be found, the fallback action is to return a zero-length string. If the file contains characters that are invalid in XML (this applies to most control characters in the range x00 to x1F under XML 1.0, but only to the null character x00 under XML 1.1) then the invalid characters are substituted by the special Unicode character xFFFD , which is specifically intended for such purposes. If the file is found, but the bytes in the file cannot be decoded into characters using the encoding chosen by following the earlier rules, this is a fatal error.

Usage and Examples

There are a number of ways this function can be used, and I will show three. These are as follows:

Up-conversion: that is, loading text that lacks markup in order to generate the XML markup
XML envelope/payload applications
HTML boilerplate generation

Up-Conversion

Up-conversion is the name often given to the process of analyzing input data for structure that is implicit in the textual content, and producing as output an XML document in which this structure is revealed by explicit markup. I have used this process, for example, to analyze HTML pages containing census data, in order to clean the data to make it suitable for adding to a structured genealogy database. It can also be used to process data that arrives in non-XML formats such as comma-separated values or EDI syntax.

The unparsed-text() function is not the only way of supplying non-XML data as input to a stylesheet; it can also be done simply by passing a string as the value of a stylesheet parameter. But the unparsed-text() function is particularly useful because the data is referenced by URI, and accessed under the control of the stylesheet.

XSLT 2.0 is much more suitable for use in up-conversion applications than XSLT 1.0. The most important tools are the <xsl:analyze-string> instruction, which enables the stylesheet to make use of structure that is implicit in the text, and the <xsl:for-each- group > instruction, which makes it much easier to analyze poorly structured markup. These can often be used in tandem: in the first stage in processing, <xsl:analyze-string> is used to recognize patterns in the text and mark these patterns using elements in a temporary tree, and in the second stage, <xsl:for-each-group> is used to turn flat markup structures into hierarchic structures that reflect the true data model.

Here is an example of a stylesheet that reads a comma-separated-values file and turns it into structured markup.

Processing a Comma-Separated-Values File

This example is a stylesheet that reads a comma-separated-values file, given the URL of the file as a stylesheet parameter. It outputs an XML representation of this file, placing the rows in a <row> element and each value in a < cell > element. It does not attempt to process a header row containing field names , but this would be a simple extension.

Input

This stylesheet does not use any source XML document. Instead, it expects the URI of an ordinary text file to be supplied as a parameter to the stylesheet.

This is what the input file names.csv looks like.

  123,"Mary Jones","IBM","USA",1997-05-14   423,"Barbara Smith","General Motors","USA",1996-03-12   6721,"Martin McDougall","British Airways","UK",2001-01-15   830,"Jonathan Perkins","Springer Verlag","Germany",2000-11-17

Stylesheet

This stylesheet analyze-names.xsl uses a named template main as its entry point: a new feature in XSLT 2.0.

To run this under Saxon, you will need Saxon 7.9 or a later release. The command for running the stylesheet looks like this.

  java-jar saxon7.jar -it main analyze-names.xsl input-uri=names.csv

The -it option here indicates that processing should start without an XML source document, at the named template main .

The stylesheet first reads the input file using the unparsed-text() function, and then uses two levels of processing using <xsl:analyze-string> to identify the structure. The first level (using the regex «\n » ) splits the input into lines. The second level is explained more fully under the description of the regex-group () function on page 580: it extracts either the contents of a quoted string, or any value terminated by a comma, and copies this to a <cell> element.

  <?xml version="1.0"?>   <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"   xmlns:xs="http://www.w3.org/2001/XMLSchema"   version="2.0">   <xsl:param name="input-uri" as="xs:string"    /    >   <xsl:output indent="yes"/>   <xsl:template name="main">   <xsl:variable name="in"   select="unparsed-text($input-uri, 'iso-8859-1 ' )"/>   <table>   <xsl:analyze-string select="$in" regex="\n">   <xsl:non-matching-substring>   <row>   <xsl:analyze-string select="," regex='("([^"]*?)")([^,]+?),'>   <xsl:matching-substring>   <cell>   <xsl:value-of select="regex-group(2)"/>   <xsl:value-of select="regex-group(3)"/>   </cell>   </xsl:matching-substring>   </xsl:analyze-string>   </row>   </xsl:non-matching-substring>   </xsl:analyze-string>   </table>   </xsl:template>   </xsl:stylesheet>

Output

The output is as follows.

  <?xml version="1.0" encoding="UTF-8"?>   <table xmlns:xs="http://www.w3.org/2001/XMLSchema">   <row>   <cell>123</cell>   <cell>Mary Jones</cell>   <cell>IBM</cell>   <cell>USA</cell>   </row>   <row>   <cell>423</cell>   <cell>Barbara Smith</cell>   <cell>General Motors</<cell>   <cell>USA</cell>   </row>   <row>   <cell>6721</cell>   <cell>Martin McDougall</cell>   <cell>British Airways</cell>   <cell>UK</cell>   </row>   <row>   <cell>830</cell>   <cell>Jonathan Perkins</cell>   <cell>Springer Verleg</cell>   <cell>Germany<cell>   </row>   </table>

XML Envelope/Payload Applications

It is not uncommon to find structures in which one XML document is wrapped in a CDATA section inside another. For example:

  <envelope>   <header>...    <    /header>   <payload>   <! [CDATA[<target-doeument>...</target-document>]]>   </payload>   </envelope>

I don't normally recommend this as a good way of designing nested structures. In general, it is usually better to nest the structure directly, without using CDATA. That is, to use:

  <envelope>   <header>...</header>   <payload>   <target-document>...</target-document>   </payload>   </envelope>

But sometimes you don't get to design the documents yourself; and there are some advantages for the CDATA approach, such as the ability for the payload document to include a DOCTYPE declaration.

Handling such structures in XSLT is not easy: the payload document is presented as a single text node, not as a tree of element nodes. However, the unparsed-text() function makes it much easier to output such structures. All you need to do is:

  <xsl:output cdata-section-elements="payload"/>   <xsl:template match="/">   <envelope>   <header>... </header>   <payload>   <xsl:value-of select="unparsed-text('payload.xml')"/>   </payload>   </envelope>

HTML Boilerplate Generation

Generally, it is best to think of HTML in terms of a tree of element and text nodes, and to manipulate it as such in the stylesheet. Occasionally, you may need to process HTML that is not well formed, and cannot easily be converted into a well- formed structure. For example, you may be dealing with a syndicated news feed that arrives in HTML, whose format is sufficiently unpredictable that you don't want to rely on tools that automatically turn the HTML into structured XHTML. You might want to output the HTML news stories embedded in your own XSLT-generated pages.

An option in such cases is to treat the HTML as unparsed text rather than as a tree of nodes. You can read the HTML news feed using the unparsed-text() function and you can output it to the serialized result, using the disable-output-escaping option, provided your processor supports this.

  <xsl:value-of select="unparsed-text('news.html')"   disable-output-escaping="yes"/>

Remember when you use disable-output-escaping that not all processors support the feature, and that it works only if the output of the stylesheet is serialized. You can't always tell whether the output is going to be serialized or not: for example, if you run a transformation in Internet Explorer, the output HTML is serialized and then reparsed before being displayed, but if you run the same transformation in the Netscape browser, the result tree is passed directly to the rendering engine, bypassing the serialization stage. This means that disable-output-escaping doesn't work with a client-side transformation in Netscape.

unparsed-text

unparsed-text

Changes in 2.0

Signature

Effect

Usage and Examples

Up-Conversion

XML Envelope/Payload Applications

HTML Boilerplate Generation

See Also