xsl:character-map

xsl: character-map

The <xsl:character-map> element is a top-level XSLT declaration used to provide detailed control over the way individual characters are serialized. A character map is used only when the result of the transformation is serialized, and when the <xsl:output> declaration that controls the serialization references the character map.

Changes in 2.0

Character maps are a new feature in XSLT 2.0, designed as a replacement for disable-output-escaping , which is now deprecated.

Format

 <xsl:character-map   name = qname   use-character-maps? = qnames>   <!-- Content: (xsl:output-character*) --> </xsl:character-map>

Position

<xsl:character-map> is a top-level declaration, so it must always occur as a child of the <xsl:stylesheet> element.

Attributes

Name	Value	Meaning
name mandatory	Lexical QName	The name of this character map
use-character-maps optional	Whitespace-separated list of lexical QNames	The names of other character maps to be incorporated into this character map

Name

Value

Meaning

name

mandatory

Lexical QName

The name of this character map

use-character-maps

optional

Whitespace-separated list of lexical QNames

The names of other character maps to be incorporated into this character map

Content

Zero or more <xsl:output-character> elements.

Effect

The name attribute is mandatory, and defines the name of the character map. It must be a lexical QName: a name with or without a namespace prefix. If the name uses a prefix, it must refer to a namespace declaration that is in scope at this point in the stylesheet, and as usual it is the namespace URI rather than the prefix that is used when matching names. If several character maps in the stylesheet have the same name, then the one with highest import precedence is used; an error is reported if this rule does not identify a character map uniquely. Import precedence is explained on page 314.

The character map contains zero or more <xsl:output-character> elements. Each <xsl:output-character> element defines a mapping between a single Unicode character and a string that is used to represent that character in the serialized output. For example, the element:

  <xsl:output-character char="&#160;" string="&amp;nbsp;"/>

indicates that the nonbreaking space character (Unicode codepoint 160) is to be represented on output by the string «  » . This illustrates one of the possible uses of character maps, which is to render specific characters using XML or HTML entity references.

The use-character-maps attribute is optional. It is used to build up one character map from a number of others. If present, its value must be a whitespace-separated list of tokens each of which is a valid QName that refers to another named character map in the stylesheet. For example:

  <xsl:character-map name="NBSP">   <xsl:output-character char="&#160;" string="&amp;nbsp;"/>   </xsl:character-map>   <xsl:character-map name="latin-1-symbols">   <xsl:output-character char="&#161;" string="&amp;iexcl;"/>   <xsl:output-character char="&#162;" string="&amp;cent;"/>   <xsl:output-character char="&#163;" string="&amp;pound;"/>   <xsl:output-character char="&#164;" string="&amp;curren;"/>   ...   </xsl:character-map>   <xsl:character-map name="latin-1-accented-letters">   <xsl:output-character char="&#192;" string="&amp;Agrave;"/>   <xsl:output-character char="&#193;" string="&amp;Aacute;"/>   <xsl:output-character char="&#194;" string="&amp;Acirc;"/>   <xsl:output-character char="&#195;" string="&amp;Atilde;"/> ...   ...   </xsl:character-map>   <xsl:character-map name="latin-1-entities"   use-character-maps="NBSP   latin-1-symbols   latin-1-accented-characters"/>

This example creates a composite character map called latin-1-entities that is effectively the union of three underlying character maps. The effect in this case is as if all the <xsl:output-character> elements in the three underlying character maps were actually present as children of the composite character map.

The rules for merging character maps are as follows . Firstly, there must be no circularities (a character map must not reference itself, directly or indirectly). The expanded content of a character map can then be defined (recursively) as the concatenation of the expanded content of each of the character maps referenced in its use-character-maps attribute, in the order in which they are named, followed by the <xsl:output-character> elements that are directly contained in the <xsl:character-map> element, in the order that they appear in the stylesheet. If the expanded content of a character map contains two mappings for the same Unicode character, then the one that comes last in this sequence is the one that is used.

A character map is used during serialization when it is named in the use-character-maps attribute of an <xsl:output> declaration. This is itself a list of named character maps; these character maps are combined using the same rule-they are concatenated in the order that they are listed, and conflicts are resolved by choosing the mapping for a character that is last in the list. It is also possible to merge the lists of character maps defined in several <xsl:output> declarations: the rules for this are given in the description of the <xsl:output> element on page 375.

During serialization, character mapping is applied to characters appearing in the content of text nodes and attribute nodes. It is not applied to other content (such as comments and processing instructions), nor to element and attribute names. It is not applied to characters for which disable-output-escaping has been specified, nor to characters in CDATA sections (that is, characters in the content of elements listed in the cdata-section-elements attribute of <xsl:output> ). It is also not applied to characters in URI-valued attributes that are subjected to URI escaping under the rules of the HTML and XHTML output methods .

If a character is included in the character map, this bypasses the normal XML/HTML escaping. For example if the ampersand character «& » is mapped to the string «### » , then an ampersand appearing in the content of a text or attribute node will be output as «### » (and not as «###amp; » ). But the character map is not applied to the output of XML/HTML escaping: A «< » character will still be output as «< » , not as «###lt; » .

The string that is substituted for a character using character mapping is inserted into the stream of characters produced by the serializer, and is processed along with the other characters in this stream by the final two stages of serialization, namely Unicode normalization and character encoding. Unicode normalization (if requested using the normalization-form attribute of <xsl:output> ) affects the way that combining characters are represented, for example it may cause a sequence consisting of a lowercase letter «c » followed by a cedilla «, » to be replaced by the single character « § » . Finally, character encoding (which is determined by the encoding attribute of <xsl:output> ) converts logical Unicode characters into actual bytes or octets; for example if the encoding is UTF-8 then the character « § » will be represented by the two octets «x3C xA7 » . You cannot use character maps to alter the effect of the Unicode normalization and character encoding processes.

Usage and Examples

Character maps are useful in many situations where you need precise control over the serialization of the result tree.

In general, if you are producing XML output that is to be used by another application, or if you are producing HTML output that is destined to be displayed in a browser, then the standard serialized output should be perfectly adequate. The situations where you need a finer level of control are typically:

If the output is designed to be edited by humans rather than processed by a machine. In this case you may want, for example, to control the use of entity references in the output.
If the output format is not standard HTML or XML, but some proprietary extension with its own rules. Such dialects are commonly encountered with HTML, though fortunately they are very rare in the case of XML. A similar requirement arises where the required output format is SGML.
If the application that processes the HTML or XML that you produce is buggy . You live in the real world and life isn't perfect. For example, it is rumored that some older browsers will not accept an «& » in a URL that has been escaped as «& » , even though the HTML standard requires the escaped form. If you encounter such bugs , you may need to work around them.
If the required output format uses what I call "double markup." By this I mean the use of XML tags in places where tags are not recognized by an XML parser, generally within CDATA sections or comments. I don't think that this is a particularly good design pattern for XML, because it is not possible to model the structure correctly as a tree using the XPath data model, but document structures such as this exist and you may be obliged to produce them. You can solve this problem using character maps by choosing two characters to map to the CDATA start and end delimiters ( «<![CDATA[ » and «]]> » ) or the comment start and end delimiters ( « » ). An example is shown below.
Finally, there are some transformations where generating the correct result tree is really difficult, or really slow. An example might be where the document structure uses interleaved markup. This is used where there are two parallel hierarchies running through the same document, for example one for the chapter/section/paragraph structure and one for the paginated layout. An expert will know when it's time to give up and cheat-which in this case means producing markup in the result document by direct intervention at the serialization stage, rather than generating the correct result tree and having the markup produced automatically by the serializer. The problem, of course, is that beginners are inclined to give up and cheat far too soon, which leads to code that is very difficult to extend and maintain.

Character Maps versus disable-output-escaping

The mechanism provided in XSLT 1.0 to handle these requirements was the disable-output-escaping attribute of the <xsl:text> and <xsl:value-of> elements. This was always an optional feature. XSLT processors were not obliged to implement it, and of course it would have no effect unless serialization was invoked. In XSLT 2.0, disable-output-escaping has become deprecated, so it's rather more likely that processors will be encountered that don't support the feature.

Character maps are less powerful than disable-output-escaping , because you can't switch them on and off for different parts of the result tree. But this is also their strength. The problem with disable-output-escaping is that it requires some extra information to pass between the transformation engine and the serializer, in addition to the information that's defined in the data model. (As evidence for this, look at the clumsy way that disable-output-escaping requests are encoded in a SAXResult stream in the Java JAXP interface.) This information is generally lost if you want to pass the result tree to another application before serializing it. The problem gets worse in XSLT 2.0, which allows temporary trees and parentless text nodes to be created and processed within the course of a transformation. One of the difficulties in designing this feature was whether a request to disable output escaping should be meaningful when the data being written was not being passed straight to the serializer, but was being written to a temporary tree or a parentless text node.

Most of the things that can be done with disable-output-escaping , including the bad things, can also be done with character maps. The big advantage of character maps is that they don't distort the data model, which means that they don't impact your ability to use a stylesheet-based transformation as a component in an application with clean interfaces to other components .

Choosing Characters to Map

Applications for character maps probably fall into two categories: those where you want to choose a nonstandard string representation of a character that occurs naturally in the data, and those where you want to choose some special character to trigger some special effect in the output.

An example in the first category would be the example shown earlier:

  <xsl:character-map name="NBSP">   <xsl:output-character char="&#1160;" string="&amp;nbsp;"/>   </xsl:character-map>

This forces the nonbreaking space character to be output as an entity reference. If the document is to be edited, many people will find the entity reference easier to manipulate because it shows up as a visible character, whereas the nonbreaking space character itself appears on the screen just like an ordinary space.

An example in the second category would be choosing two characters to represent the start and end of a comment. Suppose that the requirement is to transform an input document by "commenting out" any element that has the attribute «delete="yes" » . By commenting out, I mean outputting something like:

  <!--   <para delete="yes">   This paragraph has been deleted   </para>   -->

This is tricky, because the result cannot be modeled naturally as a result tree-comment nodes cannot have element nodes as children. So we'll choose instead to output the <para> element to the result tree unchanged, but preceded and followed by special characters, which we will map during serialization to comment start and end delimiters.

The best characters to choose for such purposes are the characters in the Unicode Private Use Area, for example the characters from xE000 to xF8FF. These characters have no defined meaning in Unicode, and are intended to be used for communications where there is a private agreement between the sender and the recipient as to what they mean. In this case, the sender is the stylesheet and the recipient is the serializer.

If you assign private use characters in information that is passed between applications, especially applications owned by different organizations, you should make sure that your use of the characters is well documented.

Here is a stylesheet that performs the required transformation:

Using a character-map to comment-out Elements

This example copies the input unchanged to the output, except that any element in the input that has the attribute «delete="yes" » is output within a comment.

Stylesheet

The stylesheet is comment-out.xsl :

  <?xml version="1.0"?>   <!DOCTYPE xsl:stylesheet [   <!ENTITY start-comment "&#xE501;">   <!ENTITY end-comment "&#xE502;">   ]>   <xsl:stylesheet version="2.0"   xmlns:xsl="http://www.w3.org/1999/XSL/Transform">   <xsl:output use-character-maps="comment-delimiters"/>   <xsl:character-map name="comment-delimiters">   <xsl:output-character char="&start-comment;" strings"&lt;!--"/>   <xsl:output-character char="&end-comment;" string="--&gt;"/>   </xsl:character-map>   <xsl:template match="*">   <xsl:copy>   <xsl:copy-of select="@*"/>   <xsl:apply-templates/>   </xsl:copy>   </xsl:template>   <xsl:template match="*[@delete='yes"]">   <xsl:text>&start-comment;</xsl:text>   <xsl:copy-of select="."/>   <xsl:text>&end-comment;</xsl:text>   </xsl:template>   </xsl:stylesheet>

Source

One of the paragraphs in the source file resume.xml is:

  <p delete="yes">Aidan is also in demand as a consort singer, performing   with groups including the Oxford Camerata and the Sarum Consort, with   whom he has made several acclaimed recordings on the ASV label of   motets by Bach and Peter Philips sung by solo voices.</p>

Output

When the stylesheet is applied to the source file resume.xml , the above paragraph appears as:

  <!--<p delete="yes">Aidan is also in demand as a consort singer,   performing with groups including the Oxford Camerata and the Sarum   Consort, with whom he has made several acclaimed recordings on the ASV   label of motets by Bach and Peter Philips sung by solo voices.</p>-->

Limitations

A character map applies to a whole result document; you cannot switch character mapping on and off at will.

The character map must be fixed at compile time. You cannot compute the output string at runtime, and there is no way the process can be parameterized. (You can, however, substitute a different character map by having different definitions of the same character map in different stylesheet modules, and deciding which one to import using <xsl:import> .)

Character mapping may impose a performance penalty, especially if a large number of characters are mapped.

Character mapping has no effect unless the result of the transformation is actually serialized. If the result tree is passed straight to another application that doesn't understand the special characters, it is unlikely to have the desired effect.

Character mapping only affects the content of text and attribute nodes. It doesn't affect characters in element and attribute names, or markup characters such as the quotes around an attribute value.

The character to be mapped, and all the characters in the replacement string, must be valid XML characters. This is because there is no way of representing invalid characters in the <xsl:output-character> element in the stylesheet. This means that character maps cannot be used to generate text files containing characters not allowed in XML, such as the NUL character (x00).