Recipe7.1.Dealing with Whitespace


Recipe 7.1. Dealing with Whitespace

Problem

You need to convert XML into formatted text, but whitespace issues are ruining the results.

Solution

Consider the following annotated XML sample. The symbols (newline), (tab), and (space) mark whitespace-only text nodes that are often overlooked but subject to being copied to the output:

Too much whitespace

  1. Use xsl:strip-space to get rid of whitespace-only nodes.

    This top-level element with a single attribute, elements, is assigned a whitespace-separated list of element names that you want stripped of extra whitespace. Here, extra whitespace means whitespace-only text nodes. This means, for example, that the whitespace separating words in the previous comment element are significant because they are not whitespace only. On the other hand, the whitespace designated by the special symbols are whitespace only.

    A common idiom uses <xsl:strip-space elements="*"/> to strip whitespace by default and xsl:preserve-space (see later) to override specific elements. In XSLT 2.0, you are also allowed to have elements="*:Name", which tells the processor to strip whitespace in all elements with the given local name regardless of namespace.

  2. Use normalize-space to get rid of extra whitespace.

    A common mistake is to assume that xsl:strip-space takes care of "extra" whitespace like that used to align text in the previous comment element. This is not the case. The parser always considers significant whitespace inside an element's text that is mixed with nonwhitespace. To remove this extra space, use normalize-space, as in <xsl:value-of select="normalize-space(comment)"/>.

  3. Use translate to get rid of all whitespace.

    Another common mistake is to assume normalize-space strips all whitespace. This is not the case. Instead, it strips only leading and trailing whitespace and converts multiple internal whitespace characters to single spaces. If you need to strip all whitespace, use TRanslate(something,'&#x20;&#xa;&#xd; &#x9;', '').

  4. Use an empty xsl:text element to prevent terminating whitespace in the stylesheet from being considered relevant.

    xsl:text is normally considered a way to preserve whitespace. However, a strategically placed empty xsl:text element can prevent trailing whitespace in the stylesheet from being interpreted as significant.

Consider the results of the two modes in the following document and stylesheet, shown in Example 7-1 to Example 7-3.

Example 7-1. Input
<numbers>   <number>10</number>   <number>3.5</number>   <number>4.44</number>   <number>77.7777</number> </numbers>

Example 7-2. Processing numbers with and without an empty xsl:text element
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">     <xsl:output method="text"/> <xsl:strip-space elements="*"/>     <xsl:template match="numbers"> Without empty text element: <xsl:apply-templates mode="without"/> With empty text element: <xsl:apply-templates mode="with"/> </xsl:template>        <xsl:template match="number" mode="without">   <xsl:value-of select="."/>, </xsl:template>     <xsl:template match="number" mode="with">   <xsl:value-of select="."/>,<xsl:text/> </xsl:template>     </xsl:stylesheet>

Example 7-3. Output
Without empty text element: 10, 3.5, 4.44, 77.7777,     With empty text element: 10,3.5,4.44,77.7777,

Note that there is nothing magical about xsl:text when it is used this way. It works just as well if you replace <xsl:text/> with <xsl:if test="0"/> (but don't do so unless you enjoy confusing others). The effect is the placement of an element node between the comma and the trailing newline, which creates a whitespace-only node that will be ignored. Of course, some find this confusing regardless, so you can also write <xsl:text>,</xsl:text> if you prefer.

Too little whitespace

  1. Use xsl:preserve-space to override xsl:strip-space for specific elements.

    There is not much point in using xsl:preserve-space unless you also use xsl:strip-space. This is because the default behavior preserves space in the input document and documents loaded with the document( ) function.

    Remember, MSXML strips whitespace-only text nodes by default. In this case, you can use xsl:preserve-space to counteract this nonconformance.


  2. Use xsl:text to precisely specify text-output spacing.

    All whitespace inside an xsl:text element is preserved. This preservation allows precise control over whitespace placement. Sometimes you can use xsl:text to simply introduce line breaks:

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">     <xsl:output method="text"/> <xsl:strip-space elements="*"/>     <xsl:template match="number">   <xsl:value-of select="."/>   <xsl:text>&#xa;</xsl:text> </xsl:template> </xsl:stylesheet>

  3. However, the problem with outputting newline characters directly is that some platforms (e.g., Microsoft's) expect a line break to be represented as carriage- return plus newline. However, since XML parsers are required to convert carriage-return plus newline into a single newline, there is no way to create a platform-independent stylesheet. Fortunately, most Windows-based editors and the Windows command prompt handle single newlines correctly. The one exception is the notepad editor that comes free with Windows.

  4. Use nonbreaking space characters.

    XSLT does not treat the character #xA0; (nonbreaking space) as normal whitespace. In particular, xsl:strip-space and normalize-space( ) both ignore this character. If you need to strip whitespace most of the time but have specific instances when it should remain in place, you might try to use this character in the XML input. Nonbreaking space is particularly useful for HTML output, but may be of lesser value in other contexts (depending on how the renderer handles it).

Discussion

The solution section lists techniques for managing whitespace. However, knowing the XSLT rules that underlie the techniques is also useful.

The most important rules to know apply to both the stylesheet and input document(s):

  1. A text node is never stripped unless it contains only whitespace characters (#x20, #x9, #xD, or #xA).

    Although they are not all that common, you should also understand the effect of xml:space attributes in both the stylesheet and the input document(s).

  2. If a text-node's ancestor element has an xml:space attribute with a value of preserve, and no closer ancestor element has xml:space with a value of default, then whitespace-only text nodes are not stripped.

    The chapter now looks at the rules for stylesheets and source documents separately. For stylesheets, your options are simple.

  3. The only stylesheet elements for which whitespace-only nodes are preserved by default are xsl:text. Here, "by default" means unless otherwise specified using xml:space="preserve" as stated earlier in Step 2. See Example 7-4 and Example 7-5.

Example 7-4. Stylesheet demonstrating the effect of xsl:text and xml:space=preserve
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">     <xsl:output method="text"/>     <xsl:strip-space elements="*"/>     <xsl:template match="numbers"> Without xml:space="preserve": <xsl:apply-templates mode="without-preserve"/> With xml:space="preserve": <xsl:apply-templates mode="with-preserve"/> </xsl:template>     <xsl:template match="number" mode="without-preserve">   <xsl:value-of select="."/><xsl:text> </xsl:text> </xsl:template>     <xsl:template match="number" mode="with-preserve" xml:space="preserve">   <xsl:value-of select="."/><xsl:text> </xsl:text> </xsl:template>     </xsl:stylesheet>

Example 7-5. Output
Without xml:space="preserve": 10 3.5 4.44 77.7777 With xml:space="preserve":       10       3.5       4.44       77.7777

The only whitespace introduced by the first number match is the single space contained in the xsl:text element. However, when you use xml:space="preserve" in the second number match template, you pick up all the whitespace contained in the element including the two line breaks (the first is after the <xsl:template ...> and the second is after the </xsl:text>).

For source documents, the rules are as follows:

  • Initially, the list of elements in which whitespace is preserved includes all elements in the document.

  • If an element matches a NameTest in an xsl:strip-space element, then it is removed from the list of whitespace-preserving element names.

  • If an element name matches a NameTest in an xsl:preserve-space element, then it is added to the list of whitespace-preserving element names.

A NameTest is either a simple name (e.g., doc) or a name with a namespace prefix (e.g., my:doc), wildcard (e.g., *), or a wildcard with a namespace prefix (e.g., my:*). In XSLT 2.0 *:doc are also allowed. The default priority and import precedence rules apply when conflicts exist between xml:strip-space and xml:preserve-space:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"  xmlns:my="http://www.ora.com/XSLTCookbook/ns/my">     <!-- Strip whitespace in all elements --> <xsl:strip-space="*"/>     <!-- except those in the "my" namespace --> <xsl:preserve-space="my:*"/>     <!-- and those named foo --> <xsl:preserve-space="foo"/>

See Also

One of the most complete discussions of whitespace handling is in Michael Kay's XSLT Programmer's Reference (Wrox, 2001). The book also includes good coverage of the import precedence rules.




XSLT Cookbook
XSLT Cookbook: Solutions and Examples for XML and XSLT Developers, 2nd Edition
ISBN: 0596009747
EAN: 2147483647
Year: 2003
Pages: 208
Authors: Sal Mangano

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net