Whitespace


Whitespace handling can be a considerable source of confusion. When the output of a stylesheet is HTML, you can get away without worrying too much about it, because except in some very specific contexts HTML generally treats any sequence of spaces and newlines in the same way as a single space. But with other output formats, getting spaces and newlines where you want them, and avoiding them where you don't, can be crucial.

There are two issues:

  • Controlling which whitespace in the source document is significant, and therefore visible to the stylesheet.

  • Controlling which whitespace in the stylesheet is significant, because significant whitespace in the stylesheet is likely to get copied to the output.

Whitespace is defined as any sequence of the following four characters .

Character

Unicode Symbol

Tab

#x9

Newline

#xA

carriage return

#xD

Space

#x20

The definition in XSLT is exactly the same as in XML itself. Other characters such as non-breaking-space (#xA0), which is familiar to HTML authors as the entity reference   ‰« , may use just as little black ink as these four, but they are not included in the definition.

There are some additional complications about the definition. Writing a character reference   ‰« is in many ways exactly the same as hitting the space bar on the keyboard, but in some circumstances it behaves differently. The character reference &#x2 0; ‰« will be treated as whitespace by the XSLT processor, but not by the XML parser, so you need to understand which rules are applied at which stage of processing.

The XML standard makes some attempt to distinguish between significant and insignificant whitespace. Whitespace in elements with element-only content is considered insignificant, whereas whitespace in elements that allow #PCDATA content is significant. However, the distinction depends on whether a validating parser is used or not, and in any case, the standard requires both kinds of whitespace to be notified to the application. The writers of the XSLT specification decided that the handling of whitespace should not depend on anything in the DTD or schema, and should not depend on whether a validating or nonvalidating parser was used. Instead the handling of whitespace is controlled entirely from the source document (using the xml:space attribute) or from the stylesheet (using the <xsl: strip-space > and <xsl: preserve-space > declarations), which are fully described in Chapter 5.

The first stages in whitespace handling are the job of the XML parser, and are done long before the XSLT processor gets to see the data. Remember that these apply both to source documents and to stylesheets:

  • End-of-line appearing in the textual content of an element is always normalized to a single newline #xA ‰« character. This eliminates the differences between line endings on Unix, Windows, and Macintosh systems. XML 1.1 introduces additional rules to normalize the line endings found on IBM mainframes.

  • The XML parser will normalize attribute values. A tab or newline will always be replaced by a single space, unless it is written as a character reference such as &#9; ‰« or &#A; ‰« ; for some types of attribute (anything except type CDATA ), a validating XML parser will also remove leading and trailing whitespace, and normalize other sequences of whitespace to a single space character.

This attribute normalization can be significant when the attribute in question is an XPath expression in the stylesheet. For example, suppose you want to test whether a string value contains a newline character. You can write this as follows .

  <xsl:if test="contains(address, '&#xA;')">  

It's important to use the character reference &#xA; ‰« here, rather than a real newline, because a newline character would be converted to a space by the XML parser, and the expression would then actually test whether the supplied string contains a space.

What this means in practice is that if you want to be specific about whitespace characters, write them as character references; if you just want to use them as separators and padding, use the whitespace characters directly.

Important  

The XSLT specification assumes that the XML parser will hand over all whitespace text nodes to the XSLT processor. However, the input to the XSLT processor is technically a tree, and the XSLT specification claims no control over how this tree is built. If you use Microsoft's MSXML3, the tree is supplied in the form of a DOM, and the default option when building a DOM in MSXML3 is to remove whitespace text nodes. If you want the parser to behave the way that the XSLT specification expects, you must set the preserveWhitespace property on the Document object to true before you load the document.

Once the XML parser has done its work, further manipulation of whitespace may be done by the schema processor. This is more likely to affect source documents than stylesheets, since there is little point in putting a stylesheet through a schema processor. For each simple data type, XML Schema defines whitespace handling as one of three options:

  • Preserve: All whitespace characters in the value are preserved. This option is used for the data type xs:string.

  • Replace: Each newline, carriage return, and tab character is replaced by a single-space character. This option is used for the data type xs:normalizedString and types derived from it.

  • Collapse: Leading and trailing whitespace is removed, and any internal sequence of whitespace characters is replaced by a single space. This option is used for all other data types (including those where internal whitespace is not actually allowed).

When source documents are processed using a schema, the rules for the XPath 2.0 data model say that for attributes, and for elements with simple content (that is, elements that can't have child elements), the typed value of the element or attribute is the value after whitespace normalization has been done according to the XML Schema rules for the particular data type. In the current draft (November 2003) it is not entirely clear whether the string value of an element or attribute is the value before or after the schema whitespace rules are applied: this will probably be clarified in later versions of the specification. However, the only thing that depends on the string value of a node is the string() function itself: everything else uses the typed value.

Finally, the XSLT processor then applies some processing of its own. By this time entity and character references have been expanded, so there is no difference between a space written as a space and one written as &#x20; ‰« :

  • Adjacent text nodes are merged into a single text node (normalized in the terminology of the DOM).

  • Then, if a text node consists entirely of whitespace, it is removed (or stripped) from the tree if the containing element is listed in an <xsl:strip-space> definition in the stylesheet. The detailed rules are more complex than this, and also take into account the presence of the xml:space attribute in the source document: see the <xsl:text> element on page 459, in Chapter 5 for details.

This process never removes whitespace characters that are adjacent to non-whitespace characters. For example, consider the following.

  <article>   <title>Abelard and Heloise</title>   <subtitle>Some notes towards a family tree</subtitle>   <author>Brenda M Cook</author>   <abstract>   The story of Abelard and Heloise is best recalled nowadays from the   stage drama of 1970 and it is perhaps inevitable that Diana Rigg stripping   off for Keith Mitchell should be the most enduring image of this historic   couple in some people's minds.   </abstract>   </article>  

Our textual analysis will focus entirely on the whitespace-the actual content of the piece is best ignored.

There are five whitespace-only text nodes in this fragment, one before each of the child elements <title> , <subtitle> , <author> , and <abstract> , and another between the end of the <abstract> and the end of the <article> . The whitespace in these nodes is passed by the XML parser to the XSLT processor, and it is up to the stylesheet whether to take any notice of it or not. Typically, in this situation this whitespace is of no interest and it can be stripped from the tree by specifying <xsl:strip-space elements="article"/> .

The whitespace within the <abstract> cannot be removed by the same process. The newline characters at the start and end of the abstract, and at the end of each line, are part of the text passed by the parser to the application, and it is not possible in the stylesheet to declare them as being irrelevant. If the <abstract> element is defined in the schema as being of type xs:token (or a type derived from this) then the schema processor will remove the leading and trailing whitespace characters, and convert the newlines into single spaces. But if it is of type xs:string , or if no schema processing is done, then all the spaces and newlines will be present in the tree model of the source document. What you can do is to call the normalize-space() function when processing these nodes on the source tree, which will have the same effect as schema processing for a type that specifies the collapse option (that is, it will remove leading and trailing whitespace and replace all other sequences of one or more whitespace characters by a single space). The normalize-space() function is described in Chapter 10 of XPath 2.0 Programmer's Reference .

The processing done by a schema processor for data of type xs:normalizedString is to replace each newline, tab, and carriage return by a single space character. This is not the same as the processing done by the normalize-space() function in XPath. The term normalization, unfortunately , does not have a standard meaning.

So we can see that XSLT makes a very firm distinction between text nodes that comprise whitespace only, and those that hold something other than whitespace. A whitespace text node can exist only where there is nothing between two pieces of markup other than whitespace characters.

To take another example, consider the following document.

  <person>   <name>Prudence Flowers</name>   <employer>Lloyds Bank</employer>   <place-of-work>   71 Lombard Street   London, UK   <zip>EC3P 3BS</zip>   </place-of-work>   </person>  

Where are the whitespace nodes? Let's look at it again, this time making the whitespace characters visible.

  <person>       <name>Prudence Flowers</name>       <employer>Lloyds Bank</employer>       <place-of-work>       71 Lombard Streets       London,uk       <zip>EC3P 3BS</zip>       <place-of-work     </person>    

The newline and tab between <person> and <name> are not adjacent to any non-whitespace characters, so they constitute a whitespace node. So do the characters between </name> and <employer> , and between </employer> and <place-of-work> . However, most of the whitespace characters within the <place-of-work> element are in the same text node as non-whitespace characters, so they do not constitute a whitespace node. To make it even clearer, let's highlight the whitespace characters in whitespace nodes, and show the others as ordinary spaces.

  <person>       <name>Prudence Flowers</name>       <employer>Lloyda Bank</employer>       <place-of-work>   71 Lombard Street   London, UK   <zip>EC3P 3BS</zip>       </place-of-work>     </person>  

Why is all this relevant? As we've seen, the <xsl:strip-space> element allows you to control what happens to whitespace nodes (those shown in the immediately preceding example), but it doesn't let you do anything special with whitespace characters that appear in ordinary text nodes (those shown in as ordinary spaces).

All the whitespace nodes in this example are immediate children of the <person> element, so they could be stripped by writing:

  <xsl:strip-space elements="person"/>  

Whitespace nodes are retained on the source tree unless you ask for them to be stripped, either by using <xsl:strip-space> , or by using some option provided by the XML parser or schema processor during the building of the tree.

Whitespace Nodes in the Stylesheet

For the stylesheet itself, whitespace nodes are all stripped, with two exceptions, namely whitespace within an <xsl:text> element, and whitespace controlled by the attribute xml:space="preserve" ‰« . If you explicitly want to copy a whitespace text node from the stylesheet to the result tree, write it within an <xsl:text> element, like this.

  <xsl:value-of select="address-line[1]"/>   <xsl:text>&#xA;</xsl:text>   <xsl:value-of select="address-line[2]"/>  

The only reason for using &#xA; ‰« here rather than an actual newline is that it's more clearly visible to the reader; it's also less likely to be accidentally turned into a newline followed by tabs or spaces. Writing the whitespace as a character reference doesn't stop it being treated as whitespace by XSLT, because the character references will have been expanded by the XML parser before the XSLT processor gets to see them.

Another way of coding the previous fragment in XSLT 2.0 is to write:

  <xsl:value-of select="address-line[position() = 1 to 2]"   separator="&#xA;"/>  

You can also cause whitespace text nodes in the stylesheet to be retained by using the option xml:space="preserve" ‰« . Although this is defined in the XML specification, its defined effect is to advise the application that whitespace is significant, and XSLT (which is the application in this case) will respect this. In XSLT 1.0 this sometimes caused problems because certain elements, such as <xsl:choose> and <xsl:apply-templates> , do not allow text nodes as children, even whitespace-only text nodes. Many processors, however, were forgiving on this. XSLT 2.0 has clarified that in situations where text nodes are not allowed, a whitespace-only text node is now stripped, despite the xml:space attribute. (However, an element that must always be empty, such as <xsl:output> , must be completely empty:whitespace-only text nodes are not allowed within these elements.)

Despite this clarification of the rules, I wouldn't normally recommend using the xml:space attribute in a stylesheet, but if there are large chunks of existing XML that you want to copy into the stylesheet verbatim, the technique can be useful.

The Effect of Stripping Whitespace Nodes

There are two main effects of stripping whitespace nodes, as done in the <person> element in the earlier example:

  • When you use <xsl:apply-templates/> to process all the children of the <person> element, the whitespace nodes aren't there, so they don't get selected, which means they don't get copied to the result tree. If they had been left on the source tree, then by default they would be copied to the result tree.

  • When you use <xsl:number> or the position() or count() functions to count nodes, the whitespace nodes aren't there, so they aren't counted. If you had left the whitespace nodes on the tree, then the <name> , <employer> , and <place-of-work> elements would be nodes 2, 4, and 6 instead of 1, 2, and 3.

There are cases where it's important to keep the whitespace nodes. Consider the following.

  <para>   Edited by <name>James Clark</name>   <email>jjc@jclark.com</email>   </para>  

The diamond represents a space character that needs to be preserved, but because it is not adjacent to any other text, it would be eligible for stripping. In fact, whitespace is nearly always significant in elements that have mixed content (that is, elements that have both element and text nodes as children).

If you want to strip all the whitespace nodes from the source tree, you can write:

  <xsl:strip-space elements="*"/>  

If you want to strip all the whitespace nodes except those within certain named elements, you can write:

  <xsl:strip-space elements="*"/>   <xsl:preserve-space elements="para h1 h2 h3 h4"/>  

If any elements in the document (either the source document or the stylesheet) use the XML-defined attribute xml:space="preserve" ‰« , this takes precedence over these rules: whitespace nodes in that element, and in all its descendants, will be kept on the tree unless the attribute is cancelled on a descendant element by specifying xml:space="default" ‰« . This allows you to control on a per-instance basis whether whitespace is kept, whereas <xsl:strip-space> controls it at the element-type level.

Solving Whitespace Problems

There are two typical problems with whitespace in the output: too much of it, or too little.

If you are generating HTML, a bit of extra whitespace usually doesn't matter, though there are some places where it can slightly distort the layout of your page. With some text formats, however (a classic example is comma-separated values) you need to be very careful to output whitespace in exactly the right places.

Too Much Whitespace

If you are getting too much whitespace, there are three possible places it can be coming from:

  • The source document

  • The stylesheet

  • Output indentation

First ensure that you set indent="no" ‰« on the <xsl:output> element, to eliminate the last of these possibilities.

If the output whitespace is adjacent to text, then it probably comes from the same place as that text.

  • If this text comes from the stylesheet, use <xsl:text> to control more precisely what is output. For example, the following code outputs a comma between the items in a list, but it also outputs a newline after the comma, because the newline is part of the same text node as the comma:

      <xsl:for-each select="item">   <xsl:value-of select="."/>,   </xsl:for-each>  

    If you want the comma but not the newline, change this so that the newline is in a text node of its own, and is therefore stripped.

      <xsl:for-each select="item">   <xsl:value-of select="."/>,<xsl:text/>   </xsl:for-each>  
  • If the text comes from the source document, use normalize-space() to trim leading and trailing spaces from the text before outputting it.

If the offending whitespace is between tags in the output, then it probably comes from white-space nodes in the source tree that have not been stripped, and the remedy is to add an <xsl:strip-space> element to the stylesheet.

Too Little Whitespace

If you want whitespace in the output and aren't getting it, use <xsl:text> to generate it at the appropriate point. For example, the following code will output the lines of a poem in HTML, with each line of the poem being shown on a new line.

  <xsl:for-each select="line">   <xsl:value-of select="."/><br/>   </xsl:for-each>  

This will display perfectly correctly in the browser, but if you want to view the HTML in a text editor, it will be difficult because everything goes on a single line. It would be useful to start a newline after each <br> element-you can do this as follows.

  <xsl:for-each select="line">   <xsl:value-of select="."/><br/><xsl:text>&#xa;</xsl:text>   </xsl:for-each>  

Another trick I have used to achieve this is to exploit the fact that the non-breaking-space character ( #xa0 ), although invisible, is not classified as whitespace. So you can achieve the required effect by writing:

  <xsl:for-each select="line">   <xsl:value-of select="."/><br/>&#xa0;   </xsl:for-each>  

This works because the newline after the &#xa0; ‰« isnow part of a non-whitespace node.




XSLT 2.0 Programmer's Reference
NetBeansв„ў IDE Field Guide: Developing Desktop, Web, Enterprise, and Mobile Applications (2nd Edition)
ISBN: 764569090
EAN: 2147483647
Year: 2003
Pages: 324

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net