3.2 Outputting XML

With the XML output method, whether declared explicitly or by default, a compliant XSLT processor produces well-formed XML as output. As you already know, well-formed XML follows the syntax rules outlined in the XML specification rules such as matching start and end tags, matching quotes around attribute values, proper nesting of elements, and so forth. For example, if you create XML as you did in Chapter 2, the processor will make sure that the XML is well-formed. If it is not, the XSLT processor will report any errors.

The output element helps you to control a number of features relating XML output, including the XML declaration, document type declarations, and CDATA sections, all of which are discussed in the sections that follow.

3.2.1 The XML Declaration

As explained in Chapter 1, the XML declaration is optional. You don't have to use it, except under certain circumstances, such as when an encoding declaration is imperative. XSLT allows you to have control over the XML declaration with the output element. With output, you can keep XML declarations from being written to output, change version information, control the encoding declaration, and monitor the stand- alone declaration. I'll cover all of these features step-by-step in the sections that follow.

3.2.1.1 Omitting the XML declaration

Most XSLT processors automatically write an XML declaration at the top of the result. If the XML declaration is not essential to your output, you can turn this behavior off by giving output's omit-xml-declaration attribute a value of yes; by default, the value is no when the attribute is not present. The omit-xml-declaration attribute is used in omit.xsl:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="yes"/> <xsl:output omit-xml-declaration="yes"/> <xsl:template match="name">  <name>   <family><xsl:apply-templates select="last"/></family>   <given><xsl:apply-templates select="first"/></given>  </name> </xsl:template>     </xsl:stylesheet>

This stylesheet uses two output elements. You could merge them into one output element if you wish. The only reason I use two output elements in this example is because it makes a cleaner line break this way!

When applied to name.xml using:

xalan name.xml omit.xsl

the XML declaration is dropped, as you can see in the output:

<name> <family>Churchill</family> <given>Winston</given> </name>

3.2.1.2 The encoding declaration

XML 1.0 supports characters or atomic units of text as described in ISO/IEC 10646-1:1993 Information technology Universal Multiple-Octet Coded Character Set (UCS) Part 1: Architecture and Basic Multilingual Plane, plus its seven amendments (see http://www.iso.ch). The mission of the UCS standard is to identify all characters in all writing systems in the world. Since XML 1.0 became a W3C recommendation, ISO/IEC 10646-1:1993 has advanced to ISO/IEC 10646-1:2000.

Unicode is a parallel standard developed by the Unicode Consortium (see http://www.unicode.org). XML 1.0 likewise supports Unicode Version 2.0, but Unicode has recently advanced to Version 4.0, so there are some differences in what XML 1.0 supports and in what the latest version of Unicode supports.

Both ISO/IEC 10646-1 and Unicode assign the same values and descriptions for each character, but Unicode defines some semantics for the characters that ISO/IEC 10646-1 does not. In this book, I'll generally refer to Unicode, although Unicode and ISO/IEC 10646-1 are an inexact synonym.

Good background reading on Unicode and character sets is Mike Brown's XML tutorial at http://www.skew.org/xml/tutorial. To look up character charts, see Kosta Kostis' charts at http://www.kostis.net/charsets/.

Each character in Unicode is represented by a unique, hexadecimal (base 16) number. The first 128 characters in Unicode are the same characters in US-ASCII or Latin-1 (ISO-8859-1), which surely makes the transition to Unicode easier to follow. The numbers that represent these characters are called code points.

Code Points

You got a very brief introduction to the concept of character encoding in Chapter 2. An XML document, whether in a file or in a stream, is really just a series of bytes. A byte is a chunk of bits (ones and zeroes) usually eight. When you assign a character encoding to a document, you express an intent to the processing software to transform the bytes in the document into a sequence of characters that another processor can recognize.

Character encoding is the mapping of binary values to code points or character positions. Let me explain what code points are. Back in the 1960s, ANSI created the ASCII or US-ASCII character-encoding format. US-ASCII represents only 128 characters, numbered 0-127, with each numbered position representing a code point. In their binary forms, every US-ASCII character is represented by only 7 bits a 7-bit byte rather than an 8-bit byte (octet). Other 7-bit encoding forms were created in other parts of the world at this time as well, not just in the U.S.

The uppercase letter A in US-ASCII, for example, is represented by the 7 bits 1000001 and is mapped to the code point 65 (decimal or integer) or 41 in hexadecimal. So the character-encoding scheme we call US-ASCII maps the code point 65 to the 7-bit binary representation 1000001. Character sets map integers to graphic character representations the US-ASCII character set maps the integer 65 to the character A, for example.

But 7-bits can only represent 128 distinct values (the highest 7-bit binary number 1111111 equals the decimal equivalent 127). There are thousands of characters in human writing systems beyond ordinary, provincial 128-character US-ASCII. So if you want more characters, such as 256 rather than 128, you need to bump up your binary numbers from 7 bits to 8 bits.

3.2.1.2.1 ISO/IEC 8859

ISO-8859-1, commonly called Latin-1, represents 256 Western European characters, numbered 0-255, using 8-bit bytes or octets. It was originally specified by the European Computer Manufacturers Association (ECMA) in the 1980s and is currently defined there as ECMA-94 (see http://www.ecma-international.org). This standard is also endorsed by ISO and is specified in ISO/IEC 8859-1:1998 Information technology 8-bit single-byte graphic character sets Part 1: Latin alphabet No. 1 (see http://www.iso.ch). ISO-8859-1 is only the beginning: there are actually 15 character sets in this family. These character sets helped to unify earlier 7-bit efforts. All 15 of these 8-bit character sets are specified by ISO and are listed in Table 3-1.

Table 3-1. ISO 8859 specifications
ISO standard	Description	Character set name
ISO/IEC 8859-1:1998	Part 1, Latin 1	ISO-8859-1
ISO/IEC 8859-2:1999	Part 2, Latin 2	ISO-8859-2
ISO/IEC 8859-3:1999	Part 3, Latin 3	ISO-8859-3
ISO/IEC 8859-4:1998	Part 4, Latin 4	ISO-8859-4
ISO/IEC 8859-5:1998	Part 5, Cyrillic	ISO-8859-5
ISO/IEC 8859-6:1996	Part 6, Arabic	ISO-8859-6
ISO 8859-7:1987	Part 7, Greek	ISO-8859-7
ISO/IEC 8859-8:1999	Part 8, Hebrew	ISO-8859-8
ISO/IEC 8859-9:1999	Part 9, Latin 5	ISO-8859-9
ISO/IEC 8859-10:1998	Part 10, Latin 6	ISO-8859-10
ISO/IEC 8859-11:2001	Part 11, Thai	ISO-8859-11
ISO/IEC 8859-13:1998	Part 13, Latin 7	ISO-8859-13
ISO/IEC 8859-14:1998	Part 14, Latin 8 (Celtic)	ISO-8859-14
ISO/IEC 8859-15:1999	Part 15, Latin 9	ISO-8859-15
ISO/IEC 8859-16:2001	Part 16, Latin 10	ISO-8859-16

Using octets to represent single characters expands the limit to 256 characters. The ISO 8859 character sets reuse the code points 0-255 for each part. Part 1 assigns the small Latin letter ÿ (y with dieresis) to code point 255 but the same code point 255 is assigned to the (Cyrillic small letter dzhe) in Part 5. Unicode avoids code point conflicts by assigning a unique number to each character. Unicode accomplishes this by not limiting character definitions to a single octet.

3.2.1.2.2 UTF-8 and UTF-16

XML processors are required to support both UTF-8 and UTF-16 character encodings. These encodings provide different ways of representing Unicode characters in binary form. (UTF stands for UCS Transformation Format.) UTF-8 is not limited to a fixed-length character encoding but can use between one and six bytes to represent Unicode characters. Unicode code points in the range of 0-255 are represented with one octet, those in the range of 256-2047 are represented with two octets, those in the range of 2048-65535 are represented with three octets, and so forth. It uses a special encoding scheme to get the most out of the least bits, using the first octet of a sequence of more than one octet to indicate how many octets are in the sequence. (See http://www.ietf.org/rfc/rfc2279.txt.)

UTF-16 uses a minimum of two octets to represent characters and, if the character cannot be represented with two octets, it uses four octets. It also uses a special encoding scheme (see http://www.ietf.org/rfc/rfc2279.txt), but if you are using only Latin characters, UTF-16 characters can take up more space when they don't need to. For example, the letter A would only take one octet in UTF-8 but would take two in UTF-16. On the other hand, a character in the higher ranges that might take six octets in UTF-8 would take at most four octets in UTF-16. UTF-8 is a good choice for Latin alphabets, and UTF-16 is good for other than the simplest Chinese, Japanese, and Korean characters.

3.2.1.2.3 The Byte Order Mark

A Byte Order Mark, or BOM, is a special space character (Unicode character FEFF) that is used only as an encoding signature. If an XML document is UTF-16, it must begin with a BOM; if it is UTF-8, it may begin with a BOM. If the document is not UTF-8 or UTF-16, the character encoding must be declared. You can also declare UTF-8 or UTF-16 encoding explicitly in an XML declaration. (See Section 4.3.3 of the XML specification.)

XML processors may support other encodings such as US-ASCII, ISO-8859-1, or Shift_JIS (Japanese). The Internet Assigned Numbers Authority keeps track of encoding names and publishes them at http://www.iana.org/assignments/character-sets. You can use your own private encoding name if you start it with x-, but you would have to write your own code to process it.

3.2.2 Unicode and the Command Shell Window

In a shell or command prompt window, it's difficult, if not impossible, to see the difference between one kind of character encoding and another. To show you the effect of this, apply the stylesheet encoding.xsl to name.xml with Xalan:

xalan name.xsl encoding.xsl

Here's encoding.xsl:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="yes"/> <xsl:output encoding="UTF-16"/>     <xsl:template match="name">  <name>   <family><xsl:apply-templates select="last"/></family>   <given><xsl:apply-templates select="first"/></given>  </name> </xsl:template>     </xsl:stylesheet>

The result in a Windows command prompt window, which doesn't handle UTF-16 properly, will look something like this:

 < ? x m l   v e r s i o n = " 1 . 0  "  e n c o d i n g = " U T F - 1 6 " ? >  < n a m e >  < f a m i l y > A s a m i < / f a m i l y >  < g i v e n > T o m o h a r u < / g i v e n >  < / n a m e >

The dark block at the beginning of the document shows you where the BOM is. Even though the BOM is a zero-width space, the code page used by the Windows command prompt represents it differently. A code page is a Microsoft character set, and if your computer is configured for U.S. English, the code page is likely to be 437. Code page 437, using the Lucida Console font, interprets 8 bits of the character (FE in hexadecimal, 11111110 in binary, and 254 in decimal) as a black square. That is what is mapped to the character in the code page (see http://www.kostis.net/charsets/cp437.htm). In Unicode, the black square is 25A0 in hexadecimal (see Figure 3-1), and it is 9632 in decimal.

Changing the Code Page in Windows

Here is how to test what code page your window is using at a Windows command prompt (such as on Windows XP Professional). Enter the command:

 mode con: cp

You can use the mode command to display the status of your system, among other things. If you change the code page to 850 (multilingual Latin 1) with this command:

mode con: cp select=850

you then transform name.xml with encoding.xsl. The result in your command prompt window will look different. To change your code page back to 437, type this command:

mode con: cp select=437

Where did that extra space come from in the output of encoding.xsl? Because you are using UTF-16 encoding, each character in the output is represented by two octets. Code page 437 interprets the other 8 bits (FF in hexadecimal, 11111111 in binary, and 255 in decimal) as nonbreaking space. Unicode numbers the nonbreaking space as A0 in hexadecimal and as 160 in decimal. That's where the extra space is coming from. This incompatibility between encoding schemes and the display of characters in a shell window or text editor is the cause of a lot of confusion. It is good to be aware of it. Character Map and UniPad are tools that can help analyze Unicode characters.

Looking at a File with xxd

If you are running Linux or Cygwin on a Windows box (see http://www.cygwin.com), you probably have the xxd utility available to you on the command line. This utility can examine a file and let you see it in hexadecimal or binary form, which may be of use to you with regard to encoding as you can look at a file character-by-character. For example, if you execute the following transformation:

xalan -o dump.xml name.xml encoding.xsl

the result of the transformation is saved to the file dump.xml. You can look at dump.xml with xxd using this command line:

xxd -g 1 dump.xml

By default, each line of output from xxd is numbered in hexadecimal, with the first line beginning with an octet numbered 0000000 and the last one numbered 000000f (0-15 in decimal). Following that, each character is printed in hexadecimal, with the normal Latin characters shown on the far right. If the character can't be represented in ASCII, it is represented by a dot (.) on the right side.

3.2.3 Using Character Map and UniPad

The Windows Character Map utility allows you to select and copy characters in available fonts for use in other applications, but it also helps you quickly identify the Unicode code point and names for characters. Notice the lower-left corner in Figure 3-1, which identifies the Unicode code point in hexadecimal (U+25A0), plus the character name (Black Square). Figure 3-1 shows what the Character Map looks like in Windows XP Professional.

Figure 3-1. Character Map utility

Another useful program is Sharmahd Computing's SC UniPad, a Unicode text editor available for free download from http://www.unipad.org. Among other things, UniPad shows you the Unicode value of a character based on the position of the cursor in the edit window. Figure 3-2 shows you dump.xml in a UniPad window. Note the Unicode character information in the status bar. A few things the status bar tells you is the Unicode code point for the character where the cursor is located (U+003C) and the character's descriptive name (LESS-THAN SIGN). It indicates the encoding (UTF-16 (L) for little endian), and tells you that the byte-order mark is present (BOM).

Figure 3-2. dump.xml in UniPad

3.2.3.1 Entities and text declarations

A text declaration is similar to an XML declaration, but it does not have to provide version information. Text declarations are used for separate, external documents called entities. If an external entity is not in UTF-8 or UTF-16, the external entity must have a text declaration (see Section 4.3.3 of the XML specification). To understand what an external entity is, look at the document entity.xml:

<?xml version="1.0" encoding="ISO-8859-1"?> <?xml-stylesheet href="entity.css" type="text/css"?> <!DOCTYPE name [ <!ENTITY first SYSTEM "name.ent"> ]>     <name>  <last>Churchill</last>  <first>&first;</first> </name>

This document contains an internal document type definition, or DTD, called an internal subset. It's internal to the XML document that it qualifies. The entity is declared in the internal subset (note the keyword ENTITY). You'll learn about DTDs in Section 3.2.4, later in this chapter. For right now, I'll focus only on the entity.

The entity is an external, parsed entity. External means that the content of the entity is stored in an external file. Parsed means that the entity is made of text that may be parsed. The name of this entity is first. The SYSTEM keyword indicates that the entity is in a named file, and the name of that file is name.ent. The first element contains a (&first;) that, when processed, will be expanded or replaced with the contents of the file name.ent:

<?xml encoding="ISO-8859-1"?>Randolph

The external entity name.ent contains a text declaration that has an encoding declaration with the encoding name ISO-8859-1. It looks like an XML declaration, but the version information is not required (nor is it forbidden). If you display entity.xml in IE, at least in Version 6.0 or greater, the entity will be expanded so that the content of the first element will be Randolph.

Figure 3-3 shows what entity.xml looks like in IE when using the stylesheet entity.css:

name {font-size: 18pt} last {display:inline}

Figure 3-3. The document entity.xml displayed in IE

You'll read more about entities in Section 3.2.3.2 to follow. For more information on text declarations, see Section 4.3.1 of the XML specification.

3.2.3.2 The standalone declaration

The standalone declaration in an XML declaration indicates explicitly whether an XML document depends on external markup declarations. An element type declaration, such as <!ELEMENT family (#PCDATA)>, is an example of a markup declaration. Markup declarations are stored in DTDs. The following document, standalone.xml, states bluntly that it does not depend on external documents:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>     <name>  <last>Churchill</last>  <first>Winston</first> </name>

If, however, you apply the stylesheet notalone.xsl:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="yes"/> <xsl:output doctype-system="notalone.dtd"/> <xsl:output standalone="no"/>     <xsl:template match="name">  <name>   <family><xsl:apply-templates select="last"/></family>   <given><xsl:apply-templates select="first"/></given>  </name> </xsl:template>     </xsl:stylesheet>

to standalone.xml, using:

xalan -o notalone.xml standalone.xml notalone.xsl

the value of the standalone declaration is changed from yes to no in the output document notalone.xml, and a document type declaration is also added:

<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE name SYSTEM "notalone.dtd"> <name> <family>Churchill</family> <given>Winston</given> </name>

The DTD notalone.dtd contains three markup declarations, all for elements:

<!ELEMENT name (family, given)> <!ELEMENT family (#PCDATA)> <!ELEMENT given (#PCDATA)>

You'll learn more about the document type declaration later in this chapter in Section 3.2.4.

It is important for you to know though you have probably already realized it that standalone declarations are not required. They may be useful in some applications because the XML declaration must be on the first line in a document, and so information about whether the document has dependencies is available to applications early on.

If a document declares standalone="no", but actually has dependencies nonetheless, an XML processor will ignore the declaration. If a document does have dependencies, declaring standalone="yes" will generate an error. If a document doesn't have a standalone declaration in an XML declaration, it usually doesn't matter much anyway: an XML processor will find the external markup declarations nevertheless. Again, for more insight, see Section 3.2.4.

3.2.3.3 XML version information

Version 1.0 of XML was approved as a W3C recommendation in February 1998. While the 1.0 specification has held its ground for over five years, it is likely that the W3C will deliver XML 1.1 as a recommendation in 2003. If so, XSLT is ready in at least one respect: you can control XML version information in an XML declaration with output's version attribute.

Here is an example of how it works. The stylesheet version.xsl uses the version attribute on the output element:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="yes" encoding="UTF-8"/> <xsl:output version="1.1"/>     <xsl:template match="name">  <name>   <family><xsl:apply-templates select="last"/></family>   <given><xsl:apply-templates select="first"/></given>  </name> </xsl:template>     </xsl:stylesheet>

When applied to name.xml like:

xalan name.xml version.xsl

this stylesheet will produce the following result with an altered XML declaration:

<?xml version="1.1" encoding="UTF-8"?> <name> <family>Churchill</family> <given>Winston</given> </name>

The XML version is changed from 1.0 to 1.1.

Xalan and Saxon both support the version attribute of output.

3.2.4 Controlling Document Type Declarations

A document type declaration associates document type definitions (DTDs) with an XML document. In essence, it helps an XML validator find where DTDs exist. The DTD can be either internal to an XML document, external to it, or both. To illustrate, the document name-int.xml has an internal subset:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE name [ <!ELEMENT name (last, first)> <!ELEMENT last (#PCDATA)> <!ELEMENT first (#PCDATA)> ]>     <name>  <last>Churchill</last>  <first>Winston</first> </name>

DTDs, as you already can see, have a different syntax than ordinary XML. DOCTYPE is the keyword for the document type declaration. Following that keyword is the name of the document element for the XML document called name. Inside the square brackets ([ ]) are three element declarations that begin with the keyword ELEMENT.

According to this internal subset, a name element must be followed by exactly one last element, which is followed by exactly one first element. Both last and first must contain parsed character data (#PCDATA). The document contained in internal.xml is valid with regard to its internal subset.

The document external.xml references an external DTD called the external subset. It is in a file called external.dtd; external.xml is valid with regard to it:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE name SYSTEM "external.dtd">     <name>  <last>Churchill</last>  <first>Winston</first> </name>

The SYSTEM keyword indicates that the following value will be a system identifier or URI. Here is external.dtd that has the same declarations as internal.dtd, but in a document separate from the instance:

<!ELEMENT name (last, first)> <!ELEMENT last (#PCDATA)> <!ELEMENT first (#PCDATA)>

The document both.xml contains an internal subset and also refers to an external subset:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE name SYSTEM "both.dtd" [ <!ELEMENT last (#PCDATA)> ]> <name>  <last>Churchill</last>  <first>Winston</first> </name>

The document type declaration encloses an internal subset and also points to the external subset both.dtd with a system identifier:

<!ELEMENT name (last, first)> <!ELEMENT first (#PCDATA)>

The external subset contains declarations for the name and first elements, and the internal subset holds a declaration for last only. Both the internal and external subsets are needed to validate the document.

3.2.4.1 Validation with transformation

You can validate a source document at the same time that you transform it by using the -v (validate) command-line option. For example, the following command line performs validation on both.xml before the document is transformed with both.xsl:

xalan -v both.xml both.xsl

The validate option works with Saxon and MSXSL as well. MSXSL is a fast, Windows-native command-line processor available free from Microsoft (see the appendix for more information on MSXSL).

3.2.4.2 Adding a document type declaration with a system identifier

XSLT won't let you add markup declarations such as <!ELEMENT name (last, first)> to an internal subset through a transformation, but it will let you add document type declarations to a result. The document name.xml, for example, doesn't have a document type declaration. You can add one with XSLT by using the doctype-system attribute on output. The following stylesheet, doctype-system.xsl, shows you how:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="yes" encoding="UTF-8"/> <xsl:output doctype-system="name.dtd"/>     <xsl:template match="name">  <name>   <family><xsl:apply-templates select="last"/></family>   <given><xsl:apply-templates select="first"/></given>  </name> </xsl:template>     </xsl:stylesheet>

When name.xml is transformed with this stylesheet:

xalan name.xml doctype-system.xsl

the doctype-system attribute triggers the creation of a document type declaration in the result that references the system identifier name.dtd:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE name SYSTEM "name.dtd"> <name> <family>Churchill</family> <given>Winston</given> </name>

3.2.4.3 Adding a document type declaration with a public identifier

Public identifiers are often associated with widely accepted DTDs the strict DTD associated with XHTML, for example. In some situations, software can resolve the names of public identifiers with local copies of a DTD, rather than by using a remote DTD over a network. Finding and using local DTDs can save processing time, especially when you have many files to validate.

Following is a public identifier for strict XHTML 1.0:

-//W3C//DTD XHTML 1.0 Strict//EN

The leading - indicates that the public identifier is not registered with ISO. The name of the identifier's owner is preceded by a pair of slashes (//W3C), followed by a pair of slashes and the description of the DTD (//DTD XHTML 1.0 Strict), followed by a pair of slashes and a language code (//EN).

The stylesheet doctype-public.xsl adds a public identifier for strict XHTML 1.0 to a result:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="yes" encoding="UTF-8"/> <xsl:output doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN"/> <xsl:output doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/>     <xsl:template match="name"> <html xmlns="http://www.w3.org/1999/xhtml"> <head>  <title><xsl:value-of select="name(  )"/></title> </head> <body>   <p><xsl:apply-templates select="last"/></p>   <p><xsl:apply-templates select="first"/></p> </body> </html> </xsl:template>     </xsl:stylesheet>

In addition to a public identifier, this stylesheet also specifies a system identifier URI for an XHTML DTD. The value-of element's select attribute contains an expression that calls the XPath name( ) function that returns the name of a node, rather than its content. You'll learn more about XPath functions such as name( ) in Chapter 5.

When applied to name.xml with:

xalan name.xml doctype-public.xsl

doctype-public.xsl produces the following output:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/ DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>name</title> </head> <body> <p>Churchill</p> <p>Winston</p> </body> </html>

3.2.4.4 Validating XHTML

This output is valid, strict XHTML 1.0. Save the output to a file, for example, with the command:

xalan -o name.html name.xml doctype-public.xsl

As XHTML, you can validate name.html just as you would any XML document. One easy way to do this is with W3C's online validation tool. If you go to the W3C Markup Validation Service page at http://validator.w3.org, you can upload a local file, such as name.html, using the Browse button (see Figure 3-4). Then you can click the Validate File button, and the service will attempt to validate the file. One of the nice things about the W3C service is that it provides diagnostics if there are errors present on the page, making it easier to correct the errors. This online tool also works as an XML and HTML validator.

Figure 3-4. The W3C Markup Validation Service

3.2.5 Outputting CDATA Sections

CDATA sections in XML allow you to hide characters like < and & from the XSLT processor. The difference between a CDATA section and an individual entity reference is that you hide a section of characters rather than just one at a time.

A CDATA section begins with the characters <![CDATA[ and ends with ]]>. For example, the company element in this fragment contains a CDATA section:

<company><![CDATA[<pub>O'Reilly & Associates</pub>]]></company>

The & and < characters in the CDATA section are hidden so that they aren't interpreted as markup (such as the start of an entity or character reference). The cdata-section-elements attribute on output lets you tell the XSLT processor which elements you want to contain CDATA sections in the result.

To see how it's done, consider the stylesheet cdata.xsl:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="yes"/> <xsl:output cdata-section-elements="notes"/>     <xsl:template match="name">  <name>   <family><xsl:apply-templates select="last"/></family>   <given><xsl:apply-templates select="first"/></given>   <notes>Author &amp; British prime minister</notes>  </name> </xsl:template>     </xsl:stylesheet>

In this example, the cdata-section-elements attribute of output contains the name of an element (notes) whose content you want to enclose in a CDATA section. If you process name.xml with cdata.xsl:

xalan name.xml cdata.xsl

you will see the following result:

<?xml version="1.0" encoding="UTF-8"?> <name> <family>Churchill</family> <given>Winston</given> <notes><![CDATA[Author & British prime minister]]></notes> </name>

The character data content of notes (from the template in the stylesheet) is surrounded by a CDATA section in the result, and the entity reference & is changed into &. The cdata-section-elements attribute can contain a list of whitespace-separated element names. Each element in such a list must contain character data in the source document, as notes does.

You can also serialize CDATA sections by using literal text. To do this, use literal text such as shown in literal-cdata.xsl:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="yes"/> <xsl:output cdata-section-elements="notes"/>     <xsl:template match="name">  <name>   <family><xsl:apply-templates select="last"/></family>   <given><xsl:apply-templates select="first"/></given>   <notes><![CDATA[Author & British prime minister]]></notes>  </name> </xsl:template>     </xsl:stylesheet>

When you transform name.xml with this stylesheet using:

xalan name.xml literal-cdata.xsl

you will see the CDATA section passed on literally to the result:

<?xml version="1.0" encoding="UTF-8"?> <name> <family>Churchill</family> <given>Winston</given> <notes><![CDATA[Author & British prime minister]]></notes> </name>

You can find more about CDATA sections in Section 2.7 of the XML specification.