Hack 39 Create a Text File from an XML Document

   

figs/beginner.gif figs/hack39.gif

Use this stylesheet to extract only the text from any XML document.

Sometimes you just want to leave the XML behind and keep only the text found in a document. The stylesheet text.xsl can do that for you. (There's an even easier way; see "Built-in Templates" following). It can be applied to any XML document, which includes XHTML. It is shown in Example 3-15.

Example 3-15. text.xsl
<xsl:stylesheet version="1.0"  <xsl:output method="text"/>             xmlns:xsl="http://www.w3.org/1999/XSL/Transform">     <xsl:template match="/">  <xsl:apply-templates select="*"/> </xsl:template>     </xsl:stylesheet>

This stylesheet finds the root node and then selects all element children (*) for processing. To test, apply this stylesheet to the XHTML document magnacarta.html, the pact between King John and the barony in England that was first signed at Runnymede on June 15, 1215 (see http://www.cs.indiana.edu/statecraft/magna-carta.html):

xalan magnacarta.html text.xsl

A small portion of the output is shown in Example 3-16. The result is shown in IE in Figure 3-18.

Example 3-16. A portion of the Magna Carta
Magna Carta     The Magna Carta JOHN, by the grace of God King of England, Lord of Ireland,  Duke of Normandy and Aquitaine, and Count of Anjou, to his  archbishops, bishops, abbots, earls, barons, justices,  foresters, sheriffs, stewards, servants, and to all his  officials and loyal subjects, Greeting.     KNOW THAT BEFORE GOD, for the health of our soul and those of  our ancestors and heirs, to the honour of God, the exaltation  of the holy Church, and the better ordering of our kingdom, at  the advice of our reverend fathers Stephen, archbishop of  Canterbury, primate of all England, and cardinal of the holy  Roman Church, Henry archbishop of Dublin, William bishop of  London, Peter bishop of Winchester, Jocelin bishop of Bath and  Glastonbury, Hugh bishop of Lincoln, Walter Bishop of Worcester,  William bishop of Coventry, Benedict bishop of Rochester, Master  Pandulf subdeacon and member of the papal household, Brother  Aymeric master of the knighthood of the Temple in England,  William Marshal earl of Pembroke, William earl of Salisbury,  William earl of Warren, William earl of Arundel, Alan de  Galloway constable of Scotland, Warin Fitz Gerald, Peter Fitz  Herbert, Hubert de Burgh seneschal of Poitou, Hugh de Neville,  Matthew Fitz Herbert, Thomas Basset, Alan Basset, Philip Daubeny,  Robert de Roppeley, John Marshal, John Fitz Hugh, and other loyal  subjects:

Figure 3-18. The Magna Carta (magnacarta.html) in IE
figs/xmlh_0318.gif


3.10.1 Built-in Templates

You can also extract text from a document just by relying on XSLT's built-in templates. A stylesheet as simple as this single line:

<xsl:stylesheet version="1.0"     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"/>

will invoke the built-in templates because there is no explicit template for any nodes that might be found in the source document. The built-in templates process all the children of the root and all elements, and copies text through for attributes and text nodes (the built-in templates do nothing for comment, processing-instruction, or namespace nodes). The benefit of using text.xsl over built-in templates is that text.xsl gives you a framework to exercise some control over the output (e.g., through additions of templates). However, adding templates to text.xsl won't make any difference, unless those templates match the document element more precisely (and therefore have higher priority than the template matching *). An empty stylesheet is the simplest one to start from if you want to add more precise templates.



XML Hacks
XML Hacks: 100 Industrial-Strength Tips and Tools
ISBN: 0596007116
EAN: 2147483647
Year: 2006
Pages: 156

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net