3.4 Modifying Word Documents


There are plenty of use cases for processing Word documents in which both the input and output are Word documents. Since XSLT is a particularly suitable tool for incrementally processing XML, it also works quite nicely for modifying Word documents. An important tool for making incremental modifications to a document is the identity transformation. Example 3-9 shows the canonical identity transformation, exactly as it appears in the XSLT recommendation itself (http://www.w3.org/TR/xslt#copying).

Example 3-9. The identity transformation, identity.xsl
<xsl:stylesheet version="1.0"   xmlns:xsl="http://www.w3.org/1999/XSL/Transform">       <xsl:template match="@*|node( )">     <xsl:copy>       <xsl:apply-templates select="@*|node( )"/>     </xsl:copy>   </xsl:template>     </xsl:stylesheet>

What is the identity transformation? Shown in Example 3-9, it's a stylesheet with one template rule that effectively copies the source tree to the result tree unchanged. Here's how it works. The single template rule, with its pattern @*|node( ), matches all elements, attributes, comments, text, and processing instructions in the source tree. Each time the template rule fires, a shallow copy of the node is created (using the xsl:copy element), and templates are applied to all of the node's attributes and children. Thus, the entire source document is recursively copied, one node at a time. (This powerful template rule and variations of it also appear in Chapter 4, in Example 4-9, saveDataOnly.xsl, and Example 4-11, create-onload-stylesheet.xsl.)

By using the identity stylesheet as your departure point, you can incrementally alter its default copying behavior by specifying exceptions to the rule, using custom template rules. Since this stylesheet serves as the baseline for each example in this section, we'll use xsl:include to include it (as identity.xsl), rather than repeatedly list the identity template rule inside each example.

3.4.1 Cleaning Up a Document for Publication

When Word saves documents, it includes a lot of information that you may not want to include in the final published document that you share with others. Sensitive information might include previous authors, comments, deleted text, revision marks, spelling and grammar error marks, and custom document properties. Example 3-10 shows a stylesheet (cleanup.xsl) that removes all such information. Each template rule is accompanied by a descriptive comment, which is highlighted in this listing. Rather than walking through the stylesheet step-by-step, we'll let it speak for itself.

Example 3-10. A stylesheet for cleaning up Word documents, cleanup.xsl
<xsl:stylesheet version="1.0"   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"   xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"   xmlns:o="urn:schemas-microsoft-com:office:office"   xmlns:aml="http://schemas.microsoft.com/aml/2001/core">       <xsl:include href="identity.xsl"/>       <!  Normalize document's view and zoom percentage (Normal at 100%)  >   <xsl:template match="w:docPr">     <xsl:copy>       <w:view w:val="normal"/>       <w:zoom w:percent="100"/>       <xsl:apply-templates select="*[not(self::w:view or self::w:zoom)]"/>     </xsl:copy>   </xsl:template>       <!  Remove all but the Author and Title document properties  >   <xsl:template match="o:DocumentProperties">     <xsl:copy>       <xsl:copy-of select="o:Author|o:Title"/>     </xsl:copy>   </xsl:template>       <!  Remove all custom document properties  >   <xsl:template match="o:CustomDocumentProperties"/>       <!  Remove all comments and comment references  >   <xsl:template match="aml:annotation[starts-with(@w:type,'Word.Comment')]"/>       <!  Remove all spelling and grammar errors  >   <xsl:template match="w:proofErr"/>       <!  Remove all deletions  >   <xsl:template match="aml:annotation[@w:type='Word.Deletion']"/>       <!  Remove all formatting changes  >   <xsl:template match="aml:annotation[@w:type='Word.Formatting']"/>       <!  Remove all insertion marks  >   <xsl:template match="aml:annotation[@w:type='Word.Insertion']">     <!-- Process content, but do not copy -->     <xsl:apply-templates select="aml:content/*"/>   </xsl:template>     </xsl:stylesheet>

As in all the rest of the examples in this section, we include the identity.xsl stylesheet, which establishes the default copying behavior:

<xsl:include href="identity.xsl"/>

Everything after that is a custom template rule overriding the default behavior for a particular element. A common pattern in this stylesheet is the use of empty xsl:template elements. These are used to remove elements from the result document. Since an empty template rule does nothing when fired (overriding the default copying behavior), it effectively strips out the matched node from the resulting document.

This stylesheet by no means provides the definitive cleanup for all the different kinds of documents you might want to publish. More than likely, you'll want to customize it to meet your particular needs. For example, if you don't want to strip out comments, then you would remove the template rule that strips out comments. Similarly, if you want to strip out another kind of information not covered by this stylesheet, then you would add your own template rule for doing that.

Let's take a look at cleanup.xsl in action. Figure 3-5 shows a document with lots of cruft deleted text, tracked insertions (underlined), a tracked formatting change, comments, and some spelling and grammar errors. It was saved in "Web" view with a zoom percentage of 125%.

Figure 3-5. A document with comments, tracked changes, and proof errors, dirty.xml
figs/oxml_0305.gif


If we apply cleanup.xsl to the WordprocessingML representation of the document shown in Figure 3-5, then we'll get the result shown in Figure 3-6.

Figure 3-6. clean.xml the result of applying cleanup.xsl to dirty.xml
figs/oxml_0306.gif


Not only have all the comments, proof errors, and tracked changes been removed, but the document's view has also been normalized to the "Normal" view with a zoom percentage of 100%.

If you publish your documents as WordprocessingML, then you have complete control over what information is contained within them. However, only users that have Word 2003 will be able to view your documents. When publishing .doc files instead, you'll have backward compatibility on your side, but you won't have quite as much control over what metadata is included. For example, whoever last saved the file will be listed under "Last saved by:" (corresponding to the o:LastAuthor element in WordprocessingML).


3.4.2 Removing All Direct (Local) Formatting

A commonly promoted "best practice" in authoring Word documents is to use styles only and no direct formatting. While there is a function in Word that allows you to remove direct formatting (by selecting text and pressing Ctrl-Space), it is sometimes handy to apply such cleanup to an entire document ex post facto, using XSLT. Example 3-11 shows a stylesheet that leaves the entire source document intact, except for the paragraph and run properties that have been applied as direct formatting those are removed.

Example 3-11. A stylesheet for removing direct run and paragraph formatting, removeDirectFormatting.xsl
<xsl:stylesheet version="1.0"   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"   xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">       <xsl:include href="identity.xsl"/>       <!-- Remove all direct paragraph formatting -->   <xsl:template match="w:p/w:pPr/*[not(self::w:pStyle)]"/>       <!-- Remove all direct run formatting -->   <xsl:template match="w:r/w:rPr/*[not(self::w:rStyle)]"/>     </xsl:stylesheet>

Once again, the default behavior for all nodes is to copy them through, because the stylesheet includes the identity.xsl stylesheet.

There are two custom template rules in this stylesheet one for direct paragraph formatting and one for direct run formatting:

<xsl:template match="w:p/w:pPr/*[not(self::w:pStyle)]"/> ... <xsl:template match="w:r/w:rPr/*[not(self::w:rStyle)]"/>

Both of these are empty, which means that matched nodes effectively get stripped from the result. All element children of local w:pPr and w:rPr elements get stripped from the document with one exception in each case. The w:pStyle and w:rStyle elements are preserved. That's because these elements are used not to apply direct formatting but to associate the paragraph or run with a particular style defined in the document. We need to preserve these associations; otherwise, the stylesheet would strip out all of the document's formatting, not just direct formatting.

An alternative version of this stylesheet could be customized according to a particular Word template so that, rather than just removing direct formatting, an appropriate style would be used instead. For example, when you come across a run that has italics turned on as direct formatting (using the w:i element), you could convert that to a run that uses the "Emphasis" character style instead (using the w:rStyle element). Such a conversion could go a long way in updating legacy Word documents according to an organization's current authoring standards. Fortunately, with Word 2003's new document protection features (introduced in Chapter 4), such restrictions can now be enforced at authoring time.

3.4.3 Removing Linked "Char" Styles

At the end of Chapter 2, in Section 2.7.8, we learned about the character styles that Word automatically creates when a user tries to apply a paragraph style to only a portion of a paragraph. Word names the new character style by appending the word "Char" to the end of the existing paragraph style's name. Unfortunately, Word does not provide a way to delete a linked character style without deleting the paragraph style it is linked to. If a user tries to delete the automatically created linked style, Word also deletes the corresponding paragraph style. However, by processing a document's WordprocessingML representation outside of Word, we can overcome that restriction. Example 3-12 shows a stylesheet that strips out linked character styles and references to them, while retaining the paragraph styles they are linked to.

Example 3-12. A stylesheet for removing linked "Char" styles, removeLinkedStyles.xsl
<xsl:stylesheet version="1.0"   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"   xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">       <xsl:include href="identity.xsl"/>       <!-- Remove all linked character styles -->   <xsl:template match="w:style[@w:type='character' and w:link]"/>       <!-- Remove the w:link element from linked paragraph styles -->   <xsl:template match="w:link"/>       <!-- Remove w:rStyle elements that refer to linked character styles -->   <xsl:template match="w:rStyle[@w:val = /w:wordDocument/w:styles/w:style                                 [@w:type='character' and w:link]/@w:styleId]"/>     </xsl:stylesheet>

The first custom template rule (overriding the default copying behavior of identity.xsl) strips out all linked character styles. A character style definition is easily identified as a w:style element that has a w:type attribute whose value is character and that contains a w:link element:

<xsl:template match="w:style[@w:type='character' and w:link]"/>

In addition to stripping out all the linked character styles, we need to strip out otherwise dangling references to them. These occur in two places. First, we strip out the remaining w:link elements (inside linked paragraph style definitions):

<xsl:template match="w:link"/>

Then, we strip out all of the document's w:rStyle elements that refer to linked character styles:

<xsl:template match="w:rStyle[@w:val = /w:wordDocument/w:styles/w:style                               [@w:type='character' and w:link]/@w:styleId]"/>

This pattern is a little more complex, but it is pretty straightforward when you break it down into its respective parts. If we were to translate this pattern into English, it would read something like this:

"Match all w:rStyle elements whose w:val attribute is equal to the w:styleId attribute of any w:style element that has both a w:link element and a w:type attribute equal to character."

The last part of this translation (beginning with the word "any") could be replaced with simply "any linked character style," thereby reducing the translation to:

"Match all w:rStyle elements whose w:val attribute is equal to the w:styleId attribute of any linked character style."

Since we know (from Chapter 2) that the w:styleId attribute is precisely what the w:rStyle element refers to in order to associate a run with a particular character style, we can finally reduce the translation to our top-level intent: "Match all references to linked character styles." When a matching w:rStyle element triggers the rule, nothing happens, thereby excluding the linked character style reference from the result.

3.4.4 Adjusting Font Sizes

Word's style inheritance features can help reduce duplicate work when it comes to making global formatting changes to your document. For example, if you want to double the size of all fonts in your document, you may only need to update the "Normal" style, as long as all of your paragraph styles are based on the "Normal" style and do not explicitly override the font size they inherit. However, when that's not the case or when your document also contains direct formatting, such changes have to made in multiple places a tedious and error-prone process.

Once again, WordprocessingML and XSLT come to the rescue. The stylesheet in Example 3-13 adjusts the font sizes within a document (whether in style definitions or direct formatting) by multiplying them by a factor that you specify (through xsl:param).

Example 3-13. A stylesheet for adjusting the font size of the "Normal" style, adjustFontSize.xsl
<xsl:stylesheet version="1.0"   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"   xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">       <xsl:include href="identity.xsl"/>       <xsl:param name="factor" select="2"/>       <!-- Adjust all w:sz elements (in style definitions or direct formatting) -->   <xsl:template match="w:sz">     <w:sz w:val="{floor(@w:val * $factor)}"/>   </xsl:template>       <!-- Account for Word's application default font size (10 points)        in underived paragraph styles when the w:sz element isn't present  -->   <xsl:template match="w:style[@w:type='paragraph' and                                not(w:rPr/w:sz) and not(w:basedOn)]">     <xsl:copy>       <xsl:apply-templates select="@*|*[not(self::w:rPr)]"/>       <w:rPr>         <w:sz w:val="{floor(20 * $factor)}"/>         <xsl:apply-templates select="w:rPr/*"/>       </w:rPr>     </xsl:copy>   </xsl:template>     </xsl:stylesheet>

As with the other examples in this section, we include the identity.xsl stylesheet module, effecting the default copying behavior of the stylesheet:

<xsl:include href="identity.xsl"/>

The xsl:param element supplies a default factor of 2, so that the default behavior of the stylesheet (when no external parameters are supplied) is to double the font sizes:

<xsl:param name="factor" select="2"/>

The first template rule of the stylesheet matches all w:sz elements, whether they occur in a style definition or within a local w:rPr element. The value of the resulting font size is the previous size multiplied by the specified factor. The floor( ) function ensures that the result is an integer:

<xsl:template match="w:sz">   <w:sz w:val="{floor(@w:val * $factor)}"/> </xsl:template>

Our work would be done at this point, if it wasn't for one other scenario we need to handle: paragraph style definitions that do not contain a w:sz element and that are not based on (do not derive from) another style. In that case, what is the font size? The answer is: an application default, 10 points (as explained in Chapter 2). To handle that scenario, we use a template rule that matches w:style elements that meet these conditions:

  <xsl:template match="w:style[@w:type='paragraph' and                                not(w:rPr/w:sz) and not(w:basedOn)]">

We make a shallow copy of the w:style element and then copy all of its attributes and element children, except for the w:rPr element:

    <xsl:copy>       <xsl:apply-templates select="@*|*[not(self::w:rPr)]"/>

Then, we create the w:sz element, nested inside a new w:rPr element. Its value is the application default (10 points) expressed in hard-coded half-points (20), and multiplied by the specified factor, once again using the floor( ) function to ensure that the result is an integer:

      <w:rPr>         <w:sz w:val="{floor(20 * $factor)}"/>

Finally, we copy any remaining child elements of the w:rPr element, if present in the source document's style definition:

        <xsl:apply-templates select="w:rPr/*"/>

Now let's take a look at adjustFontSize.xsl in action. Figure 3-7 shows an early draft of this book's Chapter 2 (Chapter2.xml), using the normal font sizes dictated by the O'Reilly Word template.

Figure 3-7. A draft of Chapter 2 before font size adjustment
figs/oxml_0307.gif


Figure 3-8 shows the result of applying adjustFontSize.xsl to Chapter2.xml, leaving the default factor of 2. As you can see, the font sizes have doubled across the board.

Figure 3-8. The result of applying adjustFontSize.xsl to Chapter2.xml
figs/oxml_0308.gif




Office 2003 XML
Office 2003 XML
ISBN: 0596005385
EAN: 2147483647
Year: 2003
Pages: 135

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net