3.1 Endless Possibilities


We have to be careful when talking about use cases for generating and processing Word documents. By defining categories too strictly, we might completely ignore possibilities that others have explored or have yet to explore. The purpose of this chapter is to help open your mind as to what's possible now that the expanse of information found in all the world's Word documents is suddenly capable of being unlocked and exposed as XML. The categories and examples in this chapter are only the tip of the iceberg. Perhaps they will help trigger some of your own ideas and creativity. When reading along, if you think of an example that we failed to cover, then we have succeeded in our goal!

That said, you can break down the scripts in this chapter into three basic categories:

  • Input is WordprocessingML

  • Output is WordprocessingML

  • Both input and output are WordprocessingML

We'll cover examples of each of these under the general activities of creating, extracting, modifying, and converting. Creation produces WordprocessingML as output; extraction takes WordprocessingML as input; modification both takes WordprocessingML as input and produces it as output; and conversion either takes WordprocessingML as input or produces it as output.

Command-Line Tools

To execute the example stylesheets in this chapter, you'll need an XSLT processor. The Office 2003 Professional and standalone editions of Word 2003 come with an XSLT processor built-in (for onload and onsave stylesheets, as introduced in Chapter 4), but the examples in this chapter assume you will be invoking them outside of Word, for example, with a command-line processor. You can read about and download one such utility, msxsl.exe, at this URL: http://msdn.microsoft.com/library/en-us/dnxml/html/msxsl.asp.

The libxml project (hosted at http://www.xmlsoft.org) houses some quite useful command-line utilities for XML processing. I personally use Cygwin (a Linux-like environment for Windows see http://www.cygwin.com) and the Cygwin distribution of the libxml tools. But there are also native Windows binaries for each of the libxml tools, available at http://www.zlatkovic.com/libxml.en.html. One particularly convenient tool in the libxml suite is the xmllint command. Its --format option, which inputs an XML document and outputs a pretty-printed version of it (adding line breaks and indentation), is an excellent tool for learning WordprocessingML and for helping to author stylesheets that create Word documents. It was also instrumental in preparing many of the code examples of this book.

The libxslt project also contains its own XSLT processor, with a command-line tool called xsltproc. Other freely-available XSLT processors you may want to try out include Saxon (http://saxon.sourceforge.net) and Xalan (http://xml.apache.org/xalan-j/), both of which are Java-based processors.

WARNING: If you process or create WordprocessingML documents using XML tools that output line endings using a linefeed character (LF) rather than a carriage return and linefeed pair (CRLF), and if your documents contain Base64-encoded data such as VBA macros or embedded images, then you will need to convert the line endings to CRLF before opening the document in Word. Otherwise, Word will not be able to open the document correctly, even though it is well-formed XML. This is arguably a bug in Word's XML processing behavior, but it can be explained by the fact that the Base64 specification requires that individual lines end with a CRLF sequence in the canonical Base64 format. Fortunately, there are easy workarounds. For example, in a Unix or Cygwin environment, you can run the unix2dos command on your file, converting each instance of the LF character to a CRLF sequence.




Office 2003 XML
Office 2003 XML
ISBN: 0596005385
EAN: 2147483647
Year: 2003
Pages: 135

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net