3.5 Converting Between WordprocessingML and Other Formats


While it can be very easy to translate simple custom XML formats into WordprocessingML (as we saw with Example 3-1), the reverse is not usually true at least not when you're interested in preserving all aspects of a document's formatting. The sheer size and complexity of WordprocessingML makes it a very daunting task to write a generic stylesheet for converting WordprocessingML documents into some other format. For that reason, we won't include any actual examples in this section. We can, however, point to some existing work that's being done in this area.

3.5.1 HTML

During the beta program for Office 2003, Microsoft released an XSLT stylesheet for converting WordprocessingML documents to HTML. At just under 4,000 lines long, this stylesheet is an impressive and enlightening look at processing Word documents in XML format. At the time of this writing, Microsoft has not yet released an updated version of the stylesheet. Fortunately, the stylesheet will largely work as-is provided that you update a few of the top-level namespace declarations. You can find this stylesheet by searching for "wordml" at Microsoft's download center (http://www.microsoft.com/downloads/search.aspx). It's quite possible that an updated version of the stylesheet will be available by the time you read this.

3.5.2 PDF

Converting Word documents to PDF can, of course, be done using products like Adobe Distiller. However, another possible way to perform this conversion is by way of XSL Formatting Objects (XSL-FO). Antenna House, Inc., maker of a premier XSL-FO processor, has released a (for-pay) XSL stylesheet that does just that. For more information, including some interesting discussion of the problem and solution, see http://www.antennahouse.com/product/wordmltofo.htm.

3.5.3 OpenOffice.org

Since OpenOffice.org, the open source alternative to Microsoft Office, saves all of its files using XML format, it only makes sense that there should be translations between WordprocessingML and the OpenOffice.org formats. Of course, this is easier said than done. While nothing significant has been released so far, this is listed on the OpenOffice.org web site as an open issue: "Develop support for Microsoft Office 2003 XML, i.e., WordprocessingML and SpreadsheetML."

3.5.4 Docbook

Just as Norm Walsh has created a suite of stylesheets for transforming Docbook to HTML and XSL-FO, it is only a matter of time before someone releases a stylesheet for converting Docbook to WordprocessingML. Since Docbook provides rich document structure and semantics, while WordprocessingML is only concerned with document formatting, such a conversion would be a "down-translation." Accordingly, it should not, in principle, be difficult.

Converting from WordprocessingML to Docbook, on the other hand, is a much less straightforward task. Certainly the wx:sub-section element (as described in Chapter 2) would be helpful for gleaning hierarchy from the Word document, but overall such a translation would have to be very special-purpose akin to converting PDF to a meaningful XML format. Usually, such "up-translations" are special-purpose, one-time conversions that must use a variety of heuristics and guesswork.

3.5.5 Special-Purpose Translations

While creating general-purpose, lossless translations of WordprocessingML into other formats is no doubt useful, there are plenty of use cases for creating special-purpose translations specific to particular classes of documents. For example, a set of documents created using the same template could be converted into a custom XML format. This could be done by translating certain parts of the document into custom XML elements in the result, or even by translating paragraph and character styles into custom XML elements. In fact, that's just what part of Chapter 4s primary example does. In the content of press release documents, individual w:p elements are translated to para elements in the result, and certain character styles within the paragraph are translated to custom XML elements in the result.



Office 2003 XML
Office 2003 XML
ISBN: 0596005385
EAN: 2147483647
Year: 2003
Pages: 135

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net