Using XML with Microsoft Word


The feature set of Microsoft Word brings it almost within reach of the quality achieved by some desktop publishing applications. Therefore, it should come as no surprise that it has extensive XML support. Word can be used to both create XML documents in WordML format or in a custom schema.

Saving Word documents as XML

The capability to save documents in WordprocessingML (or WordML) was added with Word 2003. This XML dialect attempts to recreate much of the functionality of Word, but renders it in XML. As you might expect, WordML can be rather verbose. As an example, the resume shown in Figure 25-5 is 39KB in .doc format, but 44KB in XML format. This may not seem like much, but keep in mind that this is a one-page document. For a more realistic example, Chapter 3 is 666 KB as a .doc file and 852 KB as XML-approximately a 27 percent increase in size. Because you are attempting to create as accurate a representation of the .doc format as possible, you can expect to find the formatting information within the document, as well as any revisions or other metadata. If you are familiar with Rich Text Format (RTF), the XML structure should look familiar because the XML format is basically an XML version of that format.

image from book
Figure 25-5

Listing 25-6 shows a couple of fragments of this XML.

Listing 25-6: WordML

image from book
      <?xml version="1.0" encoding="UTF-8" standalone="yes"?>      <?mso-application prog?>      <w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"      xmlns:v="urn:schemas-microsoft-com:vml"      xmlns:w10="urn:schemas-microsoft-com:office:word"      xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core"      xmlns:aml="http://schemas.microsoft.com/aml/2001/core"      xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"      xmlns:o="urn:schemas-microsoft-com:office:office"      xmlns:dt="uuid:"      w:macrosPresent="no"      w:embeddedObjPresent="no"      w:ocxPresent="no"      xml:space="preserve">        <o:DocumentProperties>        </o:DocumentProperties>        <o:CustomDocumentProperties>        </o:CustomDocumentProperties>        <w:fonts>         </w:fonts>         <w:lists>         </w:lists>         <w:styles>         </w:styles>         <w:shapeDefaults>         </w:shapeDefaults>         <w:docPr>         </w:docPr>         <w:body>           <wx:sect>            <w:tbl>              <w:tblPr>                <w:tblW w:w="8924" w:type="dxa"/>                <w:tblLayout w:type="Fixed"/>              </w:tblPr>              <w:tblGrid>                <w:gridCol w:w="446"/>                <w:gridCol w:w="22"/>                <w:gridCol w:w="6120"/>                <w:gridCol w:w="180"/>                <w:gridCol w:w="2156"/>              </w:tblGrid>              <w:tr>                <w:tblPrEx>                  <w:tblCellMar>                    <w:top w:w="0" w:type="dxa"/>                    <w:bottom w:w="0" w:type="dxa"/>                  </w:tblCellMar>                </w:tblPrEx>                <w:trPr>                  <w:cantSplit/>                </w:trPr>                <w:tc>                  <w:tcPr>                    <w:tcW w:w="8924" w:type="dxa"/>                    <w:gridSpan w:val="5"/>                    <w:tcBorders>                      <w:top w:val="nil"/>                      <w:left w:val="nil"/>                      <w:bottom w:val="single" w:sz="4" wx:bdrwidth="10"                       w:space="0" w:color="999999"/>                      <w:right w:val="nil"/>                    </w:tcBorders>                  </w:tcPr>                  <w:p>                    <w:r>                      <w:t>Foo deBar</w:t>                    </w:r>                  </w:p>                  <w:p>                    <w:r>                      <w:t>123 Any Drive, Some Place, PA, 12345</w:t>                    </w:r>                  </w:p>                  <w:p>                    <w:r>                      <w:t>+1 (111) 555-1212</w:t>                    </w:r>                  </w:p>                  <w:p>                    <w:pPr>                      <w:pStyle w:val="E-mailaddress"/>                    </w:pPr>                    <w:r>                      <w:t>foo@debar.com</w:t>                    </w:r>                  </w:p>                </w:tc>              </w:tr>      ...        </w:body>      </w:wordDocument> 
image from book

Notice that this format includes the processing instruction <?mso-application progid=“Word.Document”?> at the beginning of the document. This identifies the ProgID or program identifier of the application that executes if this document is opened from the Desktop or via Internet Explorer. The ProgID is a value stored in the Windows Registry that points at the current executable for Word.

Next comes the rather lengthy collection of namespaces, the most important of which is http://www.schemas.microsoft.com/office/word/2003/wordml used by the bulk of the elements. In addition, the namespace for Vector Markup Language (VML) is included. Any drawing elements included in the document are rendered using this namespace. Note that the namespaces are a mix of URL-style and URN-style namespaces. The URL-style namespaces are simply unique identifiers that do not point at schema documents. For this reason, the Office team tends to use URN-style namespaces because they do not imply the existence of a schema document.

The bulk of the document is composed of the <w:body> element. This contains the text of the document, as well as pointers to the styles and explicit formatting stored elsewhere in the document. Each paragraph is denoted as a <w:p> element, such as the summary heading:

      <w:p>        <w:pPr>          <w:pStyle w:val="Heading1"/>        </w:pPr>        <w:r>          <w:t>Summary</w:t>        </w:r>      </w:p> 

The <w:t> element contains the actual text, whereas the <w:pStyle> is a pointer to an element in the <w:styles> section, where the Heading1 style is defined as:

      <w:style w:type="paragraph" w:style>        <w:name w:val="heading 1"/>        <wx:uiName wx:val="Heading 1"/>        <w:basedOn w:val="Normal"/>        <w:next w:val="Normal"/>        <w:rsid w:val="00DE7766"/>          <w:pPr>          <w:pStyle w:val="Heading1"/>          <w:spacing w:before="80" w:after="60"/>          <w:outlineLvl w:val="0"/>        </w:pPr>        <w:rPr>          <wx:font wx:val="Tahoma"/>          <w:caps/>        </w:rPr>      </w:style> 

In addition to saving in WordML, you can apply an XSLT transformation to the document when it is saved. This enables you to create a simplified or customized version of the document when needed. For example, applying the XSLT listed in Listing 25-7 results in the simple HTML document shown in Listing 25-8.

Listing 25-7: SimpleWord.xsl

image from book
      <?xml version="1.0" encoding="UTF-8"?>      <xsl:stylesheet version="2.0"       xmlns:xsl="http://www.w3.org/1999/XSL/Transform"       xmlns:fo="http://www.w3.org/1999/XSL/Format"       xmlns:xs="http://www.w3.org/2001/XMLSchema"       xmlns:fn="http://www.w3.org/2005/xpath-functions"       xmlns:xdt="http://www.w3.org/2005/xpath-datatypes"       xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"       xmlns:o="urn:schemas-microsoft-com:office:office">      <xsl:output encoding="UTF-8" standalone="omit" method="html"  indent="yes" />        <xsl:template match="/">          <html>            <head>              <title>                <xsl:value-of select="/w:wordDocument/o:DocumentProperties/o:Title"/>              </title>            </head>            <body>              <xsl:apply-templates select="/w:wordDocument/w:body"/>            </body>          </html>        </xsl:template>       <xsl:template match="w:p">          <div>            <xsl:if test="exists(w:pPr)">              <xsl:attribute name="class">                <xsl:value-of select="w:pPr/w:pStyle/@w:val"/>              </xsl:attribute>            </xsl:if>            <xsl:value-of select="w:r/w:t"/>          </div>        </xsl:template>      </xsl:stylesheet> 
image from book

The template selects from all the included <w:p> elements. These are converted to <div> tags in the resulting HTML. If there is a child <w:pPr> element, the style is applied to the <div>. Finally, the text of each paragraph is extracted and added to the <div>. The resulting HTML provides a simpler view of the document (see Listing 25-8).

Listing 25-8: Output of SimpleWord.xsl

image from book
      <?xml version="1.0" encoding="UTF-8"?>      <html xmlns:fn="http://www.w3.org/2005/xpath-functions"        xmlns:fo="http://www.w3.org/1999/XSL/Format"        xmlns:o="urn:schemas-microsoft-com:office:office"        xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"        xmlns:xdt="http://www.w3.org/2005/xpath-datatypes"        xmlns:xs="http://www.w3.org/2001/XMLSchema">        <head>          <title>Foo deBar</title>        </head>        <body>          <div>Foo deBar</div>          <div>123 Any Drive, Some Place, PA, 12345</div>          <div>+1 (111) 555-1212</div>          <div >foo@debar.com</div>          <div >Summary</div>          <div ></div>          <div >More than 7 years programming and application development      experience.</div>          <div >Computer skills</div>          <div ></div>          <div >Languages</div>          <div >Proficient in: Microsoft Visual C++ (r)  and C</div>          <div >Familiar with: C#, Microsoft Visual Basic (r) ,      Java</div>          <div >Software</div>          <div >Database: Microsoft SQL Server and Microsoft      Access</div>          <div >Platforms: Microsoft Windows (r)  2000,      Microsoft Windows XP</div>          <div >Experience</div>          <div></div>          <div >Programmer Analyst</div>          <div >1997-Present</div>          <div></div>          <div>Contoso Pharmaceuticals</div>          <div >Primary responsibilities include design and      development of server code.</div>          <div >Developed and tested new financial reporting system      using Visual Basic.</div>          <div >Performed Y2K modifications on existing      financial software.</div>          <div></div>          <div >Programmer Analyst</div>          <div >1992-1997</div>          <div></div>          <div >Wide World Importers</div>          <div >Developed online and batch test plans using Y2K      critical test dates.</div>          <div >Developed and tested the new inventory management      system using C++.</div>          <div >Modified and tested order processing system      using C++.</div>          <div></div>          <div >Information System Specialist</div>          <div >1990-1992</div>          <div></div>          <div>The Phone Company</div>          <div >Provided object-oriented design, programming and      implementation support to the customer billing system, written in C++.</div>          <div >Prepared test plans and data, and user documentation      for customer billing system.</div>          <div >Problem-solved hardware issues with fault-      tolerant hard drives.</div>          <div >Education</div>          <div></div>          <div >Oak Tree University</div>          <div >1989</div>          <div></div>          <div >Salt Lake City, Utah</div>          <div >B.S., Computer Science</div>          <div></div>        </body>      </html> 
image from book

In addition to generating this simple document, you could also extract any tables or graphics used by the document or use the style definitions to create a CSS stylesheet.

Editing XML documents

Just as with Excel, you can use Word to edit XML documents. Also as with Excel, you must first create a mapping between the XML data and the document. With Word, you add one or more XML schemas to the document. Word uses this schema to validate the contents of the document. This can be an invaluable resource when using Word to create highly structured documents.

Listing 25-9 shows an XML schema for a simple resume format (for a more full-featured resume schema, see the HR-XML version listed in the resources). The schema contains sections for contact information, experience, and education.

Listing 25-9: A simple resume schema

image from book
      <?xml version="1.0" encoding="UTF-8"?>      <xs:schema xmlns="http://www.example.com/resume-simple"        xmlns:xs="http://www.w3.org/2001/XMLSchema"        targetNamespace="http://www.example.com/resume-simple"        elementFormDefault="qualified"        attributeFormDefault="unqualified" version="1.0">        <xs:element name="resume">          <xs:annotation>            <xs:documentation>Simple resume schema</xs:documentation>          </xs:annotation>          <xs:complexType mixed="true">            <xs:sequence>              <xs:element name="name" type="nameType"/>              <xs:element name="address" type="addressType"/>              <xs:element name="objectives" type="xs:string"/>              <xs:element name="experience" type="experienceType" maxOccurs="unbounded"/>              <xs:element name="education" type="educationType" maxOccurs="unbounded"/>              <xs:element name="interests" type="xs:string"/>            </xs:sequence>          </xs:complexType>        </xs:element>        <xs:complexType name="nameType">          <xs:sequence>            <xs:element name="firstName" type="xs:string"/>            <xs:element name="lastName" type="xs:string"/>            <xs:element name="middleInitials" type="xs:string" minOccurs="0"/>           </xs:sequence>        </xs:complexType>        <xs:complexType name="addressType">          <xs:sequence>            <xs:element name="street" type="xs:string"/>            <xs:element name="city" type="xs:string"/>            <xs:element name="region" type="regionType"/>            <xs:element name="postalCode" type="pcodeType"/>          </xs:sequence>        </xs:complexType>        <xs:simpleType name="regionType">          <xs:restriction base="xs:string">            <xs:length value="2"/>          </xs:restriction>        </xs:simpleType>        <xs:simpleType name="pcodeType">          <xs:restriction base="xs:string">            <xs:minLength value="5"/>          </xs:restriction>        </xs:simpleType>        <xs:complexType name="experienceType">          <xs:sequence>          <xs:element name="yearFrom" type="xs:int"/>          <xs:element name="yearTo" type="xs:int"/>          <xs:element name="company" type="xs:string"/>          <xs:element name="position" type="xs:string"/>          <xs:element name="description" type="xs:string"/>          </xs:sequence>        </xs:complexType>        <xs:complexType name="educationType">          <xs:sequence>            <xs:element name="yearFrom" type="xs:int"/>            <xs:element name="yearTo" type="xs:int"/>            <xs:element name="institution" type="xs:string"/>            <xs:element name="degree" type="xs:string"/>            <xs:element name="description" type="xs:string"/>          </xs:sequence>        </xs:complexType>      </xs:schema> 
image from book

You add this schema to a Word document or template using the XML Schema tab of the Tools, Templates and Add-ins dialog (see Figure 25-6). If you have a number of related schemas, you can create a Schema Library to work with them together.

image from book
Figure 25-6

As when you use Excel, the next step is to mark up the document to identify the regions that will be populated with the XML data. Figure 25-7 shows a document with the mapping visible. The element markers can be hidden if they are disruptive. However, showing the markers can increase the likelihood that the fields will be filled in correctly. Alternatively, if you were creating a document template for producing XML data, you would probably add fields within the elements, protect the document, and hide the element markers.

When editing the document, Word provides validation. In Figure 25-7, you can see that a validation error is currently active. This is shown by the purple squiggly at the side of the elements that have errors. In addition, these elements are highlighted in the XML Structure sidebar. Hovering over either the item in the side bar or the main document reviews the error.

image from book
Figure 25-7

After the document is completed and it passes validation, you can save it as a complete document in WordProcessingML or save only the data. Listing 25-10 shows the output of the data from the preceding document. Notice no reference back to Word is included. Only the data identified by the schema is present.

Listing 25-10: Resume data as XML

image from book
      <?xml version="1.0" encoding="UTF-8" standalone="no"?>      <resume xmlns="http://www.example.com/resume-simple">        <name>          <firstName>Deborah</firstName>          <lastName>Greer</lastName>        </name>        <address>          <street>1337 42nd Avenue</street>          <city>Blahford</city>          <region>MA</region>          <postalCode>12345</postalCode>        </address>        <objectives>Develop with XML, change the world          one angle bracket at a time.</objectives>        <experience>          <yearFrom>1990</yearFrom>          <yearTo>1994</yearTo>          <company>Arbor Shoes</company>          <position>National Sales Manager</position>          <description>Increased sales from $50 million to $100 million.            Doubled sales per representative            from $5 million to $10 million.            Suggested new products that increased earnings by 23%.</description>        </experience>        <education>          <yearFrom>1971</yearFrom>          <yearTo>1975</yearTo>          <institution>South Ridge State University</institution>          <degree>B.A., Business Administration and Computer Science.</degree>          <description>Graduated summa cum laude.</description>        </education>        <interests>South Ridge Board of Directors, running,          gardening, carpentry, computers.</interests>      </resume> 
image from book

Users in most workplaces have at least a passing knowledge of Microsoft Word. Therefore, it makes sense to use it for manipulating XML documents. The capability to save as XML, optionally with a transformation, means that you can use Word to generate simple XML formats. In addition, the XML editing feature extends the powerful forms capabilities of Word to generate valid XML documents. You might use this as the front end to a Web service, for example, using Word to generate the payload for the request.




Professional XML
Professional XML (Programmer to Programmer)
ISBN: 0471777773
EAN: 2147483647
Year: 2004
Pages: 215

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net