Section 5.1.  Word Markup Language (WordML)

Prev don't be afraid of buying books Next

5.1. Word Markup Language (WordML)

The Word Markup Language (WordML) is the native XML representation for Microsoft Word. It captures everything that might be known about a Word document. It covers not just the text of the document itself, but also all the formatting, all the styles associated with that document (whether they are used or not), and all of the various settings (such as page margins and tabs). Since it covers so many things, it is very verbose, and it is somewhat difficult to understand just by reading it.

Nevertheless, WordML has a significant benefit over the equivalent .doc binary format of Word documents: Any tool that can parse XML can make use of the Word document. This includes tools that transform, display, search, validate, store, index and query XML documents.

As Office 2003 increases in popularity, we expect third-party tools to be released that will use WordML to process Word documents in new ways and to generate Word documents from other data sources.

Caution

Because WordML is a native Word document representation, Word treats it quite differently from other uses of XML. To avoid the constant interjection of "except for WordML", we normally do not include WordML when we discuss Word's treatment of XML documents. If we do mean to include it, that will be clear from the context.




5.1.1 The WordML vocabulary

WordML is a large, complex vocabulary with over 400 different element types. Fortunately, in order to create, or even parse, WordML documents, you only need to be familiar with a small fraction of the vocabulary.[1] In fact, the first WordML document you write can be quite small and simple. It is shown in Example 5-1.

[1] A reference guide that covers the entire WordML vocabulary is included with the Microsoft Word XML Content Development Kit that can be downloaded from the MSDN library at: http://msdn.microsoft.com

Example 5-1. Your first WordML document (minimal WordML.xml)
 <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <?mso-application prog?> <w:wordDocument  xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">   <w:body>     <w:p>       <w:r><w:t>hello, Word</w:t></w:r>     </w:p>   </w:body> </w:wordDocument> 

5.1.2 Saving a Word document as WordML

Recall Doug's article for Worldwide Widget's newsletter. It started life as an ordinary Word document. We repeat it here in Figure 5-1 for your convenience.

Figure 5-1. Doug's article (article.doc)




The default format when you save a Word document is still the binary .doc format. However, if you choose to save a document as XML and Word cannot associate that document with a schema, it will be saved as WordML.[2]

[2] We saw in 4.6, "Saving a document", on page 79 how to save a document that is associated with a schema, using that schema alone. Later we will see how to save it using a combination of its own schema and WordML.

Let's save Doug's article as WordML and see what we get. To do so:

1. On the File menu, click Save As.

2. Select XML Document (*.xml) from the Save as type list.

3. Click Save.

We'll look at the actual WordML representation, as a Word rendition would be identical to Figure 5-1. Because the WordML document is extremely long, we will excerpt pieces as examples as we go along.

5.1.3 Structure of a WordML document

The basic structure of a WordML document is shown in Model 5-1/>.

Model 5-1. WordML document structure
 [Document (wordDocument)   [0..1]Document Properties -- General (DocumentProperties)   [0..1]Lists (lists)   [0..1]Styles(styles)   [0..1]Document Properties -- Word-specific (docPr)   [1..1]Body (body) 

The root of a WordML document is always a wordDocument element. The most commonly used children of a wordDocument element are:

  • an optional DocumentProperties element, which contains general information about the document such as the date it was created and last updated, the author name, and the revision number

  • an optional lists element contains information about the formatting of lists, such as the type of bullet or number, and the indentation used

  • an optional styles element contains the information about the styles used in the document, such as the font and size, language, and paragraph formatting

  • an optional docPr element, which contains Word-specific information on the settings for the document, such as margins and header and footer properties

  • a required body element that contains the bulk of the document

As you can see, most of these elements can be left out. If you omit an optional element, it defaults to the settings for new documents in Word.

5.1.4 In the beginning

Example 5-2 shows the very beginning of the WordML document.[3]

[3] Some whitespace was added to all examples to make them more readable.

Example 5-2. Beginning of WordML document (article WordML.xml)
 <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <?mso-application prog?> <w:wordDocument  xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"  xmlns:v="urn:schemas-microsoft-com:vml"  xmlns:w10="urn:schemas-microsoft-com:office:word"  xmlns:SL="http://schemas.microsoft.com/schemaLibrary/2003/core"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xmlns:aml="http://schemas.microsoft.com/aml/2001/core"  xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"  xmlns:o="urn:schemas-microsoft-com:office:office"  xml:space="preserve">   <o:DocumentProperties>     <o:Title>Heading 1</o:Title>     <o:Author>Priscilla Walmsley</o:Author>     <o:LastAuthor>Priscilla Walmsley</o:LastAuthor>     <o:Revision>2</o:Revision> 

line 1

The document starts out on line 1 with an XML declaration, which identifies the document as XML and indicates the encoding used in the document.

line 2

On line 2, a processing instruction appears which identifies the document as a Word document. The purpose of this processing instruction is to tell Windows to open this file in Word, rather than in Internet Explorer, which is often the application associated with the .xml extension.

line 3

The root element is w:wordDocument, whose start-tag has a number of namespace declarations.

line 4

The namespace of the WordML vocabulary is: http://schemas.microsoft.com/office/word/2003/wordml This namespace is commonly mapped to the w prefix, although there is no requirement that this prefix be used.

line 13

The first child of w:wordDocument is a o:DocumentProperties element that contains general information about the document. It is followed by a huge number of elements representing style information, which is not shown.

5.1.5 The body

The body of the WordML document, represented by the body element, contains all the text of the document. Its structure is shown in Model 5-2.

The body can contain sections that contain paragraphs, or it can contain paragraphs directly. Paragraphs, in turn, contain text runs, which contain text elements, which contain data characters. There is a separate text run for every data character string that has a distinct style or other properties. A paragraph can also contain images, hyperlinks and other components.

Model 5-2. WordML body structure
 Body (body)   [0..*]Section (sect)          [0..*]Paragraph (p)                  [0..*]Text Run (r)                          [0..*]Text (t)   [0..*]Paragraph (p)       ... 

5.1.5.1 Paragraphs and text

Each paragraph is represented by a p element. The paragraph has a style (and possibly other settings) associated with it in its properties child, pPr. If no style is associated with the paragraph, it defaults to "Normal" style.

A text run (r) can contain multiple text elements, as well as pictures, footnotes, fields and other Word objects. A text element (t), on the other hand, can only contain data characters, with no child elements. Every data character in the document text is contained directly in a t element.

An excerpt from the body of the WordML representation of Figure 5-1 is shown in Example 5-3. It contains two paragraphs (p elements). The first paragraph has a pPr child that identifies properties of the paragraph, namely that the style is "Heading2". It then contains a text run (r element), which contains a single text element (t).

Example 5-3. WordML paragraphs (article WordML.xml)
 <w:p>   <w:pPr>     <w:pStyle w:val="Heading2"/>   </w:pPr>   <w:r><w:t>A great month!</w:t></w:r> </w:p> <w:p>   <w:r><w:t>This month's figures are a </w:t></w:r>   <w:r>     <w:rPr>       <w:i/>     </w:rPr>     <w:t>huge</w:t>   </w:r>   <w:r>     <w:t> improvement over this month last year. We sold 1,342 widgets for a total revenue of $14,327.</w:t>   </w:r> </w:p> 

The second paragraph contains three text runs (w:r elements). As the word "huge" is in italics, it must have its own text run with its own properties (the w:rPr element) that specify the italics (the w:i element).

5.1.5.2 Lists

Bulleted and numbered lists are common in Word documents. In WordML, list items are simply paragraphs that refer to a list ID in their properties. The list ID corresponds to a list defined in the lists section of the document.

For example, suppose Doug wanted to list the identifying elements of his article in a bulleted list, as shown in Figure 5-2. The corresponding WordML would look like Example 5-4.

Figure 5-2. List in Word




Example 5-4. WordML list
 <w:p>   <w:pPr>     <w:listPr><w:ilvl w:val="0"/><w:ilfo w:val="2"/></w:listPr>   </w:pPr>   <w:r><w:t>Title: Sales Update</w:t></w:r> </w:p> <w:p>   <w:pPr>     <w:listPr><w:ilvl w:val="0"/><w:ilfo w:val="2"/></w:listPr>   </w:pPr>   <w:r><w:t>Author: Doug Jones</w:t></w:r> </w:p> <w:p>   <w:pPr>     <w:listPr><w:ilvl w:val="0"/><w:ilfo w:val="2"/></w:listPr>   </w:pPr>   <w:r><w:t>Date: February 3, 2004</w:t></w:r> </w:p> 

Each paragraph properties (pPr) element contains a list properties (listPr) element which in turn has two children:

  • The ilvl element indicates the level of the item in the list, starting with zero. If a list contains items at different outline levels, this property indicates this.

  • The ilfo element associates the paragraph with a specific list. The number specified in its val attribute is an ID that corresponds to the ilfo attribute of a list element in the lists section.

The lists element of the same document appears in Example 5-5. Notice that it has two types of children. The listDef element defines various properties of the list, such as the style used and a unique identifier. The list element has only a unique identifier and the link to a listDef element through its ilst child. The many levels of definitions for lists are due to the complexity of starting and stopping the numbering for numbered lists.

Example 5-5. The WordML lists element
 <w:lists>   <w:listDef w:listDef>     <w:lsid w:val="1E525C74"/>     <w:listStyleLink w:val="Style1bulletpw"/>   </w:listDef>   <w:list w:ilfo="1">     <w:ilst w:val="0"/>   </w:list> </w:lists> 

5.1.5.3 Tables

The structure of WordML tables (Model 5-3) is very similar to XHTML tables, so if you are familiar with HTML you have a head start. A table element (tbl) can appear anywhere a paragraph can appear, namely as a child of body.

Model 5-3. WordML table structure
 Table (tbl)   [1..1]Table Properties (tblPr)   [1..1]Table Grid (tblGrid)          [1..*]Table Grid Column (tblGridCol)   [0..*]Row (tr)          [0..1]Row Properties (trPr)          [1..*]Cell (tc)                  [1..1]Cell Properties (tcPr)                  [0..*]Tables (tbl)                  [1..*]Paragraphs (p) 

The table properties element (tblPr) is used to specify the properties of the table, such as the style used, the cell spacing, and the borders. The element is required, but none of its children (which set the individual properties) is required, so it is possible to have an empty tblPr element. All of the settings have defaults, which are used in case they are not specified.

The table grid element (tblGrid) is used to set the column widths. For each column in the table it contains a tblGridCol with a w attribute that specifies the column width in twips (twentieths of a point). The tblGrid element and its tblGridCol children are required.

Each row in the table is represented by a tr element. Each tr element has an optional properties child, trPr, and one or more cells, represented by tc elements. Each tc may itself have a properties child, tcPr, and must have one or more other tables (tbl) or paragraphs (P). The last child of the tc must always be a paragraph rather than another table.

Suppose that Doug wants to display sales data in a table. The table shown in Example 5-6 will look like Figure 5-3 when shown in Word.

Figure 5-3. Sales table displayed in Word




Example 5-6. WordML table
 <w:tbl>   <w:tblGrid>     <w:gridCol w:w="828"/>     <w:gridCol w:w="1620"/>     <w:gridCol w:w="1440"/>   </w:tblGrid>   <w:tr>     <w:tc>       <w:p>         <w:pPr><w:pStyle w:val="Heading3"/></w:pPr>         <w:r><w:t>Q</w:t></w:r>       </w:p>     </w:tc>     <w:tc>       <w:p>         <w:pPr><w:pStyle w:val="Heading3"/></w:pPr>         <w:r><w:t>Revenue</w:t></w:r>       </w:p>     </w:tc>     <w:tc>       <w:p>         <w:pPr><w:pStyle w:val="Heading3"/></w:pPr>         <w:r><w:t>Profit</w:t></w:r>       </w:p>     </w:tc>   </w:tr>   <w:tr>     <w:tc><w:p><w:r><w:t>1</w:t></w:r></w:p></w:tc>     <w:tc><w:p><w:r><w:t>$14,332.35</w:t></w:r></w:p></w:tc>     <w:tc><w:p><w:r><w:t>$2,115.12</w:t></w:r></w:p></w:tc>   </w:tr>   <w:tr>     <w:tc><w:p><w:r><w:t>2</w:t></w:r></w:p></w:tc>     <w:tc><w:p><w:r><w:t>$13,224.22</w:t></w:r></w:p></w:tc>     <w:tc><w:p><w:r><w:t>$1,655.51</w:t></w:r></w:p></w:tc>   </w:tr>   <w:tr>     <w:tc><w:p><w:r><w:t>3</w:t></w:r></w:p></w:tc>     <w:tc>     <w:p><w:r><w:t>$14,778.26</w:t></w:r></w:p></w:tc><w:tc>     <w:p><w:r><w:t>$2,243.98</w:t></w:r></w:p></w:tc>   </w:tr>   <w:tr>     <w:tc><w:p><w:r><w:t>4</w:t></w:r></w:p></w:tc>     <w:tc><w:p><w:r><w:t>$17,455.15</w:t></w:r></w:p></w:tc>     <w:tc><w:p><w:r><w:t>$2,988.22</w:t></w:r></w:p></w:tc>   </w:tr> </w:tbl> 

For more complex tables, you can use the many table formatting features of Word, such as vertical and horizontal merge, and borders and shading. You can even include tables within other tables, as we saw.

Tip

When designing a complex table, the best approach is to create an example of the table in Word and save it as WordML. This will give you a model to work from, and will save you the effort of learning every single relevant WordML element.




5.1.5.4 Images

An image embedded in a Word document is represented in WordML by a pict element. Each pict element contains a Vector Markup Language (VML) description of the shape, location and size of the image, and the image data itself in base64Binary datatype format.

Tip

As with other Word components, the best way to include an image in a generated WordML document is to create a Word document that contains the image in the desired location and size, and save it as WordML. You can then copy the pict element from the saved WordML document and place it in your XSLT stylesheet.




5.1.5.5 Hyperlinks

A hyperlink is represented in WordML by an hlink element. Example 5-7 shows a paragraph that has an embedded hyperlink.

Example 5-7. Hyperlink in WordML
 <w:p>   <w:r>     <w:t>More information on the new marketing          plan can be found at </w:t>   </w:r>   <w:hlink w:dest="http://www.xmlinoffice.com/mkplan">     <w:r>       <w:rPr><w:rStyle w:val="Hyperlink"/></w:rPr>       <w:t>http://www.xmlinoffice.com/mkplan</w:t>     </w:r>   </w:hlink>   <w:r>     <w:t>. </w:t>   </w:r> </w:p> 

The hlink element is contained directly within the p element, rather than within a text run. In fact, it contains its own text run for the hyperlink text that appears when the document is presented, as in Figure 5-4. The dest attribute of the hlink element specifies the linked URL.

Figure 5-4. Hyperlink displayed in Word




5.1.6 Using Word styles

There are four kinds of style in Word:

  • A character style applies to a data character string within a paragraph.

  • A paragraph style applies to an entire paragraph.

  • A table style has special settings relating to tables, such as background color and justification.

  • A list style has special settings related to lists, such as the bullet or numbering used.

There are quite a few different properties of a style, ranging from character properties, such as font and size, to paragraph properties, such as indentation and tab settings. Any style setting that can be specified in Word can also be expressed in WordML.

5.1.6.1 A style example

The styles element that appears before the body contains all the information about the styles used in the document. Each style element has a unique name that is specified in its styleId attribute. The text in the body of the document then refers to these styles by name.

In Example 5-3, the first paragraph refers to the style whose name is "Heading2". The style element for Heading2 is shown in Example 5-8.

Example 5-8. WordML style (article WordML.xml)
 <w:style w:type="paragraph" w:style>   <w:name w:val="heading 2"/>   <w:basedOn w:val="Normal"/>   <w:next w:val="Normal"/>   <w:rsid w:val="CF4316"/>   <w:pPr>     <w:pStyle w:val="Heading2"/>     <w:spacing w:before="240" w:after="60"/>   </w:pPr>   <w:rPr>     <w:rFonts w:ascii="Arial" w:h-ansi="Arial" w:cs="Arial"/>     <w:b/>     <w:b-cs/>     <w:kern w:val="48"/>     <w:sz w:val="48"/>     <w:sz-cs w:val="48"/>   </w:rPr> </w:style> 

5.1.6.2 Generating WordML style definitions

Fortunately, there is no need to learn all the WordML elements for the style settings you need. Attempting to construct WordML style definitions by hand would be a tedious, trial-and-error process. Because Word already provides a user-friendly front-end for defining styles, you should use Word itself to create a document that has all the styles you want to use.

You can save that document as WordML using the procedure described in 5.1.2, "Saving a Word document as WordML", on page 89. The result is a WordML document that contains all the styles you need. You can then copy the styles section of that document (and the lists section if needed).

This is a good approach not just for paragraph styles, but also for character styles. For example, if you wish to italicize a word in the middle of a sentence, you could do this using the i property for the text run, as shown in Example 5-3. However, it is sometimes difficult to remember the names of all the different properties that can be applied to text.

Using Word, you can create a character style for italics named, for example, "emphasis". Any text that should be italicized because it should be emphasized can then refer to that style, rather than using the i property. In effect you are using the principles of generalized markup for style names, just as you do for XML element-type names.

As with XML, this approach to style definitions has the added benefit of making it easy to apply a change to all text of that type. For example, if you use italics for both emphasized words and citations, you can create two styles: "emphasis" and "citation". If later, you decide you want to put citations in a different font, you can simply change the "citation" style rather than having to change the font of some but not all of the italicized text.

Amazon


XML in Office 2003. Information Sharing with Desktop XML
XML in Office 2003: Information Sharing with Desktop XML
ISBN: 013142193X
EAN: 2147483647
Year: 2003
Pages: 176

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net