2.5 Document Structure and Formatting


Now that you've been inundated with information about lots of document-level constructs, let's move into the actual content of a Word document and how it is represented in WordprocessingML. All Word documents contain three levels of hierarchy: one or more sections containing zero or more paragraphs containing zero or more characters. A run is a grouping of contiguous characters that have the same properties. Tables can occur where paragraphs can, and list items are just a special kind of paragraph. You cannot have nested structures in WordprocessingML sections within sections, or paragraphs within paragraphs. The one exception to this rule is that tables may contain tables.

2.5.1 Runs

A "run" is the basic leaf container for a document's content and is represented by the w:r element. As we've seen, the w:r element may contain w:t elements, which contain text. Including the w:t element, there are 24 valid child elements of the w:r element, representing things like text, images, deleted text, hyphens, breaks, tabs, footnotes, endnotes, footnote and endnote references, page numbers, field text, etc. We'll look at just a few of these.

The w:r element may occur in five separate element contexts: w:p, w:fldSimple, w:hlink, w:rt, and w:rubyBase. The first one, the paragraph, is the most common. The w:fldSimple element represents a Word field, the w:hlink element represents a hyperlink in Word, and the w:rt ("ruby text") and w:rubyBase elements are used together for laying out Asian ruby text.

The run is not an essential part of a Word document in the same way that paragraphs and sections are. Rather, it is WordprocessingML's way of grouping multiple characters (or other objects) that have the same property settings. To illustrate this point, consider the following WordprocessingML paragraph:

<w:p>   <w:r><w:t>H</w:t></w:r>   <w:r><w:t>e</w:t></w:r>   <w:r><w:t>l</w:t></w:r>   <w:r><w:t>l</w:t></w:r>   <w:r><w:t>o</w:t></w:r>   <w:r><w:t> </w:t></w:r>   <w:r><w:t>w</w:t></w:r>   <w:r><w:t>o</w:t></w:r>   <w:r><w:t>r</w:t></w:r>   <w:r><w:t>l</w:t></w:r>   <w:r><w:t>d</w:t></w:r> </w:p>

The above paragraph is exactly equivalent to the paragraph below:

<w:p>   <w:r>     <w:t>Hello world</w:t>   </w:r> </w:p>

When Word saves a document as XML, it merges consecutive runs that have the same property settings. It also merges consecutive w:t elements into a single w:t element. In the above paragraph's case, all of the run properties are assigned through the document's default paragraph and character styles, because no explicit, local property settings are applied (through the w:rPr element).

2.5.1.1 Text and whitespace handling

The w:t element, which stands for "text," has no attributes and may only contain text. Being one of the few string-valued elements in Word, it is also one of the few contexts in which whitespace is significant. The handling of whitespace within the w:t element can be summarized in three basic rules:

  1. Each space character (#x20) is preserved as a space and shows up as a space in Word.

  2. Each line-feed character (#xA) and character reference to a carriage-return (#xD) is converted into a space.

  3. Each tab character (#x9) is replaced by a w:tab element (broken out into a separate run).

The one exception is that when xml:space="default" is present, tab characters are instead converted to spaces (and w:tab elements ignored altogether).

2.5.1.2 Tabs and breaks

The run inside the following WordprocessingML paragraph contains text as well as a text-wrapping break and a tab, represented by the w:br and w:tab elements.

<?xml version="1.0"?> <?mso-application prog?> <w:wordDocument   xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"   xml:space="preserve">       <w:body>     <w:p>       <w:r>         <w:t>This is the first line.</w:t>         <w:br/>         <w:t>This is a tab:</w:t>         <w:tab/>         <w:t>And this is some more text.</w:t>       </w:r>     </w:p>   </w:body>     </w:wordDocument>

The first thing to note here is that the presence of xml:space="preserve" is necessary for the w:tab element to be interpreted correctly. Otherwise, the tab is stripped out when the document is loaded (even though it technically doesn't constitute whitespace as far as XML is concerned). Again, for this reason, xml:space="preserve" should be included on the root element of any WordprocessingML document you create.

The w:br element, like its HTML counterpart, inserts a break within the text flow. It is short for <w:br w:type="text-wrapping"/>. The w:type attribute may have two other values: column and page, representing column and page breaks. Figure 2-7 shows the result of opening this document in Word, with formatting marks turned on.

Figure 2-7. A text-wrapping break and a tab inside a single paragraph
figs/oxml_0207.gif


The bent arrow at the end of the first line indicates that this is a text-wrapping break (represented in WordprocessingML by the w:br element) rather than the end of the paragraph. (Word users can insert text-wrapping breaks by pressing Shift-Enter). The right-pointing arrow on the second line denotes the presence of a tab. The w:tab element inserts a tab into the text flow, according to the tab settings for the current paragraph. In this case, since the tab stops for this paragraph are not specified either locally or in the Normal paragraph style, the tab stops default to the application default: every half inch (as specified by the document's w:defaultTabStop element).

2.5.1.3 Run properties

Among all the valid child elements of w:r, the w:rPr element is special. It stands for "run properties." All of the other children of w:r may occur in any order, but the w:rPr element, when present, must come first. Its child elements collectively set properties on the run, controlling primarily how text inside the run is to be displayed. There are 42 possible child elements of the w:rPr element, all of which are empty elements. Their various attribute values specify formatting properties such as font, font size, font color, bold, italic, underline, strikethrough, character spacing, text effects, etc. They correspond to the properties you see in Word's Font dialog box, accessed by selecting Format Font . . . , as shown in Figure 2-8.

Figure 2-8. Word's font settings which correspond to run properties
figs/oxml_0208.gif


When font settings are applied using a local w:rPr element, such settings are called "local settings," "manual formatting," or "direct formatting," as distinct from font settings applied through a selection's associated paragraph and character styles. Individual font properties applied through direct formatting always override the corresponding properties defined in the associated paragraph or character styles.

Example 2-3 shows the use of some of these formatting elements, each of which is highlighted.

Example 2-3. Applying various font properties
<?xml version="1.0"?> <?mso-application prog?> <w:wordDocument   xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"   xml:space="preserve">   <w:body>     <w:p>       <w:r>         <w:rPr>           <w:i w:val="on"/> <!  turns italics on  >           <w:b/>            <!  turns bold on  >         </w:rPr>         <w:t>This run is bold and italic. </w:t>         <w:br/>       </w:r>       <w:r>         <w:rPr>           <w:u w:val="single"/> <!  single underline  >           <w:rFonts w:ascii="Arial"/>         </w:rPr>         <w:t>This is Arial and underlined.</w:t>         <w:br/>       </w:r>       <w:r>         <w:rPr>           <w:sz w:val="56"/>   <!  28-point font size  >         </w:rPr>         <w:t>This is big.</w:t>       </w:r>     </w:p>   </w:body> </w:wordDocument>

This example contains a single paragraph that contains three runs, each of which contains text. The first two runs also contain trailing text-wrapping breaks (w:br elements), effectively separating the text of each run onto its own line. Each run has different run properties specified in the w:rPr element. These properties, since they are applied as direct formatting, override the corresponding settings in the Normal style (the "default default" paragraph style, as we saw earlier).

The first run introduces the w:b and w:i elements:

        <w:rPr>           <w:i w:val="on"/> <!-- turns italics on -->           <w:b/>            <!-- turns bold on -->         </w:rPr>

The w:b and w:i elements stand for "bold" and "italic," respectively. They are among 19 of w:rPr's 42 possible child elements that, like many of w:docPr's children, are declared with the onOffProperty type in the WordprocessingML schema. This means that the default value of the w:val attribute is on. Thus, w:val="on" on the w:i element above is technically redundant. As might be guessed, by turning these properties on, all of the text within the run will be formatted in bold weight and italic style.

The presence of the w:val attribute is necessary to turn off a particular property, overriding its setting in the style. For example, if you want to turn off bold for a particular portion of text that's associated as a whole with a style in which the bold property is turned on, then you would include <w:b w:val="off"/> inside the w:rPr element.


The second run in Example 2-3 introduces the w:u and w:rFonts elements:

        <w:rPr>           <w:u w:val="single"/> <!-- single underline -->           <w:rFonts w:ascii="Arial"/>         </w:rPr>

The w:u element is similar to w:b and w:i, in that it is empty and has a w:val attribute. The difference is that, instead of having only the values on and off, you have a choice between 18 different values, including single (as in this example) and none. These values correspond to the choices in the "Underline style" drop-down menu in Word's Font dialog.

This run also specifies the Arial font, overriding the default Times New Roman font of the Normal style. This is done using the w:rFonts element, which has the same declared type in the WordprocessingML schema as the global w:defaultFonts element we saw earlier. Specifically, it allows the same attributes for specifying the fonts of different character sets: w:ascii, w:h-ansi, w:cs, and w:fareast. In this case, only the w:ascii attribute is supplied, which means that the other character sets still assume the default font.

The third and final run in our single-paragraph document sets the font size using the w:sz element:

        <w:rPr>           <w:sz w:val="56"/>   <!-- 28-point font size -->         </w:rPr>

The value of the w:val attribute in this case is measured in half-points, or 10 twips, or 144ths of an inch. Thus, while its value is 56 in the XML, the actual font size (in full points) is 28.

Finally, we see the result of opening this document in Word in Figure 2-9.

Figure 2-9. Direct formatting using local w:rPr elements
figs/oxml_0209.gif


Figure 2-9 also shows how direct formatting is represented in the Word UI. In this case, the cursor is inside the third run, containing the text "This is big." There are two things worth noting about how this direct formatting is represented:

  • The style drop-down box, as shown at the top right of the window, says "Normal + 28 pt." This is how all direct formatting is represented here (style name + individual property settings).

  • The Reveal Formatting task pane, because "Distinguish style source" is checked, distinguishes between the font size as set in the Normal style (12 pt) and the overriding font size as applied through Direct Formatting (28 pt).

2.5.1.4 Associating a run with a character style

In addition to specifying direct formatting, a run can explicitly associate itself with one of its document's character styles. This is done using the w:rStyle element. Below are three runs excerpted from a document in which the "Hyperlink" character style is defined. All three runs are associated with the "Hyperlink" style, but the middle run also applies some direct formatting (italics):

      <w:r>         <w:rPr>           <w:rStyle w:val="Hyperlink"/>         </w:rPr>         <w:t>This just </w:t>       </w:r>       <w:r>         <w:rPr>           <w:rStyle w:val="Hyperlink"/>           <w:i/>         </w:rPr>         <w:t>looks</w:t>       </w:r>       <w:r>         <w:rPr>           <w:rStyle w:val="Hyperlink"/>         </w:rPr>         <w:t> like a hyperlink.</w:t>       </w:r>

Figure 2-10 shows the result of opening this document in Word, assuming it has defined the "Hyperlink" style in its w:styles element (rendering the font blue and underlined).

Figure 2-10. A run of text associated with the "Hyperlink" style
figs/oxml_0210.gif


Once again, the Reveal Formatting task pane shows the distinction between the properties applied through direct formatting ("Italic") and the properties defined in a style ("Font color: Blue" and "Underline"). It also reveals the character style for this run: "Hyperlink."

2.5.2 Paragraphs

Paragraphs are the basic block-oriented element in Word. All text content within a document is contained within paragraphs, whether it's inside the main body of the document, a table cell, a header, a footer, a footnote, an endnote, or a textbox embedded in an image. Normally, a new paragraph is created whenever a user hits the Enter key while editing.

In WordprocessingML, a paragraph is represented by the w:p element. The area inside the w:p element could be called a "run-level" context, because it is a context in which runs (w:r elements) may appear. Similarly, the area inside the w:body element is a "block-level" context, because it is a context in which paragraphs and tables may appear. The traditional distinction between a block and an inline element (or run) is that blocks are laid out on separate lines, whereas inline elements (runs) are laid out continuously, without any hard line breaks.

The content model of the w:p element is simple enough that it's worth showing here (using a DTD-like notation):

w:pPr?, (w:r|w:proofErr|w:permStart|w:permEnd|w:fldSimple|w:hlink|w:subDoc)*

This follows the same pattern as w:r's content model: an optional properties element followed by any of a number of element choices in any order. (We didn't show w:r's entire content model because it has so many element choices.)

Three of the elements in w:p's content model, as we've seen, may also occur as children of w:body. The w:proofErr, w:permStart, and w:permEnd elements are thus both block-level and run-level elements. They are explained later in Section 2.5.6.

The w:fldSimple element represents a Word field, and the w:hlink element represents a hyperlink in Word. You may recall that these elements are also run-level contexts, i.e., they themselves may contain runs. The w:subDoc element represents a link to a sub-document of the current document.

As is the case with the w:body element, w:p's content model is actually more open than implied above. The WordprocessingML schema also allows any element from any other namespace to occur here. This enables annotations from the AML (Annotation Markup Language) namespace, as well as tags from a custom XML schema to be embedded inside WordprocessingML. As we'll see in Chapter 4, Word renders custom XML tags differently depending on whether they occur at the block level (inside w:body) or run level (inside w:p).

2.5.2.1 Paragraph properties

Among all the valid child elements of w:p, the w:pPr element is special. It stands for "paragraph properties." All of the other children of w:p may occur in any order, but the w:pPr element, when present, must come first. Its child elements collectively set properties on the paragraph, controlling how the paragraph will be displayed. There are 34 possible child elements of the w:pPr element, many but not all of which are empty elements. Their various attribute values and child elements specify paragraph properties such as alignment, indentation, spacing, tab stops, widow/orphan control, paragraph borders, etc. Most of these properties correspond to the properties you see in Word's Paragraph dialog box, accessed by selecting Format Paragraph..., as shown in Figure 2-11.

Figure 2-11. Word's Paragraph dialog, corresponding to properties inside the w:pPr element
figs/oxml_0211.gif


When paragraph settings are applied using a local w:pPr element, such settings are called "local settings," "manual formatting," or "direct formatting," as distinct from settings applied through a paragraph's associated paragraph style. Individual paragraph properties applied through direct formatting always override the corresponding properties defined in the associated paragraph style. If this sounds familiar, it should. It's the same basic rule as for font settings. Local w:rPr and w:pPr elements always override settings applied through (explicit or default) style association. Also, the properties within the w:rPr and w:pPr elements are completely disjoint from each other, so there is no possibility of conflict between these two elements.

Example 2-4 shows the use of some of these paragraph formatting elements, each of which is highlighted.

Example 2-4. Applying various paragraph properties
<?xml version="1.0"?> <?mso-application prog?> <w:wordDocument   xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"   xml:space="preserve">   <w:body>     <w:p>       <w:pPr>         <w:jc w:val="center" />       </w:pPr>       <w:r>         <w:t>All work and no play makes Evan a dull boy.</w:t>       </w:r>     </w:p>     <w:p />     <w:p>       <w:pPr>         <w:spacing w:line="480" w:line-rule="auto" />         <w:ind w:left="720" w:first-line="720" />       </w:pPr>       <w:r>         <w:t>All work and no play makes Evan a dull boy. All work and no play makes Evan a               dull boy. All work and no play makes Evan a dull boy. All work and no play               makes Evan a dull boy.</w:t>       </w:r>     </w:p>     <w:p>       <w:pPr>         <w:ind w:left="2880" w:right="2880" />       </w:pPr>       <w:r>         <w:t>All work and no play makes Evan a dull boy.</w:t>       </w:r>     </w:p>   </w:body> </w:wordDocument>

The result of opening this document in Word is shown in Figure 2-12. Also, the Format Paragraph . . . dialog shown earlier in Figure 2-11 reflects the paragraph settings of the third paragraph of this example (note that the second paragraph is empty).

Figure 2-12. Applying paragraph properties as direct formatting
figs/oxml_0212.gif


Example 2-4 contains four paragraphs. The second paragraph is empty and does not apply any direct formatting. The other three each specify paragraph properties that override the corresponding settings in the Normal style (the "default default" paragraph style).

The first paragraph is centered. The w:jc element represents the paragraph justification settings:

<w:jc w:val="center" />

Its w:val attribute value may be left, center, right, both, or one of several other options specific to East Asian text. The first four values correspond to the "Left," "Centered," "Right,", and "Justified" options in the Alignment drop-down menu in the Format Paragraph . . . dialog.

The second non-empty paragraph is double-spaced, indented on the left, and has a first-line indent. The double-spacing effect is achieved through the w:spacing element:

<w:spacing w:line="480" w:line-rule="auto" />

Unlike the w:jc element, which has specific keywords corresponding to each of the UI options, the w:spacing element specifies its values numerically in twips. The w:line attribute's value of 480 (equivalent to 24 points), in conjunction with the w:line-rule attribute's value of auto, represent the overall setting of "Double" in the Line Spacing drop-down menu in the Format Paragraph . . . dialog, as shown earlier in Figure 2-11. When the w:line-rule attribute's value is auto, then the w:line attribute's value is interpreted in a pre-defined way, regardless of the current paragraph's font size. A value of 480 means "Double," 360 means "1.5 line," and 240 means "Single." The actual line spacing distance is automatically adjusted according to the current font size, but the w:line attribute's value stays the same. The other possible values of w:line-rule are exact and at-least. These correspond to the "Exactly" and "At least" options in the Line Spacing drop-down menu and affect how the w:line value is interpreted. For example, a value of exact would fix the line spacing distance to the specified value in the w:line attribute, regardless of the current font size. The w:spacing element also has other attributes (not present in this example) that are used to determine the spacing before and after the paragraph itself.

The indentation of the third paragraph (following the empty second paragraph) is specified using the w:ind element:

<w:ind w:left="720" w:first-line="720" />

The w:left attribute specifies the left indentation distance as 720 positive twips, or half an inch to the right of the page margin. (Negative indent values move the text into the page margin.) The w:first-line attribute specifies a first-line indent of another half inch. The effect of these settings on Word's ruler is shown in Figure 2-13.

Figure 2-13. A half-inch left indent and a half-inch first-line indent
figs/oxml_0213.gif


The w:ind element may also have a w:hanging attribute which specifies a hanging indent. Its presence is mutually exclusive with the w:first-line attribute, because the same paragraph cannot have both first-line and hanging indents. If our example used a hanging indent rather than a first-line indent, then the WordprocessingML would look like this:

<w:ind w:left="720" w:hanging="720" />

And the ruler would look like Figure 2-14.

Figure 2-14. A half-inch left indent and a half-inch hanging indent
figs/oxml_0214.gif


Interestingly enough, you can also supply negative values for the w:first-line and w:hanging attributes. Since a hanging indent is essentially the opposite of a first-line indent, Word interprets a negative value as if you had supplied a positive value of the other type of indent. In fact, when it subsequently saves the document as WordprocessingML, it replaces one attribute with the other attribute (w:hanging with w:first-line or vice versa) and its negative value with its opposite (positive) value. For example, if you open a document that has this:

<w:ind w:hanging="-720" />

then Word will normalize it to this instead:

<w:ind w:first-line="720" />

The two are equivalent.

The last paragraph in Example 2-4 has both right and left indents:

<w:ind w:left="2880" w:right="2880" />

The positive value (in twips) of 2880 in each of the w:left and w:right attributes means that the paragraph will be indented two inches from the margin on each side.

The w:left, w:right, w:first-line, and w:hanging attributes all measure distance in twips. You can alternatively measure distance in character spaces, by using the w:ind element's other four optional attributes instead: w:left-chars, w:right-chars, w:first-line-chars, and w:hanging-chars.

2.5.2.2 Defining tab stops

Paragraphs can specify custom tab stops, overriding the document's default tab stop interval. This is done using the w:tabs child element of a paragraph's w:pPr element. Example 2-5 shows a paragraph with custom tab stops as well as some tabs inside the paragraph that make use of those stops.

Example 2-5. Defining custom tab stops
<?xml version="1.0"?> <?mso-application prog?> <w:wordDocument   xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"   xml:space="preserve">   <w:body>     <w:p>       <w:pPr>         <w:tabs>           <w:tab w:val="left" w:pos="720" />           <w:tab w:val="center" w:pos="3600" />           <w:tab w:val="right" w:pos="6480" />         </w:tabs>       </w:pPr>       <w:r>         <w:tab/>         <w:t>Left-aligned tab</w:t>         <w:tab/>         <w:t>Centered tab</w:t>         <w:tab/>         <w:t>Right-aligned tab</w:t>       </w:r>     </w:p>   </w:body> </w:wordDocument>

Each w:tab element within the w:tabs element defines a different tab stop. Both the w:val and w:pos attributes are required. The w:val attribute indicates the type of tab stop, controlling the alignment of text around it. Its value must be one of left, center, right, decimal, bar, list, or clear. (The value clear enables tab stops defined in an associated paragraph style to be explicitly cleared.) The w:pos attribute specifies the position of the tab stop on the ruler, as the number of twips to the right of the left page margin. The w:tab element may also have an optional w:leader attribute, which sets the style of the empty space in front of the tab. These properties correspond to the settings found in Word's Format Tabs... dialog, shown in Figure 2-15, which here is populated with the same tab stops as defined in Example 2-5.

Figure 2-15. Tab stop definitions, corresponding to Example 2-5
figs/oxml_0215.gif


Finally, the result of opening this file in Word is shown in Figure 2-16, with formatting marks turned on.

Figure 2-16. Three kinds of custom tab stops
figs/oxml_0216.gif


The custom tab stops can be seen on the ruler, and the tabs themselves are signified by arrows in the document content. The document's default tab stops (every half inch) are signified by small vertical lines below the ruler and do not resume until after the last custom tab, beginning at the 5-inch mark.

2.5.2.3 Paragraph mark properties

You may be surprised to learn that the w:rPr element ("run properties") may also occur as a child of the w:pPr element. Actually, it shows up quite often when editing documents in Word. For example, if you turn bold on, type a short paragraph, and hit Enter, then the resulting paragraph in WordprocessingML will look like this:

      <w:p>         <w:pPr>           <w:rPr>             <w:b/>           </w:rPr>         </w:pPr>         <w:r>           <w:rPr>             <w:b/>           </w:rPr>           <w:t>This text is bold.</w:t>         </w:r>       </w:p>

This may look redundant, but it isn't. By now, you should be familiar with the purpose of the second w:rPr element above. It sets the properties (in this case, bold) on the run in which it is contained. However, the first w:rPr element (inside the w:pPr element) functions differently than you might expect. Rather than setting properties of the runs inside the paragraph, it represents properties of the paragraph's paragraph mark. If we removed the first w:rPr element altogether, it would have no actual effect on the formatting of our document. In fact, we wouldn't even see a difference in the Word UI unless paragraph marks are turned on. In that case, we might notice whether or not the paragraph mark itself is displayed in bold weight.

The run properties, or font settings, of a paragraph mark, though they do not directly affect the paragraph's formatting, do have an effect on Word's behavior when subsequently editing the document. For that reason, you can think of the paragraph mark properties as containing information about your document's editing state rather than its actual formatting. For example, one practical effect of setting bold on a paragraph mark is that if the user selects the paragraph mark (by double-clicking it) and drags and drops it to create a new paragraph, bold will be turned on by default for runs entered in the new paragraph.

In practice, Word synchronizes the font settings of the paragraph mark with the font settings of the last run in the paragraph. For example, if you are typing a paragraph and you hit Enter when italics are turned on, then the paragraph mark of the paragraph you just created will also have italics turned on, as will the paragraph mark of the following paragraph, at least initially. If, on the other hand, you turn italics off right before you hit the Enter key, then the last part of your paragraph will still be italicized, but the paragraph mark won't be, and neither will the following paragraph's paragraph mark.

One final example may help elucidate the function of paragraph mark properties. Consider the WordprocessingML document in Example 2-6. It is devoid of any text content, but it does have one empty paragraph whose paragraph mark has italics turned on.

Example 2-6. An empty paragraph with italics turned on
<?xml version="1.0"?> <?mso-application prog?> <w:wordDocument   xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"   xml:space="preserve">   <w:body>     <w:p>       <w:pPr>         <w:rPr>           <w:i/>         </w:rPr>       </w:pPr>     </w:p>   </w:body> </w:wordDocument>

If we open this document in Word, we'll see nothing but a blank document with a flashing cursor an italicized flashing cursor. This, again, reflects the document's editing state, rather than its formatting. Any time you create a new paragraph while editing, Word tries to remember the formatting properties you had in effect on the last paragraph even when you create an empty paragraph, save the document, close it, and open it again later, which is what Example 2-6 demonstrates.

It's good to clear up the potential confusion surrounding w:pPr's seemingly redundant w:rPr child. Now that you're cognizant of what instances of this element do not represent, you can safely exclude them from WordprocessingML documents that you create. Their absence will have negligible impact on the user's editing experience. Don't worry Word will still work its magic.

2.5.2.4 Associating a paragraph with a paragraph style

In addition to specifying direct formatting, a paragraph can explicitly associate itself with one of its document's paragraph styles. This is done using the w:pStyle element. Below is a paragraph excerpted from a document in which the "Heading1" paragraph style is defined:

      <w:p>         <w:pPr>           <w:pStyle w:val="Heading1" />         </w:pPr>         <w:r>           <w:t>This is a heading</w:t>         </w:r>       </w:p>

This paragraph will be formatted according to the explicitly associated paragraph style, provided that the containing document has a style definition that looks something like this:

    <w:style w:type="paragraph" w:style>       <w:name w:val="Heading 1"/>       <!-- other style options -->       <w:pPr>         <!-- paragraph property settings -->       </w:pPr>       <w:rPr>         <!-- font property settings -->       </w:rPr>     </w:style>

2.5.3 Tables

Tables may occur anywhere that paragraphs may occur (and vice versa), which most commonly is directly inside the w:body element (or inside an intervening wx:sect element when the WordprocessingML is output by Word). The other contexts in which paragraphs and tables may occur are the w:hdr, w:ftr, w:footnote, w:endnote, w:tc, w:txbxContent, and w:cfChunk elements, which we already introduced briefly.

The basic structure of the w:tbl element looks like this:

<w:tbl>    <w:tblPr>...</w:tblPr>    <w:tblGrid>      <w:gridCol w:val="..."/>      <w:gridCol w:val="..."/>      ...    </w:tblGrid>    <w:tr>      <w:tc>...</w:tc>      <w:tc>...</w:tc>     ...    </w:tr>    <w:tr>...</w:tr>    ... </w:tbl>

The content model for the w:tbl element, using a DTD-like syntax, is:

aml:annotation*, w:tblPr, w:tblGrid, (w:tr | w:proofErr | w:permStart | w:permEnd)+

In other words, the w:tbl element may contain zero or more aml:annotation elements, followed by a w:tblPr element and a w:tblGrid element, followed by one or more w:tr, w:proofErr, w:permStart, or w:permEnd elements, in any order. The w:tblPr element contains table-wide properties. The w:tblGrid element contains w:gridCol elements that define the widths of columns in the table.

Table rows are represented by the w:tr element. The content model of the w:tr element, using the same notation, is:

w:tblPrEx?, w:trPr?, (w:tc | w:proofErr | w:permStart | w:permEnd)+

The w:tblPrEx element contains exceptions to the table-wide properties for this row only. The w:trPr element contains table row properties for this row.

Table cells are represented by the w:tc element. The content model of the w:tc element, using the same notation, is:

w:tcPr?,(w:p | w:tbl | w:cfChunk | w:proofErr | w:permStart | w:permEnd)*

Thus, after optionally specifying the table cell properties (with the w:tcPr element), we are once again inside a block-level context. At this point, paragraphs may contain the text for the table cell, or another table can be nested inside this one.

We've repeatedly seen the trio of w:proofErr, w:permStart, and w:permEnd now at row-level, cell-level, block-level, and run-level contexts. See Section 2.5.6, later in this chapter, to find out what exactly these elements are for and how they function.

Example 2-7 shows a simple table that references one of its document's table styles and additionally utilizes several table formatting features.

Example 2-7. A sample table with a style and merged cells
<?xml version="1.0"?> <?mso-application prog?> <w:wordDocument   xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"   xml:space="preserve">   <w:styles>     <w:style w:type="table" w:style>       <w:name w:val="My Table Style" />       <w:tblPr>         <w:tblBorders>           <w:top w:val="single"/>           <w:left w:val="single"/>           <w:bottom w:val="single"/>           <w:right w:val="single"/>           <w:insideH w:val="single"/>           <w:insideV w:val="single"/>         </w:tblBorders>         <w:tblCellMar>           <w:left w:w="108" w:type="dxa" />           <w:right w:w="108" w:type="dxa" />         </w:tblCellMar>       </w:tblPr>     </w:style>   </w:styles>   <w:body>     <w:tbl>       <w:tblPr>         <w:tblStyle w:val="MyTableStyle" />       </w:tblPr>       <w:tr>         <w:tc>           <w:p>             <w:r>               <w:t>First row, first column</w:t>             </w:r>           </w:p>         </w:tc>         <w:tc>           <w:tcPr>             <w:vmerge w:val="restart" />           </w:tcPr>           <w:p>             <w:r>               <w:t>First row, second column (merged with second row, second               column)</w:t>             </w:r>           </w:p>         </w:tc>       </w:tr>       <w:tr>         <w:tc>           <w:p>             <w:r>               <w:t>Second row, first column</w:t>             </w:r>           </w:p>         </w:tc>         <w:tc>           <w:tcPr>             <w:vmerge />           </w:tcPr>           <w:p/>         </w:tc>       </w:tr>     </w:tbl>   </w:body> </w:wordDocument>

The result of opening this WordprocessingML document in Word is shown in Figure 2-17.

Figure 2-17. A simple table, with automatically sized cells
figs/oxml_0217.gif


There are a few things to note about this table:

  • The table is associated with "MyTableStyle," which is defined within the document.

  • The "MyTableStyle" style adds borders and cell-spacing to the table.

  • Word opens the document without complaint, even though it doesn't have a w:tblGrid element; Word automatically sizes the cells to contain the content.

  • The w:vmerge element is a table cell property that is used to vertically merge one table cell with another table cell below it similar to its horizontal equivalent, the w:hmerge element.

  • The w:tbl element as generated by Word tends to be much more verbose than this example, explicitly specifying many individual property settings.

There is a lot that this example doesn't cover. To give you an idea just how much more there is to tables, the w:tblPr element has 17 possible child elements (many of which contain their own children), the w:trPr element has 12 possible child elements, and the w:tcPr element has 13 possible child elements. That's not to mention the w:tblPrEx (exceptions for a specific row), w:tblStylePr (for table-style conditional override properties), and w:tblpPr (for specifying the position of a table) elements. If you're writing WordprocessingML for tables, the main things you'll need to configure are the properties of the table, rows, and cells. These work in the same way as the paragraph properties that we've looked at in detail earlier, so we won't go into them here. A quick look at the properties dialogs for tables should give you an idea of what's involved.

2.5.4 Lists

Lists are a rather strange beast in WordprocessingML. Though tables can get pretty hairy, they at least are generally structured the way you would expect: tables containing rows containing cells. Lists, on the other hand, have no such explicit structure in WordprocessingML. Instead, a list consists of a sequence of paragraphs that function as list items. They do not have a common container, nor, unfortunately, does Word provide an auxiliary hint for list containers when outputting WordprocessingML. The member paragraphs of a list are linked to one of its document's "list definitions." These are responsible for maintaining the identity of a single list. When numbering restarts, for example, a new list definition is automatically created. These list definitions, in turn, are linked to one of the document's "base list definitions", which, if there is no subsequent list style link to traverse, define the actual formatting properties of the list. If the phrase "spectacularly convoluted" comes to mind, just wait until you see an example of this.

2.5.4.1 What makes a paragraph a list item

A paragraph participates as a member of a list under one of these separate circumstances:

  • It has a w:listPr element inside its w:pPr element, which refers to a specific list definition (via the w:ilfo element).

  • It is associated with a paragraph style that includes list formatting.

Let's take a look at how the first mechanism works. The following paragraph is a member of a list:

<w:p>   <w:pPr>     <w:listPr>       <w:ilvl w:val="0"/>       <w:ilfo w:val="1"/>     </w:listPr>   </w:pPr>   <w:r>     <w:t>This is item one.</w:t>   </w:r> </w:p>

The w:ilfo element (whose name may stand for something like "item list format," though Microsoft has not documented what it actually means) refers to one of the document's list definitions, identified by the number 1. The w:ilvl element specifies at what level of nesting this list item occurs. It is incremented each time a list is nested within another list. Since there are nine possible levels of list indentation in Word (starting at 0), its value can be anywhere from 0 to 8. It basically says, "Once you find the definition for how each level of this list is supposed to look, sign me up for the formatting and indentation that are defined for level 0." Finding the list definition is the trick. But before we figure out how that's done, let's take a look at how WordprocessingML lists compare with HTML lists.

2.5.4.2 Comparing HTML and WordprocessingML lists

Below is a simple nested list in HTML:

<ol>   <li>     <p>This is top-level item 1</p>     <ol>       <li>This is second-level item 1</li>       <li>This is second-level item 2</li>     </ol>   </li>   <li>This is top-level item 2</li> </ol>

In WordprocessingML, a list like this is expressed much differently. Instead of using a hierarchical structure to express the list hierarchy, we must represent the list as a flat sequence of four sibling paragraphs, assigning them to the same list but to different levels within the list:

<w:p>   <w:pPr>     <w:listPr>       <w:ilvl w:val="0"/>       <w:ilfo w:val="1"/>     </w:listPr>   </w:pPr>   <w:r>     <w:t>This is top-level item 1</w:t>   </w:r> </w:p> <w:p>   <w:pPr>     <w:listPr>       <w:ilvl w:val="1"/>       <w:ilfo w:val="1"/>     </w:listPr>   </w:pPr>   <w:r>     <w:t>This is second-level item 1</w:t>   </w:r> </w:p> <w:p>   <w:pPr>     <w:listPr>       <w:ilvl w:val="1"/>       <w:ilfo w:val="1"/>     </w:listPr>   </w:pPr>   <w:r>     <w:t>This is second-level item 2</w:t>   </w:r> </w:p> <w:p>   <w:pPr>     <w:listPr>       <w:ilvl w:val="0"/>       <w:ilfo w:val="1"/>     </w:listPr>   </w:pPr>   <w:r>     <w:t>This is top-level item 2</w:t>   </w:r> </w:p>

For this list to display correctly, the document must contain at least one list definition (a w:list element with w:ilfo="1", as we'll see) and a corresponding base list definition (w:listDef element), which contains the actual formatting information for list items. Each paragraph's w:ilvl value represents how far it is nested in the list. The "top-level" paragraphs are each at level 0, whereas the "second-level" paragraphs are each at level 1. Figure 2-18 shows how Word renders this WordprocessingML list, using one of its built-in list styles.

Figure 2-18. A simple nested list in Word
figs/oxml_0218.gif


2.5.4.3 Finding the list definitions

Now let's take a look at where the "list definitions" and "base list definitions" are actually defined. Unsurprisingly, they are both to be found inside the top-level w:lists element, whose basic content model is a sequence of w:listDef elements followed by a sequence of w:list elements:

<w:lists>   <w:listDef ...>     ...   </w:listDef>   <!-- more w:listDef elements -->   <w:list ...>     ...   </w:list>   <!-- more w:list elements --> </w:lists>

The w:list elements represent what we're calling "list definitions," and the w:listDef elements represent what we're calling "base list definitions."

Consider the first example list paragraph we saw earlier. This will be our starting point for finding the list definitions in the same way that Word does. Here's the paragraph again:

<w:p>   <w:pPr>     <w:listPr>       <w:ilvl w:val="0"/>       <w:ilfo w:val="1"/>     </w:listPr>   </w:pPr>   <w:r>     <w:t>This is item one.</w:t>   </w:r> </w:p>

Since our paragraph's w:ilfo element refers to the value 1, we need to find the list definition identified by the number 1. In other words, we need to find a w:list element that looks something like this (whose w:ilfo attribute's value is 1):

<w:list w:ilfo="1">   <w:ilst w:val="5"/> </w:list>

Now that we've found the list definition, the next step is finding the "base list definition." We do that by looking at the value provided by the w:ilst element. In this case, it is referring to a base list definition identified by the number 5. Recalling that the base list definitions are represented by w:listDef elements and that they precede the w:list elements inside the w:lists element, we continue to search further back in our WordprocessingML document. Eventually, we find what we're looking for:

<w:listDef w:listDef>   ...   <w:lvl w:ilvl="0">...</w:lvl>   <w:lvl w:ilvl="1">...</w:lvl>   <w:lvl w:ilvl="2">...</w:lvl>   <w:lvl w:ilvl="3">...</w:lvl>   <w:lvl w:ilvl="4">...</w:lvl>   <w:lvl w:ilvl="5">...</w:lvl>   <w:lvl w:ilvl="6">...</w:lvl>   <w:lvl w:ilvl="7">...</w:lvl>   <w:lvl w:ilvl="8">...</w:lvl> </w:listDef>

The w:listDef element is identified by its w:listDefId attribute and contains one w:lvl element for each level of list nesting for which it defines formatting. While you can create base list definitions that define fewer levels without a problem, Word's built-in list styles define all nine levels of nesting. The content of the w:lvl element includes all kinds of formatting information, such as indentation, tab stops, the number to start on, number format, and bullet images.

Once Word finds the base list definition, with all its formatting information, it then applies the appropriate level's formatting to the paragraph, according to the value of the w:ilvl element that occurs in the paragraph's list properties. Thus, Word applies the level 0 list item formatting to our example paragraph above.

2.5.4.4 List Styles

An even more complex variation of this approach occurs is when list styles are used. Unlike paragraph, table, and character styles, which can be directly associated with paragraphs, tables, and runs (via the w:pStyle, w:tblStyle, and w:rStyle elements, respectively), list styles are not directly associated with paragraphs in WordprocessingML there is not a corresponding element for direct list style references. For example, when an end user applies the built-in list style "1 / a / i" to a paragraph, the paragraph is effectively associated with a list definition, but it is not directly associated with the "1 / a / i" list style that was applied to it. The resulting WordprocessingML paragraph looks essentially no different from the example paragraph we looked at earlier. Here it is again (with the only difference here being that the w:ilfo element happens to refer to a list definition identified by the number 2):

<w:p>   <w:pPr>     <w:listPr>       <w:ilvl w:val="0"/>       <w:ilfo w:val="2"/>     </w:listPr>   </w:pPr>   <w:r>     <w:t>This is item one.</w:t>   </w:r> </w:p>

This is what the WordprocessingML looks like when an end user applies a list style to a paragraph. Rather than being directly associated with the list style, the paragraph refers to a list definition using the w:ilfo element no differently than when a list style is not involved. However, the list style association is still retained; it's just that you can't tell that from looking at the paragraph alone. The list style association only becomes evident when we start traversing the graph, and that's where things get complicated. First, the paragraph associates itself with the document's list definition (w:list element), identified by the value 2:

<w:list w:ilfo="2">   <w:ilst w:val="1"/> </w:list>

The list definition, in turn, refers (via the w:ilst element) to a base list definition (w:listDef element) identified by the value 1. So far, so good. Now, here is where a few extra levels of indirection appear. Whereas before we were done at this point (the base list definition contained all the formatting properties for each level of the list), now we're only halfway there. This time, the referenced base list definition doesn't contain any formatting properties (inside w:lvl elements) at all. Instead, it contains yet another reference the w:listStyleLink element:

    <w:listDef w:listDef>       <w:lsid w:val="27DC6005"/>       <w:plt w:val="Multilevel"/>       <w:tmpl w:val="0409001D"/>       <w:listStyleLink w:val="1ai"/>     </w:listDef>

This w:listDef element refers, via its w:listStyleLink element, to a list style definition whose w:styleId attribute's value is 1ai. This corresponds to the "1 / a / i" style that the end user applied. Here is the document's list style definition that it refers to:

    <w:style w:type="list" w:style>       <w:name w:val="Outline List 1"/>       <wx:uiName wx:val="1 / a / i"/>       <w:basedOn w:val="NoList"/>       <w:rsid w:val="00283CEE"/>       <w:pPr>         <w:listPr>           <w:ilfo w:val="1"/>         </w:listPr>       </w:pPr>     </w:style>

As you can see, the list style definition, in turn, contains a reference to yet another list definition (identified by the number 1). Dizzy yet?

    <w:list w:ilfo="1">       <w:ilst w:val="0"/>     </w:list>

This list definition refers to yet another base list definition, identified by the number 0. Finally, we are home free, as this base list definition actually contains the list formatting properties Word needs in order to format each level of the list:

    <w:listDef w:listDef>       <w:lsid w:val="1B850634"/>       <w:plt w:val="Multilevel"/>       <w:tmpl w:val="0409001D"/>       <w:styleLink w:val="1ai"/>       <w:lvl w:ilvl="0">         <w:start w:val="1"/>         <w:lvlText w:val="%1)"/>         <w:lvlJc w:val="left"/>         <w:pPr>           <w:tabs>             <w:tab w:val="list" w:pos="360"/>           </w:tabs>           <w:ind w:left="360" w:hanging="360"/>         </w:pPr>       </w:lvl>       <w:lvl w:ilvl="1">         ...       </w:lvl>       <w:lvl w:ilvl="2">         ...       </w:lvl>       <w:lvl w:ilvl="3">         ...       </w:lvl>       <w:lvl w:ilvl="4">         ...       </w:lvl>         ...       <w:lvl w:ilvl="5">         ...       </w:lvl>       <w:lvl w:ilvl="6">         ...       </w:lvl>       <w:lvl w:ilvl="7">         ...       </w:lvl>       <w:lvl w:ilvl="8">         ...       </w:lvl>     </w:listDef>

In summary, w:ilfo refers to w:list, which refers to w:listDef, which refers to w:style, which refers to another w:list, which refers to another w:listDef. Home, sweet home. Oh yeah, and the last w:listDef refers back to the same w:style through an element called w:styleLink (which you can see in the last code snippet above) thereby throwing in a little circularity for good measure.

2.5.5 Sections

A section in Word is an area or set of areas within a document, characterized by the same page settings, such as margin width, header and footer size, orientation, border, and print settings. These settings are accessible within the Word UI through the File Page Setup . . . dialog, shown in Figure 2-19. Figure 2-19 also shows the five different kinds of section breaks you can insert into a document: "Continuous," "New column," "New page," "Even page," and "Odd page."

Figure 2-19. The Page Setup dialog for section settings
figs/oxml_0219.gif


As mentioned previously, the structure of a Word document consists of one or more sections containing zero or more paragraphs containing zero or more characters. WordprocessingML, however, does not reflect that hierarchy exactly. In fact, there is no section container element in WordprocessingML proper. (As we'll see later in Section 2.6.1, the wx:sect element helps to fill this void by acting as a surrogate container, thereby aiding external processing.) Rather, sections are represented indirectly through the presence of section breaks. A section break is signified in WordprocessingML by the presence of a w:sectPr element inside the w:pPr element of the section's last paragraph. Example 2-8 shows the WordprocessingML for a document that contains two section breaks, and therefore three sections. The w:sectPr elements are highlighted.

Example 2-8. Multiple sections in a document
<?xml version="1.0"?> <?mso-application prog?> <w:wordDocument   xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"   xml:space="preserve">   <w:docPr>     <w:view w:val="normal"/>   </w:docPr>   <w:body>     <w:p>       <w:pPr>         <w:sectPr/>       </w:pPr>       <w:r>         <w:t>First section</w:t>       </w:r>     </w:p>     <w:p>       <w:r>         <w:t>Second section, first paragraph</w:t>       </w:r>     </w:p>     <w:p>       <w:pPr>         <w:sectPr/>       </w:pPr>       <w:r>         <w:t>Second section, second paragraph</w:t>       </w:r>     </w:p>     <w:p>       <w:r>         <w:t>Third section, first paragraph</w:t>       </w:r>     </w:p>     <w:p>       <w:r>         <w:t>Third section, second paragraph</w:t>       </w:r>     </w:p>     <w:sectPr/>   </w:body> </w:wordDocument>

The first two w:sectPr elements in this document represent section breaks, because they each occur inside a w:pPr element. One thing to keep in mind about WordprocessingML's way of representing section breaks is that it can be deceiving. Specifically, the w:sectPr elements do not lexically divide the text of the document according to its true section boundaries. For example, though from a first glance it may look as if the paragraph that says "Second section, second paragraph" belongs to the third and final section, that is not the case. It only looks that way because the w:sectPr element comes before the text of the paragraph in which it resides. This potential confusion is all the more reason to look forward to Section 2.6.1, later in this chapter.

The last w:sectPr element in Example 2-8 does not occur inside the w:pPr element. Rather, it is a child of w:body, following the last paragraph in the document. This is where Word always expects to see the final w:sectPr element of the document. It does not represent a section break; rather, its job is simply to apply properties to the final (and possibly only) section of the document. If it isn't there when Word loads the document, Word will add it. The presence of w:sectPr inside a w:pPr element always denotes a section break, but the presence of w:sectPr as the last child of the w:body element does not. It's important to keep this distinction in mind when generating WordprocessingML documents that have multiple sections.

Figure 2-20 shows what we see when Word opens the document in Example 2-8.

Figure 2-20. Three sections separated by Next Page section breaks
figs/oxml_0220.gif


In the "Normal" view (which we see automatically, thanks to Example 2-8s use of the w:view element), all section breaks are visible. The first mystery of the empty w:sectPr section break element is answered: by default it stands for a "Next Page" break. We could have explicitly specified this in our document by using the w:type child element of w:sectPr, like this:

<w:sectPr>   <w:type w:val="next-page"/> </w:sectPr>

Besides next-page, the other possible values (corresponding to the drop-down menu options we saw in Figure 2-19) are next-column, continuous, even-page, and odd-page.

Of course, the insertion of section breaks is not the only responsibility of the w:sectPr element, which stands for "section properties." Its content model, after all, includes 21 possible element children, which collectively represent the settings a user can edit through the File Page Setup... dialog. The properties specified inside the w:sectPr element apply to the section before the break that it represents (i.e., the section containing the paragraph with which the w:sectPr element is associated).

Normally, when you create a new blank document in Word, all of the page settings defined in the Normal.dot document template are copied into the document. These include margins, paper dimensions, vertical alignment, orientation, etc. But our hand-coded WordprocessingML document (Example 2-8) isn't "normal" in this sense. It was created outside of Word and specifies no page settings at all (as the w:sectPr elements are empty). Word gracefully handles this scenario when it loads the document by automatically inserting its application defaults for page settings. These default page settings are the same settings that are automatically copied into the Normal.dot template when Word is first installed, or when it is forced to create a new Normal.dot template.

We can see Word's application defaults for margins and paper size in the Reveal Formatting task pane in Figure 2-20. The underlying XML representation for these values looks something like this:

<w:sectPr>   <w:pgSz w:w="12240" w:h="15840"/>   <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800"            w:header="720" w:footer="720" w:gutter="0"/> </w:sectPr>

All of the attribute values shown here are expressed in twips, or 1,440ths of an inch. The w:pgSz element sets the page size to 8.5" x 11." The w:pgMar element sets the margin widths around the page: one inch on the top and bottom, and 1.25 inches on the right and left. It also sets header and footer areas, each with a height of half an inch.

If you need to override the default page settings for a particular section, you can simply specify your own values, using any of the other child elements of w:sectPr as necessary.

2.5.6 Proofing, Protection, and Annotation Markings

The w:proofErr, w:permStart, w:permEnd, and aml:annotation elements have shown up in various places so far without any real explanation. One thing they have in common is that they are all used to mark up ranges of text in a Word document: w:proofErr for spelling and grammar errors, w:permStart and w:permEnd for an editable area within a protected document, and aml:annotation for annotating comments, bookmarks, and revisions within a document.

A range is a span of text defined by a start character position and an end character position. The distinctive thing about ranges is that they can cross paragraph and section boundaries. From within a VBA application, a commonly used range is the range that corresponds to the user's current selection. Individual sentences and words are also examples of ranges that you can access through the Word object model, but they are not actually stored as part of the information in a Word document. Instead, such ranges are purely derivative and calculated on the fly, as the Word or VBA application demands. However, there are certain kinds of ranges that are necessary to be stored as part of the Word document itself. These include the various kinds of annotations you can make to a document without affecting its actual formatting, and markings that are automatically created, such as proofing marks for grammar and spelling.

There is a problem with representing such ranges of text in XML, because XML only allows you to represent a single tree. The problem of needing to represent multiple, overlapping hierarchies (which is what such annotations amount to) is commonly addressed in XML by inserting markers into the flow for the start and end positions of the range in question. This is exactly what Word does, too.

Figure 2-21 shows a paragraph in Word in which three ranges are overlapping, namely a document protection range, a grammar error range, and a comment annotation range.

Figure 2-21. Overlapping grammar, protection, and comment markings
figs/oxml_0221.gif


The outer brackets surrounding the entire sentence delineate the boundaries of an editing region with particular permissions; the inner parentheses delineate the boundaries of the text about which a comment was made; and the squiggly line under "This were" is a grammar error automatically recognized and flagged as such by Word. Example 2-9 shows the underlying WordprocessingML for this document excerpt, as output by Word. The start and end markers for each range, all of which are empty elements, are highlighted.

Example 2-9. Overlapping protection, proofing, and comment ranges
    <w:p/>     <w:permStart w: w:edGrp="everyone"/>     <w:proofErr w:type="gramStart"/>     <w:p>       <w:r>         <w:t>This </w:t>       </w:r>       <aml:annotation aml: w:type="Word.Comment.Start"/>       <w:r>         <w:t>were</w:t>       </w:r>       <w:proofErr w:type="gramEnd"/>       <w:r>         <w:t> a grammatically</w:t>       </w:r>       <aml:annotation aml: w:type="Word.Comment.End"/>       <w:r>         <w:rPr>           <w:rStyle w:val="CommentReference"/>         </w:rPr>         <aml:annotation aml: aml:author="Evan Lenz"                         aml:createdate="2003-12-22T12:15:00Z"                         w:type="Word.Comment" w:initials="edl">           <aml:content>             <w:p>               <w:pPr>                 <w:pStyle w:val="CommentText"/>               </w:pPr>               <w:r>                 <w:rPr>                   <w:rStyle w:val="CommentReference"/>                 </w:rPr>                 <w:annotationRef/>               </w:r>               <w:r>                 <w:t>Isn't that bad grammar?</w:t>               </w:r>             </w:p>           </aml:content>         </aml:annotation>       </w:r>       <w:r>         <w:t> suspect sentence.</w:t>       </w:r>       <w:permEnd w:/>     </w:p>     <w:p/>

This example illustrates the use of start and end markers to annotate ranges of text, regardless of whether they overlap each other or other elements, such as paragraphs. This explains, at long last, why these elements crop up in so many places in the WordprocessingML schema. They need to occur as block-level elements as well as run-level elements. The w:permStart element occurs in this example in a block context, as a sibling of paragraphs, whereas the corresponding w:permEnd element occurs in a run context, before the end of the paragraph. Likewise, the first of the w:proofErr elements occurs as a block-level element, before the beginning of the paragraph, but the second w:proofErr element, which ends the range at the word "were," occurs as a run-level element.

2.5.6.1 Document protection

Now let's look at how each type of annotation works. The w:permStart and w:permEnd elements work together to identify a range of text that has a particular editing permission enabled. The w:id attribute of each element is used to associate the markers with each other. In this case, we know that they go together, because the w:id attribute value is 0 for both of them:

    <w:permStart w: w:edGrp="everyone"/> ...       <w:permEnd w:/>

The value of the w:edGrp attribute denotes a group of people who can edit this region of text. In this case, the value is everyone, which means that there are no restrictions for this particular range. This is useful as a way of overriding a global document protection policy in which the rest of the document is off-limits for making changes. For more information on Word's document protection features, see Chapter 4.

2.5.6.2 Proof errors

The w:proofErr elements in Example 2-9 are used to identify the start and end points of a grammar error. The type of each marker is denoted by the w:type attribute:

    <w:proofErr w:type="gramStart"/> ...       <w:proofErr w:type="gramEnd"/>

Since grammar, as well as spelling, errors cannot overlap each other, there is no need for an ID attribute to associate start and end markers with each other. Word knows that a grammar error ends at the first gramEnd marker that it finds after the gramStart marker. Spelling errors are represented in the same way, using the values of spellStart and spellEnd for the w:type attribute. Thus, the w:proofError's w:type attribute has four possible values:

gramStart
gramEnd
spellStart
spellEnd
2.5.6.3 Comments and other annotations

Example 2-9 also demonstrates how comments are represented in WordprocessingML. Every comment is represented using three separate aml:annotation elements. The three are associated with each other by having the same aml:id attribute value (0 in Example 2-9s case). The first two aml:annotation elements are used to denote the start and end of the range that the comment is about:

      <aml:annotation aml: w:type="Word.Comment.Start"/>       ...       <aml:annotation aml: w:type="Word.Comment.End"/>

The w:type attribute values distinguish the start and end markers from each other: Word.Comment.Start and Word.Comment.End. The third aml:annotation element occurs inside a run (w:r element) that immediately follows the comment end marker:

      <w:r>         <w:rPr>           <w:rStyle w:val="CommentReference"/>         </w:rPr>         ...       </w:r>

This run is associated with the CommentReference character style, a built-in style that is automatically inserted into the document when you insert a comment. So far, this looks like a normal run that might appear in the flow of document text. The content of the run, however, does not consist of normal document text. Instead, inside the run, we see the third and last aml:annotation element for this comment:

        <aml:annotation aml: aml:author="Evan Lenz"                         aml:createdate="2003-12-22T12:15:00Z"                         w:type="Word.Comment" w:initials="edl">           ...         </aml:annotation>

The aml:id attribute's value is 0, which associates this annotation with the previous two. The w:type attribute is Word.Comment, which indicates that this element contains the actual content of the comment. The other three attributes contain metadata about the comment, including who made the comment, their initials, and the date and time they made it.

Inside the aml:annotation element is the aml:content element, which is used to contain the text of the comment:

          <aml:content>             <w:p>               <w:pPr>                 <w:pStyle w:val="CommentText"/>               </w:pPr>               <w:r>                 <w:rPr>                   <w:rStyle w:val="CommentReference"/>                 </w:rPr>                 <w:annotationRef/>               </w:r>               <w:r>                 <w:t>Isn't that bad grammar?</w:t>               </w:r>             </w:p>           </aml:content>

The comment text is represented using a sequence of Word paragraphs. These paragraphs are "out-of-band" in the sense that they do not occur in the normal flow of document text. After all, they ultimately occur inside a w:r element. A paragraph inside a run isn't normally allowed; it wouldn't make any sense. Only because of the intervening aml:annotation and aml:content elements is the w:p element allowed to occur as a descendant of a w:r element.

In addition to comments, the aml:annotation element is also used to represent bookmarks and revision markings (recorded when "Track Changes" is turned on). In each case, the type of annotation is identified by the value of the w:type attribute, which has these possible values:

Word.Insertion
Word.Deletion
Word.Formatting
Word.Bookmark.Start
Word.Bookmark.End
Word.Comment.Start
Word.Comment.End
Word.Insertion.Start
Word.Insertion.End
Word.Deletion.Start
Word.Deletion.End
Word.Comment
Word.Numbering


Office 2003 XML
Office 2003 XML
ISBN: 0596005385
EAN: 2147483647
Year: 2003
Pages: 135

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net