2.4 A Simple Example Revisited


Example 2-2 shows how our "Hello, World" example looks after opening it in Word, selecting Save As . . . , and saving the file with a new name, HelloSaved.xml. For the sake of readability, we've added line breaks and indentation, neither of which affects the meaning of the file. The highlighted lines in this example correspond to the lines that were present in our original hand-edited WordprocessingML document in Example 2-1. Everything else is new.

Example 2-2. The same Word document, after Word saves it as XML
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <?mso-application prog?> <w:wordDocument   xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"   xmlns:v="urn:schemas-microsoft-com:vml"   xmlns:w10="urn:schemas-microsoft-com:office:word"   xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core"   xmlns:aml="http://schemas.microsoft.com/aml/2001/core"   xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"   xmlns:o="urn:schemas-microsoft-com:office:office"   xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"   w:macrosPresent="no" w:embeddedObjPresent="no" w:ocxPresent="no"   xml:space="preserve">   <o:DocumentProperties>     <o:Title>Hello, World</o:Title>     <o:Author>Evan Lenz</o:Author>     <o:LastAuthor>Evan Lenz</o:LastAuthor>     <o:Revision>4</o:Revision>     <o:TotalTime>15</o:TotalTime>     <o:Created>2003-12-06T22:45:00Z</o:Created>     <o:LastSaved>2003-12-18T07:59:00Z</o:LastSaved>     <o:Pages>1</o:Pages>     <o:Words>2</o:Words>     <o:Characters>12</o:Characters>     <o:Lines>1</o:Lines>     <o:Paragraphs>1</o:Paragraphs>     <o:CharactersWithSpaces>13</o:CharactersWithSpaces>     <o:Version>11.5604</o:Version>   </o:DocumentProperties>   <w:fonts>     <w:defaultFonts w:ascii="Times New Roman" w:fareast="Times New Roman"                     w:h-ansi="Times New Roman" w:cs="Times New Roman"/>   </w:fonts>   <w:styles>     <w:versionOfBuiltInStylenames w:val="4"/>     <w:latentStyles w:defLockedState="off" w:latentStyleCount="156"/>     <w:style w:type="paragraph" w:default="on" w:style>       <w:name w:val="Normal"/>       <w:rsid w:val="00B15979"/>       <w:rPr>         <wx:font wx:val="Times New Roman"/>         <w:sz w:val="24"/>         <w:sz-cs w:val="24"/>         <w:lang w:val="EN-US" w:fareast="EN-US" w:bidi="AR-SA"/>       </w:rPr>     </w:style>     <w:style w:type="character" w:default="on"              w:style>       <w:name w:val="Default Paragraph Font"/>       <w:semiHidden/>     </w:style>     <w:style w:type="table" w:default="on" w:style>       <w:name w:val="Normal Table"/>       <wx:uiName wx:val="Table Normal"/>       <w:semiHidden/>       <w:rPr>         <wx:font wx:val="Times New Roman"/>       </w:rPr>       <w:tblPr>         <w:tblInd w:w="0" w:type="dxa"/>         <w:tblCellMar>           <w:top w:w="0" w:type="dxa"/>           <w:left w:w="108" w:type="dxa"/>           <w:bottom w:w="0" w:type="dxa"/>           <w:right w:w="108" w:type="dxa"/>         </w:tblCellMar>       </w:tblPr>     </w:style>     <w:style w:type="list" w:default="on" w:style>       <w:name w:val="No List"/>       <w:semiHidden/>     </w:style>   </w:styles>   <w:docPr>     <w:view w:val="web"/>     <w:zoom w:percent="100"/>     <w:proofState w:spelling="clean" w:grammar="clean"/>     <w:attachedTemplate w:val=""/>     <w:defaultTabStop w:val="720"/>     <w:characterSpacingControl w:val="DontCompress"/>     <w:validateAgainstSchema/>     <w:saveInvalidXML w:val="off"/>     <w:ignoreMixedContent w:val="off"/>     <w:alwaysShowPlaceholderText w:val="off"/>     <w:compat/>   </w:docPr>   <w:body>     <wx:sect>       <w:p>         <w:r>           <w:t>Hello, World!</w:t>         </w:r>       </w:p>       <w:sectPr>         <w:pgSz w:w="12240" w:h="15840"/>         <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800"                  w:header="720" w:footer="720" w:gutter="0"/>         <w:cols w:space="720"/>         <w:docGrid w:line-pitch="360"/>       </w:sectPr>     </wx:sect>   </w:body> </w:wordDocument>

The first thing that may come to mind when looking at this example is "Why does the XML contain so much more information when all I did was save it?" Or perhaps you've begun to panic.

Don't. While all of this XML is certainly daunting at first glance, we'll see that for the most part its meaning is straightforward. Take comfort in the fact that, while Word may create markup that's quite verbose, it can handle markup that minimally conforms to its schema without complaining at all. This liberality in what Word accepts makes it much easier to write applications that generate WordprocessingML.

Let's take a tour through this document, examining each top-level element in turn. Getting an overall, top-down view of what goes into a WordprocessingML document will help bring context to the more nitty-gritty, bottom-up examination of the vocabulary that will follow later in this chapter.

2.4.1 The w:wordDocument Element

The root element of Example 2-2, w:wordDocument, has a large number of attributes:

<w:wordDocument   xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"   xmlns:v="urn:schemas-microsoft-com:vml"   xmlns:w10="urn:schemas-microsoft-com:office:word"   xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core"   xmlns:aml="http://schemas.microsoft.com/aml/2001/core"   xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"   xmlns:o="urn:schemas-microsoft-com:office:office"   xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"   w:macrosPresent="no" w:embeddedObjPresent="no"   w:ocxPresent="no" xml:space="preserve">

Actually, most of these are technically namespace declarations. They are present on every WordprocessingML document that Word outputs, regardless of whether all the namespaces are actually used in the document. In WordprocessingML, you can safely leave out all the namespace declarations except the ones you actually use, which will minimally include the primary WordprocessingML namespace (normally mapped to the w prefix). Below is a list of the namespaces declared in this document, along with a brief description of the purpose of each.


http://schemas.microsoft.com/office/word/2003/wordml

Mapped to the w prefix. All of the core WordprocessingML elements and attributes are in this namespace.


urn:schemas-microsoft-com:vml

Mapped to the v prefix. Elements in this namespace represent embedded Vector Markup Language (VML) images.


urn:schemas-microsoft-com:office:word

Mapped to the w10 prefix. This namespace is used for legacy elements from Word Ten. It is used in HTML output.


http://schemas.microsoft.com/schemaLibrary/2003/core

Mapped to the sl prefix. The sl:schema and sl:schemaLibrary elements are used with Word's custom XML schema functionality, and are introduced in Chapter 4.


http://schemas.microsoft.com/aml/2001/core

Mapped to the aml prefix. The Annotation Markup Language (AML) elements are used to describe tracked changes, comments, and bookmarks.


http://schemas.microsoft.com/office/word/2003/auxHint

Mapped to the wx prefix. Elements in this namespace provide "auxiliary hints" for processing WordprocessingML documents outside of Word. They represent derivative information that is useful to us but that is of no internal use to Word. See "Auxiliary Hints in WordprocessingML," later in this chapter.


urn:schemas-microsoft-com:office:office

Mapped to the o namespace. This is the namespace for "shared" document properties and custom document properties. They are shared in that they also apply to other Office applications, such as Excel.


uuid:C2F41010-65B3-11d1-A29F-00AA00C14882

Mapped to the dt prefix. This is the XML Data Reduced (XDR) namespace, which, in WordprocessingML, qualifies the dt (data type) attributes of a document's custom document property elements.

While some confusing legacy is evident in this list, the overall distinction between namespaces is helpful, particularly between the wx and w namespaces, as we'll see.

The xml:space attribute is set to preserve, in order that whitespace characters (and even any instances of the empty w:tab element) are interpreted correctly. As a matter of best practice, you should include xml:space="preserve" on the root element of any WordprocessingML document you create.

The remaining three attributes of the w:wordDocument element are all optional and default to the value no.

w:macrosPresent="no" w:embeddedObjPresent="no" w:ocxPresent="no"

These are consistency checks for when certain kinds of base64-encoded binary objects are embedded in the document. Specifically, w:macrosPresent must be set to yes when the w:docSuppData element is present (containing toolbar customizations, VBA macros, etc.); w:embeddedObjPresent must be set to yes when the w:docOleData element is present (containing OLE objects from other applications, such as Excel); and w:ocxPresent must be set to yes when a w:ocx element is present somewhere in the body of the document (representing a control from Word's Control Toolbox). Unless your document contains any such objects, you can safely leave out these attributes.

The child elements of w:wordDocument, as included in this example, represent only a portion of the root element's complete content model. Below is a list of all possible child elements in the order they are supposed to occur, according to the WordprocessingML schema. Word tends to be lenient about WordprocessingML documents that contain these elements in a different order, which suggests it does not validate documents against the published schema when they are loaded. However, to be on the safe side, you should ensure that these elements are in the correct order in WordprocessingML documents that you create. As mentioned before, w:body is the only required child element of w:wordDocument. Only the highlighted elements in this list are actually present in Example 2-2.

w:ignoreSubtree
w:ignoreElements
o:SmartTagType
o:DocumentProperties
o:CustomDocumentProperties
sl:schemaLibrary
w:fonts
w:frameset
w:lists
w:styles
w:divs
w:docOleData
w:docSuppData
w:shapeDefaults
w:bgPict
w:docPr
w:body

Apart from the highlighted elements, the w:lists element is the only one in the above list that will receive further coverage in this chapter.

2.4.2 The o:DocumentProperties Element

The o:DocumentProperties element in Example 2-2, shown again below, is in the general Office namespace (mapped to the o prefix), because it includes properties, such as metadata and statistics, that are common to both Word and Excel:

  <o:DocumentProperties>     <o:Title>Hello, World</o:Title>     <o:Author>Evan Lenz</o:Author>     <o:LastAuthor>Evan Lenz</o:LastAuthor>     <o:Revision>4</o:Revision>     <o:TotalTime>15</o:TotalTime>     <o:Created>2003-12-06T22:45:00Z</o:Created>     <o:LastSaved>2003-12-18T07:59:00Z</o:LastSaved>     <o:Pages>1</o:Pages>     <o:Words>2</o:Words>     <o:Characters>12</o:Characters>     <o:Lines>1</o:Lines>     <o:Paragraphs>1</o:Paragraphs>     <o:CharactersWithSpaces>13</o:CharactersWithSpaces>     <o:Version>11.5604</o:Version>   </o:DocumentProperties>

These elements are also serialized as such when Word saves a document as HTML. They correspond primarily to the properties you see when you open the document Properties dialog (by selecting File Properties). Figure 2-3 shows the Statistics tab of the file Properties dialog.

Figure 2-3. The Statistics tab of the Properties dialog, corresponding to values inside the o:DocumentProperties element
figs/oxml_0203.gif


There are 12 more valid child elements of o:DocumentProperties not shown here, making a total of 26. A number of these can be added to a document from within Word, at user option. For example, there is an element corresponding to each of the fields in the Summary tab of the file Properties dialog, shown in Figure 2-4.

Figure 2-4. Other document properties can be populated at user option
figs/oxml_0204.gif


2.4.3 The w:fonts Element

The w:defaultFonts element inside the w:fonts element specifies the default font for a document.

  <w:fonts>     <w:defaultFonts w:ascii="Times New Roman" w:fareast="Times New Roman"                     w:h-ansi="Times New Roman" w:cs="Times New Roman"/>   </w:fonts>

A document's default font is applied to all of the document's paragraph styles that do not explicitly specify a font. Normally, when you create a new blank document in Word, the default font setting as specified in the Normal.dot document template is copied into the document. But our hand-coded WordprocessingML document (Example 2-1) isn't "normal" in this sense. It was created outside of Word and contains no default font definition at all. Word gracefully handles this scenario when it loads the document by automatically inserting a default font, as shown in Example 2-2. Times New Roman is thus the "default default" font. In fact, Times New Roman is also the default font assigned to the Normal.dot template when Word is first installed, or when it is forced to create a new Normal.dot template because someone deleted the Normal.dot file.

The attributes on the w:defaultFonts element indicate which font should be used for each character encoding range among ASCII, high ANSI, complex scripts, and East Asian characters. In Example 2-2, Times New Roman is the default font for all of these ranges.

The w:fonts element may also contain zero or more w:font elements (zero in the case of Example 2-2) following the w:defaultFonts element. The w:font elements are optional; you don't need to include a corresponding w:font element just to use a particular font. The only purpose of this element is to provide Word with descriptive information about a font (using its seven possible child elements) that could be useful in the event that the font is not available on a user's machine. In that case, Word can choose a reasonable alternative based on the information about the font provided in the document.

2.4.4 The w:styles Element

The w:styles element includes definitions of all of a document's styles. Before looking at the WordprocessingML syntax for defining styles, let's establish some basic terminology. A style is a group of formatting properties that can be applied as a unit. There are four possible style types in Word:

paragraph
character
table
list

These style types apply respectively to paragraphs, runs, tables, and lists. Every paragraph, run, table, and list in a Word document is necessarily associated with a style of the corresponding type. If a paragraph, run, table, or list in a WordprocessingML document doesn't explicitly specify an associated style (as is the case in Example 2-2), then it takes on the document's default style of the appropriate style type. Thus, styles are always involved, regardless of whether you specifically make use of them.

Normally, when you create a new blank document in Word, all of the styles defined in the Normal.dot document template are copied into the document. These include, at minimum, a default style definition for each style type. However, our hand-coded WordprocessingML document does not include the w:styles element. Just as Word automatically creates the w:fonts element when absent, Word automatically inserts four w:style elements, corresponding respectively to the four style types (paragraph, character, table, and list):

Normal
Default Paragraph Font
Normal Table
No List

These four Word-defined styles are what we see inside the w:styles element in Example 2-2. Effectively, they are implicitly present in any WordprocessingML document that does not explicitly define them. (However, to explicitly refer to them from within the body of the document, they must also be explicitly present in the document's w:styles element.) These "default default" styles are also the same four style definitions that are automatically copied into the Normal.dot template when Word is first installed, or when it is forced to create a new Normal.dot template.

Now let's take a look at the content of the w:styles element, extracted from Example 2-2. Preceding the style definitions themselves are two elements:

<w:versionOfBuiltInStylenames w:val="4"/> <w:latentStyles w:defLockedState="off" w:latentStyleCount="156"/>

The w:versionOfBuiltInStylenames and w:latentStyles elements are used to refer to particular built-in styles when document formatting protection is turned on. Since document protection is an important ingredient in building custom XML solutions in Word, these elements will be covered in Chapter 4. For now, all you need to know is that there are no formatting restrictions on this document. In fact, this document would be interpreted no differently if we were to remove these two (optional) elements.

Next, there are four w:style elements, one for each of the "default default" styles listed above:

    <w:style w:type="paragraph" w:default="on" w:style>       <w:name w:val="Normal"/>       <w:rPr>         <wx:font wx:val="Times New Roman"/>         <w:sz w:val="24"/>         <w:sz-cs w:val="24"/>         <w:lang w:val="EN-US" w:fareast="EN-US" w:bidi="AR-SA"/>       </w:rPr>     </w:style>     <w:style w:type="character" w:default="on"              w:style>       <w:name w:val="Default Paragraph Font"/>       <w:semiHidden/>     </w:style>     <w:style w:type="table" w:default="on" w:style>       <w:name w:val="Normal Table"/>       <wx:uiName wx:val="Table Normal"/>       <w:semiHidden/>       <w:rPr>         <wx:font wx:val="Times New Roman"/>       </w:rPr>       <w:tblPr>         <w:tblInd w:w="0" w:type="dxa"/>         <w:tblCellMar>           <w:top w:w="0" w:type="dxa"/>           <w:left w:w="108" w:type="dxa"/>           <w:bottom w:w="0" w:type="dxa"/>           <w:right w:w="108" w:type="dxa"/>         </w:tblCellMar>       </w:tblPr>     </w:style>     <w:style w:type="list" w:default="on" w:style>       <w:name w:val="No List"/>       <w:semiHidden/>     </w:style>

For now, we'll only look at the lines that are highlighted. The w:type attribute of each w:style element indicates the style type (paragraph, character, table, or list). The presence of w:default="on" denotes that this style is the default style for its style type. This attribute's default value is off.

Each style has two different names, as indicated by the w:styleId attribute and the w:name element. The w:styleId attribute is for intra-document references only; it must be unique within the file. Styles can be referred to either from within the document's body (to associate a paragraph with a certain paragraph style, for example) or from within another style definition (to derive the style from another style, for example). The w:styleId attribute is unused apart from these internal associations. In fact, Word doesn't preserve its value when it opens the document. When a document is subsequently saved as XML, Word auto-generates a value for the w:styleId attribute, usually deriving it from the style's primary name.

The primary name of a style is denoted by the w:val attribute of the w:name element. The primary name of a style is what the user sees in the Style drop-down menu in the Word UI. Also, for styles that came from a template, the primary name uniquely identifies the style in the attached template and is the basis by which styles are updated when the "Automatically update document styles" document option is turned on. This name, like the w:styleId attribute, must be unique within the file. Otherwise, Word will try to fix things up, probably not in the way that you intended.

For certain built-in styles, the style name displayed in the Word UI differs from the primary name of the style. For example, the "Normal Table" style appears as "Table Normal" in the UI. This (dubious) privilege is restricted to Word's built-in style names; there is no way in WordprocessingML to define a custom style whose UI name differs from its primary name. Word, however, does throw us a bone when it saves such styles as XML. The wx:uiName element clues us in to the distinction:

<wx:uiName wx:val="Table Normal"/>

This element is strictly informational. If you were to remove it or change the wx:val attribute's value, Word would behave no differently when opening the file. Elements and attributes in the namespace designated by the wx prefix are for our benefit only and are of no internal use to Word.

2.4.5 The w:docPr Element

Have you ever wondered whether a particular option in the Word UI represents a property of the document you are editing as opposed to a property of the application's state? The answer to your question may lie inside the w:docPr element, which, like one of its siblings mentioned earlier, stands for "document properties." However, unlike the information inside the o:DocumentProperties element, these document properties are unique to Word and describe particular aspects of a document's state, options, and default settings, rather than metadata or statistics that are common to multiple Office applications.

The Tools Options . . . dialog in the Word UI, with its many tabs, is rather notorious for being unclear about what exactly the user is modifying, whether global application options or document options. By investigating the contents of the w:docPr element, you can begin to identify which of these options are document-specific and which of them aren't.

The *Pr naming convention that w:docPr follows is common in WordprocessingML. As we'll see, a number of other elements follow this convention, such as w:pPr (paragraph properties), w:rPr (run properties), w:tblPr (table properties), w:trPr (table row properties), w:tcPr (table cell properties), and w:listPr (list properties). In fact, the baseline content model of these elements is also similar: a sequence of mostly empty elements, each standing for a particular property and each having zero or more attributes to set the values of that property. The most commonly used attribute is w:val. You may have noticed by now that WordprocessingML favors putting not only elements but also attributes in its namespace, which means you should get used to typing those w prefixes. (The attributeFormDefault value is set to qualified in each of the WordprocessingML schema documents.)

The w:docPr element has 84 optional child elements. They are declared in the WordprocessingML schema as an ordered sequence (as opposed to a repeating choice group), which suggests that they must occur in the declared order. In reality, Word does not enforce this order, though it does appear to follow it in the WordprocessingML documents it creates.

Now, let's look at the w:docPr element as output by Word in Example 2-2:

  <w:docPr>     <w:view w:val="web"/>     <w:zoom w:percent="100"/>     <w:proofState w:spelling="clean" w:grammar="clean"/>     <w:attachedTemplate w:val=""/>     <w:defaultTabStop w:val="720"/>     <w:characterSpacingControl w:val="DontCompress"/>     <w:validateAgainstSchema/>     <w:saveInvalidXML w:val="off"/>     <w:ignoreMixedContent w:val="off"/>     <w:alwaysShowPlaceholderText w:val="off"/>     <w:compat/>   </w:docPr>

The 11 child elements shown here provide a fairly representative sampling of these options.

The w:view element determines what view to use when opening the document. The default view for a WordprocessingML document that does not specify a view is web, which is also Word's default view for opening XML documents in general. That explains why we see the value web in this example:

<w:view w:val="web"/>

This value is the result of Word re-saving a WordprocessingML document that we constructed by hand, without specifying a view. The five possible values of view are print, outline, normal, web, and master-pages (similar to outline but applies only to documents that refer to sub-documents).

The w:zoom element denotes the zoom percentage that should be set when opening the document:

<w:zoom w:percent="100"/>

If you change the zoom percentage from within Word and re-save (provided that you also make a substantive change to the document's content to ensure that the file is actually updated), Word will save the document, recording the zoom level that you last used. Alternatively, you could directly edit the zoom property in the WordprocessingML, causing Word to display the document at some other zoom percentage the next time someone opens the file.

The w:proofState element records the state of the grammar and spelling checkers (clean or dirty) at the time Word saved the document:

<w:proofState w:spelling="clean" w:grammar="clean"/>

Since actual spelling and grammar errors are recorded in the body of the document, this state check reflects not whether there are errors in the document, but whether Word had a chance to finish checking for errors before the user saved the document. Thus, its primary purpose is as an optimization hint for Word when it opens the document. Its absence, however, could conceivably be a useful warning for applications that otherwise rely on Word having completed its proofing.

The w:attachedTemplate property is one of the two elements representing Templates and Add-Ins options (along with the w:linkStyles element):

<w:attachedTemplate w:val=""/>

Its value in this example is empty, which means simply that the default Normal.dot template is attached. Should you attach a different template (through the Tools Templates and Add-Ins . . . dialog) and re-save, then this value would be populated with the specific file location of a template. Alternatively, you could manually edit the XML attribute value so that the next time Word opens the document, the new template will already be attached by virtue of your manual change. Note, however, that unless the w:linkStyles element is also present inside the w:docPr element (as explained later), the fact that a template is merely attached has no immediate effect on the document. The w:attachedTemplate element defines a loose association whose potential is only realized when the w:linkStyles element is also present.

The w:validateAgainstSchema, w:saveInvalidXML, w:ignoreMixedContent, and w:alwaysShowPlaceHolderText properties (among several others not included in this example) are specific to Word's custom XML schema functionality (only available in Office 2003 Professional or standalone Word 2003), which is discussed in Chapter 4.

The w:defaultTabStop element sets the interval between default tab stops in the document:

<w:defaultTabStop w:val="720"/>

While the Word UI exposes this value in inches (when you select Format Tabs...), the underlying value is stored in twips, or 20ths of a point, or 1,440ths of an inch. (Completing this equation, there are 72 points in an inch.) Since the value of the w:val attribute is 720 twips, the default tab stops for paragraphs in this document occur every half inch. Thus, when Word opens the document, it displays the short vertical lines beneath the ruler, spaced every half inch, as shown in Figure 2-5.

Figure 2-5. Default tab stops every half inch, or 720 twips
figs/oxml_0205.gif


Once again, Word supplies this value as an application default, because our original hand-edited document (Example 2-1) did not specify a default tab stop interval. As we'll see, individual paragraphs can define their own custom tab stops too. For those paragraphs, the default tab stops only take effect to the right of the last custom stop.

The w:characterSpacingControl element is one of several Asian Typography options.

<w:characterSpacingControl w:val="DontCompress"/>

There are three possible self-describing values (DontCompress, CompressPunctuation, or CompressPunctuationAndJapaneseKana) that can be used to sets the compression option for East Asian characters. The default value that Word outputs, as evident in our example, is DontCompress. Of course, this doesn't have any real effect on our document, since it does not contain Asian characters.

Finally, the w:compat element is among the few w:docPr children that may themselves contain child elements (w:mailMerge, w:hdrShapeDefaults, w:footnotePr, w:endnotePr, and w:docVars being the only others). It has 51 possible child elements, corresponding to the compatibility options for a document that are set in the Compatibility tab of the Tools Options... dialog, as shown in Figure 2-6.

Figure 2-6. Compatibility options, corresponding to the child elements of w:compat
figs/oxml_0206.gif


The w:compat element is empty in Example 2-2, because our document does not set any particular compatibility options.

Before moving on, it would be good to point out one more common WordprocessingML convention. Among w:docPr's 84 possible child elements, 49 are declared using the same type in the WordprocessingML schema: the onOffProperty. The declaration for the onOffProperty type in the WordprocessingML schema is as follows:

<xsd:complexType name="onOffProperty">   <xsd:attribute name="val" type="onOffType" default="on"/> </xsd:complexType>

The onOffType type referred to here allows for two possible values: on or off. As you can see, the attribute declaration for w:val specifies a default value of on. This means that for the elements inside the w:docPr element that are defined with this type, the presence of w:val="on" is always implied (and thus redundant), unless overridden by the value off. However, this has no bearing at all on Word's behavior when the property element itself is absent. Default behavior in those cases varies depending on the property, and the WordprocessingML schema itself does not generally cast any light on that question, although annotations therein do sometimes help. Experimentation is probably the best way to determine Word's default behavior when particular property elements are absent.

2.4.6 The wx:sect Element

Finally, we get to the content of our document, residing inside the w:body element. Our hand-coded original (Example 2-1) directly contained a w:p (paragraph) element inside the body. After saving, we now see that the paragraph element has been inserted into an intervening wx:sect element. As mentioned earlier, the namespace mapped to the wx prefix signals a piece of information that may be useful to us in processing the XML as output by Word, but that is ignored by Word when opening a WordprocessingML file. The wx elements and attributes are of no use to Word internally. In this case, we could remove the wx:sect element's start and end tags, leaving only its contents for Word to read, and Word would behave no differently the next time it opens the file.

That's all well and good, you might be thinking, but what is the wx:sect element for? As you might guess, it stands for "section." As is true with many Word documents, our "Hello World!" example document contains only one section, so it's not particularly useful in this case. To learn what sections are and how they are defined using w:sectPr elements, see "Sections" later in this chapter. And to learn how the wx:sect element is a useful aid to external processing, see Section 2.6.1 later in this chapter.

2.4.7 The w:body Element

It may seem strange to talk about the w:body element after the wx:sect element, when until now we've been traversing our original example in document order. As already noted, however, the wx:sect element is a completely optional intervening element between w:body and its content. So, while in Example 2-2 it is the wx:sect element that contains a w:p element, that content model really belongs to w:body. Using a DTD-like syntax, we can express w:body's entire content model (much more simply than its XSD definition), like this:

(w:p|w:tbl|w:cfChunk|w:proofErr|w:permStart|w:permEnd)*, w:sectPr?

In other words, w:body may contain any number of w:p, w:tbl, w:cfChunk, w:proofErr, w:permStart, and w:permEnd elements, in any order, followed by an optional w:sectPr element. The w:p element represents a paragraph, the w:tbl element represents a table, and the w:cfChunk element represents a "context-free" chunk of inline default fonts, styles, list definitions, paragraphs, and tables.[1] We'll describe the purpose of the w:proofErr, w:permEnd, and w:permStart elements later, in Section 2.5.6.

[1] At least, that is how the WordprocessingML schema advertises it. A plethora of experiments yields few answers as to how this element is actually supposed to be used or how it is supposed to behave. Word tends to fix things up, merging such inline definitions with the document's global definitions. This is one area where more documentation from Microsoft is certainly needed.

The w:sectPr element, included in Example 2-2, defines the section properties for the last (and first, in this case) section of the document. See "Sections," later in the chapter, for more information on how w:sectPr elements are interpreted.

The first part of the w:body element's content model (that is, not including the optional w:sectPr element) is worth repeating:

(w:p | w:tbl | w:cfChunk | w:proofErr | w:permStart | w:permEnd)*

That's because it also functions as the content model for six other elements in WordprocessingML, namely w:hdr, w:ftr, w:footnote, w:endnote, w:tc, and w:txbxContent. (The only exception is that w:tc may also contain an optional preceding w:tcPr element.) The first two of these elements stand for "header" and "footer," respectively; they occur in the property definitions for a particular section, i.e., inside the w:sectPr element. Footnotes and endnotes may occur inside any "run," or w:r, element. The w:tc element represents a table cell; thus, tables may contain tables. Finally, the w:txbxContent element represents a text box that is embedded inside a VML (Vector Markup Language) image embedded somewhere inside a document's content.

This content model is actually more open than implied above. The WordprocessingML schema also allows any element from any other namespace to occur here. This enables annotations from the AML (Annotation Markup Language) namespace, as well as tags from a custom XML schema to be embedded inside WordprocessingML. (See Chapter 4.)



Office 2003 XML
Office 2003 XML
ISBN: 0596005385
EAN: 2147483647
Year: 2003
Pages: 135

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net