2.3 WordprocessingML s Style of Markup

2.3 WordprocessingML's Style of Markup

If you have any XML or HTML markup background, then WordprocessingML's style of markup may surprise you. WordprocessingML was not designed from a clean slate for the purpose of creating documents in XML markup. Instead, it is an unveiling of the internal structures that have been present in Microsoft Word for years. Though certain features have been added to make WordprocessingML usable outside the context of Word, by and large it represents a serialization of Word's internal data structures: various kinds of objects associated with myriad property values. Indeed, the object-oriented term "properties" permeates the WordprocessingML schema. If you want to make a run of text bold, you set the bold property. If you want to indent a particular paragraph, you set its indentation property. And so on.

2.3.1 No Mixed Content

Mixed content describes the presence of text content and elements inside the same parent element. It is standard fare in the world of markup, especially when using document-oriented markup. For example, in HTML, to make a sentence bold and only partially italicized, you would use code such as the following:

<b>This sentence has <i>mixed</i> formatting.</b>

WordprocessingML, however, never uses mixed content. All of the text in a WordprocessingML document resides in w:t elements, and w:t elements can only contain text (and no elements). The above sentence is represented much differently in WordprocessingML. The hierarchy is flattened into a sequence of runs having different formatting properties:

<w:r>   <w:rPr>     <w:b/>   </w:rPr>   <w:t>This sentence has </w:t> </w:r> <w:r>   <w:rPr>     <w:b/>     <w:i/>   </w:rPr>   <w:t>mixed</w:t> </w:r> <w:r>   <w:rPr>     <w:b/>   </w:rPr>   <w:t> formatting.</w:t> </w:r>

As you can see, all of the text occurs by itself (no mixed content), within w:t elements.

2.3.2 Properties Are Set Using Empty Sub-Elements

The above snippet illustrates another general principle in WordprocessingML's style of markup: properties are assigned using empty sub-elements (e.g., w:b and w:i in the above example). For runs, the w:rPr element contains a set of empty elements, each of which sets a particular property on the run. Similarly, for paragraphs (w:p elements), the w:pPr element contains the paragraph formatting properties. For tables, table rows, and table cells, there are the w:tblPr, w:trPr, and w:tcPr elements, respectively. In each case, the *Pr element must come first, so that the general structure of paragraphs, runs, tables, table rows, and table cells looks like this:

Object    Properties    Content

The properties are defined first, and the content follows. If you have any experience with RTF (Rich Text Format), then this pattern may look familiar. Before the advent of WordprocessingML, RTF was the most open format in which Word was willing to save documents. A look at the same sentence after saving it as RTF is demonstrative:

{\b\insrsid3691043 This sentence has } {\b\i\insrsid3691043 mixed} {\b\insrsid3691043  formatting.}

The parallels should be fairly easy to draw, without understanding every detail. There are three runs (delineated by curly braces). The first run has bold turned on by virtue of the \b command. The second run has both bold and italic turned on by virtue of the \b and \i commands. And the third run goes back to using just bold and no italic. From this perspective, WordprocessingML may look more like an XML format for RTF an estimation that is not too far off the mark.

To learn more about RTF, consider the RTF Pocket Guide (O'Reilly), by Sean M. Burke.

2.3.3 No Hierarchical Document Structures

Nested markup describes the use of element nesting to arbitrary depths. In addition to formatting text, nested markup is useful for structuring documents. For example, a Docbook document may have sections and sub-sections nested to an arbitrary depth, like this:

<article>   <section>     <title>Section 1</title>     <para>This is the first section.</para>     <section>       <title>Section 1A</title>       <para>This is a sub-section.</para>     </section>   </section> </article>

The above document is represented much differently in WordprocessingML. The hierarchy is flattened into a sequence of four paragraphs having different properties. Below is the w:body element, excerpted from such a document:

<w:body>   <w:p>     <w:pPr>       <w:pStyle w:val="Heading1"/>     </w:pPr>     <w:r>       <w:t>Section 1</w:t>     </w:r>   </w:p>   <w:p>     <w:r>       <w:t>This is the first section.</w:t>     </w:r>   </w:p>   <w:p>     <w:pPr>       <w:pStyle w:val="Heading2"/>     </w:pPr>     <w:r>       <w:t>Section 1A</w:t>     </w:r>   </w:p>   <w:p>     <w:r>       <w:t>This is a sub-section.</w:t>     </w:r>   </w:p> </w:body>

In Word, the paragraph is the basic block-oriented element, and paragraphs may not contain other paragraphs. Word does, however, provide a workaround for hierarchical documents, through use of the wx:sub-section element. In fact, if you were to open the above document and then save it from within Word, the result would include wx:sub-section elements that reflect the hierarchy intended by the heading paragraphs. We'll look at how this works in detail later, in Section 2.6.2.

2.3.4 All Attributes Are Namespace-Qualified

One more peculiarity worth noting about WordprocessingML markup is its use of namespace-qualified attributes. In most XML vocabularies, attributes are not in a namespace. They are generally thought to "belong" to the element to which they are attached. As long as the element is in a namespace, then no naming ambiguities should arise. Namespace qualification, however, can be useful for "global attributes" that can be attached to different elements. Such attributes do not belong to any particular element. The xml:space attribute is a good example of a global attribute. XSLT also has some global attributes, such as the xsl:exclude-result-prefixes attribute, which can occur on any literal result element (in any namespace). These are considered good use cases for qualifying attributes with a namespace.

WordprocessingML, however, does not follow this convention. While there are some "global attributes" in WordprocessingML (such as the w:type attribute, which appears on the aml:annotation element, which we'll see), WordprocessingML does not restrict its use of namespace qualification to those cases. Instead, it universally qualifies all attributes across the board. For this reason, the key thing to remember when working with attributes in WordprocessingML is that they always must have a namespace prefix (because there's no such thing as a default namespace for attributes in XML).