4.9 Schema Validation | Office 2003 XML

When a schema is attached to a document, Word performs on-the-fly schema validation of the document's embedded custom XML, visibly flagging errors as the user edits. However, since the custom XML tags are intertwined with WordprocessingML elements, Word first needs to strip out the Word-specific markup before it can validate the document. This is actually the same process the "Save data only" process that optionally occurs in step 3 of our processing model diagram (in Figure 4-7), when a user saves the document. What is not evident in that diagram is the fact that the "Save data only" process is also invoked repeatedly while the user is editing the document (during step 2). The difference here is that, rather than permanently stripping out the WordprocessingML markup, it does so temporarily just for the purpose of validation.

4.9.1 The "Ignore Mixed Content" Document Option

When Word strips out the WordprocessingML markup in order to validate the embedded XML document, by default it leaves all text content (inside w:t elements) intact. Our press release template, however, includes boilerplate text that is not actually part of our data. If this text is included in the remaining XML document, then it will be invalid according to the press release schema. Example 4-7 shows what a press release XML document would look like if all of the text remained intact after stripping out the WordprocessingML markup.

Example 4-7. An invalid press release document, including template boilerplate text

<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?mso-application prog?> <pressRelease xmlns="http://xmlportfolio.com/pressRelease"><company><name>ACME Corp.</name><address><street>555 Market St.</street><city>Seattle</city>, <state>WA</state>  <zip>98101</zip>Phone <phone>222-222-2222</phone>Fax <fax>333- 333-3333</fax></address></company>Press Release<contact>Contact: <firstName>John</firstName> <lastName>Doe</lastName>Phone: <phone>444-444- 4444</phone></contact>FOR IMMEDIATE RELEASE<date>2004-01-23</date><title>This is  the Headline</title><body><para>This is the lead-in, and this is not. The rest of  the paragraph has no formatting either.This is the second paragraph. These are just  regular Word paragraphs. They do not correspond to custom XML  elements.</para></body>-End-</pressRelease>

The highlighted segments of Example 4-7, such as Phone and FOR IMMEDIATE RELEASE, are pieces of boilerplate text from the press release template. They are not supposed to be part of the data. Thus, merely stripping out the WordprocessingML markup is not sufficient. It is also necessary to strip out the boilerplate text. How is this done? Well, the boilerplate text in this example happens to represent the only mixed content text in the document, and Word happens to provide a document option called "Ignore mixed content." By turning this option on, you can effectively strip out the boilerplate text in this and other similar examples, for the purpose of validation.

The "Ignore mixed content" document option can be viewed as a parameter to the "Save data only" process. It affects both on-the-fly schema validation as well as the document saving process when the "Save data only" document option is turned on. (The precise behavior of this process is approximated using an XSLT stylesheet listed later in this chapter, under "The `Save Data Only' Document Option".)

In our press release template, the "Ignore mixed content" document option is turned on, but the "Save data only" document option is turned off. This means that mixed content text is stripped out for the purpose of on-the-fly schema validation, but it is not stripped out when the document is saved. (Instead, our press release template uses a custom onsave XSLT stylesheet applied directly to the merged XML and WordprocessingML representation.)

The "Ignore mixed content" document option is represented in WordprocessingML using the w:ignoreMixedContent element. Our press release application's "Elegant" stylesheet, pr2word.xsl, turns the option on by generating a w:ignoreMixedContent element in the result document, just like this one:

  <w:docPr>     <!-- ... -->     <w:ignoreMixedContent/>     <!-- ... -->   </w:docPr>