Section 5.1. Schematron validation


Prev	don't be afraid of buying books	Next

5.1 Schematron validation

Before we start working on the main transformation stylesheet, let's discuss the implementation of a source validation subsystem. A basic Schematron schema was shown in Example 3.3 (page 149); here we'll see how to coordinate it with the stylesheet ( 5.1.1 , 5.1.2 ) and how to extend it with many advanced checks for the super-document layer of your source definition ( 5.1.3 ).

Why coordinate? There are two reasons. First, you usually run your stylesheet after the source has been changed, so it is only natural to validate it just before transformation. Second, our schema language of choice, Schematron, is directly related to XSLT; we will use the reference implementation of Schematron written in XSLT. Therefore, for efficient validation, your schema must share some code with the stylesheet.

5.1.1 Shared XSLT library

What stylesheet code can we use during validation? An ideal schema must not only validate the document structure but also perform many types of content checks, such as checking for broken links. Recall the discussion of address abbreviations ( 3.5.3 ); a link-checking schema must be able to unabbreviate addresses in all types of links in your XML sources exactly as they are unabbreviated by the stylesheet during transformation. This means the unabbreviation functions must be shared between the schema and the stylesheet.

Before we can do any unabbreviation, we need to access the master document and retrieve the current processing parameters as well as any metadata it has for the current source document. This information must be checked against the reality and, if correct, stored in global variables that will be used later in many places. All of this is quite a chunk of codeand another candidate for sharing between the schema and the stylesheet.

Our first task, therefore, is to create a shared XSLT library that will be imported into both the schema and the stylesheet. A complete listing of such a library is given in Example 5.19 (page 256). Below, I highlight and discuss some of its more important parts .

5.1.1.1 Stylesheet parameters

First, the library declares the stylesheet parameters, $env and $images :

 <xsl:param name=  "env"  select=  "'final'"  /> <xsl:param name=  "images"  select=  "'no'"  />

These are the values that the user may provide on each stylesheet execution (usually on the command line; see 5.1.2 and 5.7 for command-line examples). Keep the number of stylesheet parameters to a minimum and never use them to supply any information, only to select information stored somewhere else (usually in the master document).

Note the two pairs of quotes around the parameter values. The inner single quotes enclose string literals in XPath; the outer double quotes are obligatory for attribute values in XML.

You won't be running the library directly; instead, you'll import it at the very beginning of both the schema and the stylesheet. As a result, both will accept the same parameters, and both will use the same default values (in our example, 'final' for $env and 'no' for $images ) if no parameters are supplied by the user.

5.1.1.2 Master document access

Before going any further, both the schema and the stylesheet need to access the master document. The $master variable solves this problemit loads, parses, and stores the master document:

 <xsl:variable name=  "master"  select=  "document('_master.xml')"  />

Note that the filename (or, possibly, pathname or URI) of the master is hard-wired into the variablesimply because we have nowhere else to take it from at this point. This is the only bit of configuration that has to be stored in XSLT and therefore may not be easy to change by site maintainers (who might be unable to access the shared library and/or edit its XSLT code). Everything else, as we'll see below, grows out of this seed and is easy to modify by editing the master document.

Even the master document location can be made user-configurable if you add a stylesheet parameter supplying the pathname or URI of the master.

Who am I speaking to? Most of the time, the stylesheet will be run with a page document as input. However, for batch processing ( 5.6 ) we'll need to use the master document as input. Now, for convenience, let's find out if the current input document is the master (by checking the name of its root element type) and store this in a boolean variable:

 <xsl:variable name=  "this-master"  select=  "boolean(/*[name()='site'])"  />

5.1.1.3 Pathnames

Before we can do anything useful, we must figure out various pathnames specific to the current environment. To start, finding out the root directory of the source tree, the output directory, and the target directory is simplethey are stored in the children of the corresponding environment element in the master:

 <xsl:variable name=  "src-path"  select=  "$master//environment[@id=$env]/src-path"  /> <xsl:variable name=  "out-path"  select=  "$master//environment[@id=$env]/out-path"  /> <xsl:variable name=  "target-path"  select=  "$master//environment[@id=$env]/target-path"  />

The images directory is assumed to be under the $out-path , and its name is also stored in the environment :

 <xsl:variable name=  "out-img"  >   <xsl:value-of select=  "$out-path"  />   <xsl:value-of select=  "$master//environment[@id=$env]/img-path"  />   <xsl:text>/</xsl:text> </xsl:variable>

The stylesheet assumes that the directory exists and all necessary image files are already there (although it may add some generated images itself, 5.5.2 ).

For the img elements in the HTML pages we are creating, the image directory must be prefixed by $target-path , not $out-path . We therefore declare the $target-img variable that is identical to $out-img except that it uses $target-path instead of $out-path :

 <xsl:variable name=  "target-img"  >   <xsl:value-of select=  "$target-path"  />   <xsl:value-of select=  "$master//environment[@id=$env]/img-path"  />   <xsl:text>/</xsl:text> </xsl:variable>

We'll also frequently use filename extensions for source XML files and output HTML files:

 <xsl:variable name=  "src-ext"  >  .xml  </xsl:variable> <xsl:variable name=  "out-ext"  >  .html  </xsl:variable>

Sometimes, you may want to use different output file extensions for different pages, site sections, or production environments. In this case, supply these extensions in the corresponding menu or environment branches in the master document instead of hard-wiring them into the stylesheet as we did here.

Where am I? For other parameters, the approach implemented below tries to guess as much as possible based on the source document's pathname (available via the saxon:systemId() function under Saxon), thus relieving the document author from the burden to specify that information in XML. This makes the system easier for daily maintenance and more error-proof.

Saxon's extension function saxon:systemId() returns the URL of the current source document, using the file : protocol prefix for local files (e.g., file:/dir/page.xml ). Many other processors provide a similar extension function (for example, Xalan has an analogous function also called systemId() ). If you don't have such a function available in your processor, you may need to rewrite this part of the shared library so that it relies on the page document's content (e.g., a src attribute of the page's root element) rather than its location. Another approach might use a stylesheet parameter set by some external processing framework that runs the XSLT transformation (compare 7.2.7 ).

Language. Thus, in a multilingual site, instead of manually specifying the language of each page document (e.g., via an attribute of its root element), we assume that the site has several parallel directory trees, each under the subdirectory named after the language label. The $lang variable cuts out the part of the saxon:systemId() returned value that is between the $src-path and the first / character:

 <xsl:variable name=  "lang"  >   <xsl:choose>     <xsl:when test=  "substring-after(saxon:systemId(), $src-path)=''"  >       <xsl:message terminate=  "yes"  >  Error: Source file path doesn't match $src-path   $env=  <xsl:value-of select=  "$env"  />  systemId=  <xsl:value-of select=  "saxon:systemId()"  />       </xsl:message>     </xsl:when> <xsl:otherwise>       <xsl:value-of select=  "   substring-before(   substring-after(saxon:systemId(), $src-path)  ,  '/')"  />     </xsl:otherwise>   </xsl:choose> </xsl:variable>

If the string returned by saxon:systemId() does not start with $src-path , an error is reported . For example, a source document whose pathname starts with

 {$src-path}de/

is assumed to be in German, and for it, the $lang variable will get the value 'de' . Your directory layout may of course be different, but the code will likely be similar.

Abbreviated location. Next, we initialize the $current variable that is the pathname of the current source document after removing the $src-path , $lang , and the filename extension:

 <xsl:variable name=  "current"  >   <xsl:choose>     <xsl:when test=  "   not($master//languages/lang = $lang) and not($this-master)"  >       <xsl:message terminate=  "yes"  >  Error: Not in a valid language directory   $lang=  <xsl:value-of select=  "$lang"  />       </xsl:message>     </xsl:when>     <xsl:otherwise>       <xsl:value-of select=  "   substring-before(   substring-after(   substring-after(saxon:systemId(), $src-path)  ,  concat($lang, '/'))  ,  $src-ext)"  />     </xsl:otherwise>   </xsl:choose> </xsl:variable>

The idea is to be able to directly compare the value of $current to the //page/@src values in the master document. Thus, for

 /var/www/xml/de/contact/contact.xml

the value of $current will be contact/contact ( assuming $src-path for the current environment is /var/www/xml/ ).

No way. The variables $lang and $current complain if the source document path does not contain a valid language label or does not start with $src-path . These checks are done in the shared library and not in the schema because the stylesheet absolutely needs this information to be correct, or the transformation will fail. On the other hand, since our Schematron schema imports the shared library as well, these errors will be reported if you attempt to validate, not only transform, a wrongly placed document.

Finally, we create a boolean variable that checks if the current page is the site's front page (this might be useful, e.g., for creating a different layout on the home page of the site):

 <xsl:variable name=  "frontpage"  select=  "$current = 'index'"  />

Follow the right paths. All this fiddling with pathnames is an ugly but inevitable part of creating a web site. If you want to be able to run a transformation in more than one environmentand this cannot be avoided, because you don't want to do development or editing on a live sitethe stylesheet setup described in this section can hardly be simplified.

I tried to make these examples as generic as possible, so you shouldn't have too many problems adapting them to your situation. Still, your production stylesheet may need to be even more complex. For example, you may have separate image directories for each language, or separate directories for downloadable content.

5.1.1.4 Unabbreviation functions

To finish our shared XSLT library, we implement a few simple unabbreviation functions. If you are new to user-defined functions in XSLT (it is a 2.0 feature), this is a good primer.

External links. Let's look at the external links unabbreviation first. The following function checks if the parameter string starts with a URL protocol part; if it does, the parameter is returned unchanged, otherwise it is concatenated with 'http://' . You can, of course, add tests for any other URL protocols in addition to http:// and ftp:// .

 <xsl:function name="  eg:ext-link  ">   <xsl:param name="  abbr  "/>   <xsl:value-of select=  "   if (starts-with($abbr, 'http://') or   starts-with($abbr, 'ftp://'))   then $abbr   else concat('http://', $abbr)"  /> </xsl:function>

Every function must have a qualified name; the http://www.example.org/ namespace, with the prefix eg , is used in examples in this chapter for our XSLT functions.

Internal links. The internal links unabbreviation comes in two flavors, one for the source documents and the other for the transformed HTML pages.

The page-src() function is often used for checking if the linked document is present (we need to look for the source because the corresponding HTML file may not be created yet) and for accessing its XML source (e.g., to extract the title from the linked page to use in the menu, or to pull out an orthogonal content block referenced on another page).
The page-link() function is used for the actual href values in the links in HTML pages.

The first of these functions constructs the unabbreviated path from the base directory ( $src-path ), language subdirectory ( $lang; you can, of course, drop that if your site is unilingual), the relative pathname of the page (the src attribute of the corresponding page or block element), and the source filename extension ( $src-ext ):

 <xsl:function name="  eg:page-src  ">   <xsl:param name="  abbr  "/>   <xsl:param name="  lang  "/>   <xsl:value-of select=  "   concat(   $src-path,   $lang, '/',   $master//(pageblock)[(@id, tokenize(@alias, '\s+')) = $abbr]   /@src,   $src-ext)"  /> </xsl:function>

The analogous function for HTML links differs only in the variables used to construct the complete pathname $target-path instead of $src-path and $out-ext instead of $src-ext :

 <xsl:function name="  eg:page-link  ">   <xsl:param name="  abbr  "/>   <xsl:param name="  lang  "/>   <xsl:value-of select=  "   concat(   $target-path,   $lang, '/',   $master//(pageblock)[(@id, tokenize(@alias, '\s+')) = $abbr]   /@src,   $out-ext)"  /> </xsl:function>

These functions take two parameters:

$abbr is the page id (i.e., the abbreviated address);
$lang is the language of the target page.

A couple of points are worthy of note in these functions. First, the XPath expression that performs the actual lookup in the master document searches in both page s and block s, hence ( pageblock ) as a single location step; in XPath 1.0, only entire expressions and not steps within an expression could be arguments to the operator.

Second, we can refer to a page not only by its id but also by its alias ( 3.5.3.4 ), and the alias attribute in the master can store several values separated by spaces (refer to Example 3.2, page 143). Thus, we tokenize() the page's alias and combine it with the id value into a single sequence using the comma operator. This sequence is then compared to $abbr ; if any one of the sequence members passes the equality test, the = operator returns true .

Isn't XPath 2.0 wonderful?

5.1.2 Schematron wrapper for Saxon

If you're not interested in validation at this time, feel free to skip this and the next section; in 5.2 we'll finally start writing our main transformation stylesheet.

The reference implementation of Schematron, written in XSLT by Rick Jelliffe, consists of the "skeleton" library, ^[3] implementing the bulk of the Schematron functionality, and a set of wrappers ( metastylesheets ) that provide high-level interfaces to that functionality.

^[3] Latest version: www.ascc.net/xml/schematron/1.5/skeleton1-5.xsl

Custom wrapper. To run Schematron validation on Saxon, we will use the simplest wrapper, schematron-basic.xsl , ^[4] with these modifications:

^[4] www.ascc.net/xml/schematron/1.5/basic1-5/schematron-basic.html

The Saxon-specific line-number() function is added to the error handler for reporting the source line number in each report. This makes the schema's output much easier to use for fixing errors.
The shared XSLT library that we've just built ( 5.1.1 ) is imported.
Several namespaces are declared in the output (using xsl:namespace ) to access those of our extension functions that we'll need for super-document validation checks ( 5.1.3 ).

The complete listing of the schematron-saxon.xsl wrapper is shown in Example 5.1.

Example 5.1. Schematron wrapper for Saxon importing the `_lib.xsl` library (based on `schematron-basic.xsl` ).

 <xsl:stylesheet     xmlns:xsl=  "http://www.w3.org/1999/XSL/Transform"  version=  "2.0"  xmlns:axsl=  "http://www.w3.org/1999/XSL/TransformAlias"  xmlns:saxon=  "http://saxon.sf.net/"  > <xsl:import href=  "skeleton1-5.xsl"  /> <xsl:template name=  "process-prolog"  >   <!--  Namespace node for our XSLT extension functions:  -->   <xsl:namespace name=  "eg"  >  http://www.example.org/  </xsl:namespace>   <!--  Another namespace node for an extension class:  -->   <xsl:namespace name=  "f"  >  com.projectname.xslt.files  </xsl:namespace>   <!--  Importing the shared library:  -->   <axsl:import href="_lib.xsl"/>   <!--  We don't really need any output parameters:  -->   <axsl:output method="text"/> </xsl:template> <xsl:template name=  "process-root"  >   <xsl:param name=  "title"  />   <xsl:param name=  "contents"  />   <xsl:value-of select=  "$title"  />   <xsl:text>&#10;</xsl:text>   <xsl:copy-of select=  "$contents"  /> </xsl:template> <xsl:template name=  "process-message"  >   <xsl:param name=  "pattern"  />   <xsl:param name=  "role"  />   <!--  Outputting the source line number:  -->   <xsl:text>  Line  </xsl:text>     <axsl:value-of select="saxon:line-number()"/>   <xsl:if test=  "$role"  >     <xsl:text>  (  </xsl:text>     <xsl:value-of select=  "$role"  />     <xsl:text>  )  </xsl:text>   </xsl:if>:   <xsl:apply-templates mode=  "text"  />   <xsl:text>&#10;</xsl:text> </xsl:template> </xsl:stylesheet>

Deployment. The XSLT implementation of Schematron is actually a compiler that translates a Schematron schema into an XSLT stylesheet. This compiled schema is then applied to the document being checked. So, the two commands that you need to run in order to check the document src.xml against the schema schema.sch using our Saxon wrapper are these (you can create shell scripts or batch files to run them with different source files, or you can wait until 5.6 where we'll see how to run validation of all pages from within the transformation stylesheet):

 saxon -o schema-compiled.xsl schema.sch schematron-saxon.xsl

This command runs schema compilation by applying the wrapper stylesheet to the schema and producing a schema-compiled.xsl stylesheet. It is only necessary to rerun compilation if you have modified the schema. Replace saxon with (or make it an alias to) the command that runs the Saxon processor. ^[5]

^[5] On my system, it is java net.sf.saxon.Transform .

 saxon -l src.xml schema-compiled.xsl env=staging

This second command does actual validation of a source document. Here you can supply parameters (such as env ) exactly as you would do to a transformation stylesheet. This is the command that prints any schema diagnostics and reports errors. The -l switch forces Saxon to keep track of line numbers ; without it, the saxon:line-number() function does not work.

Namespace aliasing. As with any stylesheet that produces another stylesheet, our wrapper needs to alias the XSLT namespace. The axsl prefix is declared as corresponding to a non-XSLT URI and thus prevents axsl:* elements from being executed as XSLT instructions when the wrapper is run. Later, Schematron's "skeleton" file does the substitution:

 <xsl:namespace-alias stylesheet-prefix=  "axsl"  result-prefix=  "xsl"  />

which results in all axsl:* elements being output as xsl:* into the compiled schema.

5.1.3 Advanced Schematron

Our basic Schematron schema (Example 3.3, page 149) already has many advantages over grammar-based ( 2.2.1 ) schemas. It sports custom, arbitrarily detailed diagnostics and complex algorithmic checksfor instance, verifying the correspondence between the number of defined languages and the number of translation children in an element.

However, that example is still a document layer schema ( 2.1.1 ) because it can only check one document at a time, be that a page document or the master document. As promised , we will now extend that schema so that it also covers the super-document layer by validating links and dependencies between page documents and the master document as well as among different page documents. Now that we've prepared a shared library ( 5.1.1 ) and the wrapper ( 5.1.2 ) for our Schematron setup, we have all the tools we need for this.

You can add the new Schematron rules described in this section to the basic schema in Example 3.3. Alternatively, you can combine them into a separate schema and perform two-stage validation of your sourcefirst checking the document layer with the basic schema, and then validating the super-document layer using the techniques of this section.

5.1.3.1 Document availability

Our first validation task is to make sure that all web site pages listed in the master are actually present as source XML documents and are readable by the XML parser (i.e., are well- formed ). Moreover, in our multilingual site, we want to check for existence of all page documents in all defined languages. It is logical to only enable this check when we are validating the master document, because we don't want to scour the entire document tree every time we validate a single page.

Obviously, we need to get source paths for all page documents. However, we cannot simply look them up in the master. This is because our master document (Example 3.2) does not store a complete pathname for each language variant of each page; the subtree of pages and the subtree of languages are separate . If we were writing a stylesheet, we could run an xsl:for-each loop to find all combinations of children of these two subtrees. But the Schematron syntax is purely declarative and does not permit any loops . Or does it?

Recall ( 4.2 ) that XPath 2.0 makes it possible to emulate many of XSLT's processing constructsincluding loopsright in an XPath expression. This means that if we run our schema on an XSLT 2.0-compliant processor, we can pack the entire loop into the test attribute of a report or assert . Here's how it might look:

 <rule context="  menu//page  blocks/block  ">   <report test="  (for  in $master//languages/lang   return boolean(document(eg:page-src(@id, $l))))   = false()  ">  A source document not found for "  <value-of select="  @id  "/>.   </report> </rule>

Here, the for loop in the XPath expression attempts to feed the builtin document() function with source paths of all language variants for the current page or block (remember that a page may be registered either as one of the orthogonal blocks or as a page in the menu hierarchy). We use our custom page-src() function from the shared library ( 5.1.1.4 ) to go from a page's id attribute and a language to the complete pathname of the corresponding source document.

By the way, the page-src() function itself accesses the master document and searches it for a page or block with the given @id . This might seem awkward when the rule is fired , we are already in that element and only have to make one step to reach its @src . However, our approach has the advantage of being generic and therefore robust, while any attempts to "optimize" it would likely breed hard-to-catch bugs .

If one of these source documents fails to load, document() returns an empty nodeset that is converted to a boolean value of false . When we compare the sequence of boolean values returned by the for loop to a single false() value, the result is true if at least one of the values in the sequence is false that is, if at least one of the documents failed to load.

With Saxon, an attempt to load a non-existent file with document() produces a Java error message (in addition to the Schematron diagnostics). This is implementation-dependent; other processors may handle this situation differently. If the file exists but is not well-formed XML, you'll get a parser error because document() attempts to parse the file it loads. If you only want to quickly check for the existence of an arbitrary file (e.g., an image), use a custom extension function, such as files:exists() , written in Java ( 5.1.3.4 ). It may have an additional advantage of more readable and customizable diagnostic messages for missing files.

5.1.3.2 Internal links

Now that we are sure that all documents mentioned in the master document are present and loadable, we can check the validity of internal links simply by looking them up in the master. Here's the rule :

 <rule context="  int  link[@linktype='internal']  ">   <assert test="  @link = (for $i in $master//(pageblock)/(@id@alias)   return tokenize($i, '\s+'))  ">  Broken internal link: no 'page' with   @id="  <value-of select="  @link  "/>  " in the   master document. Valid identifiers are:  <value-of select="  string-join((for $i in $master//(pageblock)/(@id@alias)   return tokenize($i, '\s+')), ', ')  "/>.   </assert> </rule>

Remember that links may use either an id or any of the defined alias es of a page. This means that here, just as in the unabbreviation functions ( 5.1.1.4 ), we have to go to some lengths in order to get a sequence of all valid page identifiers.

We start by creating a nodeset of all @id and @alias values of all page s and block s, and then apply the tokenize() function to each member of the nodeset. The second argument of tokenize() is a regular expression meaning "one or more whitespace characters ." Therefore, for singleton id values, tokenize() does nothing; for space-separated alias lists, it breaks them into sequences. Everything returned by the function is then joined into one common sequence by for and compared to the current element's link attribute. The comparison yields false only if none of the sequence values matches.

The diagnostic message for this rule demonstrates how you can use Schematron's value-of element with an arbitrary XPath expression. This rule not only displays the @link value that is incorrect, but also lists all correct values from the master document (of course, this only makes sense if there aren't too many of them). For this, the same sequence-constructing expression as in the test attribute of assert is fed to the string-join() function that strings all sequence members together separated by ", ".

5.1.3.3 External links

Now, the rule to check external links should seem easy:

 <rule context="  ext  link[@linktype='external']  ">   <assert test="  boolean(unparsed-text(eg:ext-link(@link), 'iso-8859-1'))  ">  Broken external link:  <value-of select="  @link  "/>.   </assert> </rule>

First of all, we unabbreviate the link by our eg:ext-link() function ( 5.1.1.4 ). Since we cannot expect all web pages that we link to to be valid XML, we use the unparsed-text() function to access the link URI because this function only retrieves the document without attempting to parse it. (Both document() and unparsed-text() can access URIs, not only local pathnames.) The boolean conversion returns false if the document is inaccessible (or empty).

A quest for a better probe. Unfortunately, the use of unparsed-text() has its share of problems:

The XSLT specification ^[6] mandates that the URI for unparsed-text() must not contain a fragment identifierthat is, you cannot have a # in the URI you pass to the function. This means you'll have to add another wrapper function that would strip such fragment identifiers, if present, before handing the URIs over to unparsed-text() .

^[6] Unlike document() , unparsed-text() is an XSLT function, not an XPath function. This distinction may be confusing, but it rarely matters in practice.
The error that occurs when unparsed-text() cannot access its URI is defined as recoverable. However, currently (in version 7.5.1) Saxon does not recover and terminates processing on such an error. This means you can find at most one broken link per validation run, and if a valid URI is temporarily unavailable, your validation is halted. Hopefully, future versions of Saxon as well as other 2.0 processors will handle this error as recoverable.
As the name implies, unparsed-text() retrieves text. Among other things, this means that all characters in the retrieved document must be valid for that document's encoding. As a result, this function will likely choke on most binary files that you make it swallow, so you can't safely use unparsed-text() to access image files or other non-text resources.

All these limitations suggest that it might be a better idea to write your own extension function to test external URIs for availability. Such a function could ignore fragment identifiers, perform several retries for addresses that fail to respond, and handle both textual and binary resources. Additionally, to save bandwidth, it could only query the remote host for the availability of a resource without actually transferring it.

5.1.3.4 Local images

Images used on web pages are usually stored locally, and we can use our files:exists() extension function written in Java (Example 5.6, page 221) to make sure they are available:

 <rule context="  section[@image]  ">   <assert test="  files:exists(eg:os-path(concat($out-img, @image, '.png')))  ">  Image file missing  :     <value-of select="  concat($out-img, @image, '.png')  "/>   </assert> </rule>

Here, only section elements with an image attribute are checked, for this is the only construct referring to an image in our sample site source (Example 3.1, page 141). You can, of course, write a similar rule for other image-referencing elements, such as a standalone img element.

I did not define, instead of the concat() expression, an eg:image-link() function that would take an image identifier and return the full pathname of the image file. Such a function can be written in XSLT if necessary.

5.1.3.5 Language links

A language link ( 3.5.3.4 ) uses a lang element, which should not be confused with a lang within languages in the master document (you can rename either of them, of course, if you perceive this as a problem):

 <rule context="  lang[not(ancestor::languages)]  link[@linktype='language']  ">   <assert test="  @link = $master//languages/lang  ">  Broken language link:   no "  <value-of select="  @link  "/>  " language.  </assert> </rule>

5.1.3.6 External blocks

Orthogonal block definitions. For orthogonal blocks, we must first verify that their definitions in the master document are valid. Remember that each such definition specifies a source document and the identifier of a block to be extracted from it ( 3.9.1.3 ). As for the presence of the source documents, this has already been tested ( 5.1.3.1 ). What remains is a check of the validity of the @select identifier, if it is present.

We can insert this check into the same rule context as the document availability check:

 <rule context="  menu//page  blocks/block  ">   <!--  ... document availability check ...  -->   <assert test="  every $i in   (for $l in $master//languages/lang   return (   if (@select)   then boolean(document(eg:page-src(@id, $l))                              //block[@id=current()/@select])   else true()))   satisfies $i  ">  A block with @id="  <value-of select="  @select  "/>  " that the  "<value-of select="  @id  "/>  " orthogonal block refers to is missing.  </assert> </rule>

Here, the test expression attempts to load documents with the current @id for all defined languages and searches each of them for a block whose @id matches our current @select . The quantified expression every ... in ... satisfies ... returns true only if all values in the sequence satisfy the testin this case, if all page s or block s with @select match existing blocks in the corresponding source documents in all languages.

Dynamic block definitions. These checks only cover orthogonal blocks defined in the master document. Dynamic block definitions ( 3.9.1.4 ) can also be validated , but this validation will depend on the implementation of the creators of dynamic data. For example, if some of your dynamic processes are implemented as callable templates in the main transformation stylesheet, nothing prevents you from loading that stylesheet with document() and checking that it does in fact contain an xsl:template with the corresponding name and xsl:param s.

Block references. Now, the rule for validating orthogonal or dynamic block references in page documents is very simple, as it only needs to look up a block with a given @id in the master:

 <rule.context="  page//block[@idref]  ">   <assert test="  @idref = $master//blocks/block/@id  ">  A block with @idref must match the @id of one of the 'block'   elements in the master document  .   </assert> </rule>

5.1.3.7 Uniqueness

IDs in DTDs. XML has a limited mechanism for ensuring uniqueness of attribute values. If you declare an attribute of the type ID in your DTD, a validating parser will report an error if two or more elements in the same document share the same value of this attribute. It is easy to see three big problems with this approach:

The uniqueness constraint applies only to a single document. There's no way to ensure cross-document (e.g., site-wide) uniqueness.
The ID -typed attributes must be unique across all element types used in your document. For example, if you have <foo id="xyz"> not only the rest of foo elements but any other elements as well are prevented from having the same identifier value. In other words, you cannot specify one group of unique identifiers for paragraphs and another for sections; if these two groups overlap, this is a validity error.
There is no way to check elements' content for uniqueness; the ID mechanism only applies to attribute values.

Along with these three big problems, there's a smaller one as well: To verify uniqueness, you must have a complete DTD and use a validating parser for your document. As we saw in Chapter 2 ( 2.2.4 ), you may prefer alternatives to DTDs, and introducing a DTD into your setup for the sole purpose of checking uniqueness is a major hassle.

The Schematron way. Suppose we want to be able to assign identifiers to our p and head elements and ensure that these id s are unique throughout the entire web site. On the other hand, if any other element (e.g., section ) uses the same id , this is not an error.

Can our Schematron schema handle this? Easy:

 <rule context="  p[@id]  head[@id]  ">   <assert test="  count(   for $src in distinct-values($master//(pageblock)/@src)   return   document(concat($src-path, $lang, '/', $src, $src-ext))   //(phead)[@id=current()/@id]   ) = 1  ">  Non-unique @id in a 'p' or 'head':   @id="  <value-of select="  current()/@id  "/>  " is used in:  <value-of select="  string-join(   $master   //(pageblock)   [document(eg:page-src(@id, $lang))   //(phead)[@id=current()/@id]   ]   /@src,   ', ')  "/>  .  </assert> </rule>

The test expression is not as scary as it might seem at first. It simply searches across all registered page documents ( $master//(pageblock) ) and finds all distinct values of their @src attributes. The distinct-values() call is necessary because some pages may be mentioned twicefor example, once in the menu and again as a source of an orthogonal block.

Then, @src values are unabbreviated into complete paths, ^[7] corresponding documents are opened, and all p and heads elements are taken. Of them, the expression selects those whose @id is the same as the current element's @id . If the number of such elements is exactly 1, we are fine. Otherwise, we have a problem.

^[7] Note that here, we cannot use any of our unabbreviation functions because these functions take a page @id as input, but what we have is @src .

Diagnostics. To report this problem, another XPath expression in value-of extracts the @src values (pathnames) of those source documents that contain the duplicate id attributes (at least one of them will correspond to the document you are validating). For example, you might get diagnostics like

 Line 32:   Non-unique @id in a 'p' or 'head':   @id="foo" is used in:   team/index, team/hire, subscribe.

Extensible uniqueness. The same mechanism may be used not only for identifiers and not only for attributes, but also for any other data that must be unique. For example, you may want to ensure that each one from a set of images is referenced only once or that there are no two paragraphs with the same text across the site.

Stay tuned . I think the examples in this section are an impressive testament to the combined power of Schematron and XPath 2.0. Still, this is not the ultimate schema yet. At the end of the chapter, we'll enable the stylesheet to run batch validation of all source documents ( 5.6 ). Validating, transforming, and possibly even uploading the entire site by one simple commandnow that's convenience!


	Amazon