Section 2.2. Practical schematization

2.2 Practical schematization

There are two levels of correctness of an XML document. The lower level is syntactic and structural: End tags must match their start tags, elements must nest properly, all quotes must be closed, all special characters must be properly escaped, and so on. The XML specification terms such documents well- formed . Fortunately, you don't have to do anything special to ensure well- formedness : Any XML parser (used in transformation or even built into your XML authoring tool, 6.1 will report well-formedness errors immediately.

It is only at the higher levelthe level of semantic correctnessthat a well-formed document has to be validated against a source definition. As we saw in the previous section, a typical source definition may include a number of document types and be subdivided into the document and super-document layers ( 2.1.1 ). At the implementation level, the core of a source definition is one or more schema documents written in a schema language. This section examines the existing schema languages and discusses various issues related to schema design and implementation.

2.2.1 Choosing the language

One of the first decisions you face is the choice of the schema language to use. Once limited, this choice is now quite wide and keeps getting wider.

Besides the old and proven DTDs, you can use any of the numerous other languages capable of formally defining the structure and content of XML documents. The best known of these is W3C's XML Schema Definition Language ^[6] (often called simply "XML Schema" but abbreviated XSDL in this book). However, there are other schema languages that are no less deserving of your attention. Below we'll look at the main issues to consider when choosing a schema language for your project.

^[6] www.w3.org/XML/Schema

2.2.1.1 Languages for building grammars

All schema languages could be divided into two major groups. The first group encompasses grammar-based languages such as DTD, XSDL, and RELAX NG. ^[7] A grammar of a language is its formal description that aims to cover the entire language; if a document has a feature not covered by the grammar, this is because either the grammar is incomplete or the document is invalid. In either case, an error is reported .

^[7] www.oasis- open .org/ committees /relax-ng

Grammar descriptions work "downward"; that is, they start from the most global structural units and proceed to the local constructs defining everything in between. Therefore, a grammar is often similar in structure to the document it describes; for example, an element type declaration in XSDL may have the declarations for its descendants laid out exactly as are the real descendant elements in a valid document.

Grammar limitations. Defining XML vocabularies through grammars is very natural, and the resulting schemas are usually straight-forward to read and write. However, it is only in theory that this approach works perfectly ; for many practical scenarios, it is too rigid and inflexible . One of the problems is that many practical types of constraints are impossible to define in grammar-based schema languages. More importantly, in practice you may want to use only a subset of validation rules.

For example, you may want to validate first the structure of a newly created document without checking attribute values or element content, and only go for full validation at a later stage. Sometimes, on the other hand, you may need to validate attribute values even though you know that the structure of the document is not yet valid. Some checks may be more important for you than others, and new checks are likely to become necessary as your document evolves through various stages of its lifecycle.

2.2.1.2 Languages for setting rules

With a grammar-based language, it is difficult to extract part of a schema and use it independently of the rest. A grammar either matches a document in its entirety, or it does not match at all. This is where the second kind of schema languages may be more suitable: the rule-based languages, of which the best known is Schematron. ^[8]

^[8] www.ascc.net/xml/resource/schematron/schematron.html

Precision aiming. A Schematron schema consists of an arbitrary number of rules, each describing one aspect of document structure or data values. These rules do not have to cover the entire grammar; anything in the document for which no rule is found is assumed to be OK by a Schematron validator. Rules can be given in arbitrary order having nothing in common with the order of the corresponding structural units in the document.

Thus, it is easy to start a Schematron schema by defining rules for what you think are the most importantor the most likely to get botched by document authorsfeatures of your document type. Even a one-rule schema is completely workable and may be useful. Later, you can grow your schema "upward" by adding new rules to it as you see fit, either to reflect new structures in an evolving document type or to guard against further practical markup errors. Rules can be grouped into patterns, and patterns can be turned on or off during validation to implement different validation scenarios (such as checking attribute values without checking the structure of elements).

Schematron primer. A simple Schematron example illustrates the above points. Example 2.1 is a Schematron rule combining three checks. The context attribute specifies that these checks will be applied to each section element in the source document. The first two checks verify the presence of obligatory children elements ( head and p ). The last check uses the XPath function normalize-space() to ensure that the section element contains no child text nodes, that is, no "dangling" bits of textual data not enclosed into an appropriate element.

Example 2.1. A simple Schematron rule.

 <rule context="  section  ">   <assert test="  head  ">  A 'section' must have a 'head'.  </assert>   <assert test="  p  ">  A 'section' must have at least one 'p' (paragraph).  </assert>   <assert test="  normalize-space(text()) = ''  ">  A 'section' cannot contain text. Use a 'p' element to include a   paragraph of text.  </assert> </rule>

Of course, we could think up lots of other checks applicable to this simple structure. For instance, we could check that not more than one head element is a child of a section , or that a head comes before any p . However, those checks included in our example grew out of the everyday markup practicethey were added to prevent the most common errors in real documents. You can always add more checks (including those that are impossible with a grammar-based schema language) to respond to the changing requirements of XML authors.

As you can see, the only relatively tricky aspect of this example is the XPath expressions it uses. In fact, for those familiar with XPath, the learning curve of Schematron is nearly nonexistent. The reference implementation of Schematron (which we will use for our examples) is itself written in XSLT and translates a schema into an XSLT stylesheet ( 5.1.2 ). More complex Schematron schemas (Examples 3.3, 5.20) will be analyzed in the following chapters.

Growing rules into grammar. Ultimately, a set of rules in a Schematron schema may grow complete and thus become a grammar. Admittedly, because of the way it was developed, such a grammar may not be as prettily laid out and easy to read as an XSDL schema for the same document type (although it is likely to be more powerful). Of course, you can always organize it and clean it up if you feel like it, or you can even rewrite it completely in a grammar-based schema language. What's important is that your Schematron code has played its role: It allowed you to effectively validate your documents while they were being developed.

So, the rule-based approach makes Schematron an ideal "prototyping" schema language, useful at the early stages of any XML project. Moreover, the fact that Schematron is tightly coupled with XSLT and allows you to easily express rules on both document and super-document layers ( 2.1.1 ) makes it especially suitable for web site projects. If you use XSLT for transforming your XML and if you store the source in more than one document, Schematron is a natural choice.

Guided editing. One downside to a rule-based schema is that it cannot answer arbitrary questions about valid documents, such as "what attributes are permitted for this element type?" or "what type of element can come after this element?"

A rule-based schema is, in a sense, a collection of canned answers to questions that its developer deemed most importantso you cannot rely on it to contain the answer to your particular question. Conversely, a grammar-based schema is a complete description of a valid document, and you can use it to find out an answer to any question so long as it belongs to one of the types covered by this schema.

One practical consequence of this is that you need a grammar-based schema if you want your XML authoring tool to provide guided editing ( 6.1.1.1 ), that is, to suggest valid markup at any point in the document. In order to compile, for example, a list of element types that you can insert at some specific point, an XML editor must have a complete grammar of the document type, not a collection of disjointed checks from a rule-based schema.

Obviously, guided editing is a feature most useful for site editors, not developers. This gives you another reason to create a grammar of your source definition after it is developed and tested but before the bulk of the site's content is marked up with it.

Best of both worlds . With XSDL, you can embed modules written in other schema languages, including Schematron, into your grammarbased schemas. This approach is attractive because it combines the completeness and logical layout of XSDL with the power and precision of Schematron rules.

2.2.1.3 Modularity

Modularity is the best way to keep complex projects under control. Without breaking your work down to manageable and reusable pieces, further development and maintenance may soon become excessively difficult.

A web site's source definition is no exception. XML is intrinsically modular in that element types and attributes, declared once, can be reused arbitrarily many times. However, for practical purposes this is not sufficient. A schema must enforce some higher-level abstractions above element type and attribute declarations, so it can be split into modules that are sufficiently orthogonal (such that changing one module introduces little risk of breaking other modules), easy to maintain, and easy to reuse.

Different schema languages provide different high-level abstractions and therefore different methods of modularizing schemas. This is another important aspect that you should consider before selecting one of the languages. Try to choose the language whose way of dividing schemas into interconnected modules appears closest to the way you tend to think about your source definition.

Since all schema languages exist in the common XML universe, the pieces they consist of at the lowest level are the same: element type declarations, attribute declarations, and content models for specifying what elements, in what order, may occur within other elements. Also, most schema languages support the notion of a document type and allow modularizing schemas at this level. Beyond this, however, it becomes more interesting.

XSDL emphasizes data types and provides an extensive set of tools that you can use to define, extend, restrict, inherit, and reuse data types. Therefore, one could say that XSDL is modular primarily at the data type level. A library of reusable components in XSDL is likely to consist mainly of type definitions that you can reuse in your schema's declarations.
In DTDs , the modularization mechanism is parameter entities (see 2.2.1.4 for an example). An entity is similar to a text editor's macro in that it works at the character level and just replaces an identifier (called a parameter entity reference ) with an associated fragment of text or external object. Any syntax checks are made only after all entity references are expanded.

This approach is proven and powerful, but may lead to hard-to-track bugs . With DTDs, however, this paradigm is effective, as it allows you to create complex schemas that are pretty well modularizedeven though sometimes hard to read.

Unlike most other schema languages, DTDs do not support local element type declarations. This means that you cannot restrict an element type to certain contexts, such as within a certain parent element. Any element type you declare becomes global, and you cannot have two global element types with the same name . For example, if a book element can have an author child and so can a song element, a DTD will validate this only if these two author s have exactly the same children and attributes, or if book and song are in different document types. This is an important reason why DTDs are hard to modularize , although it provides consistency for the authors.
Schematron is the only schema language that does not have explicit element type and attribute declarations as its basic building blocks. Instead, it allows you to specify arbitrary checks that a valid document must pass. Nothing prevents you, however, from arranging these checks into groups so that each group defines all aspects of one element type and is thus a functional equivalent of another schema language's element type declaration.

On the other hand, you can group your checks in any way that makes sense for your application. Schematron offers several levels at which checks can be grouped ( rule , pattern , phase ), and you can switch different phase s on or off for each validation pass. Finally, new rules can be defined to extend existing abstract rules when applied to specific contexts. This flexibility makes it possible to create Schematron schemas that are not only effective but modular and easy to extend.

2.2.1.4 Expressiveness

Fake integers in DTDs. A typical schema can express much more than it can enforce . For example, if you want to declare in your DTD an attribute that only takes integer values, you might think you're out of luck because an integer attribute type is not enforceable via a DTD. However, you can define an entity:

 <!ENTITY % integer "CDATA">

(here CDATA means "any character data") and then use it whenever you want to define an integer-valued attribute. True, for an XML parser this trick is meaningless, as it still won't be able to tell that a value of "xyz" for such an attribute is wrong. But for a person looking up an element's attribute list in the DTD, a reference to such an entity

 <!ATTLIST element   attribute %integer; #IMPLIED >

makes a lot more sense than just

 <!ATTLIST element   attribute CDATA #IMPLIED >

to which it is formally equivalent.

Some might argue that this trickery is useless and can even be misleading, because it gives a DTD author a false feeling of security that is not based on any solid foundation. This may be true, but it is also true that readability is an important aspect of reliability. Other schema languages have other kinds of limitations where similar unenforceable hints might be necessary.

This is a complete sentence . Even though Schematron's XPath expressions are very powerful, you cannot use them to enforce rules that can't be formulated algorithmically, even though these rules may be very important for your source definition. For example, you may require that a heading is always a complete sentence (and not, say, a single word or a phrase). While you could output a warning if the number of words in a heading seems to be too small for a sentence, you cannot reliably catch this error using XPath.

You can, however, make your schema more expressive and more useful with regard to this rule in several ways:

Document the schema (see also 2.2.3 ). This is the least obtrusive but the least efficient approach, as only those XML authors who bother to read the documentation will be aware of the restriction.
Provide validation-time diagnostics. If you cannot check if an element satisfies a rule automatically, you can still remind the user to see to it whenever your schema runs across this element type. This is obviously a more obtrusive option, but it may be advisable if the element type in question does not occur too frequently or the requirement you're trying to enforce is very important.
Choose a "talking" name. Even though it is the structural role that must be the basis for selecting the name for an element type ( 2.3.4 ), sometimes other factors can participate too. If there's an important but formally unenforceable requirement concerning some element type, you can reflect it right in its namefor example, by using heading-sentence instead of just heading if you want the heading to contain a complete sentence. This way, whoever is authoring the XML source will be reminded of the rule every time he or she inserts the corresponding element. Of course, longish and unwieldy names may become a major nuisance, so use this method only for really important aspects of your source definition.

Example 2.2 shows a fragment of a Schematron schema ^[9] that implements all three approaches listed above. The heading-is-a-complete-sentence rule is documented in the schema and is additionally reinforced by the choice of the element type name. An unconditional "reminder" is fired whenever a heading-sentence element is encountered , plus two additional checks ensure that the element's value contains at least one space between words and its last character is a letter. ^[10]

^[9] The function matches() in the test expressions is from XPath 2.0 ( 4.2 ).

^[10] Note that Schematron 1.5 does not allow p within rule , so I had to use XML comments to provide per-rule documentation.

2.2.1.5 Strictness

The question of how strict your schema must be is equivalent to the question of how wide is the gray zone of XML structures which, from the viewpoint of the schema author, do not make much sensebut do no harm either and are therefore considered valid. There are two opposite approaches here: either "whatever is not permitted is forbidden" or "whatever is not forbidden is permitted".

Each of the schema languages naturally gravitates toward one of these two approaches. For example, if you don't explicitly permit a certain attribute on a certain element type in a DTD, using this attribute in an XML document is a validity error. On the other hand, if you say nothing about some element type in a Schematron schema, corresponding instance elements are always considered valid. Still, by using techniques such as wildcards you can to some extent emulate both approaches in any schema language.

Example 2.2. Checking `heading-sentence` with Schematron.

 <pattern name="Heading checks">   <p>  This pattern's rules check the validity of   various heading elements.  </p>   <rule context="  heading-sentence  ">     <!--  This element must contain exactly one complete sentence (i.e., one   with a subject and a predicate) but no punctuation at the end.  -->     <report test="  true()  ">  Check that this element contains a complete sentence.  </report>     <report test="  matches(normalize-space(), '[^A-Za-z0-9]$')  ">  The last character of this element's content is not a   letter nor digit.   Please check that there is no punctuation at the end   of the heading sentence.  </report>     <report test="  not(matches(normalize-space(), ' '))  ">  This element's value has no spaces. You cannot write a   complete sentence without at least one space between words.  </report>   </rule>   <!--  more rule  --> </pattern>

Which approach is better? For database-like XML ( 1.2 ) produced and consumed by programs, something not explicitly prescribed in a schema is most likely an error. For documents written and read by human beings, the opposite is more often true. You cannot possibly foresee all the real-world circumstances that may force you to look for markup workarounds in your documents. Therefore, the "whatever is not forbidden is permitted" approach is usually more suitable for a web site source definition.

Once again, Schematron turns out to be designed for the task. A grammar-based schema, for example, would require you to explicitly list all allowed attributes in an element type declaration; with Schematron, you can prohibit or require certain attributes within certain elements (possibly depending on the context in which an element occurs) and pay no attention to all others. The Schematron motto is, "Don't bother defining it unless it causes you problems."

2.2.2 Schema creation scenarios

For those who prefer to get results fast, writing a formal source definition may look like a waste of time. XML is so intuitive that the temptation to jump straight into authoring (leaving the definition of the documents' structure for later) is very strong. And when the first sample pages are ready and tested (perhaps even with real content and a real stylesheet), the incentive to go back to a formal definition of what you've just created is even weaker. After all, the page templates are so self-explanatory, why bother describing them in yet another layer of complexity?

Indeed, the pedantic, make-a-plan-first-then-start-to-code approach may not be the best for everyone. Are there alternatives? This depends on the complexity of your project, your experience with XML, as well as the level of expertise of those who will be maintaining and supporting the site after it is launched. (In fact, it is likely that you'll get a strong motivation to formally define your markup once you see the incredible errors others are making in their documents.)

2.2.2.1 Working incrementally

If your site's source structure need not be too complex, and especially if you are reusing some bits from previous projects, you can work on the schema in parallel with the actual XML documents. This way, when you think you need a new structural unit, you add it both to the XML document you're writing and to the schema. Here, the actual documents are your drafting boarda schema is simply kept in sync so your documents will validate.

However, this approach will be pointless unless you take time to carefully review, clean up, modularize, and generalize your schema as soon as most of its components are in place.

Starting small. Especially convenient are schema languages that allow the schema to be incomplete but still workable, such as Schematron. For instance, you can quickly write a Schematron schema to check that a heading element has only translation children, each having a unique (within the heading ) value of the language attribute (compare 2.3.5 ). You need not specify any other restrictions or list any other element types or attributes for this schema to work; anything not mentioned in its single check will simply be ignored during validation.

As you continue to add new structural units to your XML, you can expand this schema to express more constraints. In principle, you can even declare the project finished with such an incomplete schemaif you are sure that it catches the most likely markup errors and that whoever authors new XML documents will not break anything "obvious" that the schema does not cover (this last assumption may sound more plausible if new documents are always created from templates, 2.2.3.3 , that already contain basic structural blocks). However, it is still advisable to fill in all the blanks and make the rule-based schema as complete as possible so that only fully compliant documents will pass validation in the daily maintenance of the site.

Ending big. Even if throughout the development cycle, you were working with a rule-based language such as Schematron, you may still need to provide a grammar-based schema (e.g., a DTD or an XSDL schema) when the source definition is complete. This may be a result of several factors.

First, you have to consider the limitations of your production setting. For example, if your web site is going to be part of a larger XML framework that supports only DTDs, you must provide DTDs for your document types. The same is true if you need to integrate the web site with existing schema libraries; for example, you may want to base your web page's data types on those defined in existing XSDL schemas.

Second, you should remember that a schema is more than just a filter that separates valid and invalid documents; it is also one of the principal parts of the system's documentation, the ultimate reference manual of your source definition. It must therefore reflect not only valid structures in your documents, but also the larger ideas and concepts behind these structures. The schema language ideally satisfying these documentation requirements will likely be different from the language that is most convenient for development and practical validation; obviously, complete grammars make better documentation than collections of disjointed validation checks.

Still another reason (already mentioned in 2.2.1.2 ) to write a grammar for your source documents is to enable guided editing if your XML editor supports it.

2.2.2.2 Changing the rules

Don't worry if you don't get it right the first time and have to make changes to your source definition, either in the process of writing the stylesheet or even during after-launch site maintenance. Admittedly, such changes can be costly because they may necessitate modifying markup in a lot of documents (although in many cases, you can automate this by creating an XSLT stylesheet that will transform your documents from the old markup to the new one), but sometimes they are unavoidable. Here are some bits of advice:

Accumulate many small changes into a few large "releases" that are put into effect simultaneously across the entire system. (Be careful, however, not to frighten your users by making these changes too sweeping.)
Carefully document all changes (this, of course, implies that the original markup rules that you are changing were also well documented).
Make sure that your schemas, transformation stylesheet, and other software are aware of both old and new versions of markup. Provide corresponding checks and helpful error messages. Whenever possible, make the system backward-compatible , but warn the user that the old format, even if it still works, should be changed to the new one as soon as possible.

2.2.3 Documenting schemas

An ideal schema does not need documentation because it is documentation itself. Indeed, a complete human-readable specification of an XML vocabulary is also its "schema" in the sense that you can use it as the ultimate authority on whether or not an instance of that vocabulary is conformant. The only problem with such a "schema" is that you need a human to apply it to each documentwhich makes it costly, slow, and error-prone .

2.2.3.1 Documenting in different languages

It is only natural to try to combine formalized schemas that permit automatic validation with human-readable documentation. Schema languages provide embedded documentation tools that are as different as user requirements can be. Here are a few examples:

With DTDs , you can add documentation to a schema by using  . The comments can contain XML markup, except for other comments.
XSDL offers a powerful mechanism whereby you can use element types from any vocabulary (e.g., HTML or DocBook) to provide any amount of documentation on the components of your schema. The documentation elements are distinguished from XSDL elements by their unique namespace. Because both the schema itself and its embedded documentation are in XML, it is possible to process schemas using an XSLT stylesheet (for example, to extract all documentation into a separate document).
Schematron can also embed arbitrary XML elements in schemas. ^[11] However, Schematron's focus is different; its rules are typically much less structured than XSDL or DTD declarations, so trying to structure the documentation according to the layout of these rules may not be an optimal strategy. Instead, what Schematron excels in is on-demand diagnostics tied to specific markup contexts or triggered by specific errors.

^[11] The Schematron 1.5 specification lists a number of element types permitted inside the p elements that are intended for documentation, but most implementations will allow you to use arbitrary markup there.

For many users, this "context-sensitive help" feature of Schematron can be even more useful than a narrative-style documentation, as it may help them learn practical markup much faster. Note also that Schematron diagnostics are delivered to the user directly and are not (unlike other languages) interspersed with markup declarationsfor some users, this makes a lot of difference in terms of documentation usability.

2.2.3.2 Documentation components

What should be in the documentation of a source definition for it to be useful? Surely the idea of " well-written documentation" is quite subjective : Where one user would prefer a detailed narrative with examples and explanations , another would be perfectly happy with basic templates and a validator that provides minimal diagnostics for markup errors. Your best bet, therefore, is to discuss documentation requirements with the actual users of your source definition. In my experience, the components of successful documentation include, in order of decreasing usefulness , the following:

Any relevant rules that are not in the schema itself. As we've seen, schema languages vary widely in how much of a complete source definition they can formally express. DTDs are particularly weak in this regard, but even with XPath-based Schematron rules, there may be certain conventions that you cannot check automatically. It's obviously a priority to supply such rules in human-readable form so that XML authors can avoid bad practices or at least figure out what they've been doing wrong.

Example 2.3. An XSDL element declaration with embedded documentation containing markup examples.

 <xsd:element name="block"             xmlns:xsd="http://www.w3.org/2001/XMLSchema">   <xsd:annotation xmlns="http://www.w3.org/1999/xhtml">     <p>  An example of a block:  </p>     <pre>  <![CDATA[   <block>   <heading>A heading</heading> <!-- no full stop! -->   <p>A paragraph of text.</p>   <p>Possibly one more paragraph.</p>   </block>   ]]>  </pre>     <p>  Note, however, that a block may be empty if it   contains a reference to an external resource:  </p>     <pre>  <![CDATA[   <block idref="block-id"  />  ]]>  </pre>   </xsd:annotation>  ...  </xsd:element>

Markup examples. As for documenting the rules that are in the schema, examples of conforming markup work best: An example is worth a thousand words. You should provide not only typical examples but also any special or borderline cases if they are likely to cause problems. If used in XML (as opposed to plain text or XML comments), use CDATA sections ^[12] for example code so it is not treated as part of the markup. Example 2.3 shows such a CDATA section embedded into HTML documentation in an XSDL schema.

^[12] www.w3.org/TR/REC-xml#sec-cdata-sect

Note how the examples alleviate the need for long descriptive documentation and allow the author to add a succinct note in an XML comment exactly where it is relevant. Note also that CDATA sections alone are not sufficientthey only protect special characters but are not elements themselves , so a pre element is added for each example.
Structural information , such as descriptions of content models and attribute data types. In a grammer-based schema, this information is already formally expressed by the schema itself, but depending on users' familiarity with your particular schema notation, it may be beneficial to reword it in plain English (possibly adding some non-formalizable usage requirements or suggestions). For rule-based schemas such as Schematron, providing this information is even more of a necessity because such a schema may not contain any coherent formal description of document structure at all.
Metadata. Finally, it is a good practice to document your schema's metadata, such as authorship and copyright information. For a schema that has already been in use, it is especially important to have a change log documenting all changes made to the current version since its first public release, including dates, authors, and details for each change (see also 2.2.2.2 ). This information should be given prominently at the top of a schema document.

2.2.3.3 Page templates

An important part of a source definition is a set of document templates. A template is an example document with dummy (or no) data content, showing a typical layout of a valid XML document, usually with comments. Users can take the template as a starting point for creating a new document simply by filling in content between the template's supplied tags. You should provide a template for each sufficiently distinct type of source document.

Even if the users of your source definition (i.e., site editors and maintainers) have never worked with XML before, the concept of storing content units within matching pairs of tags should not be hard to grasp. It's actually a very natural way to think even for nontechnical people. However, starting a new document from scratch may be difficult even if you know what you want to get, and this is where page templates are invaluable.

A part of the difficulty is that, along with the meaningful content, an XML document may also contain a lot of metadata such as the XML declaration ( <?xml ... ?> ), a stylesheet processing instruction ( <?xml-stylesheet ... ?> ), a DOCTYPE declaration, or an internal DTD subset. To many "light" XML users, these constructs look (deservedly) much more frightening and indecipherable than the body of the document. So, providing a template with all this stuff already filled in (and making sure it doesn't need to change for each new document) is usually a good idea.

2.2.4 Using DTDs

As far as the document layer definitions are concerned , DTDs support basic constraints such as element type names, attribute lists, and content models. You cannot use DTDs to check the data type of an element's content, ^[13] express the dependence of an element's content model on the presence of an attribute, or perform complex syntactic checks of data values. As for the super-document layer, it's totally out of reach for DTDs.

^[13] There exists a DTD extension for this purpose called Datatypes for DTDs ; see www.w3.org/TR/dt4dtd for more information. Open source code is available at www.XMLHandbook.com/DT4DTD.

On the other hand, the DTD notation stands apart from other schema languages in that it is defined right in the XML Recommendation. ^[14] Also, DTDs have been traditionally used for defining XML vocabularies (including W3C standards), and the DTD notation is still the default schema language in many existing XML tools and frameworks. Another big advantage of this language is that a DTD validator is built into every validating XML parser, so you don't need any additional software for validation. Let's look at some of the issues related to using DTDs in a web site source definition.

^[14] www.w3.org/TR/REC-xml

2.2.4.1 DTDs and namespaces

DTDs are not namespace-aware, simply because namespaces were introduced to XML after the first version of the XML Recommendation, including the DTD syntax, was finalized. You can still use DTDs to declare and validate namespace-qualified namesbut they must include a fixed namespace prefix and the ":" separator to be treated by a DTD validator as a whole.

For example, you can write a DTD declaration for an element type named xsl:stylesheet to specify its content model and attributes. However, from the DTD viewpoint, this element type will have nothing in common with xsl:template and, more importantly, nothing in common with my:stylesheet even if the prefix my was declared for the same URI as xsl .

A partial workaround involves declaring names without prefixes in your DTD and always using the default (prefixless) namespace for them in your XML documents. By providing prefixless DTD declarations for your primary namespace and declarations with fixed prefixes as needed for foreign-namespace elements, you can to some extent reconcile the limitations of DTDs with the requirements of modern multi-namespace documents. (Although this approach is not likely to work if you freely mix elements from arbitrary namespaces in your documents, such documents would not be DTD-valid in any case, so the DTD would be irrelevant.)

2.2.4.2 Linking DTDs to documents

DTDs don't need a separate processor for validation. Any validating XML parser (possibly including the one that will read your documents to pass them on to the XSLT processor) will report DTD errors immediately. All you need to do is link each of your documents to its DTD, as follows :

 <  !DOCTYPE page SYSTEM  "page.dtd"> <page>   <!--...--> </page>

Here, page.dtd is the name of the file containing your DTD (in this example, it is supposed to be in the same directory as your document file; if it is not, you can provide a relative path to the file or a URI instead). Note that the DOCTYPE declaration must also mention the name of the root element type of your document ( page ).

An XSLT processor with a validating parser will first read in the document and check it against the DTD. Only if it is valid will transformation begin. However, the same document without the DOCTYPE declaration will be transformed just as well, except that any DTD conformance errors will not be caught (only well-formedness errors, such as unclosed elements or missing quotes in attributes, will halt the parser). Therefore, you should use DOCTYPE -less documents if you have a schema other than a DTD and a corresponding processor for validation.