Chapter 16. XML Schemas

CONTENTS

16.1 Overview
16.2 Schema Basics
16.3 Working with Namespaces
16.4 Complex Types
16.5 Empty Elements
16.6 Simple Content
16.7 Mixed Content
16.8 Allowing Any Content
16.9 Controlling Type Derivation

Although Document Type Definitions can enforce basic structural rules on documents, many applications need a more powerful and expressive validation method. The W3C developed the XML Schema Recommendation, released on May 2, 2001 after a long incubation period, to address these needs. Schemas can describe complex restrictions on elements and attributes. Multiple schemas can be combined to validate documents that use multiple XML vocabularies. This chapter provides a rapid introduction to key W3C XML Schema concepts and usage.

This chapter progressively introduces the structures and concepts of XML Schemas, beginning with the fundamental structure that is common to all schemas. The chapter begins with a very simple schema and proceeds to add more functionality to it until ever major feature of XML Schemas has been introduced.

16.1 Overview

A schema is a formal description of what comprises a valid document. An XML schema is an XML document containing a formal description of what comprises a valid XML document. A W3C XML Schema Language schema is an XML schema written in the particular syntax recommended by the W3C.

In this chapter when we use the word schema without further qualification, we are referring specifically to a schema written in the W3C XML schema language. However, there are numerous other XML schema languages, including RELAX NG and Schematron, each with their own strengths and weaknesses.

An XML document described by a schema is called an instance document. If a document satisfies all the constraints specified by the schema, it is considered to be schema-valid. The schema document is associated with an instance document through one of the following methods:

An xsi:schemaLocation attribute on an element contains a list of namespaces used within that element and the URLs of the schemas with which to validate elements in those namespaces.
An xsi:noNamespaceSchemaLocation attribute contains a URL for the schema used to validate elements that are not in any namespace.
The validating parser may attempt to locate the schema using the namespace of the element itself in one of these ways: directly by looking for a schema at that namespace, indirectly by looking for a RDDL document at that namespace, or implicitly by knowing in advance which schema is right for that namespace.
A validating parser may be instructed to validate a given document against an explicitly provided schema, ignoring any hints that might be provided within the document itself.

16.1.1 Schemas Versus DTDs

DTDs provide the capability to do basic validation of the following items in XML documents:

Element nesting
Element occurrence constraints
Permitted attributes
Attribute types and default values

However, DTDs do not provide fine control over the format and data types of element and attribute values. Other than the various special attribute types (ID, IDREF, ENTITY, NMTOKEN, and so forth), once an element or attribute has been declared to contain character data, no limits may be placed on the length, type, or format of that content. For narrative documents (such as web pages, book chapters, newsletters, etc.), this level of control is probably good enough.

But as XML makes inroads into more data-intensive applications (such as web services using SOAP), more precise control over the text content of elements and attributes becomes important. The W3C XML Schema standard includes the following features:

Simple and complex data types
Type derivation and inheritance
Element occurrence constraints
Namespace-aware element and attribute declarations

The most important of these features is the addition of simple data types for parsed character data and attribute values. Unlike DTDs, schemas can enforce specific rules about the contents of elements and attributes. In addition to a wide range of built-in simple types (such as string, integer, decimal, and dateTime), the schema language provides a framework for declaring new data types, deriving new types from old types, and reusing types from other schemas.

Besides simple data types, schemas add the ability to place more explicit restrictions on the number and sequence of child elements that can appear in a given location. This is even true when elements are mixed with character data, unlike the mixed content model (#PCDATA) supported by DTDs.

There are a few things that DTDs do that XML Schema can't do. Defining general entities for use in documents is one of these. XML Inclusions (XInclude) may be able to replace some uses of general entities, but DTDs remain extremely convenient for short entities.

16.1.2 Namespace Issues

As XML documents are exchanged between different people and organizations around the world, proper use of namespaces becomes critical to prevent misunderstandings. Depending on what type of document is being viewed, a simple element like <fullName>Zoe</fullName> could have widely different meanings. It could be a person's name, a pet's name, or the name of a ship that recently docked. By associating every element with a namespace URI, it is possible to distinguish between two elements with the same local name.

Because the Namespaces in XML recommendation was released after the XML 1.0 recommendation, DTDs do not provide explicit support for declaring namespace-aware XML applications. Unlike DTDs (where element and attribute declarations must include a namespace prefix), schemas validate against the combination of the namespace URI and local name rather than the prefixed name.

Namespaces are also used within instance documents to include directives to the schema processor. For example, the special attributes that are used to associate an element with a schema (schemaLocation and noNamespaceSchemaLocation) must be associated with the official XML Schema instance namespace URI (http://www.w3.org/2001/XMLSchema-instance) in order for the schema processor to recognize it as an instruction to itself.

16.2 Schema Basics

This section will construct, step by step, a simple schema document representing a typical address book entry, introducing different features of the XML Schema language as needed. Example 16-1 shows a very simple well-formed XML document.

Example 16-1. addressdoc.xml

<?xml version="1.0"?> <fullName>Scott Means</fullName>

Assuming that the fullName element can only contain a simple string value, the schema for this document would look like Example 16-2.

Example 16-2. address-schema.xsd

<?xml version="1.0"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">  <xs:element name="fullName" type="xs:string"/> </xs:schema>

It is also common to associate the sample instance document explicitly with the schema document. Since the fullName element is not in any namespace, the xsi:noNamespaceSchemaLocation attribute is used as shown in Example 16-3.

Example 16-3. addressdoc.xml with schema reference

<?xml version="1.0"?> <fullName xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"   xsi:noNamespaceSchemaLocation="address-schema.xsd">Scott Means</fullName>

Validating the simple document against its schema requires a validating XML parser that supports schemas such as the open source Xerces parser from the Apache XML Project (http://xml.apache.org/xerces-j/ ). This is written in Java and includes a command-line program called dom.DOMWriter that can be used to validate addressdoc.xml like this:

% java dom.DOMWriter -V -S addressdoc.xml

Since the document is valid, DOMWriter will simply echo the input document to standard output. An invalid document will cause the parser to generate an error message. For instance, adding b elements to the contents of the fullName element violates the schema rules:

<?xml version="1.0"?> <fullName xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"   xsi:noNamespaceSchemaLocation="address-schema.xsd"   >Scott <b>Means</b></fullName>

If this document were validated with DOMWriter, the following validity errors would be detected by Xerces:

[Error] addressdoc.xml:4:13: Element type "b" must be declared. [Error] addressdoc.xml:4:31: Datatype error: In element 'fullName' : Can not  have element children within a simple type content.

16.2.1 Document Organization

Now that there is a basic schema and a valid document from which to work, it is time to examine the structure of a schema document and its contents. Every schema document consists of a single root xs:schema element. This element contains declarations for all elements and attributes that may appear in a valid instance document.

The XML elements that make up an XML schema must belong to the XML Schema namespace (http://www.w3.org/2001/XMLSchema), which is frequently associated with the xs: prefix. For the remainder of this chapter, all schema elements will be written using the xs: prefix to indicate that they belong to the Schema namespace.

Instance elements declared by top-level elements in the schema (immediate child elements of the xs:schema element) are considered global elements. For example, the simple schema in Example 16-2 globally declares one element: fullName. According to the rules of schema construction, any element that is declared globally may appear as the root element of an instance document.

In this case, since only one element has been declared, that shouldn't be a problem. But when building more complex schemas, this side effect must be taken into consideration. If more than one element is declared globally, a schema-valid document may not contain the root element you expect.

Naming conflicts are another potential problem with multiple global declarations. When writing schema declarations, it is an error to declare two things of the same type at the same scope. For instance, trying to declare two global elements called fullName would generate an error. But declaring an element and an attribute with the same name would not create a conflict, because the two names are not used in the same way.

16.2.2 Annotations

Now that there is a working schema, it's good practice to include some documentary material about who authored it, what it was for, any copyright restrictions, etc. Since an XML schema document is an XML document in its own right, one simple option would be to use XML comments to include documentary information.

The major drawback to using XML comments is that parsers are not obliged to keep comments intact when parsing XML documents, and applications have to do a lot of work to negotiate their internal structures. This increases the likelihood that, at some point, important documentation will be lost during an otherwise harmless transformation or editing procedure. Encoding documentation as markup inline with the element and type declarations they refer to opens up endless possibilities for automatic documentation generation.

To accommodate this extra information, most schema elements may contain an optional xs:annotation element as their first child element. The annotation element may then, in turn, contain any combination of xs:documentation and xs:appinfo elements, which are provided to contain extra human-readable and machine-readable information, respectively.

16.2.2.1 The xs:documentation element

As a concrete example, let's add some authorship and copyright information to the simple schema document, as shown in Example 16-4.

Example 16-4. address-schema.xsd with annotation

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">     <xs:annotation>   <xs:documentation xml:lang="en-us">     Simple schema example from O'Reilly's     <a href="http://www.oreilly.com/catalog/xmlnut">XML in a Nutshell.</a>     Copyright 2002 O'Reilly &amp; Associates   </xs:documentation>  </xs:annotation>     <xs:element name="fullName" type="xs:string"/>    </xs:schema>

The xs:documentation element permits an xml:lang attribute to identify the language of the brief message. This attribute can also be applied to the xs:schema element to set the default language for the entire document. For more information about using the xml:lang attribute, see Chapter 5 and Chapter 20.

Also, notice that the documentation element contains additional markup: an a element ( la HTML). The xs:documentation element is allowed to contain any well-formed XML, not just schema elements. The Section 16.8 later in this chapter explains how this can be done in your own documents.

16.2.2.2 The xs:appinfo element

In reality, there is little difference between the xs:documentation element and the xs:appinfo element. Either one can contain any combination of character data or markup the schema author wants to include. But the developers of the schema specification intended the xs:documentation element to contain human-readable content, while the xs:appinfo element would contain application-specific extension information related to a particular schema element.

For example, let's say that it is necessary to encode context-sensitive help text with each of the elements declared in a schema. This text might be used to generate tool-tips in a GUI or system prompts in a voicemail system. Either way, it would be very convenient to associate this information directly with the particular element in question using the xs:appinfo element, like this:

. . . <xs:element name="fullName" type="xs:string">   <xs:annotation>     <xs:appinfo>       <help-text>Enter the person's full name.</help-text>     </xs:appinfo>   </xs:annotation>  </xs:element> . . .

Although schemas allow very sophisticated and powerful rules to be expressed, they cannot possibly encompass every conceivable need that a schema developer might face. That is why it is important to remember that there is a facility that can be used to include your own application-specific information directly within the actual schema declarations.

Schematron is especially well-suited to use in annotations and is capable of checking a wide variety of conditions well beyond the bounds of XML Schema. For more information about Schematron, see http://www.ascc.net/xml/resource/schematron/schematron.html.

16.2.3 Element Declarations

XML documents are composed primarily of nested elements, and the xs:element element is one of the most often-used declarations in a typical schema. The simple example schema already includes a single global element declaration that tells the schema processor that instance documents must consist of a single element called fullName:

<xs:element name="fullName" type="xs:string">

This declaration uses two attributes to describe the element that can appear in the instance document: name and type. The name attribute is self-explanatory, but the type attribute requires some additional explanation.

16.2.4 Simple Types

Schemas support two different types of content: simple and complex. Simple content equates with basic data types that are found in most modern programming languages (strings, integers, dates, times, etc.). Simple types cannot, by definition, contain nested element content.

In the previous example, the type="xs:string" attribute/value pair tells the schema processor that this element can only contain simple content of the built-in type xs:string. Table 16-1 lists a representative sample of the built-in simple types that are defined by the schema specification. See Chapter 21 for a complete listing.

Table 16-1. Built-in simple schema types
Type	Description
anyURI	A Uniform Resource Identifier
base64Binary	Base64 content-encoded binary data
boolean	May contain either true or false, 0 or 1
byte	A signed byte quantity >= -128 and <= 127
dateTime	An absolute date and time value combination
duration	A relative amount of time, expressed in units of years, months, days, hours, etc
ID, IDREF, IDREFS, ENTITY, ENTITIES, NOTATION, NMTOKEN, NMTOKENS	Same values as defined in the attribute declaration section of the XML 1.0 recommendation
integer	Any positive or negative counting number
language	May contain same values as `xml:lang` attribute from XML 1.0 recommendation
Name	An XML name
normalizedString	String with newline, tab, and carriage-return characters normalized to spaces
string	Unicode string
token	Same as `normalizedString` with multiple spaces collapsed and leading and trailing spaces removed

Since attribute values cannot contain elements, attributes must always be declared with simple types. Also, an element that is declared to have a simple type cannot have any attributes. This means that if an attribute must be added to the fullName element, some fairly significant changes to the element declaration are required.

16.2.5 Attribute Declarations

To make the fullName element more informative, it would be nice to add a language attribute to provide a hint as to how it should be pronounced. Although adding an attribute to an element sounds like a fairly simple task, it is complicated by the fact that elements with simple types (like xs:string) cannot have attribute values.

Attributes are declared using the xs:attribute element. Attributes may be declared globally by top-level xs:attribute elements (which may be referenced from anywhere within the schema) or locally as part of a complex type definition that is associated with a particular element.

To incorporate a language attribute into the fullName element declaration, a new complex type based on the built-in xs:string type must be created. To do this, three new schema elements must be used: xs:complexType, xs:simpleContent, and xs:extension:

<xs:element name="fullName">      <xs:complexType>     <xs:simpleContent>       <xs:extension base="xs:string">         <xs:attribute name="language" type="xs:language"/>       </xs:extension>     </xs:simpleContent>   </xs:complexType>    </xs:element>

This declaration no longer has a type attribute. Instead it has an xs:complexType child element. This element tells the schema processor that the fullName element may have attributes, but the xs:simpleContent element tells the processor that the content of the element is a simple type. To specify what type of simple content, it uses the base attribute of the xs:extension element to derive a new type from the built-in xs:string type. The xs:attribute element within the xs:extension element indicates that this derived type may have an attribute called language that contains values conforming to the built-in simple type xs:language (mentioned in Table 16-1). Type derivation is an important part of schema creation and will be covered in more detail later in this chapter.

16.2.5.1 Attribute groups

In DTDs, parameter entities are used to encapsulate repeated groups of attribute declarations that are shared between different element types. Schemas provide the same functionality in a more formal fashion using the xs:attributeGroup element.

An attribute group is simply a named group of xs:attribute declarations (or references to other attribute groups) that can be referenced from within a complex type definition. The attribute group must be declared as a global element with a unique name attribute. The group is referenced within a complex type definition by including another xs:attributeGroup element with a ref attribute that matches the desired top-level attribute group name.

Within the fullName schema, an attribute group could be used to create a package of attributes related to a person's nationality. This package of attributes could be used on several elements, including the fullName element, without repeating the same attribute declarations. Then, if it were later necessary to extend this collection of attributes, it could be done in a single location:

<xs:element name="fullName"> . . .       <xs:extension base="xs:string">         <xs:attributeGroup ref="nationality"/>       </xs:extension> . . . </xs:element>    <xs:attributeGroup name="nationality">   <xs:attribute name="language" type="xs:language"/> </xs:attributeGroup>

16.3 Working with Namespaces

So far, namespaces have only been dealt with as they relate to the schema processor and schema language itself. But the schema specification was written with the intention that schemas could support and describe XML namespaces. In an ideal world, any XML parser with access to the Internet would be able to validate any XML document, given only that document's namespace. In fact, the Resource Directory Description Language (RDDL) standard is an attempt to build the framework that will enable this functionality and is described in detail in Chapter 14.

16.3.1 Target Namespaces

Associating a schema with a particular XML namespace is extremely simple: add a targetNamespace attribute to the root xs:schema element, like so:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"   targetNamespace="http://namespaces.oreilly.com/xmlnut/address">

It is important to remember that many XML 1.0 documents are not associated with namespaces at all. To validate these documents, it is necessary to use a schema that doesn't have a targetNamespace attribute. When developing schemas that are not associated with a target namespace, you should explicitly qualify schema elements (like xs:element) to keep them from being confused with global declarations for your application.

However, making that simple change impacts numerous other parts of the example application. Trying to validate the addressdoc.xml document as it stands (with the xsi:noNamespaceSchemaLocation attribute) causes the Xerces schema processor to report this validity error:

General Schema Error: Schema in address-schema.xsd has a different target  namespace from the one specified in the instance document :.

To rectify this, it is necessary to change the instance document to reference the new, namespace-enabled schema properly. This is done using the xsi:schemaLocation attribute, like so:

<fullName xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"   xsi:schemaLocation="http://namespaces.oreilly.com/xmlnut/address      address-schema.xsd"   language="en">Scott Means</fullName>

Notice that the schemaLocation attribute value contains two tokens. The first is the target namespace URI that matches the target namespace of the schema document. The second is the physical location of the actual schema document.

Unfortunately, there are still problems. If this document is validated, the validator will report errors like these two:

Element type "fullName" must be declared. Attribute "language" must be declared for element type "fullName".

This is because, even though a schema location has been declared, the element still doesn't actually belong to a namespace. Either a default namespace must be declared or a namespace prefix that matches the target namespace of the schema must be used. The following document uses a default namespace:

<fullName xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"   xsi:schemaLocation="http://namespaces.oreilly.com/xmlnut/address      address-schema.xsd"   xmlns="http://namespaces.oreilly.com/xmlnut/address"   language="en">Scott Means</fullName>

But before this document can be successfully validated, it is necessary to fix one other problem that was introduced when a target namespace was added to the schema. Within the element declaration for the fullName element, there is a reference to the nationality attribute group. By associating the schema with a target namespace, every global declaration has been implicitly associated with that namespace. This means that the ref attribute of the attribute group element in the element declaration must be updated to point to an attribute group that belongs to the new target namespace.

The clearest way to do this is to declare a new namespace prefix in the schema that maps to the target namespace and use it to prefix any references to global declarations:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"   targetNamespace="http://namespaces.oreilly.com/xmlnut/address"   xmlns:addr="http://namespaces.oreilly.com/xmlnut/address"> . . .         <xs:attributeGroup ref="addr:nationality"/> . . .

Now, having made these three simple changes, the document will once again validate cleanly against the schema.

The obvious lesson from this is that namespaces should be incorporated into your schema design as early as possible. If not, there will likely be a large amount of cleanup involved as various assumptions that used to be true are no longer valid.

16.3.2 Controlling Qualification

One of the major headaches with DTDs is that they have no explicit support for namespace prefixes since they predate the Namespaces in XML recommendation. Although Namespaces in XML went to great pains to explain that prefixes were only placeholders and only the namespace URIs really matter, it was painful and awkward to design a DTD that could support arbitrary prefixes. Schemas correct this by validating against namespace URIs and local names rather than prefixed names.

The elementFormDefault and attributeFormDefault attributes of the xs:schema element control whether locally declared elements and attributes must be namespace-qualified within instance documents. Suppose the attributeFormDefault attribute is set to qualified in the schema like this:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"   targetNamespace="http://namespaces.oreilly.com/xmlnut/address"   xmlns:addr="http://namespaces.oreilly.com/xmlnut/address"   attributeFormDefault="qualified">

Now, if addressdoc.xml is validated against the schema, the validator reports the following error:

Attribute "language" must be declared for element type "fullName".

Since the default attribute form has been set to qualified, the schema processor doesn't recognize the unqualified language attribute as belonging to the same schema as the fullName element. This is because attributes, unlike elements, don't inherit the default namespace from the xmlns="..." attribute. They must always be explicitly prefixed if they need to belong to a particular namespace.

The easiest way to fix the instance document is to declare an explicit namespace prefix and use it to qualify the element and attribute, as shown in Example 16-5.

Example 16-5. addressdoc.xml with explicit namespace prefix

<?xml version="1.0"?> <addr:fullName xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"   xsi:schemaLocation="http://namespaces.oreilly.com/xmlnut/address      address-schema.xsd"   xmlns:addr="http://namespaces.oreilly.com/xmlnut/address"   addr:language="en">Scott Means</addr:fullName>

The elementFormDefault attribute serves the same function in regards to namespace qualification of nested elements. If it is set to qualified, nested elements must belong to the target namespace of the schema (either through a default namespace declaration or an explicit prefix).

16.4 Complex Types

A schema assigns a type to each element and attribute it declares. In Example 16-5, the fullName element has a complex type. Elements with complex types may contain nested elements and have attributes. Only elements can have complex types. Attributes always have simple types.

Since the type is declared using an xs:complexType element embedded directly in the element declaration, it is also an anonymous type rather than a named type.

New types are defined using xs:complexType or xs:simpleType elements. If a new type is declared globally with a top-level element, it needs to be given a name so that it can be referenced from element and attribute declarations within the schema. If a type is declared inline (inside an element or attribute declaration), it does not need to be named. But since it has no name, it cannot be referenced by other element or attribute declarations. When building large and complex schemas, data types will need to be shared among multiple different elements. To facilitate this reuse, it is necessary to create named types.

To show how named types and complex content interact, let's expand the example schema. A new address element will contain the fullName element, and the person's name will be divided into a first- and last-name component. A typical instance document would look like Example 16-6.

Example 16-6. addressdoc.xml after adding address, first, and last elements

<?xml version="1.0"?> <addr:address xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"     xsi:schemaLocation="http://namespaces.oreilly.com/xmlnut/address        address-schema.xsd"     xmlns:addr="http://namespaces.oreilly.com/xmlnut/address"     addr:language="en">   <addr:fullName>     <addr:first>Scott</addr:first>     <addr:last>Means</addr:last>   </addr:fullName> </addr:address>

To accommodate this new format, fairly substantial structural changes to the schema are required, as shown in Example 16-7.

Example 16-7. address-schema.xsd to support address element

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"   targetNamespace="http://namespaces.oreilly.com/xmlnut/address"   xmlns:addr="http://namespaces.oreilly.com/xmlnut/address"   attributeFormDefault="qualified" elementFormDefault="qualified"> <xs:element name="address">   <xs:complexType>     <xs:sequence>       <xs:element name="fullName">         <xs:complexType>           <xs:sequence>             <xs:element name="first" type="addr:nameComponent"/>             <xs:element name="last" type="addr:nameComponent"/>           </xs:sequence>         </xs:complexType>       </xs:element>     </xs:sequence>   <xs:attributeGroup ref="addr:nationality"/>   </xs:complexType>  </xs:element>    <xs:complexType name="nameComponent">   <xs:simpleContent>     <xs:extension base="xs:string"/>   </xs:simpleContent>  </xs:complexType> </xs:schema>

The first major difference between this schema and the previous version is that the root element name has been changed from fullName to address. The same result could have been accomplished by creating a new top-level element declaration for the new address element, but that would have opened a loophole allowing a valid instance document to contain only a fullName element and nothing else.

Within the address element declaration, a new anonymous complex type is declared. Unlike the old declaration, this complex type is declared to contain complex content using the xs:sequence element. The sequence element tells the schema processor that the contained list of elements must appear in the target document in the exact order they are given. In this case, the sequence contains only one element declaration.

The nested element declaration is for the fullName element, which then repeats the xs:complexType and xs:sequence declaration process. Within this nested sequence, two element declarations appear for the first and last elements.

These two element declarations, unlike all prior element declarations, explicitly reference a new complex type that's declared in the schema, the addr:nameComponent type. It is fully qualified to differentiate it from possible conflicts with built-in schema data types.

The nameComponent type is declared by the xs:complexType element immediately following the address element declaration. It is identified as a named type by the presence of the name attribute, but in every other way it is constructed the same way it would have been as an anonymous type.

16.4.1 Occurrence Constraints

One feature of schemas that should be welcome to DTD developers is the ability to explicitly set the minimum and maximum number of times an element may occur at a particular point in a document using minOccurs and maxOccurs attributes of the xs:element element. For example, this declaration adds an optional middle name to the fullName element:

<xs:element name="fullName">   <xs:complexType>     <xs:sequence>       <xs:element name="first" type="addr:nameComponent"/>       <xs:element name="middle" type="addr:nameComponent"           minOccurs="0"/>       <xs:element name="last" type="addr:nameComponent"/>     </xs:sequence>   </xs:complexType> </xs:element>

Notice that the element declaration for the middle element has a minOccurs value of 0. The default value for both minOccurs and maxOccurs is 1, if they are not provided explicitly. Therefore, setting minOccurs to 0 means that the middle element may appear 0 to 1 times. This is equivalent to using the ? operator in a DTD declaration. Another possible value for the maxOccurs attribute is unbounded, which indicates that the element in question may appear an unlimited number of times. This value is used to produce the same effect as the * and + operators in a DTD declaration.

16.4.2 Types of Element Content

So far you have seen elements that contain only character data and elements that contain only other elements. The next several sections cover each of the possible types of element content individually, from most restrictive to least restrictive:

Empty
Simple content
Mixed content
Any type

16.5 Empty Elements

In many cases, it is useful to declare an element that cannot contain anything. Most of these elements convey all of their information via attributes or simply by their position in relation to other elements (e.g., the br element from XHTML).

Let's add a contact-information element to the address element that will be used to contain a list of ways to contact a person. Example 16-8 shows the sample instance document after adding the new contacts element and a sample phone entry.

Example 16-8. addressdoc.xml with contact element

<?xml version="1.0"?> <addr:address xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"     xsi:schemaLocation="http://namespaces.oreilly.com/xmlnut/address        address-schema.xsd"     xmlns:addr="http://namespaces.oreilly.com/xmlnut/address"     addr:language="en">   <addr:fullName>     <addr:first>William</addr:first>     <addr:middle>Scott</addr:middle>     <addr:last>Means</addr:last>   </addr:fullName>   <addr:contacts>     <addr:phone addr:label="888.737.1752"/>   </addr:contacts> </addr:address>

Supporting this new content requires further modifications to the schema document. Although it would be possible to declare the new element inline within the existing address-element declaration, for clarity it makes sense to create a new global type and reference it by name:

<xs:element name="address">   <xs:complexType>     <xs:sequence>       <xs:element name="fullName"> . . .       </xs:element>       <xs:element name="contacts" type="addr:contactsType" minOccurs="0"/>     </xs:sequence>   <xs:attributeGroup ref="addr:nationality"/>   </xs:complexType> </xs:element>

The declaration for the new contactsType complex type looks like this:

<xs:complexType name="contactsType">   <xs:sequence>     <xs:element name="phone" minOccurs="0">       <xs:complexType>         <xs:attribute name="number" type="xs:string"/>       </xs:complexType>     </xs:element>   </xs:sequence> </xs:complexType>

The syntax used to declare an empty element is actually very simple. Notice that the xs:element declaration for the previous phone element contains a complex type definition that only includes a single attribute declaration. This tells the schema processor that the phone element may only contain complex content (elements), and since no additional nested element declarations are provided, it must remain empty.

16.5.1 complexContent

The preceding example actually took a shortcut with the schema language. One of the early fullName element declarations used the xs:simpleContent element to indicate that the element could only contain simple content (no nested elements). There is a corresponding content-declaration element that specifies that a complex type can only contain complex content (elements). This is the xs:complexContent element.

When the phone element was declared using an xs:complexType element with no nested element declarations, the schema processor automatically inferred that it should contain only complex content. The phone element declaration could be rewritten like so, using the xs:complexContent element:

<xs:element name="phone" minOccurs="0">   <xs:complexType>     <xs:complexContent>       <xs:restriction base="xs:anyType">         <xs:attribute name="number" type="xs:string"/>       </xs:restriction>     </xs:complexContent>   </xs:complexType> </xs:element>

The most common reason to use the xs:complexContent element is to derive a complex type from an existing type. This example derives a new type by restriction from the built-in xs:anyType type. xs:anyType is the root of all of the built-in schema types and represents an unrestricted sequence of characters and markup. Since the xs:complexType indicates that the element can only contain element content, the effect of this restriction is to prevent the element from containing either character data or markup.

16.6 Simple Content

Earlier, the xs:simpleContent element was used to declare an element that could only contain simple content:

<xs:element name="fullName">   <xs:complexType>     <xs:simpleContent>       <xs:extension base="xs:string">         <xs:attribute name="language" type="xs:language"/>       </xs:extension>     </xs:simpleContent>   </xs:complexType>  </xs:element>

The base type for the extension in this case was the built-in xs:string data type. But simple types are not limited to the predefined types. The xs:simpleType element can define new simple data types, which can be referenced by element and attribute declarations within the schema.

16.6.1 Defining New Simple Types

To show how new simple types can be defined, let's extend the phone element from the example application to support a new attribute called location. This attribute will be used to differentiate between work and home phone numbers. This attribute will have a new simple type called locationType, which will be referenced from the contactsType definition:

<xs:complexType name="contactsType">   <xs:sequence>     <xs:element name="phone" minOccurs="0">       <xs:complexType>         <xs:attribute name="number" type="xs:string"/>          <xs:attribute name="location" type="addr:locationType"/>       </xs:complexType>     </xs:element>   </xs:sequence> </xs:complexType> <xs:simpleType name="locationType">   <xs:restriction base="xs:string"/> </xs:simpleType>

Of course, a location type that just maps to the built-in xs:string type isn't particularly useful. Fortunately, schemas can strictly control the possible values of simple types through a mechanism called facets.

16.6.2 Facets

In schema-speak, a facet is an aspect of a possible value for a simple data type. Depending on the base type, some facets make more sense than others. For example, a numeric data type can be restricted by the minimum and maximum possible values it could contain. But these types of restrictions wouldn't make sense for a boolean value. The following list covers the different facet types that are supported by a schema processor:

length (or minLength and maxLength)
pattern
enumeration
whiteSpace
maxInclusive and maxExclusive
minInclusive and minExclusive
totalDigits
fractionDigits

Facets are applied to simple types using the xs:restriction element. Each facet is expressed as a distinct element within the restriction block, and multiple facets can be combined to further restrict potential values of the simple type.

16.6.2.1 Handling whitespace

The whiteSpace facet controls how the schema processor will deal with any whitespace within the target data. Whitespace normalization takes place before any of the other facets are processed. There are three possible values for the whiteSpace facet:

preserve: Keep all whitespace exactly as it was in the source document (basic XML 1.0 whitespace handling for content within elements).
replace: Replace occurrences of #x9 (tab), #xA (line feed), and #xD (carriage return) characters with #x20 (space) characters.
collapse: Perform the replace step first, then collapse multiple-space characters into a single space.

16.6.2.2 Restricting length

The length-restriction facets are fairly easy to understand. The length facet forces a value to be exactly the length given. The minLength and maxLength facets can be used to set a definite range for the lengths of values of the type given. For example, take the nameComponent type from the schema. What if a name component could not exceed 50 characters (because of a database limitation, for instance)? This rule can be enforced by using the maxLength facet. Incorporating this facet requires a new simple type to reference from within the nameComponent complex type definition:

<xs:complexType name="nameComponent">   <xs:simpleContent>     <xs:extension base="addr:nameString"/>   </xs:simpleContent>  </xs:complexType>    <xs:simpleType name="nameString">   <xs:restriction base="xs:string">     <xs:maxLength value="50"/>   </xs:restriction>  </xs:simpleType>

The new nameString simple type is derived from the built-in xs:string type, but can contain no more than 50 characters (the default is unlimited). The same approach can be used with the length and minLength facets.

16.6.2.3 Enumerations

One of the more useful types of restriction is the simple enumeration. In many cases, it is sufficient to restrict possible values for an element or attribute to a member of a predefined list. For example, values of the new locationType simple type defined earlier could be restricted to a list of valid options like so:

<xs:simpleType name="locationType">   <xs:restriction base="xs:string">     <xs:enumeration value="work"/>     <xs:enumeration value="home"/>     <xs:enumeration value="mobile"/>   </xs:restriction>       </xs:simpleType>

Then, if the location attribute in any instance document contained a value not found in the list of enumeration values, the schema processor would generate a validity error.

16.6.2.4 Numeric Facets

Almost half of the of built-in data types defined by the schema specification represent numeric data of one type or another. More might be called numeric since the date/time and duration types are considered to be scalar quantities as well. The following two sections cover all of the numeric facets available, but for a comprehensive list of which of these facets are applicable to which data types, see Chapter 21.

16.6.2.4.1 Minimum and maximum values

Four facets control the minimum and maximum values of items:

minInclusive
minExclusive
maxInclusive
maxExclusive

The primary difference between the inclusive and exclusive flavors of the min and max facets is whether the value given is considered part of the set of allowable values. For example, the following two facet declarations are equivalent:

<xs:maxInclusive value="1"/> <xs:maxExclusive value="0"/>

The difference between inclusive and exclusive becomes more significant when dealing with decimal or floating point values. For example, if minExclusive were set to 5.0, the equivalent minInclusive value would require an infinite number of nines to the right of the decimal point (4.99999). These facets can also be applied to date and time values.

16.6.2.4.2 Length and precision

There are two facets that control the length and precision of decimal numeric values: totalDigits and fractionDigits . The totalDigits facet determines the total number of digits (only digits are counted, not signs or decimal points) that are allowed in a complete number. fractionDigits determines the number of those digits that must appear to the right of the decimal point in the number.

16.6.2.5 Enforcing format

The xs:pattern facet can place very sophisticated restrictions on the format of string values. The pattern facet compares the value in question against a regular expression, and if the value doesn't conform to the expression, it generates a validation error. For example, this xs:simpleType element declares a social security number simple type using the pattern facet:

<xs:simpleType name="ssn">   <xs:restriction base="xs:string">     <xs:pattern value="\d\d\d-\d\d-\d\d\d\d"/>   </xs:restriction>  </xs:simpleType>

This new simple type enforces the rule that a social security number consists of three digits, a dash followed by two digits, another dash, and finally four more digits. The actual regular-expression language is very similar to that of the Perl programming language, but it also supports a wide range of Unicode characters. See Chapter 21 for more information on the full pattern-matching language.

16.6.2.6 Lists

XML 1.0 provided a few very simple list types that could be declared as possible attribute values: IDREFS, ENTITIES, and NMTOKENS. Schemas have generalized the concept of lists and provide the ability to declare lists of arbitrary types.

These list types are themselves simple types and may be used in the same places other simple types are used. For example, if the fullName element were to be expanded to accommodate multiple middle names, one approach would be to declare the middle element to contain a list of nameString values:

 <xs:element name="middle" type="addr:nameList" minOccurs="0"/> . . . <xs:complexType name="nameList">   <xs:simpleContent>     <xs:extension base="addr:nameListType"/>   </xs:simpleContent>  </xs:complexType>    <xs:simpleType name="nameListType">   <xs:list itemType="addr:nameString"/>  </xs:simpleType>

After this change has been made, the middle element of an instance document can contain an unlimited list of names, each of which can contain up to 50 characters separated by whitespace. The use of xs:complextype here will greatly simplify adding attributes later.

16.6.2.7 Unions

In some cases, it is useful to allow potential values for elements and attributes to have any of several types. The xs:union element allows a type to be declared that can draw from multiple type spaces. For example, it might be useful to allow users to enter their own one-word descriptions into the location attribute of the phone element, as well as to choose from a list. The location attribute declaration could be modified to include a union that incorporated the locationType type and the xs:NMTOKEN types:

<xs:attribute name="location">   <xs:simpleType>     <xs:union memberTypes="addr:locationType xs:NMTOKEN"/>   </xs:simpleType> </xs:attribute>

Now the location attribute can contain either addr:locationType or xs:NMTOKEN content.

16.7 Mixed Content

XML 1.0 provided the ability to declare an element that could contain parsed character data (#PCDATA) and unlimited occurrences of elements drawn from a provided list. Schemas provide the same functionality plus the ability to control the number and sequence in which elements appear within character data.

16.7.1 Allowing Mixed Content

The mixed attribute of the complexType element controls whether character data may appear within the body of the element with which it is associated. To illustrate this concept, Example 16-9 gives us a new schema that will be used to validate form-letter documents.

Example 16-9. formletter.xsd

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">   <xs:element name="letter">     <xs:complexType mixed="true"/>   </xs:element> </xs:schema>

This schema seems to declare a single element called body that may contain character data and nothing else. But attempting to validate the following document produces an error, as shown in Example 16-10.

Example 16-10. formletterdoc.xml

<letter xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"   xsi:noNamespaceSchemaLocation="formletter.xsd">Hello!</letter>

The following error is generated:

The content of element type "letter" must match "EMPTY".

This is because there's no complex content for the letter element. Setting mixed to true is not the same as declaring an element that may contain a string. The character data may only appear in relation to other complex content, which leads to the subject of relative element positioning.

16.7.2 Controlling Element Placement

You have already seen the xs:sequence element, which dictates that the elements it contains must appear in exactly the same order in which they appear within the sequence element. In addition to xs:sequence, schemas also provide the xs:choice and xs:all elements to control the order in which elements may appear. These elements may be nested to create sophisticated element structures.

Expanding the form-letter example, a sequence adds support for various letter components to the formletter.xsd schema:

<xs:element name="letter">   <xs:complexType mixed="true">     <xs:sequence>       <xs:element name="greeting"/>       <xs:element name="body"/>       <xs:element name="closing"/>     </xs:sequence>   </xs:complexType> </xs:element>

Now, thanks to the xs:sequence element, a letter must include a greeting element, a body element, and a closing element, in that order. But in some cases, what is desired is that one and only one element appear from a collection of possibilities. The xs:choice element supports this. For example, if the greeting element needed to be restricted to contain only one salutation out of a permissible list, it could be declared to do so using xs:choice:

<xs:element name="greeting">   <xs:complexType mixed="true">     <xs:choice>       <xs:element name="hello"/>       <xs:element name="hi"/>       <xs:element name="dear"/>     </xs:choice>   </xs:complexType> </xs:element>

Now one of the permitted salutations must appear in the greeting element for the letter to be considered valid.

The remaining element-order enforcement construct is the xs:all element. Unlike the xs:sequence and xs:choice elements, the xs:all element must appear at the top of the content model and can only contain elements that are optional or appear only once. The xs:all construct tells the schema processor that each of the contained elements must appear once in the target document, but can appear in any order. This could be applied in the form-letter example. If the form letter had certain elements that had to appear in the body element, but not in any particular order, xs:all could be used to control their appearance:

<xs:element name="body">   <xs:complexType mixed="true">     <xs:all>       <xs:element name="item"/>       <xs:element name="price"/>       <xs:element name="arrivalDate"/>     </xs:all>   </xs:complexType> </xs:element>

This would allow the letter author to mix these elements into the narrative without being restricted as to any particular order. Also, it would prevent the author from inserting multiple references to the same value by accident. A valid document instance, including the new body content, might look like Example 16-11.

Example 16-11. formletterdoc.xml

<letter xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"   xsi:noNamespaceSchemaLocation="formletter.xsd">   <greeting><hello/> Bob!</greeting>   <body>     Thank you for ordering the <item/> ($<price/>), it should arrive     by <arrivalDate/>.   </body>   <closing/> </letter>

The element order constructs are not just limited to complex types with mixed content. If the mixed attribute is not present, the declared sequence of child elements is still enforced, but no character data is permitted between them.

16.7.3 Using Groups

Just as the xs:attributeGroup element allows commonly used attributes to be grouped together and referenced as a unit, the xs:group element allows sequences, choices, and model groups of individual element declarations to be grouped together and given a unique name. These groups can then be included in another element-content model using an xs:group element with the ref attribute set to the same value as the name attribute of the source group.

16.8 Allowing Any Content

It is often necessary to allow users to include any type of markup content they see fit. Also, it is useful to tell the schema processor to validate the content of a particular element against another application's schema. Incorporating XHTML content into another document is an example of this usage.

These applications are supported by the xs:any element. This element accepts attributes that indicate what level of validation should be performed on the included content, if any. Also, it accepts a target namespace that can be used to limit the vocabulary of included content. For instance, going back to the address-book example, to associate a rich-text notes element with an address entry, you could add the following element declaration to the address element declaration:

<xs:element name="notes" minOccurs="0">   <xs:complexType>     <xs:sequence>       <xs:any namespace="http://www.w3.org/1999/xhtml"            minOccurs="0" maxOccurs="unbounded"            processContents="skip"/>     </xs:sequence>   </xs:complexType> </xs:element>

The attributes of the xs:any element tell the schema processor that zero or more elements belonging to the XHTML namespace (http://www.w3.org/1999/xhtml) may occur at this location. Notice that this is done by setting minOcccurs to 0 and maxOccurs to unbounded. It also states that these elements should be skipped. This means that no validation will be performed against the actual XHTML namespace by the parser. Other possible values for the processContents attribute are lax and strict. When set to lax, the processor will attempt to validate any element it can find a declaration for and silently ignore any unrecognized elements. The strict option requires every element to be declared and valid per the schema associated with the namespace given.

There is also support in schemas to declare that any attribute may appear within a given element. The xs:anyAttribute element may include the namespace and processContents attributes, which perform the same function as they do in the xs:any element. For example, adding the following markup to the address element would allow any XLink attributes to appear in an instance document:

<xs:element name="address">   <xs:complexType> . . .   <xs:attributeGroup ref="addr:nationality"/>   <xs:attribute name="ssn" type="addr:ssn"/>   <xs:anyAttribute namespace="http://www.w3.org/1999/xlink"       processContents="skip"/>   </xs:complexType>  </xs:element>

As an application grows and becomes more complex, it is important to take steps to maintain readability and extensibility. Things like separating a large schema into multiple documents, importing declarations from external schemas, and deriving new types from existing types are all typical tasks that will face designers of real-world schemas.

16.8.1 Using Multiple Documents

Just as large computer programs are separated into multiple physical source files, large schemas can be separated into smaller, self-contained schema documents. Although a single large schema could be arbitrarily separated into multiple smaller documents, taking the time to group related declarations into reusable modules can simplify future schema development.

There are three mechanisms that include declarations from external schemas for use within a given schema: xs:include, xs:redefine, and xs:import. The next three sections will discuss the differences between these methods and when and where they should be used.

16.8.1.1 Including external declarations

The xs:include element is the most straightforward way to bring content from an external document into a schema. To demonstrate how xs:include might be used, Example 16-12 shows a new schema document called physical-address.xsd that contains a declaration for a new complex type called physicalAddressType.

Example 16-12. physical-address.xsd

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"   targetNamespace="http://namespaces.oreilly.com/xmlnut/address"   xmlns:addr="http://namespaces.oreilly.com/xmlnut/address"   attributeFormDefault="qualified" elementFormDefault="qualified">      <xs:annotation>     <xs:documentation xml:lang="en-us">       Simple schema example from O'Reilly's       <a href="http://www.oreilly.com/catalog/xmlnut">XML in a         Nutshell.</a>       Copyright 2002 O'Reilly &amp; Associates     </xs:documentation>   </xs:annotation>      <xs:complexType name="physicalAddressType">     <xs:sequence>       <xs:element name="street" type="xs:string" maxOccurs="3"/>       <xs:element name="city" type="xs:string"/>       <xs:element name="state" type="xs:string"/>     </xs:sequence>   </xs:complexType>    </xs:schema>

The address-book.xsd schema document can include and reference this declaration:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"   targetNamespace="http://namespaces.oreilly.com/xmlnut/address"   xmlns:addr="http://namespaces.oreilly.com/xmlnut/address"   attributeFormDefault="qualified" elementFormDefault="qualified"> . . .    <xs:include schemaLocation="physical-address.xsd"/>    <xs:element name="address">   <xs:complexType>     <xs:sequence> . . .       <xs:element name="physicalAddress"           type="addr:physicalAddressType"/> . . .     </xs:sequence> . . .   </xs:complexType>  </xs:element>

Content that has been included using the xs:include element is treated as though it were actually a part of the including schema document. But unlike external entities, the included document must be a valid schema in its own right. That means that it must be a well-formed XML document and have an xs:schema element as its root element. Also, the target namespace of the included schema must match that of the including document.

16.8.1.2 Modifying external declarations

The xs:include element allows external declarations to be included and used as-is by another schema document. But sometimes it is useful to extend and modify types and declarations from another schema, which is where the xs:redefine element comes in.

Functionally, the xs:redefine elements works very much like the xs:include element. The major difference is that within the scope of the xs:redefine element, types from the included schema may be redefined without generating an error from the schema processor. For example, the xs:redefine element could extend the physicalAddressType type to include longitude and latitude attributes without modifying the original declaration in physical-address.xs:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"   targetNamespace="http://namespaces.oreilly.com/xmlnut/address"   xmlns:addr="http://namespaces.oreilly.com/xmlnut/address"   attributeFormDefault="qualified" elementFormDefault="qualified"> . . . <xs:redefine schemaLocation="physical-address.xsd">   <xs:complexType name="physicalAddressType">     <xs:complexContent>       <xs:extension base="addr:physicalAddressType">         <xs:attribute name="latitude" type="xs:decimal"/>         <xs:attribute name="longitude" type="xs:decimal"/>       </xs:extension>     </xs:complexContent>   </xs:complexType>  </xs:redefine> . . . </xs:schema>

16.8.1.3 Importing schemas for other namespaces

The xs:include and xs:redefine elements are useful when the declarations are all part of the same application. But as more public schemas become available, incorporating declarations from external sources into custom applications will be important. The xs:import element is provided for this purpose.

Using xs:import, it is possible to make the global types and elements that are declared by a schema belonging to another namespace accessible from within an arbitrary schema. The W3C has used this functionality to create type libraries. A sample type library was developed by the schema working group and can be viewed on the W3C web site at http://www.w3.org/2001/03/XMLSchema/TypeLibrary.xsd.

To use some of the types from this library in a schema, include the following xs:import element as a child of the root schema element:

<xs:import namespace="http://www.w3.org/2001/03/XMLSchema/TypeLibrary"     schemaLocation="http://www.w3.org/2001/03/XMLSchema/TypeLibrary.xsd"/>

16.8.2 Derived Complex Types

We have been using the xs:extension and xs:restriction elements without going too deeply into how or why they work. The schema language provides functionality for extending existing types, which is conceptually similar to that of inheritance in object-oriented programming. The extension and restriction elements allow new types to be defined either by expanding or limiting the potential values of existing types.

16.8.2.1 Deriving by extension

When deriving a new type from an existing type, the resulting type is equivalent to appending the contents of the new declaration to the contents of the base declaration. For instance, the following example declares a new type called mailingAddressType that extends the physicalAddressType type to include a zip code:

<xs:complexType name="mailingAddressType">   <xs:complexContent>     <xs:extension base="addr:physicalAddressType">       <xs:sequence>         <xs:element name="zipCode" type="xs:string"/>       </xs:sequence>     </xs:extension>   </xs:complexContent> </xs:complexType>

This declaration appends a required element, zipCode, to the existing physicalAddressType type. The biggest benefit of this approach is that as new declarations are added to the underlying type, the derived type will automatically inherit them.

16.8.2.2 Deriving by restriction

When a new type is a logical subset of an existing type, the xs:restriction element allows this relationship to be expressed directly. Like the xs:extension type, it allows a new type to be created based on an existing type. In the case of simple types, this restriction is a straightforward application of additional constraints on the value of that simple value.

In the case of complex types, it is not quite so straightforward. Unlike the extension process, it is necessary to completely reproduce the parent type definition as part of the restriction definition. By omitting parts of the parent definition, the restriction element creates a new, constrained type. As an example, this xs:complexType element derives a new type from the physicalAddressType that only allows a single street element to contain the street address. The original physicalAddressType looks like:

<xs:complexType name="physicalAddressType">   <xs:sequence>     <xs:element name="street" type="xs:string" maxOccurs="3"/>     <xs:element name="city" type="xs:string"/>     <xs:element name="state" type="xs:string"/>   </xs:sequence> </xs:complexType>

The restricted version looks like:

<xs:complexType name="simplePhysicalAddressType">   <xs:complexContent>     <xs:restriction base="addr:physicalAddressType">       <xs:sequence>         <xs:element name="street" type="xs:string"/>         <xs:element name="city" type="xs:string"/>         <xs:element name="state" type="xs:string"/>       </xs:sequence>     </xs:restriction>   </xs:complexContent>  </xs:complexType>

Notice that this type very closely resembles the physicalAddressType, except the maxOccurs="3" attribute has been removed from the street element declaration.

16.8.2.3 Using derived types

One of the chief benefits of creating derived types is that the derived type may appear in place of the parent type within an instance document. The xsi:type attribute tells the schema processor that the element on which it appears conforms to a type that is derived from the normal type expected. For example, take the instance document in Example 16-13, which conforms to the address schema.

Example 16-13. addressdoc.xml using a derived type

<?xml version="1.0"?> <addr:address xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"     xsi:schemaLocation="http://namespaces.oreilly.com/xmlnut/address        address-schema.xsd"     xmlns:addr="http://namespaces.oreilly.com/xmlnut/address"     addr:language="en"     addr:ssn="123-45-6789"> . . .   <physicalAddress addr:latitude="34.003855" addr:longitude="-81.034808"     xsi:type="addr:simplePhysicalAddressType">     <street>1400 Main St.</street>     <city>Columbia</city>     <state>SC</state>   </physicalAddress> . . . </addr:address>

Notice that the physicalAddress element has an xsi:type attribute that informs the validator that the current element conforms to the simplePhysicalAddressType, rather than the physicalAddressType that would normally be expected. This feature is particularly useful when developing internationalized applications, as distinct address types could be derived for each country and then flagged in the instance document for proper validation.

16.8.3 Substitution Groups

A feature that is closely related to derived types is the substitution group. A substitution group is a collection of elements that are all interchangeable with a particular element, called the head element, within an instance document. To create a substitution group, all that is required is that an element declaration include a substitutionGroup attribute that names the head element for that group. Then, anywhere that the head element's declaration is referenced in the schema, any member of the substitution group may also appear. Unlike derived types, it isn't necessary to use the xsi:type attribute in an instance document to identify the type of the substituted element.

The primary restriction on substitution groups is that every element in the group must either be of the same type as or derived from the head element's type. Declaring a numeric element and trying to add it to a substitution group based on a string element would generate an error from the schema processor. The elements must also be declared globally and in the target namespace of the schema.

16.9 Controlling Type Derivation

Just as some object-oriented programming languages allow the creator of an object to dictate the limits on how an object can be extended, the schema language allows schema authors to place restrictions on type extension and restriction.

16.9.1 Abstract Elements and Types

The abstract attribute applies to type and element declarations. When it is set to true, that element or type cannot appear directly in an instance document. If an element is declared as abstract, a member of a substitution group based on that element must appear. If a type is declared as abstract, no element declared with that type may appear in an instance document.

16.9.2 The Final Attribute

Until now, the schema has placed no restrictions on how other types or elements could be derived from its elements and types. The final attribute can be added to a complex type definition and set to either #all, extension, or restriction. On a simple type definition it can be set to #all or a list containing any combination of the values list, union, and/or restriction, in any order. When a type is derived from another type that has the final attribute set, the schema processor verifies that the desired derivation is legal. For example, a final attribute could prevent the physicalAddressType type from being extended:

<xs:complexType name="physicalAddressType" final="extension">

Since the main schema in address-schema.xsd attempts to redefine the physicalAddressType in an xs:redefine block, the schema processor generates the following errors when it attempts to validate the instance document:

ComplexType 'physicalAddressType': cos-ct-extends.1.1: Derivation by  extension is forbidden by either the base type physicalAddressType_redefined  or the schema. Attribute "addr:latitude" must be declared for element type "physicalAddress". Attribute "addr:longitude" must be declared for element type  "physicalAddress".

The first error is a result of trying to extend a type that has been marked to prevent extension. The next two errors occur because the new, extended type was not parsed and applied to the content in the document. Now that you've seen how this works, removing this particular "feature" from the physicalAddressType definition gets the schema working again.

16.9.3 Setting fixed Facets

Similar to the final attribute, the fixed attribute is provided to mark certain facets of simple types as immutable. Facets that have been marked as fixed="true" cannot be overridden in derived types.

16.9.4 Uniqueness and Keys

Perhaps one of the most welcome features of schemas is the ability to express more sophisticated relationships between values in elements and attributes of a document. The limitations of the primitive index capability provided by the XML 1.0 ID and IDREF attributes became readily apparent as documents began to include multiple distinct types of element data with complex data keys. The two facilities for enforcing element uniqueness in schemas are the xs:unique and xs:key elements.

16.9.4.1 Forcing uniqueness

The xs:unique element enforces element and attribute value uniqueness for a specified set of elements in a schema document. This uniqueness constraint is constructed in two phases. First, the set of all of the elements to be evaluated is defined using a restricted XPath expression. Next, the precise element and attribute values that must be unique are defined.

To illustrate, let's add logic to the address schema to prevent the same phone number from appearing multiple times within a given contacts element. To add this restriction, the element declaration for contacts includes a uniqueness constraint:

<xs:element name="contacts" type="addr:contactsType" minOccurs="0">   <xs:unique name="phoneNums">     <xs:selector xpath="phone"/>     <xs:field xpath="@addr:number"/>   </xs:unique> </xs:element>

Now, if a given contacts element contains two phone elements with the same value for their number attributes, the schema processor will generate an error.

This is the basic algorithm that the schema processor follows to enforce these restrictions:

Use the xpath attribute of the single xs:selector element to build a set of all of the elements to which the restriction will apply.
Logically combine the values referenced by each xs:field element for each selected element. Compare the combinations of values that you get for each of the elements.
Report any conflicts as a validity error.

The very perceptive among you, are right: the contactsType type definition only permits a single phone child element. So this particular restriction would not be very useful. Modifying the contactsType definition to permit multiple child elements is not difficult.

16.9.4.2 Keys and references

The xs:key element is closely related to the xs:unique element. Logically, the xs:key element functions exactly the same way the xs:unique element does. It uses the xs:selector element to define a set of elements it applies to, then one or more xs:field elements are used to define which values make up this particular key. The major difference is that, in the case of the xs:key element, uniqueness is not the only desired property of these elements. The goal of the xs:key element is to define a set of elements that can be referenced using the xs:keyref element. Having created a fairly full-featured address element, creating a collection of these elements called addressBook would be an excellent way to show this feature in operation.

First, the new addressBook element is declared, including a key based on the ssn attribute of each address entry:

<xs:element name="addressBook">   <xs:complexType>     <xs:sequence maxOccurs="unbounded">       <xs:element ref="addr:address"/>     </xs:sequence>   </xs:complexType>   <xs:key name="ssnKey">     <xs:selector xpath="addr:address"/>     <xs:field xpath="@addr:ssn"/>   </xs:key>  </xs:element>

Now that the key is defined, you can add a new element to the address element declaration that connects a particular address record with another record. For example, to list references to the children of a particular person in the address book, add the following declaration for a kids element:

<xs:element name="address">   <xs:complexType>     <xs:sequence>       <xs:element name="fullName"> . . .       </xs:element>       <xs:element name="kids" minOccurs="0">         <xs:complexType>           <xs:sequence maxOccurs="unbounded">             <xs:element name="kid">               <xs:complexType>                 <xs:attribute name="ssn" type="addr:ssn"/>               </xs:complexType>             </xs:element>           </xs:sequence>         </xs:complexType>       </xs:element> . . .     </xs:sequence>   <xs:attributeGroup ref="addr:nationality"/>   <xs:attribute name="ssn" type="addr:ssn"/>   <xs:anyAttribute namespace="http://www.w3.org/1999/xlink"       processContents="skip"/>   </xs:complexType>  </xs:element>

Now, an xs:keyref element in the addressBook element declaration enforces the constraint that the ssn attribute of a particular kid element must match an ssn attribute on an address element in the current document:

<xs:element name="addressBook"> . . .   <xs:key name="ssnKey">     <xs:selector xpath="addr:address"/>     <xs:field xpath="@addr:ssn"/>   </xs:key>   <xs:keyref name="kidSSN" refer="addr:ssnKey">     <xs:selector xpath="addr:address/kids/kid"/>     <xs:field xpath="@addr:ssn"/>   </xs:keyref>  </xs:element>

Now, if any kid element in an instance document refers to a nonexistent address record, the schema validat or will generate an error.