17.2 Schema Basics | XML in a Nutshell, Third Edition

This section will construct, step-by-step, a simple schema document representing a typical address book entry, introducing different features of the XML Schema language as needed. Example 17-1 shows a very simple, well- formed XML document.

Example 17-1. addressdoc.xml

 <?xml version="1.0"?> <fullName>Scott Means</fullName>

Assuming that the fullName element can only contain a simple string value, the schema for this document would look like Example 17-2.

Example 17-2. address-schema.xsd

 <?xml version="1.0"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">  <xs:element name="fullName" type="xs:string"/> </xs:schema>

It is also common to associate the sample instance document explicitly with the schema document. Since the fullName element is not in any namespace, the xsi:noNamespaceSchemaLocation attribute is used, as shown in Example 17-3.

Example 17-3. addressdoc.xml with schema reference

 <?xml version="1.0"?> <fullName xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"   xsi:noNamespaceSchemaLocation="address-schema.xsd">Scott Means</fullName>

Validating the simple document against its schema requires a validating XML parser that supports schemas such as the open source Xerces parser from the Apache XML Project (http://xml.apache.org/xerces2-j/ ). This is written in Java and includes a command-line program called dom.Writer that can be used to validate addressdoc.xml , like this:

 % java dom.Writer -V -S addressdoc.xml

Since the document is valid, dom.Writer will simply echo the input document to standard output. An invalid document will cause the parser to generate an error message. For instance, adding b elements to the contents of the fullName element violates the schema rules:

 <?xml version="1.0"?> <fullName xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"   xsi:noNamespaceSchemaLocation="address-schema.xsd">Scott <b>Means</b> </fullName>

If this document were validated with dom.Writer , the following validity errors would be detected by Xerces:

 [Error] addressdoc.xml:4:13: Element type "b" must be declared. [Error] addressdoc.xml:4:31: Datatype error: In element 'fullName' : Can not  have element children within a simple type content.

17.2.1 Document Organization

Now that there is a basic schema and a valid document from which to work, it is time to examine the structure of a schema document and its contents. Every schema document consists of a single root xs:schema element. This element contains declarations for all elements and attributes that may appear in a valid instance document.

The XML elements that make up an XML Schema must belong to the XML Schema namespace ( http://www.w3.org/2001/XMLSchema ), which is frequently associated with the xs : prefix. For the remainder of this chapter, all schema elements will be written using the xs : prefix to indicate that they belong to the Schema namespace.

Instance elements declared using top-level xs:element elements in the schema (immediate child elements of the xs:schema element) are considered global elements. For example, the simple schema in Example 17-2 globally declares one element: fullName . According to the rules of schema construction, any element that is declared globally may appear as the root element of an instance document.

In this case, since only one element has been declared, that shouldn't be a problem. But when building more complex schemas, this side effect must be taken into consideration. If more than one element is declared globally, a schema-valid document may not contain the root element you expect.

Naming conflicts are another potential problem with multiple global declarations. When writing schema declarations, it is an error to declare two things of the same type at the same scope. For instance, trying to declare two global elements called fullName would generate an error. But declaring an element and an attribute with the same name would not create a conflict because the two names are not used in the same way.

17.2.2 Annotations

Now that there is a working schema, it's good practice to include some documentary material about who authored it, what it was for, any copyright restrictions, etc. Since an XML Schema document is an XML document in its own right, one simple option would be to use XML comments to include documentary information.

The major drawback to using XML comments is that parsers are not obliged to keep comments intact when parsing XML documents, and applications have to do a lot of work to negotiate their internal structures. This increases the likelihood that, at some point, important documentation will be lost during an otherwise harmless transformation or edit. Encoding documentation as markup inline with the element and type declarations they refer to opens up endless possibilities for automatic documentation generation.

To accommodate this extra information, most schema elements may contain an optional xs:annotation element as their first child element. The annotation element may then, in turn , contain any combination of xs:documentation and xs:appinfo elements, which are provided to contain extra human-readable and machine-readable information, respectively.

17.2.2.1 The xs:documentation element

As a concrete example, let's add some authorship and copyright information to the simple schema document, as shown in Example 17-4.

Example 17-4. address-schema.xsd with annotation

 <xs:schema xmlns:xsi="http://www.w3.org/2001/XMLSchema">     <xs:annotation>   <xs:documentation xml:lang="en-US">     Simple schema example from O'Reilly's     <a href="http://www.oreilly.com/catalog/xmlnut">XML in a Nutshell.</a>     Copyright 2004 O'Reilly Media, Inc.   </xs:documentation>  </xs:annotation>     <xs:element name="fullName" type="xs:string"/>    </xs:schema>

The xs:documentation element permits an xml:lang attribute to identify the language of the brief message. This attribute can also be applied to the xs:schema element to set the default language for the entire document. For more information about using the xml:lang attribute, see Chapter 5 and Chapter 21.

Also, notice that the documentation element contains additional markup: an a element ( la HTML). The xs:documentation element is allowed to contain any well-formed XML, not just schema elements.

17.2.2.2 The xs:appinfo element

In reality, there is little difference between the xs:documentation element and the xs:appinfo element. Either one can contain any combination of character data or markup the schema author wants to include. But the developers of the schema specification intended the xs:documentation element to contain human-readable content, while the xs:appinfo element would contain application-specific extension information related to a particular schema element.

For example, let's say that it is necessary to encode context-sensitive help text with each of the elements declared in a schema. This text might be used to generate tool-tips in a GUI or system prompts in a voicemail system. Either way, it would be very convenient to associate this information directly with the particular element in question using the xs:appinfo element, like this:

 . . . <xs:element name="fullName" type="xs:string">   <xs:annotation>     <xs:appinfo>       <help-text>Enter the person's full name.</help-text>     </xs:appinfo>   </xs:annotation>  </xs:element> . . .

Although schemas allow very sophisticated and powerful rules to be expressed , they cannot possibly encompass every conceivable need that a developer might face. That is why it is important to remember that there is a facility that can be used to include your own application-specific information directly within the actual schema declarations.

Schematron is especially well-suited to use in annotations and is capable of checking a wide variety of conditions well beyond the bounds of XML Schema. For more information about Schematron, see http://www.ascc.net/xml/resource/schematron/schematron.html.

17.2.3 Element Declarations

XML documents are composed primarily of nested elements, and xs:element is one of the most often used declarations in a typical schema. This simple example schema already includes a single global element declaration that tells the schema processor that instance documents must consist of a single element, fullName :

 <xs:element name="fullName" type="xs:string">

This declaration uses two attributes to describe the element that can appear in the instance document: name and type . The name attribute is self-explanatory, but the type attribute requires some additional explanation.

17.2.3.1 Simple types

Schemas support two different types of content: simple and complex. Simple content consists of pure text that does not contain nested elements.

In the previous example, the type="xs:string " attribute tells the schema processor that this element can only contain simple content of the built-in type xs:string . Table 17-1 lists a representative sample of the built-in simple types that are defined by the schema specification. See Chapter 22 for a complete listing.

Table 17-1. Built-in simple schema types

Type	Description
`anyURI`	A Uniform Resource Identifier
`base64Binary`	Base64-encoded binary data
`boolean`	May contain either true or false, 0 or 1
`byte`	A signed byte quantity >= -128 and <= 127
`dateTime`	An absolute date and time
`duration`	A length of time, expressed in units of years , months, days, hours, etc.
`ID, IDREF, IDREFS, ENTITY` , `ENTITIES, NOTATION, NMTOKEN` , `NMTOKENS`	Same values as defined in the attribute declaration section of the XML 1.0 Recommendation
`integer`	Any positive or negative integer
`language`	May contain same values as `xml:lang` attribute from the XML 1.0 Recommendation
`Name`	An XML name
`string`	Unicode string

Since attribute values cannot contain elements, attributes must always be declared with simple types. Also, an element that is declared to have a simple type cannot have any attributes. This means that if an attribute must be added to the fullName element, some fairly significant changes to the element declaration are required.

17.2.4 Attribute Declarations

To make the fullName element more informative, it would be nice to add a language attribute to provide a hint as to how it should be pronounced. Although adding an attribute to an element sounds like a fairly simple task, it is complicated by the fact that elements with simple types (like xs:string ) cannot have attribute values.

Attributes are declared using the xs:attribute element. Attributes may be declared globally by top-level xs:attribute elements (which may be referenced from anywhere within the schema) or locally as part of a complex type definition that is associated with a particular element.

To incorporate a language attribute into the fullName element declaration, a new complex type based on the built-in xs:string type must be created. To do this, three new schema elements must be used: xs:complexType , xs:simpleContent , and xs:extension :

 <xs:element name="fullName">      <xs:complexType>     <xs:simpleContent>       <xs:extension base="xs:string">         <xs:attribute name="language" type="xs:language"/>       </xs:extension>     </xs:simpleContent>   </xs:complexType>    </xs:element>

This declaration no longer has a type attribute. Instead, it has an xs:complexType child element. This element tells the schema processor that the fullName element may have attributes, but the xs:simpleContent element tells the processor that the content of the element is a simple type. To specify what type of simple content, it uses the base attribute of the xs:extension element to derive a new type from the built-in xs:string type. The xs:attribute element within the xs:extension element indicates that this derived type may have an attribute called language that contains values conforming to the built-in simple type xs:language (mentioned in Table 17-1). Type derivation is an important part of schema creation and will be covered in more detail later in this chapter.

17.2.4.1 Attribute groups

In DTDs, parameter entities are used to encapsulate repeated groups of attribute declarations that are shared between different element types. Schemas provide the same functionality in a more formal fashion using the xs:attributeGroup element.

An attribute group is simply a named group of xs:attribute declarations (or references to other attribute groups) that can be referenced from within a complex type definition. The attribute group must be declared as a global xs:attributeGroup element with a unique name attribute. The group is referenced within a complex type definition by including another xs:attributeGroup element with a ref attribute that matches the desired top-level attribute group name.

Within the fullName schema, an attribute group could be used to create a package of attributes related to a person's nationality . This package of attributes could be used on several elements, including the fullName element, without repeating the same attribute declarations. Then, if it were later necessary to extend this collection of attributes, it could be done in a single location:

 <xs:element name="fullName"> . . .       <xs:extension base="xs:string">         <xs:attributeGroup ref="nationality"/>       </xs:extension> . . . </xs:element>    <xs:attributeGroup name="nationality">   <xs:attribute name="language" type="xs:language"/> </xs:attributeGroup>