XML Schema: An Overview | NetBeansв„ў IDE Field Guide: Developing Desktop, Web, Enterprise, and Mobile Applications (2nd Edition)

The primary purpose of XML Schema is to enable documents to be validated : they define a set of rules that XML documents must conform to, and enable documents to be checked against these rules. This means that organizations using XML to exchange invoices and purchase orders can agree on a schema defining the rules for these messages, and both parties can validate the messages against the schema to ensure that they are right. So the schema, in effect, defines a type of document, and this is why schemas are central to the type system of XSLT.

In fact, the designers of XML Schema were more ambitious than this. They realized that rather than simply giving a "yes" or "no" answer, processing a document against a schema could make the application's life easier by attaching labels to the validated document indicating, for each element and attribute in the document, which schema definitions it was validated against. In the language of XML Schema, this document with validation labels is called a Post Schema Validation Infoset or PSVI. The data model used by XSLT and XPath is based on the PSVI, but it only retains a subset of the information in the PSVI: specifically , the type annotations attached to element and attribute nodes.

We begin by looking at the kinds of types that can be defined in XML Schema, starting with simple types and moving on to progressively more complex types.

Simple Type Definitions

Let's suppose that many of our messages refer to part numbers, and that part numbers have a particular format such as ABC12345. We can start by defining this as a type in the schema:

  <xs:simple-type name="part-number">   <xs:restriction, base="xs:token">   <xs:pattern value="[A-Z]{3}[0-9]{5}"/>   </xs:retriction>   </xs:simple-type>

Part number is a simple type because it doesn't have any internal node structure (that is, it doesn't contain any elements or attributes). I have defined it by restriction from xs: token, which is one of the built-in types that come for free with XML Schema. I could have chosen to base the type on xs: string, but xs: token is probably better because with xs:string, leading and trailing whitespace is considered significant, whereas with xs: token, it gets stripped automatically before the validation takes place. The particular restriction in this case is that the value must match the regular expression given in the <xs:pattern> element. This particular regular expression says that the value must consist of exactly three letters in the range A to Z, followed by exactly five digits.

Having defined this type, I can now refer to it in definitions of elements and attributes. For example, I can define the element:

  <xs:element name="part" type="part-number"/>

This allows documents to contain <part> elements whose content conforms to the rules for the type called part-number. Of course, I can also define other elements that have the same type, for example:

  <xs:element name="subpart" type="part-number"/>

Note the distinction between the name of an element and its type. Many element declarations in a schema (declarations that define elements with different names ) can refer to the same type definition, if the rules for validating their content are the same. It's also permitted, though we won't go into the detail just yet, to use the same element name at different places within a document with different type definitions.

We can also use the same type definition in an attribute, for example:

  <xs:attribute name="part-nr" type="part-number"/>

As we will see later in the chapter, we can declare variables and parameters in a stylesheet whose values must be elements or attributes of a particular type. Once a document has been validated using this schema, elements that have been validated against the declarations of part and subpart given above, and attributes that have been validated against the declaration named part-nr , will carry the type annotation part-number , and they can be assigned to a variable such as

  <xsl:variable name="part" as    ="    element(*, part-number    )">

This variable is allowed to contain any element node that has the type annotation part-number . If further types have been defined as restricted subtypes of part-number (for example, Boeing-part-number ), these can be assigned to the variable too. The « * » indicates that we are not concerned with the name of the element, but only with its type.

There are actually three kinds of simple types that you can define in XML Schema: atomic types, list types, and union types. Atomic types are treated specially in the XPath/XSLT type system, because values of an atomic type (called, naturally enough, atomic values) can be manipulated as free-standing items, independently of any node. Like integers, booleans, and strings, part numbers as defined above are atomic values, and you can hold a part number or a sequence of part numbers directly in a variable, without creating any node to contain it. For example, the following declaration defines a variable whose value is a sequence of three part numbers:

  <xsl:variable name="part" as="part-number*"   select="part-number('WZH94623'),   part-number ('BYF67253'),   part-number('PRG83692')"/>

Simple types in XML Schema are not the same thing as atomic types in the XPath data model. This is because a simple type can also allow a sequence of values. For example, it is possible to define the following simple type:

  <xs    :    simpleType name    =    "colors">   <xs:list>   <xs:simpleType    >    <xs:restriction base="xs:NCName    ">     <    xs:enumeration value ="red"/>   <xs:enumeration value="orange"/>   <xs:enumeration value="yellow"/>   <xs:enumeration value="green    "    /    >    <xs    :    enumeration value    ="    blue    "    /    >    <xs:enumeration value="indigo    "    /    >     <    xs:enumeration value="violet    "    /    >     <    /xs:restriction    >     <    /xs:simpleType>   </xs:list>   </xs:simpleTyppe    >

There are actually two type definitions here. The inner type is anonymous, because the <xs: simpleType> element has no name attribute. It defines an atomic value, which must be an xs:NCName , and more specifically, must be one of the values «red » , «orange » , «yellow » , «green » , «blue » , «indigo » , or «violet » . The outer type is a named type (which means it can be referenced from elsewhere in the schema), and it defines a list type whose individual items must conform to the inner type.

This type therefore allows values such as «red green blue » , or «violet yellow » or even «red red red » . The values are written in textual XML as a list of color names separated by spaces, but once the document has been through schema validation, the typed value of an element with this type will be a sequence of xs: NCName values.

The term simple type in XML Schema rules out types involving multiple attribute or element nodes, but it does allow composite values consisting of a sequence of atomic values.

Elements with Attributes and Simple Content

One thing that might occur quite frequently in an invoice or purchase order is an amount in money: there might be elements such as:

<unit-price currency="USD">50.00</unit-price>
<amount-due currency="EUR">1890.00</amount-due>

What these two elements have in common is that they have a currency attribute (with a particular range of allowed values) and content that is a decimal number. This is an example of a complex type. We defined part-number as a simple type because it didn't involve any nodes. The money-amount type is a complex type, because it involves a decimal number and an attribute value. We can define this by declaring two elements in the schema with the same type:

  <xs:simpleType name="currency-type">   <xs:restriction base="xs:token">   <xs:enumeration value="USD"/>   <xs:enumeration value="EUR"/>   <xs:enumeration value="GBP"/>   <xs:enumeration value="CAD"/>   </xs:restriction>   </xs:simpleType>   <xs:complexType name="money-amount">   <xs:simpleContent>   <xs:extension base="xs:decimal">   <xs:attribute name="currency" type="currency-type"/>   </xs:extension>   </xs:simpleContent>   </xs:complexType>

Here we have defined two new types in the schema, both of which are named. The first defines the type of the currency attribute. We could have used the same name for the attribute and its type, but many people prefer to keep the names of types distinct from those of elements and attributes, to avoid confusing the two. In this case I've chosen to define it (again) as a subtype of xs:token , but this time restricting the value to be one of four particular world currencies. In practice, of course, the list might be much longer. The currency-type is again a simple type, because it's just a value; it doesn't define any nodes.

The second definition is a complex type, because it defines two things. It's the type of an element that has a currency attribute conforming to the definition of currency-type , and has content (the text between the element start and end tags) that is a decimal number, indicated by the reference to the built-in type xs: decimal . This particular kind of complex type is called a complex type with simple content, which means that elements of this type can have attributes, but they cannot have child elements.

Again, the name of the type is quite distinct from the names of the elements that conform to this type. We can declare the two example elements above in the schema as follows :

  <xs:element name="unit-price" type="money-amount"/>   <xs:element name="amount-due" type="money-amount"/>

But although the type definition doesn't constrain the element name, it does constrain the name of the attribute, which must be «currency » . If the type definition defined child elements, it would also constrain these child elements to have particular names.

In an XSLT 2.0 stylesheet, we can write a template rule for processing elements of this type, which means that all the logic for formatting money amounts can go in one place. For example, we could write:

  <xsl:template match="element(*, money-amount)">   <    xsl    :value-of select="@currency, format-number(., '#,##0.00')"/>   </xsl:template>

This would output the example <amount-due> element as «EUR 1,890.00 » . (The format-number() function is described in Chapter 7, on page 558). The beauty of such a template rule is that it is highly reusable: however much the schema is extended to include new elements that hold amounts of money, this rule can be used to display them.

Elements with Mixed Content

The type of an element that can contain child elements is called a complex type with complex content. Such types essentially fall into three categories, called empty content, mixed content, and element-only content. Mixed content allows intermingled text and child elements, and is often found in narrative XML documents, allowing markup such as:

  <para>The population of <city>London</city> reached   <number>5, 572,000</number> in <year>1891</year>, and had risen   further to <number>7, 160,000</number> by <year>1911</year>.</para>

The type of this element could be declared in a schema as:

  <xs:complex-type name="para-type" mixed="true">   <xs:choice minOccurs="0" maxOccurs="unbounded">   <xs:element ref="city"/>   <xs:element ref="number"/>   <xs:element ref="year"/>   </xs:choice>   </xs:complex-type>

In practice, the list of permitted child elements would probably be much longer than this, and a common technique is to define substitution groups which allow a list of such elements to be referred to by a single name.

Narrative documents tend to be less constrained than documents holding structured data such as purchase orders and invoices, and while schema validation is still very useful, the type annotations generated as a result of validation aren't generally so important when the time comes to process the data using XSLT: The names of the elements are usually more significant than their types. However, there is plenty of potential for using the types, especially if the schema is designed with this in mind.

When schemas are used primarily for validation, the tendency is to think of types in terms of the form that values assume. For example, it is natural to define the element <city> (as used in the example above) as a type derived from xs: token by restriction, because the names of cities are strings, perhaps consisting of multiple words, in which spaces are not significant. Once types start to be used for processing information (which is what you are doing when you use XSLT), it's also useful to think about what the value actually means. The content of the <city> element is not just a string of characters , it is the name of a geographical place, a place that has a location on the Earth's surface, that is in a particular country, and that may figure in postal addresses. If you have other similar elements such as <county> , <country> , and <state> , it might be a good idea to define a single type for all of them. Even if this type doesn't have any particular purpose for validation, because it doesn't define any extra constraints on the content, it can potentially be useful when writing XSLT templates because it groups a number of elements that belong together semantically.

Elements with Element-Only Content

This category covers most of the "wrapper" elements that are found in data-oriented XML. A typical example is the outer <person> element in a structure such as:

  <person id="P517541">   <name>   <given>Michael</given>   <given>Howard</given>   <family>Kay</family>   </name>   <date-of-birth>1951-10-11</date-of-birth>   <place-of-birth>Hannover</place-of-birth>   </person>

The schema for this might be:

  <xs:element name="person" type="person-type"/>   <xs:complexType name="person-type">   <xs:sequence>   <xs:element name="name" type="personal-name-type"/>   <xs:element name="date-of-birth" type="xs:date"/>   <xs:element name="place-of-birth" type="xs:token"/>   </xs:sequence>   <xs:attribute name="id" type="id-number"/>   </xs:complexType>   <xs:complexType name="personal-name-type">   <xs:sequence>   <xs:element name="given" maxOccurs="unbounded" type="xs:token"/>   <xs:element name="family" type="xs:token"/>   </xs:sequence>   </xs:complexType>   <xs:simpleType name="id-number">   <xs:restriction base="xs:ID">   <xs:pattern value="[A-Z][0-9]{6}"/>   </xs:restriction>   </xs:simpleType>

There are a number of ways these definitions could have been written. In the so-called Russian Doll style, the types would be defined inline within the element declarations, rather than being given separate names of their own. The schema could have been written using more top-level element declarations, for example the <name> element could have been described at a top level. When you use a schema for validation, these design decisions mainly affect your ability to reuse definitions later when the schema changes. When you use a schema to describe types that can be referenced in XSLT stylesheets, however, they also affect the ease of writing the stylesheet.

In choosing the representation of the schema shown above, I made a number of implicit assumptions:

It's quite likely that there will be other elements with the same structure as <person> , or with an extension of this structure: perhaps not at the moment, but at some time in the future. Examples of such elements might be <employee> or <pensioner> . Therefore, it's worth describing the element and its type separately.
Similarly, personal names are likely to appear in a number of different places. Elements with this type won't always be called <name> , so it's a good idea to create a type definition that can be referenced from any element.
Not every element called <name> will be a personal name, the same tag might also be used (even in the same namespace) for other purposes. If I were confident that the tag would always be used for personal names, then I would probably have made it the subject of a top-level element declaration, rather than defining it inline within the <person> element.
The elements at the leaves of the tree (those with simple types) such as <date-of-birth> , <place-of-birth> , <given> , and <family> are probably best defined using local element declarations rather than top-level declarations. Even if they are used in more than one container element, there is relatively little to be gained by pulling the element declarations out to the top level. The important thing is that if any of them have a user-defined type (which isn't the case in this example) then the user -defined types are defined using top-level <xs: simpleType> declarations. This is what I have done for the id attribute (which is defined as a subtype of xs:ID , forcing values to be unique within any XML document), but I chose not to do the same for the leaf elements.

Substitution Groups

The type of an element or attribute tells you what can appear inside the content of the element or attribute. Substitution groups, by contrast, classify elements according to where they can appear.

There is a schema for XSLT 2.0 stylesheets published as part of the XSLT Recommendation (see http://www.w3.org/TR/xslt20 ) . Let's look at how this schema uses substitution groups.

Firstly, the schema defines a type that is applicable to any XSLT-defined element, and that simply declares the standard attributes that can appear on any element:

  <xs:complexType name="generic-element-type">   <xs:attribute name="extension-element-prefixes" type="xsl:prefixes"/>   <xs:attribute name="exclude-result-prefixes" type="xsl:prefixes"/>   <xs:attribute name="xpath-default   -   namespace" type="xs:anyURI    "/>    <xs:attribute ref="xml:space"/>   <xs:attribute ref="xml:lang"/>   <xs:anyAttribute namespace="##other" processContents="skip"/>   </xs:complexType>

There's a good mix of features used to define these attributes. Some attributes use built-in types ( xs:anyURI ), some use user-defined types defined elsewhere in the schema ( xsl: prefixes ), and two of them ( xml: space and xml: lang) ) are defined in a schema for a different namespace. The <xs:anyAttribute> at the end says that XSLT elements can contain attributes from a different namespace, which are not validated. (Perhaps it would be better to specify lax validation, which would validate the attribute if and only if a schema is available for it.)

Every XSLT element except the <xsl: output> element allows a standard version attribute (the <xsl: output> element is different because its version attribute is defined for a different purpose and has a different type). So the schema defines another type that adds this attribute:

  <xs:complexType name="versioned-element-type">   <xs:complexContent>   <xs:extension base="xsl:generic-element-type">   <xs:attribute name="version" type="xs:decimal" use="optional"/>   </xs:extension>   </xs:complexContent>   </xs:complexType>

The XSLT specification classifies many XSLT elements as instructions. This is not a structural distinction based on the attributes or content model of these elements (which varies widely), it is a distinction based on the way they are used. In particular, instruction elements are interchangeable in terms of where they may appear in a stylesheet: If you can use one instruction in a particular context, you can use any instruction. This calls for defining a substitution group :

  <xs:element name="instruction"   type="xsl:versioned-element-type"   abstract="true"/>

Note that although the substitution group is defined using an element declaration, it does not define a real element, because it specifies «abstract=" true" » . This means that an actual XSLT stylesheet will never contain an element called <xsl:instruction> . It is a fictional element that exists only so that others can be substituted for it.

What this declaration does say is that every element in the substitution group for <xsl: instruction> must be defined with a type that is derived from xsl: versioned-element-type . That is, every XSLT instruction allows the attributes extension-element-prefixes, exclude-result-prefixes, xpath-default-namespace , xml:space, xml:lang , and version . This is in fact the only thing that XSLT instructions have in common with each other, as far as their permitted content is concerned.

Individual instructions are now defined as members of this substitution group. Here is a simple example, the declaration of the <xsl:if > element:

  <xs:element name="if" substitutionGroup="xsl:instruction">   <xs:complexType>   <xs:complexContent mixed="true">   <xs:extension base="xsl:sequence-constructor">constructor">   <xs:attribute name="test" type="xsl:expression" use="required"/>   </xs:extension>   </xs:complexContent>   </xs:complexType>   </xs:element>

This shows that the <xsl:if> element is a member of the substitution group whose head is the abstract <xsl:instruction> element. It also tells us that the content model of the element (that is, its type) is defined as an extension of the type xsl: sequence-constructor , the extension being to require a test attribute whose value is of type xsl:expression -this is a simple type defined later in the same schema, representing an XPath expression that may appear as the content of this attribute.

The type xsl: sequence-constructor is used for all XSLT elements whose permitted content is a sequence constructor. A sequence constructor is simply a sequence of zero or more XSLT instructions, defined like this:

  <xs:complexType name="sequence-constructor">   <xs:complexContent mixed="true">   <xs:extension base="xsl:versioned-element-type">   <xs:group ref="xsl:sequence-constructor-group" minOccurs="0"   maxOccurs="unbounded"/>   </xs:extension>   </xs:complexContent>   </xs:complexType>   <xs:group name="sequence-constructor-group">   <xs:choice>   <xs:element ref="xsl:variable"/>   <xs:element ref="xsl:instruction"/>   <xs:group ref="xsl:result-elements"/>   </xs:choice> </xs:group>

The first definition says that the xsl: sequence-constructor type extends xsl:versioned-element-type , whose definition we gave earlier. If it didn't extend this type, we wouldn't be allowed to put <xsl:if > in the substitution group of <xsl:instruction> . It also says that the content of a sequence constructor consists of zero or more elements, each of which must be chosen from those in the group sequence-contructor-group . The second definition says that every element in sequence-contructor-group is either an <xsl:instruction> (which implicitly allows any element in the substitution group for <xsl:instruction> , including of course <xsl:if> ), or an <xsl:variable> .

The <xsl: variable> element is not defined as a member of the substitution group because it can be used in two different contexts: either as an instruction or as a top-level declaration in a stylesheet. This is one of the drawbacks of substitution groups: they can't overlap. The schema defines all the elements that can act as declarations in a very similar way, using a substitution group headed by an abstract <xsl:declaration> element. It's not possible for the same element, <xsl:variable> , to appear in both substitution groups, so it has been defined in neither , and needs to be treated as a special case.

If you need to use XSLT to access an XSLT stylesheet (which isn't as obscure a requirement as it may seem; there are many applications for this) then the classification of elements as instructions or declarations can be very useful. For example, you can match all the instructions that have an attribute in the Saxon namespace with the template rule:

  <xsl:template match="schema-element(xsl:instruction)[@saxon:*]">

assuming that the namespace prefix «saxon » has been declared appropriately. Here the expression «schema-element (xsl: instruction) » selects elements that are either named <xsl: instruction> , or are in the substitution group with <xsl: instruction> as its head element, and the expression «[@saxon:*] » is a filter that selects only those elements that have an attribute in the «saxon » namespace.

The penalty of choosing a real schema for our example is that we have to live with its complications. As we saw earlier, the <xsl: variable> element isn't part of this substitution group. So we might have to extend the query to handle <xsl: variable> elements as well. We can do this by writing:

  <xsl:template match="(schema-element(xsl:instruction)xsl:variable)[@saxon:*]">

A detailed explanation of this match pattern can be found in Chapter 6.

So, to sum up this section, substitution groups are not only a very convenient mechanism for referring to a number of elements that can be substituted for each other in the schema, but can also provide a handy way of referring to a group of elements in XSLT match patterns. But they do have one limitation, which is that elements can only belong directly to one substitution group (or to put it another way, substitution groups must be properly nested, they cannot overlap).

At this point I will finish the lightning tour of XML Schema. The rest of the chapter builds on this understanding to show how the types defined in an XML Schema can be used in a stylesheet.