2.3 XML Schema | Programming Web Services with Perl

The XML Schema language is the specification that the W3C organization developed to replace the DTD as the preferred way to describe the content and structure of XML documents. While the DTD still has a well-established place in XML technologies, schemas are being used by more and more applications. The overall acceptance of the XML Schema format continues to grow at a steady pace.

In general, you have to read this section only if you're planning to read the chapters detailing the low-down and dirty details of SOAP, WSDL, and UDDI. Those three standards build heavily on the XML Schema Language. If you're planning on letting toolkits do all the heavy lifting for you, you can flip straight past this to Chapter 3 and enjoy the simple life.

2.3.1 Why Replace the DTD?

The main argument against the DTD is simple: it isn't XML. The DTD structure was inherited from HTML's roots in SGML, which itself is designed to solve a much wider range of problems. Thus, the syntax and structure of the DTD has to manage and support this flexibility that XML itself doesn't use or need.

The DTD still has some benefits over XML Schema:

A DTD is generally simpler and smaller in size than the schema describing the same structure.
XML Schema don't provide a way to define named text entities, such as é for the character .
While being an XML application is a boon for XML Schema, the selection of available tools still heavily favors DTD. This factor can be expected to change over time, however.

The main area in which XML Schema wins out over the DTD is in expressing more complex structures and relationships. While a DTD can express the same level of complexity, the complexity of the DTD itself grows at an alarming rate. The XML Schema language has very rich support not just for defining elements and types themselves , but also for defining them by extending and expanding upon existing structures.

The schema approach also can define the format and constraints on the data in elements. Beyond simply stating whether an element contains character data, other elements, or a mix, this ability means putting specific limitations on what is allowed for a element's content. For example, if an element is declared as being of type integer (one of the predefined basic types, covered later), this means that if a document is validated against the schema, the content must contain a numerical value with no decimal (or fractional ) parts . Types may define their constraints and conditions using a wide range of techniques.

2.3.1.1 Document structure versus data structure

A schema can define not only the components of the overall document structure but can also describe complex structured data. The data structures may then be used to describe the document itself. These definitions are distinct from each other, making it possible to have both a dataRecord that is an element, and a dataRecord that describes a complex data structure.

It is this data-centric approach that truly separates schemas from DTDs. With a DTD, the structural relationship can be described, but the distinction between data and description is lost. There is nothing that classifies the elements themselves within a DTD, so all definitions are treated exactly the same way.

2.3.1.2 Understanding more about XML Schema

The rest of the material in this chapter covers the high points of the schema language, but a complete coverage is beyond the scope of the book, let alone a single chapter. The schema specifications are on the W3C's web site (http://www.w3c.org), and in addition to this there is an O'Reilly book devoted exclusively to the topic, XML Schema . The actual coverage of the schema language in this book is limited to the topics related to SOAP and its aggregate technologies such as WSDL (covered in more detail in Chapter 9).

2.3.2 Schema Components

A schema is built up from a collection of components. These components are classified in the formal specification as belonging to one of three groups, as illustrated in Table 2-6. The primary components are those most often seen in practice, but depending on the role the schema is intended to play, not all the primary components may be present.

Schemas can be used with other technologies such as WSDL to provide definitions of datatypes without actually defining the overall document structure. In fact, WSDL uses a schema description to describe its overall document layout but defers the definitions of types to an XML application outside of WSDL, such as XML Schema itself.

Table 2-6. Groups of XML Schema components

Group

Types of components

Primary components

Simple type definitions

Complex type definitions

Attribute declarations

Element declarations

Secondary components

Attribute group definitions

Identity-constraint definitions

Model group definitions

Notation declarations

"Helper" components

Annotations

Model groups

Particles

Wildcards

Attribute uses

A schema document uses a top-level container element called schema and a namespace http://www.w3.org/2001/XMLSchema . Some schema documents may refer to the earlier namespace that was used, http://www.w3.org/2000/10/XMLSchema , but the newer should be used for any new descriptions being written.

2.3.2.1 The predefined simple types

Before exploring the tools by which a schema author defines a document structure, you need to understand the basic types that XML Schema provides. All other definitions, whether they describe elements or attributes or new types, must reference some existing type (either supplied by XML Schema or declared elsewhere within the document). All new types are extensions of , or restrictions upon , existing types.

The specification for XML Schema has two parts, the second of which ^[2] is focused specifically on the datatypes that are provided as a starting point. The types themselves reflect the general functionality of a broad set of programming languages and tools. They exist both as components to be used directly within a schema and as building blocks from which to derive more specific types as the need presents itself.

^[2] http://www.w3.org/TR/xmlschema-2/

The full list of predefined basic types is too long to reproduce here in full. Table 2-7 highlights the list of types, showing both obvious basic concepts (numeric variations) and the more detailed and specialized types (date expressions, URIs).

Table 2-7. A subset of types defined by XML Schema

Type	Example	Description
`string`	Any character-based sequence	The `string` type covers the range of character data, allowing for facets that define properties such as minimum or maximum length, etc.
`boolean`	`true` , `false` , `1` ,	The `boolean` type illustrates a primitive that is an enumeration, allowing only the four values shown as examples.
`decimal`	123.456, -5, +100.0	The `decimal` is one of three primitive numerical types (the other two being `float` and `double` ). It defines numbers in terms of digits to the left and right of a decimal point, and an optional sign. Exponential notations aren't permitted; they're left instead to the other numerical primitives.
`dateTime`	2002-07-19T09:16:58	The `dateTime` is one of the more detailed type specifications. It provides for dates in ISO 8601 format.
`anyURI`	http://www.oreilly.com	The `anyURI` field is just that: a kind of string that describes a URI, whether relative or absolute. It may include fragment specifications (reference parts), etc.
`base64Binary`	(No example; imagine a really large, Base64-encoded PNG image)	This type allows for defining elements that are used to contain binary data encoded using the well-known Base64 algorithm.
`integer` `int`	1, 32768, etc.	These show two generations of derivation from the previous `decimal` type. The `integer` type is a restriction of `decimal` that has no decimal point or following digits. The `int` type further restricts this by limiting the range to between 2147483647 and -2147483648, inclusive. This is the range of a signed, 32-bit integer (the C `int` ).
`NMTOKEN`	`_Name` , `longerName` , `soap:Envelope`	The `NMTOKEN` is an expression of that type from the XML specification itself. This is a derivation of `token` , which is a derivation of `normalizedString` , which is derived from `string` . It represents the same type of name that the `NMTOKEN` specification in a DTD does.

Note that the last two rows of examples represented derivations of earlier types. The specification of schema types provides 19 primitive types, and 25 types derived from these primitives. Each type represents some important basic form from the perspective of XML or XML Schema. While some may seem to be redundant at first glance, a closer look reveals that they each have a distinct role to play.

2.3.2.2 Primary components

The primary components are those that define elements, attributes, and types (both simple and complex). Type declarations aren't required to have names , but attributes and elements must be named.

Each type of components may be considered local or global , depending on where they fall within the schema document. Items that are direct children of the top-level schema element are considered global. Any components that are nested within other structures are considered local to their containing structure. Whether to make a particular component global or local is as much a part of the design process as choosing content itself. Global parts may be referenced and reused within other parts of the schema, where local parts can't. Global elements also have the advantage of being candidates for the top-level element in any document based on the schema. Likewise, keeping an element local instead of global may be a method of keeping it from being used as a top-level container.

2.3.2.3 Attributes

Attributes are declared with the attribute element, which itself uses attributes to provide the basic information about the new component. The attribute's name , type , and use are the most commonly seen attributes in these definitions, with the addition of ref , which serves a special function. The name and type attributes define the name of the new attribute, as well as its type. The type must be one of the simple types available to the schema, either from the basic set provided by the XML Schema specification or one defined elsewhere in the schema itself.

The name must conform to what is known as a NCName in schema terms. That is simply a name (following the usual character limitations) that don't contain a colon (:) anywhere within. This prevents conflict with the name possibly being referenced by full namespace qualification at some future point.

The use attribute defines the nature of the attribute's use within the element it (eventually) gets associated with: a value of optional (the default) means that the attribute's presence in the element is optional. A value of required is the opposite , requiring that the attribute always be present on the given element. The last value is prohibited , which keeps an attribute from being inherited when a new type is being derived from some existing type (normally, as with class inheritance, all the class information would propagate to the new derived class, or type as in this case).

The ref attribute allows a complex definition to refer to attributes defined in other parts of the schema. Example 2-3 illustrates how ref defines an attribute only once, while actually using it in many places. This attribute can also define elements and types, so it will be seen in several places.

2.3.2.4 Elements

Elements are defined using a tag called element . They also feature name and type attributes, as well as a host of others (that aren't usually present in simple WSDL or SOAP applications). It may also use the ref attribute to define a local instance of a globally defined element. The name must also be a NCName , as with attribute declarations. The other attributes that warrant mentioning are used to define elements: minOccurs and maxOccurs .

When an element is defined as a local part of a type declaration, these attributes can specify minimum or maximum times the element can occur. If minOccurs is , the element isn't required to be present. If maxOccurs is set to unbounded , the element may appear as many times as desired. ^[3]

^[3] In earlier versions of XML Schema, the character * was also used to express an unbounded value for maxOccurs .

Since elements are generally expected to be more complex than attributes, the type of an element may be any defined type in the schema. This includes the basic types provided by the specification as well as any defined within the schema itself. The type attribute may be skipped if the type information is going to be given within the element block itself.

2.3.2.5 Simple and complex types

Defining types is necessarily more complex than defining elements or attributes. There are two categories of type definition: simple and complex . Simple types don't allow attributes or nested elements within them. Complex types may have attributes, elements, data or any combination of those components. Complex types may even be built up from other complex or simple types.

Simple types have one aspect to them complex types don't: a simple type is always a derivation of some sort , based on a previous type. A new simple type may be defined by declaring itself a restriction of an existing type. This is much like subclassing. Alternately, the new type may be defined as a list of some other type, or a union of several types. A single named type can be defined in terms of only one of these three derivations. But a type may be defined, then derived from using any or all of the three methods in other type declarations.

Types, simple or complex, aren't required to have names if the definition is made at a local level. Global types must have names, and these names may then be used to define the type of attributes or elements or even as the basis for defining other types. Like elements and attributes, the type declaration tags ( simpleType and complexType ) use an attribute called name to define this part of the component.

Complex types are at once more and less difficult than simple types. Less difficult in that a complex type can be smaller than a simple type in the number of elements that declare it. However, you can do much more with complex types, so a complex type definition can easily be larger than a simple type.

A complex type can also define itself as a restriction of a base type (which must be a complex type as well). A new complex type may also define itself as an extension of an existing type, a concept that has no counterpart in simple types. Where a restriction generally narrows the scope of what a type can express, an extension adds more content to a complex type. Still, both are very similar in nature to the relationship between parent classes and subclasses in object-oriented programming languages, which was the intent of the developers of XML Schema.

A complex type is declared in some cases where it would seem that the type should be simple, such as when attributes are going to be part of the data representation. The complexType container has a child element called simpleContent that is used in these cases, when the main goal is to overcome the limitation of simpleType without defining an overly complex structure. A simpleContent container may declare itself as extending a base (simple) type, or restricting one.

When it comes to defining a complex type in terms of complex content, there are a range of elements that can be used. These are summarized in Table 2-8.

Table 2-8. Complex type declaration elements

Element	Role	Notes
`sequence`	Defines a list (sequence) of content parts that are ordered with regards to each other.	While the order within a `sequence` is a part of the definition of it, elements within it may still be defined with `minOccurs` of , allowing for individual elements to be optional.
`Choice`	Defines a component that will be one of a set of specified choices. Only one choice appears at a time.	One at a time, but the element itself typed with the choice declaration may appear multiple times.
`all`	Like `sequence` , this declares a set of parts that appear together. However, order isn't mandated here, and each part may only appear exactly once or not at all.	The `all` content description is meant for cases in which other types would be too unwieldy. It has other restrictions besides the number limitation; the parts listed within an `all` may be only elements, not other types or other core components.

Three elements may not seem to present an intimidating number of combinations, until you consider the fact that all these may appear as subdescriptions within the types being declared by any other format (even all , which can only contain element components, may have elements with anonymous type declarations).

2.3.2.6 A unified example schema

Example 2-3 shows the use of elements, attributes, and cross references between them. It partially describes a XML syntax for expressing Concurrent Version Control (CVS) operations. Remember that this is only a fraction of the full expressiveness of XML Schema, but more than this is outside the scope of the book.

Example 2-3. XML Schema syntax samples

 <?xml version="1.0"?> <xsd:schema targetNamespace="urn:schema-samples"             xmlns:xsd="http://www.w3.org/2001/XMLSchema">       <!-- A basic attribute declaration -->   <xsd:attribute name="lang" type="xsd:string" />   <!-- An attribute with local (anonymous) typing -->   <xsd:attribute name="lines">     <xsd:simpleType>       <xsd:restriction base="xsd:unsignedInt">         <xsd:maxInclusive value="256000" />       </xsd:restriction>     </xsd:simpleType>   </xsd:attribute>   <!-- A simple type using enumeration -->   <xsd:simpleType name="programming.lang">     <xsd:restriction base="xsd:string">       <xsd:enumeration value="C" />       <xsd:enumeration value="Perl" />       <xsd:enumeration value="PHP" />       <xsd:enumeration value="Java" />     </xsd:restriction>   </xsd:simpleType>   <!-- A more complex type, using some of the above -->   <xsd:complexType name="SoftwareModule">     <xsd:simpleContent>       <xsd:attribute name="code.language"                      type="programming.lang" />       <xsd:attribute name="comment.language" type="lang" />       <xsd:attribute ref="lines" />     </xsd:simpleContent>   </xsd:complexType>   <!-- Defining an element using the previous type at this        level allows for a document referencing this schema        to use the element as a top-level container -->   <element name="code" type="CodeModule" />   <!-- Now define an even more complex type -->   <xsd:complexType name="CVS.Checkin">     <xsd:sequence>       <xsd:element name="module" type="xsd:string" />       <xsd:element name="credentials">         <xsd:complexType>           <xsd:choice>             <xsd:element name="pserver "                          type="xsd:string" />             <xsd:sequence>               <xsd:element name="name" type="xsd:string" />               <xsd:element name="password"                            type="xsd:string" />             </xsd:sequence>           </xsd:choice>         </xsd:complexType>       </xsd:element>       <xsd:element name="file" type="xsd:string" />       <xsd:element ref="code" />     </xsd:sequence>   </xsd:complexType>   <element name="checkin" type="CVS.Checkin" /> </xsd:schema>

While the example is admittedly contrived, it does show how the various components can work together. Example 2-4 shows a simple XML document that follows this schema

Example 2-4. A CVS operation in XML

 <?xml version="1.0"?> <checkin xmlns="CVS-Schema.xsd">   <module>perl-web-examples</module>   <credentials><pserver>...</pserver></credentials>   <file>perl/server.pl</file>   <code code.language="Perl" comment.language="en-US">     # The Perl code would go here   </code> </checkin>

Both the code and checkin elements are declared at the top level, making them global definitions. The type declarations that define the elements are also global. The type that defines the code.language attribute, programming.lang , is defined as an enumeration. This means that "perl" would not have been acceptable as a value for the attribute.

Note the nesting that takes place around the definition of the credentials element: the content of this element will be one of an element called pserver that contains a string or a sequence of two elements ( name and password , both strings). Using the choice construct here also contributes to future flexibility: if the set of choices is extended to include SSH (Secure SHell) key information or digest-authentication tokens, the existing choices are still valid in all the documents that already exist.

2.3.3 XML Schema in SOAP and Related Areas

XML Schema is important not just because of the momentum it has in the effort to replace the DTD. The SOAP specification uses basic datatypes from the schema specifications. Service descriptions written using WSDL relegate the description of datatypes to an external application without specifying a specific one. In practice, though, XML Schema is the dialect of choice for WSDL type detail. Even the basic types in XML-RPC are based on the same precursor documents that led to XML Schema, such as the XML Data specification.

Fortunately, the core elements of schema declarations tend to be clear and self-explanatory. The specification also allows for a documentation element, called annotation , to be present at almost all levels of a schema document. ^[4] While there are a lot of pieces to the schema puzzle that aren't covered here, the basics that were presented should help you through the majority of schemas in general, and a good part of the structure of more complex schemas (even if repeated referencing of the W3C specifications are necessary).

^[4] Subject, of course, to the limited inclination of developers to comment their code.

Schemas aren't limited to SOAP and WSDL in their use. More and more, XML applications are being defined in terms of schema rather than DTD. The Electronic Business XML (ebXML) initiative, an alternative approach to web services (in place of SOAP) between businesses, uses multiple schema documents to define the structure of its multitiered architecture.