Simple Schema Data Types | Developing XML Solutions (DV-MPS General)

[Previous] [Next]

For the most part, XML documents fall into two categories: document-oriented and data-oriented. The document-oriented XML document contains text sections mixed with field data, whereas the data-oriented XML document contains only field data. The XHTML document we created in Chapter 5 is an example of a document-oriented XML document. Another example of a document-oriented XML document is a message such as the one shown here:

 <message priority="high" date="2000-01-11"> <from>Jake Sturm</from> <to>Gwen Sturm</to> <subject>DNA Course</subject> <body> The new DNA course that we are offering is now complete. It will provide a complete overview discussion of designing and building DNA systems, including DNS, DNA, COM, and COM+. The course is also listed on the Web site, at http://ies.gti.net. </body> </message>

This message has a large text body, but it also contains attributes—in this case, date and priority. The date attribute has a date data type, and the priority attribute has an enumerated data type. It will be useful to be able to validate that these attributes are correctly formatted for these two data types. Schemas will allow you do this.

A data-oriented document looks like this:

 <bill> <OrderDate>2001-02-11</OrderDate> <ShipDate>2001-02-12</ShipDate> <BillingAddress> <name>John Doe</name> <street>123 Main St.</street> <city>Anytown</city> <state>NY</state> <zip>12345-0000</zip> </BillingAddress> <voice>555-1234</voice> <fax>555-5678</fax> </bill>

This entire document contains data fields that will need to be validated. Validating data fields is an essential aspect of this type of XML document. We'll look at an example schema for a data-oriented document in the section "A Schema for a Data-Oriented XML Document" later in this chapter.

Up to this point, we've been looking only at document-oriented XML documents that contain only one data type (the string data type) because DTDs work best with document-oriented XML documents that contain only string data types. Because schemas allow you to validate datatype information, it's time now to take a look at data types as they are defined in the schema specification.

The term data type is defined in the second schema standard, which can be found at http://www.w3.org/TR/xmlschema-2/. A data type represents a type of data, such as a string, an integer, and so on. The second schema standard defines simple data types in detail, and that's what we'll look at in this section.

The Components of a Schema Data Type

In a schema, a data type has three parts: a value space, a lexical space, and a facet. The value space is the range of acceptable values for a data type. The lexical space is the set of valid literals that represent the ways in which a data type can be displayed—for example, 100 and 1.0E2 are two different literals, but both denote the same floating point value. A facet is some characteristic of the data type. A data type can have many facets, each defining one or more characteristics. Facets specify how one data type is different from other data types. Facets define the value space for the data type.

There are two kinds of facets: fundamental and constraining. Fundamental facets define the data type, and constraining facets place constraints on the data type. Examples of fundamental facets are rules specifying an order for the elements, a maximum or minimum allowable value, the finite or infinite nature of the data type, whether the instances of the data type are exact or approximate, and whether the data type is numeric. Constraining facets can include the limit on the length of a data type (number of characters for a string or number of bits for a binary data type), minimum and maximum lengths, enumerations, and patterns.

We can categorize the data types along several dimensions. First, data types can be atomic or aggregate. An atomic data type cannot be divided. An integer value or a date that is represented as a single character string is an atomic data type. If a date is presented as day, month, and year values, the date is an aggregate data type.

Data types can also be distinguished as primitive or generated. Primitive data types are not derived from any other data type; they are predefined. Generated data types are built from existing data types, called basetypes. Basetypes can be primitive or generated data types. Generated types, which will be discussed later in the chapter, can be either simple or complex data types.

Primitive data types include the following: string, Boolean, float, decimal, double, timeDuration, recurringDuration, binary, and uri. In addition, there is also the timeInstant data type that is derived from the recurringDuration data type. Among these primitive data types, two of them are specific to XML schemas: timeDuration, and recurringDuration. The timeInstant data type is also specific to XML. Let's have a look at them here.

The timeInstant data type represents a combination of date and time values that represent a specific instance of time. The pattern is shown here:

 CCYY-MM-DDThh:mm:ss.sss

CC represents the century, YY is the year, MM is the month, and DD is the day, preceded by an optional leading sign to indicate a negative number. If the sign is omitted, a plus sign (+) is assumed. The letter T is the date/time separator, and hh, mm, and ss.sss represent the hour, minute, and second values. Additional digits can be used to increase the precision of fractional seconds if desired. To accommodate year values greater than 9999, digits can be added to the left of this representation.

The timeInstant representation can be immediately followed by a Z to indicate the Universal Time Coordinate (UTC). The time zone information is represented by the difference between the local time and UTC and is specified immediately following the time and consists of a plus or minus sign (+ or -) followed by hh:mm.

The timeDuration data type represents some duration of time. The pattern for timeDuration is shown here:

 PyYmMdDThHmMsS

Y represents the number of years, M is the number of months, D is the number of days, T is the date/time separator, H is the number of hours, M is the number of minutes, and S is the number of seconds. The P at the beginning indicates that this pattern represents a time period. The number of seconds can include decimal digits to arbitrary precision. An optional preceding minus sign is allowed to indicate a negative duration. If the sign is omitted, a positive duration is assumed.

The recurringDuration data type represents a moment in time that recurs. The pattern for recurringDuration is the left-truncated representation for timeInstant. For example, if the CC century value is omitted from the timeInstant representation, that timeInstant recurs every hundred years. Similarly, if CCYY is omitted, the timeInstant recurs every year.

Every two-character unit of the representation that is omitted is indicated by a single hyphen (-). For example, to indicate 1:20 P.M. on May 31 of every year for Eastern Standard Time that is 5 hours behind UTC, you would write the following code:

 --05-31T13:20:00-05:00

Creating Simple Data Types

New simple data types can be created by using simpleType elements. A simplified version of a DTD declaration required for the simpleType element is shown below. (For a complete declaration, see the schema specification at http://www.w3.org/IR/xmlschema-2/.)

 <!ENTITY % ordered ' (minInclusive | minExclusive) | (maxInclusive | maxExclusive) | precision | scale '> <!ENTITY % unordered 'pattern | enumeration | length | maxlength | minlength | encoding | period'> <!ENTITY % facet '%ordered; | %unordered;'> <!ELEMENT simpleType ((annotation)?, (%facet;)*)> <!ATTLIST simpleType name NMTOKEN #IMPLIED base CDATA #REQUIRED final CDATA '' abstract (true | false) 'false' derivedBy (list | restriction | reproduction) 'restriction'> <!ELEMENT annotation (documentation)> <!ENTITY % facetAttr 'value CDATA #REQUIRED'> <!ENTITY % facetModel '(annotation)?'> <!ELEMENT maxExclusive %facetModel;> <!ATTLIST maxExclusive %facetAttr;> <!ELEMENT minExclusive %facetModel;> <!ATTLIST minExclusive %facetAttr;> <!ELEMENT maxInclusive %facetModel;> <!ATTLIST maxInclusive %facetAttr;> <!ELEMENT minInclusive %facetModel;> <!ATTLIST minInclusive %facetAttr;> <!ELEMENT precision %facetModel;> <!ATTLIST precision %facetAttr;> <!ELEMENT scale %facetModel;> <!ATTLIST scale %facetAttr;> <!ELEMENT length %facetModel;> <!ATTLIST length %facetAttr;> <!ELEMENT minlength %facetModel;> <!ATTLIST minlength %facetAttr;> <!ELEMENT maxlength %facetModel;> <!ATTLIST maxlength %facetAttr;> <!-- This one can be repeated. --> <!ELEMENT enumeration %facetModel;> <!ATTLIST enumeration %facetAttr;> <!ELEMENT pattern %facetModel;> <!ATTLIST pattern %facetAttr;> <!ELEMENT encoding %facetModel;> <!ATTLIST encoding %facetAttr;> <!ELEMENT period %facetModel;> <!ATTLIST period %facetAttr;> <!ELEMENT documentation ANY> <!ATTLIST documentation source CDATA #IMPLIED> <!ELEMENT documentation ANY> <!ATTLIST documentation source CDATA #IMPLIED xml:lang CDATA #IMPLIED>

As you can see, the simpleType element, which represents a simple data type, can be either ordered or unordered. An ordered type can be placed in a specific sequence. Positive integers are ordered—that is, you can start at 0 and continue to the maximum integer value. Unordered data types do not have any order, and would include data types such as a Boolean that cannot be placed in a sequence. Using the preceding DTD, you can create your own simple data types. These simple data types can then be used in your schemas to define elements and attributes.

Unordered data types include Boolean and binary data types. All of the numeric data types are ordered. Strings are ordered, but when you are defining your own string data types, they will be defined with the unordered elements.

For each data type, numerous possible child elements can be used to define the simpleType element. Each child element will contain an attribute with the value for the child element and an optional comment. The child elements define facets for the data types you create.

Let's look now at how to create simple data types using ordered and unordered facets.

Using ordered facets

Notice that in the previous code listing, ordered facets consist of the following facets: maxExclusive, minExclusive, maxInclusive, minInclusive, precision, and scale. The value of maxExclusive is the smallest value for the data type outside the upper bound of the value space for the data type. The value of minExclusive is the largest value for the data type outside the lower bound of the value space for the data type. Thus, if you wanted to have an integer data type with a range of 100 to 1000, the value of minExclusive would be 99 and the value of maxExclusive would be 1001. The simple data type could be declared as follows:

 <simpleType name="limitedInteger" base="integer"> <minExclusive = "99"/> <maxExclusive = "1001"/> </simpleType>

The minInclusive and maxInclusive facets work in the same way as minExclusive and maxExclusive, except that the minInclusive value is the lower bound of the value space for a data type, and the maxInclusive is the upper bound of the value space for a data type. Our simple data type could be rewritten as follows:

 <simpleType name="limitedInteger" base="integer"> <minInclusive = "100"/> <maxInclusive = "1000"/> </simpleType>

Precision is the number of digits that will be used to represent a number. The scale, which must always be less than the precision, represents the number of digits that will appear to the right of the decimal place. For example, a data type that does not go above but includes 1,000,000 and that has two digits to the right of the decimal place (1,000,000.00) has a precision of 9 (ignore commas and decimals) and a scale of 2. The declaration would look as follows:

 <simpleType name="TotalSales" base="integer"> <minInclusive = "0"/> <maxInclusive = "1000000"/> <precision = "9"/> <scale = "2"/> </simpleType>

If you had left out the maxInclusive facet, numbers up to 9,999,999 would have been valid. If you had needed a value less than 1,000,000, the following declaration would have been sufficient:

 <simpleType name="TotalSales" base="integer"> <precision = "8"/> <scale = "2"/> </simpleType>

Now that you have learned how to use ordered facets to create simple data types, let's look at how to use unordered facets to create simple data types.

Using unordered facets

In the previous code, you can see that unordered facets are made up of the following facets: period, length, maxLength, minLength, pattern, enumeration, and encoding.

For time data types, you can use the period facet to define the frequency of recurrence of the data type. The period facet is used in a timeDuration data type. For example, if you wanted to create a special holiday data type that includes recognized U.S. holidays, you could use the following declaration:

 <simpleType name="holidays" base="date"> <annotation> <documentation>Some U.S. holidays</documentation> </annotation> <enumeration value='--01-01'> <annotation> <documentation>New Year's Day</documentation> </annotation> </enumeration> <enumeration value='--07-04'> <annotation> <documentation>Fourth of July</documentation> </annotation> </enumeration> <enumeration value='--12-25'> <annotation> <documentation>Christmas</documentation> </annotation> </enumeration> </simpleType>

When you use the length facet, the data type must be a certain fixed length. Using length, you can create fixed-length strings. The maxLength facet represents the maximum length a data type can have. The minLength facet represents the smallest length a data type can have. Using minLength and maxLength, you can define a variable-length string that can be as small as minLength and as large as maxLength.

The pattern facet is a constraint on the value space of the data type achieved by constraining the lexical space (the valid values). The enumeration facet limits the value space to a set of values. The encoding facet is used for binary types, which can be encoded as either hex or base64. In addition to containing a facet, simple data types also contain a set of attributes that can be used to define the data type. Let's now take a look at these attributes.

Attributes for simple data types

Notice in the code below that the simpleType element has the following attributes: name, base, abstract, final, and derivedBy. The name attribute can be either a built-in type or a user-defined type. The base attribute is the basetype that is being used to define the new type. The final attribute is discussed in detail later in this chapter. The abstract attribute of a data type is beyond the scope of this book. For more information about this attribute, refer to the schema specification.

The derivedBy attribute can be set to list, restriction, or reproduction. The list value allows you to create a data type that consists of a list of items separated by space. For example, you can use the following declaration to create a list data type:

 <simpleType name='StringList' base='string' derivedBy='list'/>

This data type can then be used in an XML document to create a new list type, as shown here:

 <myListElement xsi:type='StringList'> This is not list item 1. This is not list item 2. This is not list item 3. </myListElement>

By using xsi, you overrode the default declaration of the myListElement and made it a StringList data type. Since a StringList data type contains a list of strings, you can now use a list of strings as content for the myListElement. The xsi namespace will be discussed in more detail later in the chapter.

Up to this point, we have been discussing the XML schema 2 specification, which covers simple data types. The XML schema 1 specification covers all the general issues involving schemas and also covers complex data types. Let's now take a look at the complex data types described in the first schema specification.