Simple Content Elements | Using XML with Legacy Business Applications

In this section we'll talk about Elements that are declared with no child Elements, that is, those with simple content models. The main thing we're concerned with in regard to these Elements is imposing restrictions on the built-in schema data types, reducing the set of allowable values.

Data types for Elements with simple content are specified in Part 2: Datatypes of the XML Schema Recommendation.

Schema Built-in Data Types

Schema language provides a very large set of data types. If you work with enough different schemas for long enough, you may eventually see all the data types used. However, the set I list below accounts for most you will see in common business documents.

string : This data type is just what its name implies. The important thing to note is that an Element with a type of string can contain any Unicode character, including those with double-byte encoding, if so specified in the instance document prolog.
boolean : Logically true or false, but the full set of allowed values for this data type is true, false, 0, and 1.
decimal : This is a number that may have a fractional part, that is, values to the left and right of the decimal point. However, it need not have both. Float and double are similar but aren't seen as often in business document schemas. A leading sign is allowed but may be omitted if positive. Leading and trailing zeroes are optional. If the fractional part is zero, the decimal and trailing zeroes may be omitted unless they're needed to indicate precision.
integer : Again, this data type is just what the name implies. Specialized integer types include nonPositiveInteger, negativeInteger, long, int, and short.
date, time, and dateTime : These data types represent units of time as specified by ISO 8601 (the International Organization for Standardization's representations of dates and times). The most commonly used are CCYY-MM-DD, hh:mm:ss.sss, and CCYY-MM-DDThh:mm:ss, respectively. Fractional seconds may be omitted, but seconds are required. The T indicates a time zone. A Z indicates Coordinated Universal Time ( UTC ). The local time is indicated by using the T and following the seconds with an offset. For example, 2002-08-23T20:27:00-05:00 could indicate August 23, 2002, 3:27 PM U.S. Central Daylight Time. Be aware that an offset from UTC is not necessarily the same thing as a time zone, although a time zone could be inferred from other knowledge. For example, -05:00 could be either Central Daylight Time or Eastern Standard Time depending on the day of the year.

To reiterate, there are many other built-in data types, but the short list above encompasses those most commonly used. If you encounter a type that isn't on this list, look for the definition in the Schema Recommendation.

Don't let talk of value space, canonical representations, lexical space, lexical representations, constraining facets, and order relations put you off. Here are the meanings of the terms you need to know the most.

Value space : This is the logical set of values that a data type might have, independent of how it might look in a data stream. For example, the concept of positive integers represents a value space.
Lexical representation, lexical space : These deal with how the values look in an XML instance document. For the value space of positive integers, we have the set 1, 2, 3, 4, and so on. The value space consisting of the integer value 1 might have a lexical space of 01 and 001 in addition to 1.
Constraining facets : Schema designers can use these attributes on the built-in types when they want to limit the range of allowable values. We'll talk about some of the most commonly used facets.

Note that despite the rich set of types that schema language provides, you will still see people using constructs that don't take advantage of those provided. For example, instead of using the conventional boolean, I have seen a Yes No Indicator used with the values Y and N enumerated. I have also seen date types created to accommodate alphabetic month abbreviations or slash separators, such as 23-Aug-2002 or 8/23/02, respectively. Another example is a derived type for a date range to express concepts such as June 1, 2002, through July 31, 2002. Using the built-in date data type you could easily express this concept as:

 <OnSaleRange>   <BeginDate>2002-06-01</BeginDate>   <EndDate>2002-07-31</EndDate> </OnSaleRange>

However, to save themselves a few bytes in instance documents, some schema authors create their own derived types that allow them to say:

 <OnSaleRange>2002-06-01 - 2002-07-31</OnSaleRange>

Schema language makes it very easy to do things like this. As I said earlier, there are a thousand different ways to hang yourself.

What is much more common than such practices, however, is restricting the range of values allowed for the built-in data types. There are very good business reasons for doing this. We'll talk next about some of the types of restrictions you're likely to see.

Extending and Restricting Simple Types

Simple types can be either extended or restricted. Again, we don't change the content model of a simple type since simple types always have a simple content model, that is, no child Elements may be defined. We extend the simple type into a complex type by adding one or more Attributes. We restrict a simple type by using one of its constraining facets that reduces the range of allowable values. Extension of a simple type is done by using the schema language xs:extension Element. We'll see it used in the Attribute Declarations subsection below. Restriction is done by using the schema language xs:restriction Element. We'll see examples of this next. Both extension and restriction must specify a "base," that is, the existing data type from which you are deriving your new data type.

Setting a Maximum Length

It is very common to see restrictions on the length of string Elements. To avoid truncating alphanumeric values before inserting them into a database, some programmers create schemas that make sure no one can use a string long enough to be truncated. To restrict our column lengths we can create a String1024Type type derived from string, as follows .

Setting a Maximum Length in SimpleCSV2.xsd

 <xs:simpleType name="String1024Type">   <xs:annotation>     <xs:documentation>This user-defined type shows how we         restrict our columns to 1024 characters     </xs:documentation>   </xs:annotation>   <xs:restriction base="xs:string">     <xs:maxLength value="1024"/>   </xs:restriction> </xs:simpleType>

We use the xs:simpleType Element with its name Attribute. The xs:restriction Element identifies the restriction base as the built-in string data type. We use the xs:maxLength Element as the constraining facet, with a value Attribute of 1024.

Our column Elements then are assigned this user-defined type instead of the built-in string type.

 <xs:element name="Column01" type="String1024Type" minOccurs="0"/>

You can see this in context in SimpleCSV2.xsd.

I use a common convention of suffixing type names with Type. This is purely for readability. Schema language allows you to name types pretty much anything (well, anything allowable for NMTOKEN as defined in XML 1.0). A type may even have the same name as an Element that uses it. This is allowed because the names have different scope.

Setting Minimum and Maximum Values

Just as people want schemas to prevent them from truncating alphanumeric fields, they also want schemas to set upper limits on numeric fields so they don't cause truncations or overflows. Using our working example, we could restrict the zip code to a set of values as follows:

Setting Maximum and Minimum Values

 <xs:simpleType name="zipCodeType">   <xs:annotation>     <xs:documentation>Here we define a ZIP Code as an integer         from 1 through 99999     </xs:documentation>   </xs:annotation>   <xs:restriction base="xs:integer">     <xs:minInclusive value="1"/>     <xs:maxInclusive value="99999"/>   </xs:restriction> </xs:simpleType>

This type of minimum and maximum value restriction probably makes more sense for the amount due on an invoice, but you get the idea. The next subsection shows a better way to do it.

Patterns for Identifiers

Identifiers are all around us. Social Security numbers, zip codes, and DUNS numbers (unique numbers issued by Dun & Bradstreet and used to identify business entities) are just a few. The range of values of such things is usually way too large and dynamic to practically enumerate in a schema, although business applications usually validate them in their databases. Even though we might not want to enumerate the set of allowable values, we can do a level of validation by specifying a pattern to which the identifier must conform. For example, U.S. Social Security numbers have the general pattern NNN-NN-NNNN, where N represents a digit from 0 through 9. U.S. zip codes (in the five-digit form) have a pattern of NNNNN. The common way to declare an identifier type is to restrict the string data type using the pattern facet . Here's an example for zip codes.

Specifying a Pattern in SimpleCSV2.xsd

 <xs:simpleType name="zipCodeType">   <xs:annotation>     <xs:documentation>Here we define a ZIP Code as 5-digit         pattern.     </xs:documentation>   </xs:annotation>   <xs:restriction base="xs:string">     <xs:pattern value="\d\d\d\d\d"/>   </xs:restriction> </xs:simpleType>

Each \d in the value Attribute of the xs:pattern Element indicates a decimal digit. Schema language offers many ways to specify patterns. Patterns are specified as regular expressions , which are sequences of characters that denote sets of strings. The schema language notation for regular expressions is very similar to that used in Perl. It is described in Appendix F, Part 2, of the Schema Recommendation. However, that appendix is somewhat obtuse and doesn't have any examples. If you're going to use patterns frequently and are new to regular expressions, I recommend getting a good introductory book on Perl, learning how Perl does regular expressions, then seeing how schema language differs . There's also a good resource on the Web, listed at the end of this chapter.

Code Lists

Codes aren't used in real life as much as identifiers, but they abound in business applications. There is always some debate about what constitutes a code versus what constitutes an identifier, and more confusion is created when people talk about coded identifiers! However, we can put a stake in the ground at least a bit by talking about country codes, which use short, coded values in place of longer, less constrained values that are harder to validate. For example, we use "US" instead of "United States of America" and "FR" instead of "France." Country codes are especially useful since country names may be spelled differently in different languages. Other very common code lists are for state or province codes and units of measure.

We don't yet have any good applications for codes in our working example. Suppose we wanted to add an Element for gender, with a coded value indicating male, female , or unknown. We could do this as shown below.

Specifying a Code List in SimpleCSV2.xsd

 <xs:simpleType name="genderCodeType">   <xs:annotation>     <xs:documentation>This type specifies male, female,         or unknown     </xs:documentation>   </xs:annotation>   <xs:restriction base="xs:string">     <xs:enumeration value="M"/>     <xs:enumeration value="F"/>     <xs:enumeration value="U"/>   </xs:restriction> </xs:simpleType>

Note that, as with some types of patterns, when we specify the enumerations we don't have to specify lengths. Note, too, that although we are talking in this section about Elements, all of these simple types we are creating by restriction can also be used for Attributes.

As I write this there are very few (if any) approved, standard schemas for standard code lists that can be reused by anyone in their own schemas just by referencing them. For example, you would think that the folks at the U.S. Post Office might post a schema with the standard codes for states, military addresses, and U.S. territories and possessions. They haven't, at least not yet. People are restating such code lists on their own. This is pretty silly, but it's our only choice for now since there isn't yet a standard way to specify a code list in a schema. (Remember, a thousand different ways to hang yourself!) However, there is hope.

The Universal Business Language Technical Committee of the Organization for the Advancement of Structured Information Systems (the UBL TC in OASIS ) has drafted a proposal for a standard way to express code lists in schema language. A task group within the U.S. EDI standards group, ANSI ASC X12, reviewed it shortly before I wrote this and had a favorable assessment. At that time the UBL folks said they planned to submit the work to the appropriate group within ISO. It would be very significant if this proposal is fully fleshed out and gains acceptance because it would lay the foundation for reusable code lists.

Attribute Declarations

As I said earlier, adding Attributes to Elements and understanding what that does to simple versus complex types is one of the more awkward parts of the Schema Recommendation. However, once you get your head around this idiosyncrasy, Attribute declarations are fairly straightforward. Here's a plain- vanilla version in which we just extend our Columns from a simple type to a complex type by adding the ColumnNumber Attribute.

Specifying a ColumnNumber Attribute in SimpleCSV3.xsd

 <xs:complexType name="ColumnType">   <xs:annotation>     <xs:documentation>Here we add the ColumnNumberAttribute,         type integer, and optional to the ColumnType.     </xs:documentation>   </xs:annotation>   <xs:simpleContent>     <xs:extension base="xs:string">       <xs:attribute name="ColumnNumber" type="xs:integer"           use="optional"/>     </xs:extension>   </xs:simpleContent> </xs:complexType>

The syntax for an Attribute declaration is very similar to an Element declaration. We still have the name and type Attributes, but the schema Element we use is xs:attribute. Note also that we still have simple content, as indicated by the xs:simpleContent Element, but the declaration is for a xs:complexType.

The relationships between restriction and extension and simple and complex become a bit clearer when we consider the next example. We want to add the ColumnNumber Attribute, but we also want to restrict the length of the column to 1,024 characters. We're extending the type from a simple type to a complex type but restricting the content model. We have to do this in two steps. Although it requires a few more lines in the schema, we can thank the Schema Recommendation authors for implementing the concepts with a bit of consistency. I think it would ultimately be more confusing if we could do both operations in one step. In the following snippet, we first create the String1024Type type by restriction from string, then extend it to the ColumnType complex type by adding the Attribute.

Simple Content Restriction and Extension in SimpleCSV4.xsd

 <xs:simpleType name="String1024Type">   <xs:annotation>     <xs:documentation>This is the base for our ColumnType type,         showing restriction, then extension.     </xs:documentation>   </xs:annotation>   <xs:restriction base="xs:string">     <xs:maxLength value="1024"/>   </xs:restriction> </xs:simpleType> <xs:complexType name="ColumnType">   <xs:annotation>     <xs:documentation>Here we add the ColumnNumberAttribute,         type integer, and optional to the ColumnType, but our         base is our restricted 1024-byte string instead of the         built-in string type.     </xs:documentation>   </xs:annotation>   <xs:simpleContent>     <xs:extension base="String1024Type">       <xs:attribute name="ColumnNumber" type="xs:integer"           use="optional"/>     </xs:extension>   </xs:simpleContent> </xs:complexType>

This shows clearly that we create a new simple type by restriction from a built-in data type, then use it to create a new complex type by extension. Again, note that in both this and the previous example the content of our complex type is still simple.

As I noted earlier, there are many more built-in schema language data types, and there is certainly a lot more involved in creating simple content Elements than what I've presented here. However, this discussion should give you all you need to read about 90 percent of what you'll find in schemas for business data. Let's move on now to Elements with complex content.