Foundation Concepts and Terminology | Using XML with Legacy Business Applications

Although schemas are a tremendous advance over DTDs, they introduce some new concepts and terminology to XML that take some getting used to. It will help to review some of these before we discuss more specific features.

Elements and Types

All programming languages make a distinction between a variable, which is a named location in memory, and a variable's data type, which can put constraints on the variable's range of values or how the variable should be interpreted. In general, the closest thing that XML 1.0 DTDs have to this separation is PCDATA for parsed character data and CDATA for (unparsed) character data. The XML Schema Recommendation implements a more conventional view of variables and data types. (I should add "in general" because this is true for Elements. Attribute declarations have a bit more expressive power since they can be declared as CDATA, ID, IDREF, and so on.)

NOTE "Data Type" or "Datatype"?

The Schema Recommendation uses "datatype," while most programming languages and texts use "data type." We'll go with convention and use the latter, but keep in mind that in reference to the schema language I also mean the former.

In addition to defining a set of base types to be used for simple Elements and Attributes, the schema language also specifies a way to define complex types. These are similar to structs in C, GROUPs in COBOL, and records in other languages. This facility greatly aids the reuse of common constructs.

Both user -defined complex types and the built-in simple data types may be modified through the mechanisms of restriction and extension. These mechanisms and the most common simple data types will be discussed shortly.

Simple and Complex

One of the things I found somewhat confusing when first learning schema language is how it uses the terms "simple" and "complex." These are used in two different contexts.

In the first context, the terms "simple" and "complex" describe the type of content an Element may have, or its content model . In the most general terms, an Element may be declared in schema language with one of two content models.

Simple content : The Element has no child Elements.
Complex content : The Element may have child Elements.

These types of content are represented using the schema language Elements xs:simpleContent and xs:complexContent, respectively. In addition to simple and complex, there are various types of content models for complex content. The most common are sequence and choice . We'll talk more about these later in the chapter. Mixed content indicates both child Elements and character data are permissible. This content model is used infrequently in business document schemas intended for use directly by applications.

The other context in which the terms "simple" and "complex" are used is in referring to types .

Simple type : A simple type is either one of the built-in schema data types or a type derived from one of them. An Element that is of a simple type may not have Attributes.
Complex type : A complex type may take several forms. A complex type is used to define Elements that may have child Elements. A complex type is used to define any Element that may have Attributes. A complex type is also used to declare that an Element may have mixed content.

These are declared in schema language using the xs:simpleType and xs:complexType Elements, respectively. For both of these types, schema language supports the concept of derivation , similar to the way it is used in object-oriented analysis, using the mechanisms of extension and restriction . These mechanisms are used somewhat differently in simple and complex types.

Simple types are derived from built-in schema language data types or other user-defined simple types by restriction. Restriction of a simple type involves reducing the set of values allowed for the type.

Complex types are either declared (created by the user) or derived from other types. Complex types may be derived from simple types by extension, usually by adding one or more Attributes. Complex types may also be derived from other complex types by either restriction or extension. Restriction generally involves removing child Elements from the content model, while extension generally involves adding Elements or Attributes. Restriction and extension will be discussed in more detail in later sections.

What can be a bit confusing is that there isn't a one-to-one correspondence between simple and complex content and simple and complex types. The use of "complex" is consistent since Elements that are defined to allow complex content always have a complex type. However, there isn't any such consistency with the use of "simple." Elements that are defined to allow only simple content may be of either simple type or complex type. An Element that is defined with simple content but that allows one or more Attributes always has a complex type. However, we can define a complex type, derived from a simple type by extension, that has a simple content model and no Attributes.

The best advice I can give you here is to make sure you understand the context, that is, whether you are dealing with content models or types, and don't confuse the two.