Less Commonly Used W3C XML Schema Language Features | Using XML with Legacy Business Applications

Up to this point I've focused on the schema language features I've most often seen used for EAI and electronic commerce applications (and therefore those you will most likely encounter). Other features are less frequently used, at least at present. However, it may help you to at least be familiar with them in case you do encounter them.

Wildcards

Schema language supports wildcards with xs:any, xs:anyAttribute, and xs:anyType. There are some good technical reasons for having wildcards in certain types of applications. But for most schemas that describe documents intended for production or consumption by business applications, and especially for those that define data exchanged between organizations, wildcards tend to be a bit problematic . They can somewhat defeat the purpose of performing schema validation against a standard schema. I am, however, aware of some standards bodies discussing wildcard use in specific extension Elements. These extension Elements would allow people to exchange nonstandard data in otherwise standard documents and would not require any changes to the standard schemas.

Default and Fixed Attributes

Ever had the experience of looking at the output of an application and not being able to figure out where certain data came from? Schema language carries over from DTDs the ability to define default Attributes for instance documents. Even if the Attribute doesn't appear in an instance document, a schema-aware XML processor will hand it over to your application when it processes the document. This feature may be of some use in scenarios where you are integrating in-house applications and have complete control over the environment. There are certainly differences of opinion, but many people don't find default values very friendly when exchanging data between organizations. We generally like to know explicitly what is being told us, with nothing implied or defaulted.

Whitespace Suppression

Space characters , tabs, carriage returns, and line feeds are considered to be whitespace. Schema language has a pattern facet that lets you specify that whitespace should be suppressed when retrieving the content of a string Element or Attribute. By default all whitespace is preserved. However, using this feature can cause leading and trailing spaces to be stripped before passing the string to an application. Just as it can be unfriendly to add data with default Attributes, if your application thinks that leading or trailing spaces are significant, it can be unfriendly to have them removed. This is not a problem for most people since they tend to trim leading and trailing spaces when creating instance documents. However, it is something to be aware of. If you have an application or trading partner that is sending you whitespace you don't want, you may be able to use this feature to your advantage. To employ it, you must modify your schema to create a new simple type by restriction, using string as the base type, and specify the whitespace facet with the value of "collapse." You then set the Elements where you want whitespace stripped to be of that type rather than string. Whitespace suppression does not affect schema validation. It affects only the values returned to you by an XML processor.

Derivation of Complex Types by Restriction

We talked earlier about deriving simple types from built-in types by restriction and about deriving complex types by extension. We didn't talk about deriving complex types by restriction. There are several reasons for this. One reason is the general mechanism by which this restriction is performed. For example, if you want to remove Elements from a sequence content model, you have to restate the sequence, listing only those Elements you want to use. This is a bit more awkward than just listing those you want to remove. In addition, most people designing hierarchies of complex types tend to follow an object-oriented approach. This generally involves defining base classes with a few common properties, then extending them into subclasses by adding properties. However, some groups are taking a "kitchen sink" approach, similar to what has always been done in EDI, and defining in the base type everything that anyone ever might want to use. They then restrict down to specific usages by removing things they don't want. However, some groups that started out this way are moving back toward a conventional derivation model. Another persuasive reason for minimizing the use of restriction to derive complex types is that, even two years after the XML Schema Recommendation was published, most APIs still don't do derivation by restriction properly. This is no wonder since the Schema Recommendation is rather obtuse in this area.

Redefinition

Simple content models can be extended by means of a union, similar to a union in C or a Redefine in COBOL. I haven't seen many applications for it. A similar-sounding but fundamentally different feature is the schema xs:redefine Element. The xs:redefine Element can be used to modify the definitions of an external schema before they are used within the redefining schema. Some people may find reasons for doing this, but the extension and restriction mechanisms work just as well in most cases.

Named Attribute Groups

Sets of Attributes that are repeatedly used together may be associated with each other by creating a named Attribute group . This serves the same function as a macro in a programming language or some uses of ENTITY in DTDs. I've not seen named Attribute groups used at all, but they could be very useful when the same Attributes are used repeatedly.

Named Model Groups

These do for Elements what named Attribute groups do for Attributes. However, I have yet to see very many grounds for using them. In most cases, things are grouped together for a semantic reason. Putting them together as children under a parent Element explicitly shows their association when used in an instance document. Named model groups let you reuse groups of Elements in schema declarations, but they have no visibility in instance documents. You can't differentiate something that is part of a named model group from a sibling that isn't. When implemented in an organized fashion, deriving complex types by extension generally tends to meet most people's needs.

Abstract Types and Substitution Groups

They best way to describe these is to use an analogy from object-oriented programming. You can define a base class, then define subclasses, all with the same named method. Then you can process objects of the subclass by referring to them as if they were all members of the base class. This type of polymorphism allows you to call a named method but have different behaviors because each of the derived classes has implemented the named method in a fashion tailored to the derived class. Abstract types are akin to the base class, and the members of a substitution group are akin to the derived classes. I've been told that this approach is used in the OASIS Security Assertion Markup Language ( SAML ), but I've seen it used in only one schema I've actually reviewed. The only reason I think it was used there is because, in my humble opinion, the schema designers were trying to address too many types of business processes and data in a single schema. However, they may have used it because they were generating Java classes directly from schemas (more on this in Chapter 12). Most schemas are so tightly focused in their problem domain and data that there is no need for this type of flexibility. However, as I write this X12 is considering using this feature for a very limited, specific type of constraint on a lower-level type. There may be a need to say that at least one member of a group of things is required but that there could be more than one. An example is requiring at least a name or a frequent flyer number on an airline reservation. Abstract types and substitution groups are currently in the running for expressing that type of constraint.

Keys and Uniqueness

Schema language offers four different mechanisms for relating Elements to each other. Several offer a means by which schema processors can enforce something similar to the referential integrity or unique key constraints that relational database systems offer. The four mechanisms are (1) ID with IDREF, (2) the xs:unique schema language Element, (3) xs:key and xs:keyref, and (4) XLink and XPointer. Any one of these would require at least a paragraph and an example to explain. That doesn't quite seem appropriate for a discussion about little-used features. To date, most people are content to let their applications enforce referential integrity and uniqueness and to relate different Elements in instance documents using the appropriate business data rather than schema language constructs.

Nillable Elements

Schema language offers a feature that allows you to identify an Element as being nillable , that is, capable of representing a null value. This is the same as saying that a column in a table in a relational database can be null. With this feature you can create an instance document with an Element that is empty but that has an attribute of xsi:nill with a boolean value of true. Strictly speaking, as far as the schema language syntax goes this is not the same as saying that the Element is empty or that it is absent. However, what your application chooses to do with such Elements is probably up to you. I've not seen this feature used yet. However, I will be surprised if I don't find some clever monkey using it before I retire from working with XML.

So, those are some features you probably won't see used very much any time soon. With some warning you may be able to avoid hanging yourself too badly when someone springs one on you.

The last topic of nillable Elements leads naturally to the problem in the next section.