Rope, Anyone? | Using XML with Legacy Business Applications

Back in a previous lifetime when I worked as a consultant for Digital Equipment Corporation, we all loved using the VMS operating system. However, we also joked that it was so capable and flexible that it gave us a thousand different ways to hang ourselves . The W3C XML Schema language is very similar. You can write a nearly infinite number of schemas to describe any particular instance document. Every one of these schemas can provide an accurate description of the document, be loaded into a parser, and validate the document.

XML Schemas and the Formal Study of Grammars

The situation of multiple schemas being able to validate the same instance document is very analogous to a situation in the formal study of languages and automata in computer science. In this realm of study, a language is not necessarily a spoken language like French or English. It can comprise even nonsense strings of characters so long as they exhibit certain properties. An immediate application of this field of study is building compilers that transform programs in languages like Java and C++ into executable code.

Languages can be grouped by certain properties they exhibit. One of these groups is a class that can be specified by a special type of grammar called context-free grammar , or CFG . An interesting property of these languages and CFGs is that almost an infinite number of CFGs can be developed to describe a given language. There are well-established algorithms for determining whether or not two CFGs describe the same language. I've not seen anything equivalent for XML schemas yet, but given enough time I'm sure someone will develop a tool to do it.

If there were only one way to write a schema I'm not sure I would devote a whole chapter to the topic. However, because all of this variation is possible, a certain level of instruction is called for in order to help you understand schemas. Understanding the schemas you use is important for a number of reasons. When used, schemas are the ultimate authority on the format of your XML instance documents. If you encounter parsing or validation errors you may have to refer to the schema as well as the instance document to determine the source of the error. If you are like most people who use XML, your trading partners or application vendors will tell you which schemas to use. Because there is sometimes a perception that schemas are fully definitive and self-documenting , the schema may be the only documentation you'll get regarding how your instance documents are supposed to look.

In this chapter I attempt to convey to you the basics of what you need to know to read and make sense of schemas written in W3C's XML Schema language. This chapter focuses just on understanding the suckers. In Chapter 5 we'll write code to use schemas for validation. Later in this chapter we'll talk about creating schemas, and we'll revisit the topic in Chapter 12. However, for the most part I'll avoid discussing the advantages and disadvantages of the various schema design options. The topic of best practices in schema design is fairly broad and sometimes contentious. Since I anticipate that most of you reading this will not be designing schemas, I'll limit the discussion to a more pragmatic scope.

If you want a good introduction to most of the features of the schema language, you can't do better than W3C's Primer (see the Resources section at the end of this chapter). However, this chapter takes a different tack. While the Primer tries to describe most of the important schema features, I'll focus only on features and examples you are likely to see in schemas describing business documents (as opposed to wedding invitations, Web pages, or design documents for weapons systems). Although the Primer uses a purchase order as one of its main examples, many of the schema features discussed are not commonly used in business documents (not yet, anyway).

Although I have good words to say about the Primer, which is Part 0 of the XML Schema Recommendation, I don't have such good words to say about Parts 1 and 2, Structures and Datatypes, respectively. I find parts of them nearly impenetrable, and I have a master's degree in computer science! There also seems to me to be a great deal of functionality that is unnecessarily overlapping or outright duplicative. I don't want to come across as too harsh here since I have worked on standards committees myself and know firsthand what can happen when writing a document by committee. However, even when compared with other W3C Recommendations, Parts 1 and 2 of the Schema Recommendation don't measure up very well.

There are schema languages other than the W3C XML Schema language. However, despite my mixed feelings about the W3C XML Schema Recommendation, I'm not going to talk about those other schema languages. It's not that I think that Relax NG or any of its cousins are technically deficient or harder to use than the W3C XML Schema language. Far from it. The truth is that the market just isn't very interested in them. Almost everyone looks to the W3C as the final authority on all things XML. They have spoken.

NOTE Chapter Conventions

When dealing with the topic of schemas it can sometimes be difficult to keep track of whether one is talking about an individual instance of a schema document or referring to the schema language or recommendation. To try to keep these distinct I'll use the conventions of referring to the W3C XML Schema language as "schema language" and to the W3C XML Schema Recommendation as "Schema Recommendation." I'll use "schema" to refer to an individual schema, that is, an instance document written in the W3C XML Schema language.

In addition, I frequently use the xs: namespace qualifier prefix to help make it clear that I'm referring to specific Elements from the W3C XML Schema language.