In the general sense of the word, a schema is a generic representation of a class of things. For example, a schema for restaurant menus could be the phrase "a list of dishes available at a particular eating establishment." A schema may resemble the thing it describes, the way a "smiley face" represents an actual human face. The information contained in a schema allows you to identify when something is or is not a representative instance of the concept. In the XML context, a schema is a pass-or-fail test for documents. [1] A document that passes the test is said to conform to it, or be valid . Testing a document with a schema is called validation . A schema ensures that a document fulfills a minimum set of requirements, finding flaws that could result in anomalous processing. It also may serve as a way to formalize an application, being a publishable object that describes a language in unambiguous rules.
4.1.1 ValidationAn XML schema is like a program that tells a processor how to read a document. It's very similar to a later topic we'll discuss called transformations . The processor reads the rules and declarations in the schema and uses this information to build a specific type of parser, called a validating parser. The validating parser takes an XML instance as input and produces a validation report as output. At a minimum, this report is a return code, true if the document is valid, false otherwise . Optionally, the parser can create a Post Schema Validation Infoset (PSVI) including information about data types and structure that may be used for further processing. Validation happens on at least four levels:
Structural validation is the most important, and schemas are best prepared to handle this level. Data typing is often useful, especially in "data-style" documents, but not widely supported. Testing integrity is less common and somewhat problematic to define. Business rules are often checked by applications. 4.1.2 A History of Schema LanguagesThere are many different kinds of XML schemas, each with its own strengths and weaknesses. 4.1.2.1 DTDThe oldest and most widely supported schema language is the Document Type Definition (DTD). Borrowed from SGML, a simplified DTD was included in the XML Core recommendation. Though a DTD isn't necessary to read and process an XML document, it can be a useful component for a document, providing the means to define macro-like entities and other conveniences . DTDs were the first widely used method to formally define languages like HTML. 4.1.2.2 W3C XML SchemaAs soon as XML hit the streets , developers began to clamor for an alternative to DTDs. DTDs don't support namespaces, which appeared after the XML 1.0 specification. They also have very weak data typing, being mostly markup-focused. The W3C formed a working group for XML Schema and began to receive proposals for what would later become their W3C XML Schema recommendation. Following are some of the proposals made by various groups.
Informed by these proposals, the W3C XML Schema Working Group arrived at a recommendation in May 2001, composed of three parts (XMLS0, XMLS1, and XMLS2) named Primer, Structures, and Datatypes, respectively. Although some of the predecessors are still in use, all involved parties agreed that they should be retired in favor of the one, true W3C XML Schema. 4.1.2.3 RELAX NGAn independent effort by a creative few coalesced into another schema language called RELAX NG (pronounced " relaxing "). It is the merging of Regular Language Description for XML (RELAX) and Tree Regular Expressions for XML (TREX). Like W3C Schema, it supports namespaces and datatypes. It also includes some unique innovations, such as interchangeability of elements and attributes in content descriptions and more flexible content models. RELAX, a product of the Japanese Standard Association's INSTAC XML Working Group, led by Murata Makoto, was designed to be an easy alternative to XML Schema. " Tired of complex specifications?" the home page asks. "You can relax!" Unlike W3C Schema, with its broad scope and high learning curve, RELAX is simple to implement and use. You can think of RELAX as DTDs (formatted in XML) plus datatypes inherited from W3C Schema's datatype set. As a result, it is nearly painless to migrate from DTDs to RELAX and, if you want to do so later, fairly easy to migrate from RELAX to W3C Schemas. It supported two levels of conformance. "Classic" is just like DTD validation plus datatype checking. "Fully relaxed " added more features. The theoretical basis of RELAX is Hedge Automata tree processing. While you don't need to know anything about Hedge Automata to use RELAX or RELAX NG, these mathematical foundations make it easier to write efficient code implementing RELAX NG. Murata Makoto has demonstrated a RELAX NG implementation which occupies 27K on a cell phone, including both the schema and the XML parser. At about the same time RELAX was taking shape, James Clark of Thai Opensource Software was developing TREX. It came out of work on XDuce, a typed programming language for manipulating XML markup and data. XDuce (a contraction of "XML" and "transduce") is a transformation language which takes an XML document as input, extracts data, and outputs another document in XML or another format. TREX uses XDuce's type system and adds various features into an XML-based language. XDuce appeared in March 2000, followed by TREX in January 2001. Like RELAX, TREX uses a very clear and flexible language that is easy to learn, read, and implement. Definitions of elements and attributes are interchangeable, greatly simplifying the syntax. It has full support for namespaces, mixed content, and unordered content, things that are missing from, or very difficult to achieve, with DTDs. Like RELAX, it uses the W3C XML Schema datatype set, reducing the learning curve further. RELAX NG (new generation) combines the best features from both RELAX and TREX in one XML-based schema language. First announced in May 2001, an OASIS Technical Committee headed by James Clark and Murata Makoto oversees its development. It was approved as a Draft International Standard by the ISO/IEC. 4.1.2.4 SchematronAlso worth noting is Schematron, first proposed by Rick Jelliffe of the Academia Sinicia Computing Centre in 1999. It uses XPath expressions to define validation rules and is one of the most flexible schema languages around. 4.1.3 Do You Need Schemas?It may seem like schemas are a lot of work, and you'd be right to think so. In designing a schema, you are forced to think hard about how your language is structured. As your language evolves, you have to update your schema, which is like maintaining a piece of software. There will be bugs , version tracking, usability issues, and even the occasional overhaul to consider. So with all this overhead, is it really worth it? First, let's look at the benefits:
Using a schema also has some drawbacks:
To make the decision easier, think about it this way. A schema is basically a quality-control tool. If you are reasonably certain that your documents are good enough for processing, then you have no need for schemas. However, if you want extra assurance that your documents are complete and structurally sound, and the work you save fixing mistakes outweighs the work you will spend maintaining a schema, then you should look into it. One thing to consider is whether a human will be involved with producing a document. No matter how careful we are, we humans tend to make a lot of mistakes. Validation can find those problems and save frustration later. But software-created documents tend to be very predictable and probably never need to be validated . The really hard question to answer is not whether you need a schema, but which standard to use. There are a few very valuable choices that I will be describing in the rest of the chapter. I hope to provide you with enough information to decide which one is right for your application. |