Item 25. Pretend There's No Such Thing as the PSVI
Schemas are an extremely useful means of checking preconditions on XML documents before processing them. While schema validation normally can't detect every possible violation of an application's constraints, it can often detect a lot of them. However, some schema languages ( especially the W3C XML Schema Language) have a second, less salutary purpose. Although support is mostly experimental so far, the W3C XML Schema specifications indicate that a truly schema-aware parser should produce a post-schema validation information set (PSVI).
The concept is that the PSVI includes not only the actual content of the XML document but also annotations about that content from schemas. In particular, it defines types for elements and attributes. Thus you can know, for example, that a Price element has type xsd:double , a CurrentStock element has type xsd:nonNegativeInteger , an expires attribute has type xsd:date , and a Weight element has type xsd:decimal .
This sounds good in theory. In practice, the real world is rarely so simple. The types defined in the schema are only occasionally the correct types to be used in the local processing environment.
The fact is schema-defined types just aren't all that useful. XML 1.0 also has a notion of types based purely on element and attribute names . Along with the namespace and the outer context (that is, the parents and ancestors of the element and attribute), this is normally all you need to convert XML into some local structure for further processing. If you can recognize that the CurrentStock element represents the number of boxes sitting in the warehouse, you normally know that it's a non-negative integer, and you know how to handle it in your code. In fact, you can handle it a lot better by understanding that it is the number of boxes sitting in the warehouse than you can merely by knowing that it's a non-negative integer.
Look at it this way: It's very rare to assign different types to the same element. What is the chance you'll have a CurrentStock element that is an integer in one place and a string in the next , or any other element for that matter? This does happenfor example, in a medical record, a Duration element might contain minutes when describing the length of a procedure and days, months, and years when describing a preexisting conditionbut normally this indicates that the type is really some broader type that can handle both types, in this case perhaps an ISO 8601 duration that handles everything from years down to fractions of a second. Almost always, knowing the name , namespace, and context of an element is sufficient to deduce its nature and to do this in a far more robust and useful way than merely knowing the type. Adding the schema type really does not provide any new information.
The normal response of PSVI proponents is that the type is necessary for documentationthat it serves as some sort of contract between the producer of the data and the consumer of the data. The type tells the eventual recipient that the sender intended a particular element to be treated as a non-negative integer, or a date in the Gregorian calendar, or in some other way. A schema and its types can certainly be useful for documentation purposes. It helps to more unambiguously define what is expected, and like all extra-document context it can be used to inform the development of the process that receives the document. However, the software generally won't dispatch based on the type. It will decide what to do with elements based on the element's name, namespace, position within the document, and sometimes content. The type will most often be ignored.
Even more importantly, there is absolutely no guarantee that the type the sender assigns to a particular element is in any way, shape, or form the type the recipient needs or can process. For example, some languages don't have integer types, only decimal types. Thus they might choose to parse an integer as a floating point number. Even more likely, they may choose to treat all numbers as strings or to deserialize data into a custom class or type such as a money object. The idea that each data has a type that can satisfy all possible uses of the data in all environments is a fantasy. Regardless of what type it's assigned, different processes will treat the data differently, as befits their local needs and concerns.
Some more recent W3C specifications, especially XPath 2.0, XSLT 2.0, and XQuery, are based on the PSVI, and this has caused no end of problems. The specifications are far more complex and elaborate than they would be if they were based on basic XML 1.0 structures. They impose significant additional costs on implementers, significant enough that several developers who have implemented XPath and XSLT 1.0 have announced they won't be implementing version 2 of the specification. They also impose noticeable performance penalties on users because schema validation becomes a prerequisite for even the simplest transformation.
What exactly is bought for this extra cost? Honestly, that's hard to say. It's not clear what, if anything, can be done with XPath 2.0/XSLT 2.0/XQuery as currently designed that couldn't be done with a less typed language. It is possible to search a document for all the elements with type int or for all the attributes with type date , but that's rarely useful without knowing what the int or date represents. Indeed, in the XQuery Use Cases specification (May 2003, working draft), only 12 of 77 examples actually use typed data, and most of those could easily be rewritten to depend on element names rather than element types. Some developers suspect they can use strong typing information to optimize and speed up certain queries, but so far this is no more than a hypothesis without any real evidence to back it up. Indeed, the main purpose of all the static typing seems to be soothing the stomachs of the relational database vendors in the working group who are constitutionally incapable of digesting a dynamically typed language.
The PSVI also causes trouble in many of the data binding APIs such as Castor, Zeus, JAXB, and others. Here the problem is a little less explicit. These APIs start with the assumption that they can read a schema to determine the input form of the document and then deserialize it into equivalent local objects. They rely on schema types to determine which local forms the data takes: int, double, BigDecimal, and so on. There are numerous problems with this approach, which are not shared by more type- agnostic processes.
The PSVI is a useful theory for talking about schema languages and what they mean. However, its practice remains to be worked out. For the time being, simple validation is the most you can or should ask of a schema.