Item 25. Pretend There s No Such Thing as the PSVI | Effective XML: 50 Specific Ways to Improve Your XML

Item 25. Pretend There's No Such Thing as the PSVI

Schemas are an extremely useful means of checking preconditions on XML documents before processing them. While schema validation normally can't detect every possible violation of an application's constraints, it can often detect a lot of them. However, some schema languages ( especially the W3C XML Schema Language) have a second, less salutary purpose. Although support is mostly experimental so far, the W3C XML Schema specifications indicate that a truly schema-aware parser should produce a post-schema validation information set (PSVI).

The concept is that the PSVI includes not only the actual content of the XML document but also annotations about that content from schemas. In particular, it defines types for elements and attributes. Thus you can know, for example, that a Price element has type xsd:double , a CurrentStock element has type xsd:nonNegativeInteger , an expires attribute has type xsd:date , and a Weight element has type xsd:decimal .

This sounds good in theory. In practice, the real world is rarely so simple. The types defined in the schema are only occasionally the correct types to be used in the local processing environment.

A Price element might be declared as a double. However, to avoid round-off errors when adding and subtracting prices, it might need to be processed as a fixed point type as in Cobol or as an instance of a custom Money class that never has round-off errors.
CurrentStock might always be non-negative (that is, greater than or equal to zero), but it might still be represented as a signed int in languages such as Java that don't have unsigned types.
An expires attribute may contain a date, but the local database into which the dates are fed might represent dates as an integer containing the number of days since December 1, 1904, or the number of seconds since midnight, December 31, 1969. It might round all dates to the nearest month or the nearest week. It might even convert them to a non-Gregorian calendar.
A Weight element might be declared as type xsd:decimal , equivalent to BigDecimal in Java. However, local processing might change it to a double for efficiency of calculation if the local process doesn't need the extra precision or size of the decimal type. It might also need to convert from pounds to kilograms, or ounces to grams, or grams to kilograms. The mere annotation as a decimal is insufficient to truly determine the weight.

The fact is schema-defined types just aren't all that useful. XML 1.0 also has a notion of types based purely on element and attribute names . Along with the namespace and the outer context (that is, the parents and ancestors of the element and attribute), this is normally all you need to convert XML into some local structure for further processing. If you can recognize that the CurrentStock element represents the number of boxes sitting in the warehouse, you normally know that it's a non-negative integer, and you know how to handle it in your code. In fact, you can handle it a lot better by understanding that it is the number of boxes sitting in the warehouse than you can merely by knowing that it's a non-negative integer.

Look at it this way: It's very rare to assign different types to the same element. What is the chance you'll have a CurrentStock element that is an integer in one place and a string in the next , or any other element for that matter? This does happenfor example, in a medical record, a Duration element might contain minutes when describing the length of a procedure and days, months, and years when describing a preexisting conditionbut normally this indicates that the type is really some broader type that can handle both types, in this case perhaps an ISO 8601 duration that handles everything from years down to fractions of a second. Almost always, knowing the name , namespace, and context of an element is sufficient to deduce its nature and to do this in a far more robust and useful way than merely knowing the type. Adding the schema type really does not provide any new information.

The normal response of PSVI proponents is that the type is necessary for documentationthat it serves as some sort of contract between the producer of the data and the consumer of the data. The type tells the eventual recipient that the sender intended a particular element to be treated as a non-negative integer, or a date in the Gregorian calendar, or in some other way. A schema and its types can certainly be useful for documentation purposes. It helps to more unambiguously define what is expected, and like all extra-document context it can be used to inform the development of the process that receives the document. However, the software generally won't dispatch based on the type. It will decide what to do with elements based on the element's name, namespace, position within the document, and sometimes content. The type will most often be ignored.

Even more importantly, there is absolutely no guarantee that the type the sender assigns to a particular element is in any way, shape, or form the type the recipient needs or can process. For example, some languages don't have integer types, only decimal types. Thus they might choose to parse an integer as a floating point number. Even more likely, they may choose to treat all numbers as strings or to deserialize data into a custom class or type such as a money object. The idea that each data has a type that can satisfy all possible uses of the data in all environments is a fantasy. Regardless of what type it's assigned, different processes will treat the data differently, as befits their local needs and concerns.

Some more recent W3C specifications, especially XPath 2.0, XSLT 2.0, and XQuery, are based on the PSVI, and this has caused no end of problems. The specifications are far more complex and elaborate than they would be if they were based on basic XML 1.0 structures. They impose significant additional costs on implementers, significant enough that several developers who have implemented XPath and XSLT 1.0 have announced they won't be implementing version 2 of the specification. They also impose noticeable performance penalties on users because schema validation becomes a prerequisite for even the simplest transformation.

What exactly is bought for this extra cost? Honestly, that's hard to say. It's not clear what, if anything, can be done with XPath 2.0/XSLT 2.0/XQuery as currently designed that couldn't be done with a less typed language. It is possible to search a document for all the elements with type int or for all the attributes with type date , but that's rarely useful without knowing what the int or date represents. Indeed, in the XQuery Use Cases specification (May 2003, working draft), only 12 of 77 examples actually use typed data, and most of those could easily be rewritten to depend on element names rather than element types. Some developers suspect they can use strong typing information to optimize and speed up certain queries, but so far this is no more than a hypothesis without any real evidence to back it up. Indeed, the main purpose of all the static typing seems to be soothing the stomachs of the relational database vendors in the working group who are constitutionally incapable of digesting a dynamically typed language.

The PSVI also causes trouble in many of the data binding APIs such as Castor, Zeus, JAXB, and others. Here the problem is a little less explicit. These APIs start with the assumption that they can read a schema to determine the input form of the document and then deserialize it into equivalent local objects. They rely on schema types to determine which local forms the data takes: int, double, BigDecimal, and so on. There are numerous problems with this approach, which are not shared by more type- agnostic processes.

More documents don't have schemas than do. Limiting yourself only to documents with schemas cuts way down on what you can usefully process.
Documents that have schemas don't always have schemas in the right language. Most data binding APIs such as JAXB are based around the W3C XML Schema Language. Most actual schemas are DTDs. A few tools can handle DTDs, but what do you do when the schema language is RELAX NG or something still more obscure?
Documents that do have schemas aren't always valid. Just because some expected content is missing doesn't mean there isn't useful information still in the document. It's even more likely that extra, unexpected content will not get in the way of your processing. However, because most data binding tools take validity as a prerequisite, they throw up their hands in defeat at the first sign of trouble. Type-agnostic processes soldier on.
Many data binding tools make assumptions about types, particularly complex types, that simply aren't true of many, perhaps most, XML documents. For instance, they assume order doesn't matter, mixed content doesn't exist, elements aren't repeated (the normalization fallacy), and more. In essence, they're trying to fit XML into object or relational structures rather than building objects or tables around XML. Too often when somebody talks about strong typing in XML, they really mean limiting XML to just a few types they happen to be familiar with that work well in their environment. They aren't considering approaching XML on its own terms.

The PSVI is a useful theory for talking about schema languages and what they mean. However, its practice remains to be worked out. For the time being, simple validation is the most you can or should ask of a schema.