Item 37. Validate Inside Your Program with Schemas | Effective XML: 50 Specific Ways to Improve Your XML

Rigorously testing preconditions is an important characteristic of robust, reliable software. Schemas make it very easy to define the preconditions for XML documents you parse and the postconditions for XML documents you write. Even if the document itself does not have a schema, you can write one and use it to test the documents before you operate on them. It is quite hard to attach a DTD to a document inside a program. Fortunately, however, most other schema languages are much more flexible about this.

For example, let's suppose you're in charge of a system at TV Guide that accepts schedule information from individual stations over the Web. Information about each show arrives as an XML document formatted as shown in Example 37-1.

Example 37-1 An XML Instance Document Containing a Television Program Listing

 <Program xmlns="http://namespaces.example.com/tvschedule"   <Title>Reality Bites</Title>   <Description>    Elimination tournament in which contestants eat a    succession of gross items until only one is left standing.    Tonight's episode features rancid apples, insects, and    McDonald's Happy Meals.   </Description>   <Date>2003-11-21</Date>   <Start>08:00:00-05:00</Start>   <Duration>PT30M</Duration>   <Station>KFOX</Station> </Program>

Every day, around the clock, stations from all over the country send schedule updates like this one that you need to store in a local database. Some of these stations use software you sold them. Some of them hire interns to type the data into a password-protected form on your web site. Others use custom software they wrote themselves . There may even be a few hackers typing the information into text files using emacs and then telnetting to your web server on port 80, where they paste in the data. There are about a dozen different places where mistakes can creep in. Therefore, before you even begin to think about processing a submission, you want to verify that it's correct. In particular, you want to verify the following.

The root element of the document is Program .
All required elements are present.
No more than one of each element is present.
The Title element is not empty.
The date is a legal date in the future.
The Start element contains a sensible time.
The duration looks like a period of time.
The station identifier is a four-letter code beginning with either K or W.
The station identifier maps to a known station somewhere in the country, which can be determined by looking it up in a database running on a different machine in your intranet.

You could write program code to verify all of these statements after the document was parsed. However, it's much easier to write a schema that describes them declaratively and let the parser check them. The W3C XML Schema Language, RELAX NG, and Schematron can all handle about 85% of these requirements. They all have problems with the requirement that the date be in the future and that the station be listed in a remote database. These will have to be checked using real programming code written in Java, C++, or some other language after the document has been parsed. However, we can make the other checks with a schema. Example 37-2 shows one possible W3C XML Schema Language schema that tests most of the above constraints.

Example 37-2 A W3C XML Schema for Television Program Listings

 <?xml version="1.0"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">   <xsd:element name="Program">     <xsd:complexType>       <xsd:all>         <xsd:element name="Title">           <xsd:simpleType>             <xsd:restriction base="xsd:string">               <xsd:minLength value="1"/>             </xsd:restriction>           </xsd:simpleType>         </xsd:element>         <xsd:element name="Description" type="xsd:string"/>         <xsd:element name="Date"        type="xsd:date"/>         <xsd:element name="Start"       type="xsd:time"/>         <xsd:element name="Duration"    type="xsd:duration"/>         <xsd:element name="Station">           <xsd:simpleType>             <xsd:restriction base="xsd:token">               <xsd:pattern value="(WK)[A-Z][A-Z][A-Z]"/>             </xsd:restriction>           </xsd:simpleType>         </xsd:element>       </xsd:all>     </xsd:complexType>   </xsd:element> </xsd:schema>

For simplicity, I'll assume this schema resides at the URL http://www.example.com/tvprogram.xsd in the examples that follow, but you can store it anywhere convenient .

There are several different ways to programmatically validate a document, depending on the schema language, the parser, and the API. Here I'll demonstrate two: Xerces-J using SAX properties and DOM Level 3 validation.