Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX - page 19


Validity

Programmers have long known the value of verifiable preconditions on functions and methods . (A lot of us carelessly don't use them, but that's a topic for another book.) One of the important innovations of XML is the ability to place preconditions on the data the programs read, and to do this in a simple declarative way. XML allows you to say that every Order element must contain exactly one Customer element; that each Customer element must have an id attribute that contains an XML name token; that every ShipTo element must contain one or more Street s, one City , one State , and one Zip ; and so forth. Checking an XML document against this list of conditions is called validation. Validation is an optional step but an important one.

There is more than one language in which you can express such conditions. Generically these are called schema languages, and the documents that list the constraints are called schemas. Various schema languages have different strengths and weaknesses. The document type definition (DTD) is the only schema language built into most XML parsers and endorsed as a standard part of XML. However, because of the extensible nature of XML, many other schema languages have been invented that you can easily integrate with your systems.

DTDs

A DTD focuses on the element structure of a document. It specifies what elements a document may contain, what each element may and must contain and in what order, and what attributes each element has.

Element Declarations

In order to be valid according to a DTD, each element used in the document must be declared in an ELEMENT declaration. For example, the following ELEMENT declaration specifies that Name elements contain #PCDATA , that is, text but no child elements:

 <!ELEMENT Name (#PCDATA)> 

Elements that can have children are declared by listing the names of their children in order, separated by commas. For example, the following ELEMENT declaration says that an Order element contains a Customer element, a Product element, a Subtotal element, a Tax element, a Shipping element, and a Total element in that order:

 <!ELEMENT Order (Customer, Product, Subtotal, Tax, 
          Shipping, Total)> 

The parenthesized list of things an element can contain is called the element's content model. You can attach a question mark (?) after an element name in the content model to indicate that the element is optional; that is, that either zero or one instance of the element may occur at that position. You can attach an asterisk (*) after the element name to indicate that zero or more instances of the element may occur at that position, or a plus sign (+) to indicate that one or more instances of the element must occur at that position. For example, the following element declaration states that a ShipTo element must contain zero or one GiftRecipient elements, one or more Street elements, and exactly one City , State , and Zip element each in that order:

 <!ELEMENT ShipTo (GiftRecipient?, Street+, City, State, Zip)> 

You can use a vertical bar () instead of a comma to indicate that either one or the other of the elements may appear. You can group collections of elements with parentheses to indicate that the entire group should be treated as a unit. You can suffix a * , ? , or + to the group to indicate that zero or more, zero or one, or one or more of those groups may appear at that point. Finally, you can replace the entire content model with the keyword EMPTY to specify that the element must contain no content at all.

Attribute Declarations

A DTD also specifies which attributes may appear and which must appear on which elements. Each attribute is declared in an ATTLIST declaration, which specifies

  • The element to which the attribute belongs

  • The name of the attribute

  • The type of the attribute

  • The default value of the attribute

For example, the following ATTLIST declaration states that every Customer element must have an attribute named id with type ID:

 <!ATTLIST Customer id ID #REQUIRED> 

DTDs define ten different types for attributes.

CDATA

Any string of text.

NMTOKEN

A string composed of one or more legal XML name characters . Unlike an XML name, a name token may start with a digit.

NMTOKENS

A white-space -separated list of name tokens.

ID

An XML name that is unique among ID type attributes in the document.

IDREF

An XML name used as an ID attribute value on some element in the document.

IDREFS

A white-space-separated list of XML names used as ID attribute values somewhere in the document.

ENTITY

The name of an unparsed entity declared in an ENTITY declaration in the DTD.

ENTITIES

A white-space-separated list of unparsed entities declared in the DTD.

NOTATION

The name of a notation declared in a NOTATION declaration in the DTD.

Enumeration

A list of all legal values for the attribute, separated by vertical bars. Each possible value must be an XML name token.

Most parsers and APIs will tell you what the type of an attribute is if you want to know, but in practice this knowledge is not very useful. W3C XML Schema Language schemas offer much more complete data typing for both elements and attributes, including not only these types but also the more customary data types such as int and double .

DTDs allow four possible default values for attributes:

#REQUIRED

Each element in the instance document must provide a value for this attribute.

# IMPLIED

Each element in the instance document may or may not provide a value for this attribute. If an element does not provide a value, then no default value is provided from the DTD. [8]

[8] This is really a poor choice of terminology, as nothing is being implied here. A more accurate keyword would be #OPTIONAL . However, #IMPLIED is what XML gives us.

#FIXED" "value"

The attribute always has the value that follows #FIXED in double or single quotes, whether or not it's present in the instance document.

" value "

By default the attribute has the value specified in the DTD in single or double quotes. However, individual instances of the element may specify a different value.

Parsers may or may not tell you whether an attribute came from the instance document or came by default from the DTD, and seldom do you care about this one way or the other. However, if you're using a document that relies heavily on attribute values from DTDs, (for example, for namespace declarations), then make sure you're using a parser that does read the external DTD subset.

Example 1.8 is a complete DTD for order documents of the type shown in this chapter. It uses both ELEMENT and ATTLIST declarations.

Example 1.8 A DTD for Order Documents
 <!ELEMENT Order (Customer, Product+, Subtotal, Tax,
          Shipping, Total)>
<!ELEMENT Customer (#PCDATA)>
<!ATTLIST Customer id ID #REQUIRED>
<!ELEMENT Product (Name, SKU, Quantity, Price, Discount?,
                   ShipTo, GiftMessage?)>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT SKU (#PCDATA)>
<!ELEMENT Quantity (#PCDATA)>
<!ELEMENT Price (#PCDATA)> 
<!ATTLIST Price currency (USD  CAN  GBP) #REQUIRED>
<!ELEMENT Discount (#PCDATA)>
<!ELEMENT ShipTo (GiftRecipient?, Street+, City, State, Zip)>
<!ELEMENT GiftRecipient (#PCDATA)>
<!ELEMENT Street (#PCDATA)>
<!ELEMENT City   (#PCDATA)>
<!ELEMENT State  (#PCDATA)>
<!ELEMENT Zip    (#PCDATA)>
<!ELEMENT GiftMessage (#PCDATA)>
<!ELEMENT Subtotal (#PCDATA)>
<!ATTLIST Subtotal currency (USD  CAN  GBP) #REQUIRED>
<!ELEMENT Tax (#PCDATA)>
<!ATTLIST Tax currency (USD  CAN  GBP) #REQUIRED
              rate CDATA "0.0"
>

<!ELEMENT Shipping (#PCDATA)>
<!ATTLIST Shipping currency (USD  CAN  GBP) #REQUIRED
                   method   (USPS  UPS  Overnight) "UPS">
<!ELEMENT Total (#PCDATA)>
<!ATTLIST Total currency (USD  CAN  GBP) #REQUIRED> 
Document Type Declarations

Documents are associated with particular DTDs through document type declarations. Following is a document type declaration that points to the DTD in Example 1.8:

 <!DOCTYPE Order SYSTEM "order.dtd"> 

The document type declaration is placed in the instance document's prolog, after the XML declaration but before the root element start-tag. For example,

 <?xml version="1.0" encoding="ISO-8859-1"?> 
<!DOCTYPE Order SYSTEM "order.dtd">
<Order>
  ... 

This does assume that the DTD can be found in the same directory where the document itself resides. If you prefer, you can use an absolute URL instead. For example,

 <?xml version="1.0" encoding="ISO-8859-1"?> 
<!DOCTYPE Order SYSTEM "http://www.cafeconleche.org/dtds/order.dtd">
<Order>
  ... 

Even though Example 1.5 satisfies all the conditions expressed in Example 1.8, it is not valid because it does not have a document type declaration pointing to that DTD.

Caution

The acronym DTD is correctly used only to mean "document type definition. " It should never be used to mean "document type declaration. " The document type declaration may contain or point to the document type definition (or both), but the two are not the same.


DTDs are not just about validation. They can also affect the content of the instance document itself. In particular, they can

  • Define entities

  • Define notations

  • Provide default values for attributes

Assuming you are using a validating parser, there is little reason to care about how such things happen. The entities the DTD defines will be resolved before you see them. The notations will be applied to the appropriate elements and entities. A default attribute value will be just one more attribute in an element's list of attributes. Some APIs may tell you what entity a particular element came from, or whether an attribute value was defaulted from the DTD or present in the instance document. However, most of the time you simply do not need to know this.

Schemas

The W3C XML Schema Language (schemas for short, though it's hardly the only schema language) addresses several limitations of DTDs. First, schemas are written in XML instance document syntax, using tags, elements, and attributes. Second, schemas are fully namespace aware. Third, schemas can assign data types such as integer and date to elements, and validate documents based not only on the element structure but also on the element contents.

Example 1.9 shows a schema for order documents. Where order.dtd uses an ELEMENT declaration, order.xsd uses an xsd:element element. Where order.dtd uses an ATTLIST declaration, order.xsd uses an xsd:attribute element.

But order.xsd doesn't only repeat the same constraints found in order.dtd ; it also assigns types and ranges to the elements. For example, it requires that all the money elements Tax , Shipping , Subtotal , Total , and Price contain a decimal number such as 9.85, 7.2, or -3.25. [9] If one of these elements contained text that was not a decimal number, such as "France," then the validator would notice and report the problem. DTDs cannot detect mistakes like this. A DTD can note that there is no Price element where one is expected, but it cannot determine that the Price element does not actually give a price.

[9] It would be possible to require further that each money item be a positive number with two decimal digits of precision such as 9.85 but not 7.2 or -3.25, but for now I want to keep this example smaller.

Example 1.9 order.xsd: A Schema for Order Documents
 <?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

  <xsd:element name="Order">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:element name="Customer">
          <xsd:complexType>
            <xsd:simpleContent>
              <xsd:extension base="xsd:string">
                <xsd:attribute name="id" type="xsd:ID"/>
              </xsd:extension>
            </xsd:simpleContent>
          </xsd:complexType>
         </xsd:element>
        <xsd:element name="Product" maxOccurs="unbounded">
          <xsd:complexType>
            <xsd:sequence>
              <xsd:element name="Name"     type="xsd:string"/>
              <xsd:element name="SKU"
               type="xsd:positiveInteger"/>
              <xsd:element name="Quantity"
               type="xsd:positiveInteger"/>
              <xsd:element name="Price"    type="MoneyType"/>
              <xsd:element name="Discount" type="xsd:decimal"
                           minOccurs="0"/>
              <xsd:element name="ShipTo">
                <xsd:complexType>
                  <xsd:sequence>
                    <xsd:element name="GiftRecipient"
                     type="xsd:string"
                     minOccurs="0" maxOccurs="unbounded"/>
                    <xsd:element name="Street"
                     type="xsd:string"/>
                    <xsd:element name="City" type="xsd:string"/>
                    <xsd:element name="State"
                     type="xsd:string"/>
                    <xsd:element name="Zip" type="xsd:string"/>
                  </xsd:sequence>
                </xsd:complexType>
              </xsd:element>
              <xsd:element name="GiftMessage" type="xsd:string"
                           minOccurs="0"/>
            </xsd:sequence>
          </xsd:complexType>  
        </xsd:element>
        <xsd:element name="Subtotal" type="MoneyType"/>
        <xsd:element name="Tax">
          <xsd:complexType>
            <xsd:simpleContent>
              <xsd:extension base="MoneyType">
                <xsd:attribute name="rate" type="xsd:decimal"/>
              </xsd:extension>
            </xsd:simpleContent>
          </xsd:complexType>
        </xsd:element>
        <xsd:element name="Shipping">
          <xsd:complexType>
            <xsd:simpleContent>
              <xsd:extension base="MoneyType">
                <xsd:attribute name="method" type="xsd:string"/>
              </xsd:extension>
            </xsd:simpleContent>
          </xsd:complexType>
        </xsd:element>
        <xsd:element name="Total" type="MoneyType"/>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

  <xsd:complexType name="MoneyType">
    <xsd:simpleContent>
      <xsd:extension base="xsd:decimal">
        <xsd:attribute name="currency" type="xsd:string"/>
      </xsd:extension>
    </xsd:simpleContent>
  </xsd:complexType>

</xsd:schema> 

There are multiple ways to indicate that a document should satisfy a known schema. The most common is an xsi:noNamespaceSchemaLocation attribute on the root element of the instance document. The xsi prefix is bound to the http://www.w3.org/2001/XMLSchema-instance URI. For example,

 <?xml version="1.0" encoding="ISO-8859-1"?> 
<Order xsi:noNamespaceSchemaLocation="order.xsd"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  ... 

Some parsers also provide ways to specify a schema from inside a program, for example by setting various properties. I'll discuss this more when we get to programmatic validation in Chapter 7.

Schemas are still pretty bleeding-edge technology at the time of this writing ( September 2002). There are only a few parsers that provide complete implementations of the full W3C XML Schema Language 1.0 specification. Nonetheless, developers have been clamoring for this functionality (if not necessarily this syntax) for some time, so schemas seem likely to achieve broad adoption relatively quickly.

For the moment, schema support is limited to simple validation, much as DTD support is. A schema-aware parser will read an XML document, compare what it sees there with a schema, and return a boolean result: the document either satisfies the schema or it does not. If the document fails to satisfy the schema, the parser might give you a line number and a more detailed error message specifying the problem, but that's it. More complete use of schemas, in which parsers tell you what the type of any element is so you can, for example, convert elements with type xsd:int to actual Java int s, are still a matter for research and experimentation.

Schematron

Rick Jelliffe's Schematron is a radically different approach to an XML schema language. Whereas other languages are conservative (everything not permitted is forbidden), Schematron is liberal (everything not forbidden is permitted). Furthermore, Schematron is based on XPath, so it can check co-occurrence constraints between elements and attributes; for example, that the content of the total price element must be equal to the sum of the content of the subtotal, tax, and shipping elements. Finally, Schematron can be implemented as an XSLT stylesheet rather than requiring special software.

Example 1.10 shows a Schematron schema for order documents. To keep the example smaller, I did not test absolutely everything I could. Instead, I took advantage of Schematron's liberality to test only those conditions that neither DTDs nor schemas can validate; for example, that the total price is the sum of the subtotal, the tax, and the shipping. I haven't necessarily lost anything by doing this, because I can validate a single document against multiple different schemas. For instance, orders could first be checked against the DTD, then checked against a W3C XML Schema Language schema, and then checked against this Schematron schema only if they passed the first two tests.

Example 1.10 order.sct : A Schematron Schema for Order Documents
 <?xml version="1.0"?>
<schema xmlns="http://www.ascc.net/xml/schematron">
  <title>A Schematron Schema for Orders</title>
  <pattern>
    <rule context="Order">
      <!-- Due to round-off error, floating point numbers
           should rarely be compared for direct equality.
           For this purpose, it's enough if they're accurate
           within one penny. -->
      <assert test="(Shipping+Subtotal+Tax - Total)&lt;0.01
                and (Shipping+Subtotal+Tax - Total)&gt;-0.01">
        The subtotal, tax, and shipping
        must add up to the total.
      </assert>
      
	<assert test=
       "(Subtotal+Shipping)*((Tax/@rate) div 100.0)
         - Tax &lt; 0.01 and (Subtotal+Shipping)*((Tax/@rate)
         div 100.0)-Tax &gt; -0.01"
      >
        The tax was incorrectly calculated.
      </assert>

    </rule>
  </pattern>
</schema> 

XPath is not by itself Turing complete, so there are still some limits to what you can express in a Schematron schema. For example, you can't multiply the Quantity by the Price for each Product element and make sure that the sum of these equals the Subtotal . However, Schematron is still much more powerful than other schema languages.

Schematron is implemented in a very unusual fashion. First you run your Schematron schema through an XSLT processor using a skeleton stylesheet that Jelliffe provides. This produces a new XSLT stylesheet. In essence, this compiles the Schematron schema into an XSLT stylesheet. The compiler itself is written in XSLT. You then transform all your instance documents using the compiled schema. If any of the assertions fail, the output will contain the assertion message. Otherwise it will contain just the XML declaration. For example, using Michael Kay's SAXON XSLT processor to validate Example 1.2 against Example 1.10 yields the following:

 C:\XMLJAVA>  saxon"order.sct"skeleton1-5.xsl>order_sct.xsl  C:\XMLJAVA>  saxon"order.xml"order_sct.xsl  <?xml version="1.0" encoding="utf-8"?> 

Schematron is the idiosyncratic product of one person. It therefore is not a standard part of any major parsers, unlike DTDs and the W3C XML Schema Language. However, it's not particularly difficult to install Jelliffe's Schematron validation software into most systems. Because Schematron is implemented in XSLT, all you need is a good API to access an XSLT engine. I'll take this up again in Chapter 17 when I discuss APIs for XSLT.

The Last Mile

Although Schematron is powerful, there are some checks it cannot perform. In particular, it cannot perform any checks that require information external to the document and the schema. For example, it cannot verify that the page at a referenced URL is reachable . It cannot verify that a file exists on the local file system. It cannot compare the SKUs, names, and prices in an order document with their values in a remote database. None of the existing schema languages allows you to state conditions such as these.

Java can do all of these things. The java.net.URL class can easily test whether a URL is live. The exists() method of the java.io.File class is a simple test for whether a file is where you think it is. JDBC is a whole API for remote database access. However, unlike the more limited constraints of DTDs, the W3C XML Schema Language, or even Schematron, simply listing the conditions is not enough. To test such conditions, you have to write the code that tests them. Nobody has done the hard work for you. There will always be some constraints that require a full-blown programming language to check. Indeed doing exactly this will be a major focus of this book.

One thing you can learn from the existing languages is the clean way in which they separate validation from processing. If you design your own validation layer, you should do that too. Perform all validation before the document is processed for its contents. If possible, separate the constraints from the code that checks them.