Introducing Document Types | XML: A Managers Guide (2nd Edition) (Addison-Wesley Information Technology Series)

From a business perspective, the single order document in Example 2-5 is obviously less useful in isolation than as a collection of many order documents. It would hardly make sense to write a piece of software or design a screen layout for just one document. The true power of XML to improve business processes comes when multiple documents all use the same data format. Then a single piece of software can process them all, and a single screen layout can display them all. Most important, parties can quickly agree on the structure of their information exchange by referring to the appropriate format.

DTDs provide these common, public data formats. A DTD is a collection of rules that specifies the allowable structure of a document class. It serves as a format referee at two important points in the software life cycle. During the design phase, a software developer can look at a DTD and know that as long as the application he builds will output documents that conform to that DTD, other applications can process those documents. During the execution phase, the XML processor can verify that a document conforms to the DTD so the application that processes the document knows that it will receive properly structured content. In essence, the DTD is a contract between the supplier of the document and the consumer. A particular XML document is valid if it obeys all the rules of its parent DTD as well as the criteria for well- formed documents.

Earlier in the chapter, you saw that elements and attributes are the two primary constructs in an XML document. It shouldn't surprise you to learn that the syntax for DTDs deals primarily with specifying allowable element structure and rules that attributes must follow.

Defining Element Structure

In Example 2-1b, you saw a DTD for a simple business card that specified the shape of the document tree ”the nodes and allowable branches from each node. In Example 2-7, this DTD is annotated to illustrate how to use DTD syntax to specify the different types of content.

Example 2-7

 <!ELEMENT BusinessCard (Name, Title, Author, ContactMethods)>              document element <!ELEMENT Name (GivenName, MiddleName?, FamilyName)>             element content <!ELEMENT GivenName (#PCDATA)>        data content <!ELEMENT MiddleName (#PCDATA)>       data content <!ELEMENT MiddleName (#PCDATA)>       data content <!ELEMENT Title (#PCDATA)>            data content <!ELEMENT Author EMPTY>               empty content <!ELEMENT ContactMethods (Phone*)>    element content <!ELEMENT Phone (#PCDATA)>            data content

Each element declaration begins with "<!ELEMENT" and ends with ">". It contains the element name and a content model surrounded by parentheses. As we saw earlier, there are four types of allowable content. The corresponding syntax for each content model is as follows .

Data content. These elements contain only data. To indicate this structure, the element declaration specifies a content model of # PCDATA .
Element content. These elements contain only other elements. To indicate this structure, the element declaration specifies a content model that lists the element names , separated by commas. Note that this list is order sensitive.
Empty. These elements contain neither elements nor data. To indicate this structure, the element declaration uses the keyword EMPTY as the content model.
Mixed content. These elements contain both data and other elements. To indicate this structure, the element declaration specifies a content model that includes PCDATA to indicate that data content is allowed and element names to indicate that elements of these types are allowed.

In defining the element model, a document designer starts with the document element. After the document element, the designer moves on to the subelements of the document elements. Then come the subelements of those elements, continuing until there are only leaf elements ”those that have only data content or are empty.

In addition to these basic content models, DTD designers may use special characters to encode rules about the number of sub elements that an element may contain. These cardinality rules use the following syntax. Note that the default cardinality is exactly 1.

or 1. The ? character indicates an optional subelement. So <!ELEMENT Person (FirstName, MiddleName?, LastName)> indicates that "Person" must have one "FirstName," then may or may not have one "MiddleName," then must have one "LastName."
or more. The * character indicates a subelement that may optionally appear one or more times. So <!ELEMENT ContactMethods (Phone*)> indicates that "ContactMethods" may have any number of "Phone."
1 or more. The + character indicates a subelement that must appear at least once. So <!ELEMENT ContactMethods (Phone+)> would indicate that there must be at least one "Phone."
Enumerated alternatives. A list of subelements separated by vertical bars indicates that the element must contain one of the subelements in the list. So, <!ELEMENT Payment (Cash Check Card)> would indicate that "Payment" must contain either "Cash," "Check," or "Card."

Document designers can combine different rules in the same element declaration, using parentheses to group subelements together. For example, <!ELEMENT WorkId (Passport (Drivers License, SocialSecurityCard))> indicates that acceptable "WorkId" consists of either a "Passport" or both a "DriversLicense" and a "SocialSecurityCard." The declaration <!ELEMENT EMailList (Name?, (MailServer, To+, CC*)+, (Version Updated))> indicates that an "EMailList" has (1) an optional "Name"; (2) one or more blocks that include one "MailServer"; one or more "To" and zero or more "CC"; and (3) either a "Version" or an "Updated." As you can see, the syntax for defining element structure is rich enough to make plain English descriptions difficult. These basic building blocks give developers the power to specify sophisticated content models.

Defining Attribute Rules

As you've already learned, elements are only one of the constructs available to XML document authors. Attributes enhance the meaning of element content by providing additional metadata. Not surprisingly, XML DTDs include syntax for defining the rules that attributes must follow. Example 2-8 is a DTD for the database schema definition document from Example 2-5. As you can see, the element declarations specify a database document element with one or more table elements. Each table element must have one or more column elements, each of which has character data.

Example 2-8

 <!ELEMENT Database (Table+)> <!ATTLIST database dbType       CDATA            #REQUIRED address      CDATA            #IMPLIED > <!ELEMENT Table (Column+)> <!ATTLIST Table name         CDATA            #REQUIRED> <!ELEMENT Column (#PCDATA)> <!ATTLIST Column dataType  (String  Int  Float  Date  BLOB) "String">

Remember that in Example 2-5 attributes were added to an existing set of elements. Therefore, in addition to element declarations, Example 2-8 includes attribute declarations that define the rules that these attributes must follow. These declarations begin with <!ATTLIST and end with ">". Internally, they have four parts .

Element type. After the ATTLIST keyword, the declaration specifies the element to which the list applies. In Example 2-8, the "Database," "Table," and "Column" elements have attribute list declarations.
Attribute name. The rule for each attribute in the list appears on a new line. The first part of the rule is the attribute name. For example, the "Database" element has "dbType" and "address" attributes.
Attribute type. After the attribute name, the type specification of the attribute appears. For character values such as "dbType" and "address," the type specification is CDATA . For enumerated values, such as "dataType," the type specification is a list of the possible values, separated by a vertical bar and enclosed in parentheses. Not appearing in Example 2-8 are the ID and IDREF type specifications. ID indicates a character string must be unique. IDREF indicates a character string that corresponds to the value of an ID attribute within the document. Using ID and IDREF attributes, document authors can create links between elements in the same document.
Default value. After the attribute type, the document designer must specify the default value for the attribute. There are a number of options. #REQUIRED indicates that every document must explicitly assign a value to the attribute. "dbType" is an example of a #REQUIRED attribute. #IMPLIED indicates that a document does not have to assign a value to the attribute; the XML processor will tell the application that no value was assigned. "address" is an example of an #IMPLIED attribute. A value in quotation marks indicates an attribute with a specific default value; if the document does not explicitly assign a value to the attribute, the XML processor will automatically assign the default value. "dataType" has the default value "String." Not appearing in Example 2-8 are attributes with #FIXED default actions. After the #FIXED keyword, there is an attribute value in quotation marks. If a document explicitly assigns a value to such an attribute, it must be the same as the value specified after #FIXED.

DTDs give document designers a high degree of control over both the structure of document elements and the rules attributes must follow. This control opens the doors for document designers to create DTDs that apply to vastly different fields, from online catalogs, to database integration tools, to supply chain management systems.

Example DTD

Examples 2-7 and 2-8 demonstrate the syntax for specifying element structure and attribute rules for simple DTDs. Even DTDs for modest applications, such as the order in Example 2-6, can quickly become complex. Illustrating the potential complexity of a DTD for a complete XML application, Example 2-9 consists of a DTD for this order form.

Example 2-9

 <!-- Example Order Form DTD from _XML: A Manager's   Guide_ --> <!-- Document Structure --> <!ELEMENT Order (Addresses, LineItems, Payment)> <!ATTLIST Order source          (web  phone  retail )      #REQUIRED customerType    (consumer  business)        "consumer" currency        CDATA                        "USD"> <!-- Collection of Addresses --> <!ELEMENT Addresses (Address+)> <!-- Address Structure --> <!ELEMENT Address (FirstName, MiddleName?, LastName,   Street+, City, State, Postal, Country)> <!ATTLIST Address addType        (bill  ship  billship)     "billship"> <!ELEMENT FirstName (#PCDATA)> <!ELEMENT MiddleName (#PCDATA)> <!ELEMENT LastName (#PCDATA)> <!ELEMENT Street (#PCDATA)> <!ATTLIST Street lineOrder            CDATA                   #IMPLIED> <!ELEMENT City (#PCDATA)>> <!ELEMENT State (#PCDATA)> <!ELEMENT Postal (#PCDATA)> <!ELEMENT Country (#PCDATA)> <!-- Collection of LineItems --> <!ELEMENT LineItems (LineItem+)> <!-- LineItem Structure --> <!ELEMENT LineItem (Product, Quantity, UnitPrice)> <!ATTLIST LineItem ID                   ID                  #REQUIRED> <!ELEMENT Product(#PCDATA)> <!ATTLIST Product category             (CDROM  MBoard  RAM)   #REQUIRED> <!ELEMENT Quantity(#PCDATA)> <!ELEMENT UnitPrice(#PCDATA)> <!-- Payment Structure --> <!ELEMENT Payment (Card  PO)> <!-- Card Structure --> <!ELEMENT Card (CardHolder, Number, Expiration)> <!ATTLIST Card cardType  (VISA  MasterCard  Amex)        #REQUIRED> <!ELEMENT Cardholder (#PCDATA)> <!ELEMENT Number (#PCDATA)> <!ELEMENT Expiration (#PCDATA)> <!-- PO Structure --> <!ELEMENT PO (Number, Authorization*)> <!ELEMENT Number (#PCDATA)> <!ELEMENT Authorization (#PCDATA)>

The document element for this DTD is "Order," and it has three required subelements: "Addresses," "LineItems," and "Payment." The "Addresses" and "LineItems" elements are simply collections of one or more "Address" and "LineItem" elements, respectively. Therefore, "Addresses" and "LineItems" are not strictly necessary. The "Order" element could contain the collections of "Address" and "LineItem" elements directly. In this case, the first element declaration would be <!ELEMENT Order (Address+, LineItems+, Payment)>. However, at some point in the future, the application may need additional information that applies to a group of "Address" or "LineItem" elements. Also, many developers find it easier to work with documents structured in this way because it's analogous to collection types used in object-oriented programming languages.

To see how wrapping a collection of "LineItem" elements in a "LineItems" element may improve extensibility, suppose that the company develops a pricing strategy that allows sales representatives to apply different discount policies to different blocks of line items. In this case, you would need to allow multiple "LineItems" elements and add a "DiscountPolicy" subelement. The syntax for these changes would be <!ELEMENT Order (Addresses, LineItems+, Payment)> and <!ELEMENT LineItems (DiscountPolicy?, Line Items+)>, respectively. Notice that because the "DiscountPolicy" element is optional all documents that conform to the original DTD will also conform to the new DTD. Planning for such future enhancements is an important part of DTD design.

Beyond this subtle design point, the rest of the element structure is relatively straightforward. The other interesting features of this structure are in the "Payment" element. This element presents a binary decision between the credit card and purchase order payment methods with the "Card" and "PO" subelements. The "PO" subelement allows for an optional list of "Authorization" elements. This list enables the application to keep track of any employee as well as the customer who authorized the PO. If there is a dispute about payment, maintaining this information could be very important.

The attribute specification reveals the flexibility available to document designers. "currency" and "lineOrder" may both have character data values, but "currency" has a default value, whereas "lineOrder" is optional. "source," "addressType," "category," and "cardType" are all enumerated types, but the document must explicitly assign an option to "source," "cat," and "cardType," whereas "addType" has a default value. The required "ID" attribute of ID type for the "LineItem" element enables an application using an order document to identify uniquely any particular "LineItem" element by this attribute.

Although the DTD in Example 2-9 is somewhat long and modestly complex, it is still relatively simple in comparison with DTDs used in actual applications. You could imagine the necessary complexity of DTDs for documents for financial wire transfers, telecommunications service provisioning, and medical records. Designing DTDs that cover all the possible document cases as well as allowing for future enhancements is one of the crucial steps in deploying effective XML applications.

As organizations have attempted to design DTDs for some of these sophisticated domains, they have found DTDs lacking important features. The inability to specify the datatype of element and attribute values with enough precision to ensure compatibility with programming languages and databases is perhaps the biggest drawback. But the lack of other features such as finer cardinality constraints on elements, range checking on values, and specifying reusable structures also create the need for a more sophisticated alternative. Chapter 3 discusses how XML Schema provides this alternative.