Understanding Document Type Definitions

DTDs are a nice concept in that they enable developers to add and change rules about XML documents without necessarily having to change the document or the data in it. This allows for much greater flexibility.

So far, we have learned about attributes and elements but not about DTDs. Let's look at a DTD and learn how to create one. Listing 19.2 shows an example of a DTD.

Listing 19.2 An Example of a DTD

 <?xml version="1.0" encoding="UTF-8"?>  <!  ********************************************************  >  <!  VSI Pricelist file Definition                     >  <!  ********************************************************  >  <!ELEMENT price-list (price-group*)>  <!ELEMENT price-group (name, price-element*)>  <!ELEMENT price-element (product-code, description, license*, list-price)>  <!ELEMENT license (quantity | unlimited)>  <!ATTLIST license        type (Server | Clients | WebServer | FaxServer) "Server"  >  <!ELEMENT name (#PCDATA)>  <!ELEMENT product-code (#PCDATA)>  <!ELEMENT description (#PCDATA)>  <!ELEMENT list-price (#PCDATA)>  <!ELEMENT quantity (#PCDATA)>  <!ELEMENT unlimited (#PCDATA)>

Now that we have a DTD, let's briefly talk about how to associate DTDs with XML documents. The next sections, "Element Declaration" and "Attribute-List Declaration," dissect what goes into creating a DTD.

Because a DTD is a separate file, DTDs can be associated with an XML document in two ways: within the XML document itself (inline) or by a DOCTYPE reference. All inline DTDs must start with the string <!DOCTYPE. The next part of the string must be the document name, which must correspond to the document's root element. If you would like to refer to your DTD within your XML document, you can include it between the beginning and ending brackets, like this:

 <?xml version="1.0"?>  <!DOCTYPE HelloWorld  [  <!ELEMENT HelloWorld (#PCDATA)>  ]>  <HelloWorld>      Hi how are you world  </HelloWorld>

You can also refer to a DTD from an external source (which is more common) via a uniform resource identifier (URI), as follows:

 <?xml version="1.0"?>  <!DOCTYPE HelloWorld SYSTEM "HelloWorld.dtd">  <HelloWorld>  </HelloWorld>

This DTD declaration uses the SYSTEM parameter. This states that the DTD associated with the XML document is located somewhere on this (or another) system via the next quoted URI, such as a local file, a URL to a web server, or even an FTP site. Another way you can refer to a DTD is via the PUBLIC parameter, as follows:

 <!DOCTYPE html      PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"      "DTD/xhtml1-transitional.dtd">

This is the DOCTYPE declaration for XHTML that has two quoted statements after the PUBLIC parameter. The first one is the unique name of the DTD, and the second is the URI. Public DTDs follow a specific naming convention. If you're interested in how to construct one of these, take a peek at the XML 1.0 recommendation (www.w3.org/TR/REC-xml).

Most of a DTD is made up of element declarations, which are discussed in the next section.

Element Declaration

As you probably noticed when creating a DTD, the major portion of the statements is element declarations in this form:

 <!ELEMENT elementname content-spec>

Element declarations can have one of four different element content types, as shown in Table 19.3.

Table 19.3. Element Declaration Types
Content Type Specification	Content
Empty content	No content, an empty element
Character content	Only data or text within an element
Any content/mixed content	Can have character data, other elements, or both
Element content	Can only have subelements as specified in the element content specification

Let's try to make this a little clearer by explaining each content model specification in the following sections.

Character Data Content

One of the unique things about XML is that it specifies that all character data must be in Unicode, a superset of ASCII characters that we all know and love. Apart from the benefits this brings to cultures that use a non-Roman alphabet (Chinese, Japanese, Inuit, and so on), it forces us to call "text" something else: parsed character data (PCDATA) to be specific.

When we want to declare that an element contains nothing but text (whether numbers or ASCII characters), we can say something like this:

 <!ELEMENT item (#PCDATA)>  Examples of our newly updated item element would be:  <item>This is an item.</item>  <item>1234</item>

Any (or Mixed) Content

When the ANY content identifier is used, it's typically called mixed content because this is the catchall definition for an element that can contain anything either text (PCDATA) or other elements.

This type of unstructured content definition is most commonly used for complex hierarchies of elements and text or for something as simple as, say, marking up a technical book you might find on the shelves:

 <!ELEMENT para ANY>  <para> This is a <footnote num="1"> see reference</footnote> mixed content paragraph</ para>

Empty Content

Empty content is the model for empty elements:

 <!ELEMENT item EMPTY>

For example:

 <item sku="1234" />

This declares that the item element is EMPTY. Although there's an attribute (sku="1234"), there's no content within the element; it's equivalent to <item sku="1234"></item>.

Element Content

The following example states that the items_ordered element must have at least one subelement item that can be repeated (signified by the plus sign):

 <!ELEMENT items_ordered (item)+>

The parentheses are used to form what is called a content particle. Content particles can be nested within each other like this:

 <!ELEMENT billto(company,contact,street,city,state_province,zipcode,country)>

This example says that the billto element must have a company element followed by a contact element, and so on. The commas are used to specify a sequence that we would like to enforce namely that company must be followed by contact, which is followed by street, and so on. This type of construct is called a sequence content particle.

Furthermore, you can make specific elements repeatable, optional, or occur only once like in the items_ordered example. To do this, you must use what are called occurrence indicators, which are listed in Table 19.4.

Table 19.4. Occurrence Indicators
Indicator	Element or Content Particle Can Occur…
`?`	Zero or one time (optional)
`*`	Zero or more times (optional and repeatable)
`+`	One or more times (required and repeatable)

Additionally, you can use grouping and recurrence symbols to denote orders in which elements can occur:

 <!ELEMENT component (stanza+|line)>  <!ELEMENT stanza (line+|(copyright,date))>  <!ELEMENT line (#PCDATA)>  <!ELEMENT copyright (#PCDATA)>  <!ELEMENT date (#PCDATA)>

This DTD describes a bare-bones poem:

The <poem> element can consist of either one or more <stanza> elements or a single <line> element.
The <stanza> element must consist of either a bunch of <line> elements. or a <copyright> element followed by a <date> element.
The <line>, <copyright>, and <date> elements just contain text.

A poem converted into XML might look like the following:

 <poem>      <stanza>          <line>Roses are red, Violets are blue</line>          <line>I can write XML, and so can you!</line>      </stanza>      <stanza>          <copyright>NewRiders</copyright>          <date>2002</date>      </stanza>  </poem>

Use that one on Valentine's Day!

Attribute-List Declaration

We have already covered element declarations, so now let's look at attribute-list declarations. Attributes probably look much more familiar to you as a ColdFusion developer in that they always have a name-value pairing such as in HTML:

 <a href="http://www.w3c.org">

In a DTD, an attribute-list declaration always begins with the string <!ATTLIST, which is followed by the element name to which the attributes belong. After the element name, you can add one or more attribute declarations. Attribute declarations have three parts: the attribute name, its type, and the default declaration. The general form for an attribute declaration is as follows:

 <!ATTLIST elemName attName attType default-decl>

Or from our sample DTD:

 <!ATTLIST item      sku CDATA #REQUIRED      qty CDATA #REQUIRED      description CDATA #IMPLIED      price CDATA #IMPLIED>

Let's step through this example so that you can see how straightforward this really is. This attribute declaration statement says that we're talking about the attributes of the item element. The item element has four attributes: sku, qty, description, and price. All of the attributes are string data types or CDATA but could be one of two other data types: a set of tokenized types or an enumerated type.

Each of the attributes also has a declaration option that could be one of four types: required, implied, fixed, or a default value. The REQUIRED declaration means that the attribute must be present, and the IMPLIED declaration enables you to optionally include an attribute. The FIXED declaration means you must supply a value. Finally, we have the last option, VALUE, which enables you to define a default value that will always be used unless the user overrides it.

One of the most common complaints with the XML recommendation is the syntax for creating DTDs. For example, in an element declaration we talk about text as PCDATA, whereas in an attribute declaration it's called CDATA. Oddities like this are what sparked the XML Schema language, a rewrite of DTDs plus data typing in an XML element syntax.