Chapter 3. Document Type Definitions (DTDs) | XML in a Nutshell, 2nd Edition

CONTENTS

3.1 Validation
3.2 Element Declarations
3.3 Attribute Declarations
3.4 General Entity Declarations
3.5 External Parsed General Entities
3.6 External Unparsed Entities and Notations
3.7 Parameter Entities
3.8 Conditional Inclusion
3.9 Two DTD Examples
3.10 Locating Standard DTDs

While XML is extremely flexible, not all the programs that read particular XML documents are so flexible. Many programs can work with only some XML applications but not others. For example, Adobe Illustrator 10 can read and write Scalable Vector Graphics (SVG) files, but you wouldn't expect it to understand a Platform for Privacy Preferences (P3P) document. And within a particular XML application, it's often important to ensure that a given document indeed adheres to the rules of that XML application. For instance, in XHTML, li elements should only be children of ul or ol elements. Browsers may not know what to do with them, or may act inconsistently, if li elements appear in the middle of a blockquote or p element.

The solution to this dilemma is a document type definition (DTD). DTDs are written in a formal syntax that explains precisely which elements and entities may appear where in the document and what the elements' contents and attributes are. A DTD can make statements such as "A ul element only contains li elements" or "Every employee element must have a social_security_number attribute." Different XML applications can use different DTDs to specify what they do and do not allow.

A validating parser compares a document to its DTD and lists any places where the document differs from the constraints specified in the DTD.^[1] The program can then decide what it wants to do about any violations. Some programs may reject the document. Others may try to fix the document or reject just the invalid element. Validation is an optional step in processing XML. A validity error is not necessarily a fatal error like a well-formedness error, though some applications may choose to treat it as one.

3.1 Validation

A valid document includes a document type declaration that identifies the DTD the document satisfies. The DTD lists all the elements, attributes, and entities the document uses and the contexts in which it uses them. The DTD may list items the document does not use as well. Validity operates on the principle that everything not permitted is forbidden. Everything in the document must match a declaration in the DTD. If a document has a document type declaration and the document satisfies the DTD that the document type declaration indicates, then the document is said to be valid. If it does not, it is said to be invalid.

There are many things the DTD does not say. In particular, it does not say the following:

What the root element of the document is
How many of instances of each kind of element appear in the document
What the character data inside the elements looks like
The semantic meaning of an element; for instance, whether it contains a date or a person's name

DTDs allow you to place some constraints on the form an XML document takes, but there can be quite a bit of flexibility within those limits. A DTD never says anything about the length, structure, meaning, allowed values, or other aspects of the text content of an element.

Validity is optional. A parser reading an XML document may or may not check for validity. If it does check for validity, the program receiving data from the parser may or may not care about validity errors. In some cases, such as feeding records into a database, a validity error may be quite serious, indicating that a required field is missing, for example. In other cases, rendering a web page perhaps, a validity error may not be so important, and you can work around it. Well-formedness is required of all XML documents; validity is not. Your documents and your programs can use it or not as you find needful.

3.1.1 A Simple DTD Example

Recall Example 2-2 from the last chapter; this described a person. The person had a name and three professions. The name had a first name and a last name. The particular person described in that example was Alan Turing. However, that's not relevant for DTDs. A DTD only describes the general type, not the specific instance. A DTD for person documents would say that a person element contains one name child element and zero or more profession child elements. It would further say that each name element contains a first_name child element and a last_name child element. Finally it would state that the first_name, last_name, and profession elements all contain text. Example 3-1 is a DTD that describes such a person element.

Example 3-1. A DTD for the person

<!ELEMENT person     (name, profession*)> <!ELEMENT name       (first_name, last_name)> <!ELEMENT first_name (#PCDATA)> <!ELEMENT last_name  (#PCDATA)> <!ELEMENT profession (#PCDATA)>

This DTD would probably be stored in a separate file from the documents it describes. This allows it to be easily referenced from multiple XML documents. However, it can be included inside the XML document if that's convenient, using the document type declaration we discuss later in this section. If it is stored in a separate file, then that file would most likely be named person.dtd, or something similar. The .dtd extension is fairly standard though not specifically required by the XML specification. If this file were served by a web server, it would be given the MIME media type application/xml-dtd.

Each line of Example 3-1 is an element declaration. The first line declares the person element; the second line declares the name element; the third line declares the first_name element; and so on. However, the line breaks aren't relevant except for legibility. Although it's customary to put only one declaration on each line, it's not required. Long declarations can even span multiple lines.

The first element declaration in Example 3-1 states that each person element must contain exactly one name child element followed by zero or more profession elements. The asterisk after profession stands for "zero or more." Thus, every person must have a name and may or may not have a profession or multiple professions. However, the name must come before all professions. For example, this person element is valid:

<person>   <name>     <first_name>Alan</first_name>     <last_name>Turing</last_name>   </name> </person>

However, this person element is not valid because it omits the name:

<person>   <profession>computer scientist</profession>   <profession>mathematician</profession>   <profession>cryptographer</profession> </person>

This person element is not valid because a profession element comes before the name:

<person>   <profession>computer scientist</profession>   <name>     <first_name>Alan</first_name>     <last_name>Turing</last_name>   </name>   <profession>mathematician</profession>   <profession>cryptographer</profession> </person>

The person element may not contain any element except those listed in its declaration. The only extra character data it can contain is whitespace. For example, this is an invalid person element because it adds a publication element:

<person>   <name>     <first_name>Alan</first_name>     <last_name>Turing</last_name>   </name>   <profession>mathematician</profession>   <profession>cryptographer</profession>   <publication>On Computable Numbers...</publication> </person>

This is an invalid person element because it adds some text outside the allowed children:

<person>   <name>     <first_name>Alan</first_name>     <last_name>Turing</last_name>   </name>   was a <profession>computer scientist</profession>,   a <profession>mathematician</profession>, and a   <profession>cryptographer</profession> </person>

In all these examples of invalid elements, you could change the DTD to make these elements valid. All the examples are well-formed, after all. However, with the DTD in Example 3-1, they are not valid.

The name declaration says that each name element must contain exactly one first_name element followed by exactly one last_name element. All other variations are forbidden.

The remaining three declarations first_name, last_name, and profession all say that their elements must contain #PCDATA. This is a DTD keyword standing for parsed character data that is, raw text possibly containing entity references such as & and <, but not containing any tags or child elements.

Example 3-1 placed the most complicated and highest-level declaration at the top. However, that's not required. For instance, Example 3-2 is an equivalent DTD that simply reorders the declarations. DTDs allow forward, backward, and circular references to other declarations.

Example 3-2. An alternate DTD for the person element

<!ELEMENT first_name (#PCDATA)> <!ELEMENT last_name  (#PCDATA)> <!ELEMENT profession (#PCDATA)> <!ELEMENT name       (first_name, last_name)> <!ELEMENT person     (name, profession*)>

3.1.2 The Document Type Declaration

A valid document includes a reference to the DTD to which it should be compared. This is given in the document's single document type declaration. A document type declaration looks like this:

<!DOCTYPE person SYSTEM "http://www.cafeconleche.org/dtds/person.dtd">

This says that the root element of the document is person and that the DTD for this document can be found at the URI http://www.cafeconleche.org/dtds/person.dtd.

URI stands for Uniform Resource Identifier. URIs are a superset of URLs . They include not only URLs but also Uniform Resource Names (URNs). A URN allows you to identify a resource such as the DTD for SVGs irrespective of its location. Indeed, the resource might exist at multiple locations, all equally authoritative. In practice, the only URIs in wide use today are URLs.

The document type declaration is included in the prolog of the XML document after the XML declaration but before the root element. (The prolog is everything in the XML document before the root element start-tag.) Example 3-3 demonstrates.

Example 3-3. A valid person document

<?xml version="1.0" standalone="no"?> <!DOCTYPE person SYSTEM "http://www.cafeconleche.org/dtds/person.dtd"> <person>   <name>     <first_name>Alan</first_name>     <last_name>Turing</last_name>   </name>   <profession>computer scientist</profession>   <profession>mathematician</profession>   <profession>cryptographer</profession> </person>

If the document resides at the same base site as the DTD, you can use a relative URL instead of the absolute form. For example:

<!DOCTYPE person SYSTEM "/dtds/person.dtd">

You can even use just the filename if the DTD is in the same directory as the document:

<!DOCTYPE person SYSTEM "person.dtd">

3.1.2.1 Public IDs

Standard DTDs may actually be stored at multiple URLs. For example, if you're drawing an SVG picture on your laptop at the beach, you probably want to validate your drawing without opening a network connection to the W3C's web site where the official SVG DTD resides. Such DTDs may be associated with public IDs. The name of the public ID uniquely identifies the XML application in use. At the same time, a backup URI is also included in case the validator does not recognize the public ID. To indicate that you're specifying a public ID, use the keyword PUBLIC in place of SYSTEM. For example, this document type declaration refers to the Rich Site Summary (RSS) DTD standardized by Netscape:

<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"               "http://my.netscape.com/publish/formats/rss-0.91.dtd">

A local catalog server can be used to convert the public IDs into the most appropriate URLs for the local environment. The catalogs themselves can be written in XML, specifically, the OASIS XML catalog format (http://www.oasis-open.org/committees/entity/spec.html). In practice, however, PUBLIC IDs aren't used very much. Almost all validators rely on the URI to actually validate the document.

3.1.3 Internal DTD Subsets

When you're first developing a DTD, it's often useful to keep the DTD and the canonical example document in the same file so you can modify and check them simultaneously. Therefore, the document type declaration may actually contain the DTD between square brackets rather than referencing it at an external URI. Example 3-4 demonstrates.

Example 3-4. A valid person document with an internal DTD

<?xml version="1.0"?> <!DOCTYPE person [   <!ELEMENT first_name (#PCDATA)>   <!ELEMENT last_name  (#PCDATA)>   <!ELEMENT profession (#PCDATA)>   <!ELEMENT name       (first_name, last_name)>   <!ELEMENT person     (name, profession*)> ]> <person>   <name>     <first_name>Alan</first_name>     <last_name>Turing</last_name>   </name>   <profession>computer scientist</profession>   <profession>mathematician</profession>   <profession>cryptographer</profession> </person>

Some document type declarations contain some declarations directly but link in others using a SYSTEM or PUBLIC identifier. For example, this document type declaration declares the profession and person elements itself but relies on the file name.dtd to contain the declaration of the name element:

<!DOCTYPE person SYSTEM "name.dtd" [   <!ELEMENT profession (#PCDATA)>   <!ELEMENT person (name, profession*)> ]>

The part of the DTD between the brackets is called the internal DTD subset. All the parts that come from outside this document are called the external DTD subset. Together they make up the complete DTD. As a general rule, the two different subsets must be compatible. Neither can override the element declarations the other makes. For example, if name.dtd also declared the person element, then there would be a problem. However, entity declarations can be overridden with some important consequences for DTD structure and design, which we'll see shortly when we discuss entities.

When you use an external DTD subset, you should give the standalone attribute of the XML declaration the value no. For example:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

Actually, the XML specification includes four very detailed rules about exactly when the presence of an external DTD subset does and does not require the standalone attribute to have the value no. However, the net effect of these rules is that almost all XML documents that use external DTD subsets require standalone to have the value no. Since setting standalone to no is always permitted even when it's not required, it's simply not worth worrying about the uncommon cases.

A validating processor is required to read the external DTD subset. A nonvalidating processor may do so, but is not required to, even if standalone has the value no. This means that if the external subset makes declarations that have consequences for the content of a document (for instance, providing default values for attributes) then the content of the document depends on which parser you're using and how it's configured. This has led to no end of confusion. Although some of the earliest XML parsers did not resolve external entities, most of the parsers still being used can do so and generally will do so. You should read the external DTD subset unless efficiency is a major concern or you're very familiar with the structure of the documents you're parsing.

3.1.4 Validating a Document

As a general rule, web browsers do not validate documents but only check them for well-formedness. If you're writing your own programs to process XML, you can use the parser's API to validate documents. If you're writing documents by hand and you want to validate them, you can either use one of the online validators or run a local program to validate the document.

The online validators are probably the easiest way to validate your documents. There are two of note:

The Brown University Scholarly Technology Group's XML Validation Form at http://www.stg.brown.edu/service/xmlvalid/
Richard Tobin's XML well-formedness checker and validator at http://www.cogsci.ed.ac.uk/%7Erichard/xml-check.html

First, you have to place the document and associated DTDs on a publicly accessible web server. Next, load one of the previous URLs in a browser, and type the URL of the document you're checking into the online form. The validating server will retrieve your document and tell you what, if any, errors it found. Figure 3-1 shows the results of using the Brown validator on a simple invalid but well-formed document.

Figure 3-1. Validity errors detected by the Brown University online validator

figs/xian2_0301.gif

Most XML parser class libraries include a simple program to validate documents you can use if you're comfortable installing and using command-line programs. In Xerces 1.x, that program is sax.SAXCount. (Xerces 2.x uses sax.Counter instead.) Use the -v flag to turn on validation. (By default, SAXCount only checks for well-formedness.) Then pass the URLs or filenames of the documents you wish to validate to the SAXCount program on the command line like this:

C:\>java sax.SAXCount -v invalid_fibonacci.xml [Error] invalid_fibonacci.xml:8:10: Element type "title" must be declared. [Error] invalid_fibonacci.xml:110:22: The content of element type  "Fibonacci_Numbers" must match "(fibonacci)*". fibonacci.xml: 541 ms (103 elems, 101 attrs, 307 spaces, 1089 chars)

You can see from this output that the document invalid_fibonacci.xml has two validity errors that need to be fixed: the first in line 8 and the second in line 110.

There are also some simple GUI programs for validating XML documents, including the Topologi Schematron Validator for Windows (http://www.topologi.com) shown in Figure 3-2. Despite the name, this product can actually validate documents against schemas written in multiple languages, including DTDs and the W3C XML Schema Language, as well as Schematron.

Figure 3-2. Validity errors detected by the Topologi Schematron Validator

figs/xian2_0302.gif

3.2 Element Declarations

Every element used in a valid document must be declared in the document's DTD with an element declaration. Element declarations have this basic form:

<!ELEMENT element_name content_specification>

The name of the element can be any legal XML name. The content specification specifies what children the element may or must have in what order. Content specifications can be quite complex. They can say, for example, that an element must have three child elements of a given type, or two children of one type followed by another element of a second type, or any elements chosen from seven different types interspersed with text.

3.2.1 #PCDATA

About the simplest content specification is one that says an element may only contain parsed character data, but may not contain any child elements of any type. In this case the content specification consists of the keyword #PCDATA inside parentheses. For example, this declaration says that a phone_number element may contain text, but may not contain elements:

<!ELEMENT phone_number (#PCDATA)>

3.2.2 Child Elements

Another simple content specification is one that says the element must have exactly one child of a given type. In this case, the content specification simply consists of the name of the child element inside parentheses. For example, this declaration says that a fax element must contain exactly one phone_number element:

<!ELEMENT fax (phone_number)>

A fax element may not contain anything else except the phone_number element, and it may not contain more or less than one of those.

3.2.3 Sequences

In practice, however, a content specification that lists exactly one child element is rare. Most elements contain either parsed character data or (at least potentially) multiple child elements. The simplest way to indicate multiple child elements is to separate them with commas. This is called a sequence. It indicates that the named elements must appear in the specified order. For example, this element declaration says that a name element must contain exactly one first_name child element followed by exactly one last_name child element:

<!ELEMENT name (first_name, last_name)>

Given this declaration, this name element is valid:

<name>   <first_name>Madonna</first_name>   <last_name>Ciconne</last_name> </name>

However, this one is not valid because it flips the order of two elements:

<name>   <last_name>Ciconne</last_name>   <first_name>Madonna</first_name> </name>

This element is invalid because it omits the last_name element:

<name>   <first_name>Madonna</first_name> </name>

This one is invalid because it adds a middle_name element:

<name>   <first_name>Madonna</first_name>   <middle_name>Louise</middle_name>   <last_name>Ciconne</last_name> </name>

3.2.4 The Number of Children

As the previous examples indicate, not all instances of a given element necessarily have exactly the same children. You can affix one of three suffixes to an element name in a content specification to indicate how many of that element are expected at that position. These suffixes are:

?: Zero or one of the element is allowed.
*: Zero or more of the element is allowed.
+: One or more of the element is required.

For example, this declaration says that a name element must contain a first_name, may or may not contain a middle_name, and may or may not contain a last_name:

<!ELEMENT name (first_name, middle_name?, last_name?)>

Given this declaration, all these name elements are valid:

<name>   <first_name>Madonna</first_name>   <last_name>Ciconne</last_name> </name> <name>   <first_name>Madonna</first_name>   <middle_name>Louise</middle_name>   <last_name>Ciconne</last_name> </name> <name>   <first_name>Madonna</first_name> </name>

However, these are not valid:

<name>   <first_name>George</first_name>   <!-- only one middle name is allowed -->   <middle_name>Herbert</middle_name>   <middle_name>Walker</middle_name>   <last_name>Bush</last_name> </name> <name>   <!-- first name must precede last name -->   <last_name>Ciconne</last_name>   <first_name>Madonna</first_name> </name>

You can allow for multiple middle names by placing an asterisk after the middle_name:

<!ELEMENT name (first_name, middle_name*, last_name?)>

If you wanted to require a middle_name to be included, but still allow for multiple middle names, you'd use a plus sign instead, like this:

<!ELEMENT name (first_name, middle_name+, last_name?)>

3.2.5 Choices

Sometimes one instance of an element may contain one kind of child, and another instance may contain a different child. This can be indicated with a choice. A choice is a list of element names separated by vertical bars. For example, this declaration says that a methodResponse element contains either a params child or a fault child:

<!ELEMENT methodResponse (params | fault)>

However, it cannot contain both at once. Each methodResponse element must contain one or the other.

Choices can be extended to an indefinite number of possible elements. For example, this declaration says that each digit element can contain exactly one of the child elements named zero, one, two, three, four, five, six, seven, eight, or nine:

<!ELEMENT digit  (zero | one | two | three | four | five | six | seven | eight | nine) >

3.2.6 Parentheses

Individually, choices, sequences, and suffixes are fairly limited. However, they can be combined in arbitrarily complex fashions to describe most reasonable content models. Either a choice or a sequence can be enclosed in parentheses. When so enclosed, the choice or sequence can be suffixed with a ?, *, or +. Furthermore, the parenthesized item can be nested inside other choices or sequences.

For example, let's suppose you want to say that a circle element contains a center element and either a radius or a diameter element, but not both. This declaration does that:

<!ELEMENT circle (center, (radius | diameter))>

To continue with a geometry example, suppose a center element can either be defined in terms of Cartesian or polar coordinates. Then each center contains either an x and a y or an r and a . We would declare this using two small sequences, each of which is parenthesized and combined in a choice:

<!ELEMENT center ((x, y) | (r, ))>

Suppose you don't really care whether the x element comes before the y element or vice versa, nor do you care whether r comes before . Then you can expand the choice to cover all four possibilities:

<!ELEMENT center ((x, y) | (y, x) | (r, ) | (, r) )>

As the number of elements in the sequence grows, the number of permutations grows more than exponentially. Thus, this technique really isn't practical past two or three child elements. DTDs are not very good at saying you want n instances of A and m instances of B, but you don't really care which order they come in.

Suffixes can be applied to parenthesized elements too. For instance, let's suppose that a polygon is defined by individual coordinates for each vertex, given in order. For example, this is a right triangle:

figs/p39a.gif

What we want to say is that a polygon is composed of three or more pairs of x-y or r- coordinates. An x is always followed by a y, and an r is always followed by a . This declaration does that:

The plus sign is applied to ((x, y) | (r, )).

To return to the name example, suppose you want to say that a name can contain just a first name, just a last name, or a first name and a last name with an indefinite number of middle names. This declaration achieves that:

<!ELEMENT name (last_name                | (first_name, ( (middle_name+, last_name) | (last_name?) )                ) >

3.2.7 Mixed Content

In narrative documents it's common for a single element to contain both child elements and un-marked up, nonwhitespace character data. For example, recall this definition element from Chapter 2:

<definition>The <term>Turing Machine</term> is an abstract finite  state automaton with infinite memory that can be proven equivalent  to any any other finite state automaton with arbitrarily large memory.  Thus what is true for a Turing machine is true for all equivalent  machines no matter how implemented. </definition>

The definition element contains some nonwhitespace text and a term child. This is called mixed content. An element that contains mixed content is declared like this:

<!ELEMENT definition (#PCDATA | term)*>

This says that a definition element may contain parsed character data and term children. It does not specify in which order they appear, nor how many instances of each appear. This declaration allows a definition to have one term child, no term children, or twenty-three term children.

You can add any number of other child elements to the list of mixed content, though #PCDATA must always be the first child in the list. For example, this declaration says that a paragraph element may contain any number of name, profession, footnote, emphasize, and date elements in any order, interspersed with parsed character data:

<!ELEMENT paragraph   (#PCDATA | name | profession | footnote | emphasize | date )* >

This is the only way to indicate that an element contains mixed content. You cannot say, for example, that there must be exactly one term child of the definition element, as well as parsed character data. You cannot say that the parsed character data must all come after the term child. You cannot use parentheses around a mixed-content declaration to make it part of a larger grouping. You can only say that the element contains any number of any elements from a particular list in any order, as well as undifferentiated parsed character data.

3.2.8 Empty Elements

Some elements do not have any content at all. These are called empty elements and are sometimes written with a closing />. For example:

<image source="bus.jpg" width="152" height="345"        alt="Alan Turing standing in front of a bus" />

These elements are declared by using the keyword EMPTY for the content specification. For example:

<!ELEMENT image EMPTY>

This merely says that the image element must be empty, not that it must be written with an empty-element tag. Given this declaration, this is also a valid image element:

<image source="bus.jpg" width="152" height="345"        alt="Alan Turing standing in front of a bus"></image>

If an element is empty, then it can contain nothing, not even whitespace. For instance, this is an invalid image element:

<image source="bus.jpg" width="152" height="345"        alt="Alan Turing standing in front of a bus"> </image>

3.2.9 ANY

Very loose DTDs occasionally want to say that an element exists without making any assertions about what it may or may not contain. In this case you can specify the keyword ANY as the content specification. For example, this declaration says that a page element can contain any content including mixed content, child elements, and even other page elements:

<!ELEMENT page ANY>

The children that actually appear in the page elements' content in the document must still be declared in element declarations of their own. ANY does not allow you to use undeclared elements.

ANY is sometimes useful when you're just beginning to design the DTD and document structure and you don't yet have a clear picture of how everything fits together. However, it's extremely bad form to use ANY in finished DTDs. About the only time you'll see it used is when external DTD subsets and entities may change in uncontrollable ways. However, this is actually quite rare. You'd really only need this if you were writing a DTD for an application like XSLT or RDF that wraps content from arbitrary, unknown XML applications.

3.3 Attribute Declarations

As well as declaring its elements, a valid document must declare all the elements' attributes. This is done with ATTLIST declarations. A single ATTLIST can declare multiple attributes for a single element type. However, if the same attribute is repeated on multiple elements, then it must be declared separately for each element where it appears. (Later in this chapter you'll see how to use parameter entity references to make this repetition less burdensome.)

For example, this ATTLIST declaration declares the source attribute of the image element:

<!ATTLIST image source CDATA #REQUIRED>

It says that the image element has an attribute named source. The value of the source attribute is character data, and instances of the image element in the document are required to provide a value for the source attribute.

A single ATTLIST declaration can declare multiple attributes for the same element. For example, this ATTLIST declaration not only declares the source attribute of the image element, but also the width, height, and alt attributes:

<!ATTLIST image source CDATA #REQUIRED                 width  CDATA #REQUIRED                 height CDATA #REQUIRED                 alt    CDATA #IMPLIED >

This declaration says the source, width, and height attributes are required. However, the alt attribute is optional and may be omitted from particular image elements. All four attributes are declared to contain character data, the most generic attribute type.

This declaration has the same effect and meaning as four separate ATTLIST declarations, one for each attribute. Whether to use one ATTLIST declaration per attribute is a matter of personal preference, but most experienced DTD designers prefer the multiple-attribute form. Given judicious application of whitespace, it's no less legible than the alternative.

3.3.1 Attribute Types

In merely well-formed XML, attribute values can be any string of text. The only restrictions are that any occurrences of < or & must be escaped as < and & and whichever kind of quotation mark, single or double, is used to delimit the value must also be escaped. However, a DTD allows you to make somewhat stronger statements about the content of an attribute value. Indeed, these are stronger statements than can be made about the contents of an element. For instance, you can say that an attribute value must be unique within the document, that it must be a legal XML name token, or that it must be chosen from a fixed list of values.

There are ten attribute types in XML. They are:

CDATA
NMTOKEN
NMTOKENS
Enumeration
ENTITY
ENTITIES
ID
IDREF
IDREFS
NOTATION

These are the only attribute types allowed. A DTD cannot say that an attribute value must be an integer or a date between 1966 and 2002, for example.

3.3.1.1 CDATA

A CDATA attribute value can contain any string of text acceptable in a well-formed XML attribute value. This is the most general attribute type. For example, you would use this type for an alt attribute of an image element because there's no particular form the text in such an attribute has to follow.

<!ATTLIST image alt CDATA #IMPLIED>

You would also use this for other kinds of data such as prices, URIs, email and snail mail addresses, citations, and other types that while they have more structure than a simple string of text don't match any of the other attribute types. For example:

<!ATTLIST sku  list_price             CDATA #IMPLIED  suggested_retail_price CDATA #IMPLIED  actual_price           CDATA #IMPLIED > <!-- All three attributes should be in the form $XX.YY -->

3.3.1.2 NMTOKEN

An XML name token is very close to an XML name. It must consist of the same characters as an XML name, that is, alphanumeric and/or ideographic characters and the punctuation marks _, -, ., and :. Furthermore, like an XML name, an XML name token may not contain whitespace. However, a name token differs from an XML name in that any of the allowed characters can be the first character in a name token, while only letters, ideographs, and the underscore can be the first character of an XML name. Thus 12 and .cshrc are valid XML name tokens although they are not valid XML names. Every XML name is an XML name token, but not all XML name tokens are XML names.

The value of an attribute declared to have type NMTOKEN is an XML name token. For example, if you knew that the year attribute of a journal element should contain an integer such as 1990 or 2015, you might declare it to have NMTOKEN type, since all years are name tokens:

<!ATTLIST journal year NMTOKEN #REQUIRED>

This still doesn't prevent the document author from assigning the year attribute values like "99" or "March", but it at least eliminates some possible wrong values, especially those that contain whitespace such as "1990 C.E." or "Sally had a little lamb."

3.3.1.3 NMTOKENS

A NMTOKENS type attribute contains one or more XML name tokens separated by whitespace. For example, you might use this to describe the dates attribute of a performances element, if the dates were given in the form 08-26-2000, like this:

<performances dates="08-21-2001 08-23-2001 08-27-2001">   Kat and the Kings </performances>

The appropriate declaration is:

<!ATTLIST performances dates NMTOKENS #REQUIRED>

On the other hand, you could not use this for a list of dates in the form 08/27/2001 because the forward slash is not a legal name character.

3.3.1.4 Enumeration

An enumeration is the only attribute type that is not an XML keyword. Rather, it is a list of all possible values for the attribute, separated by vertical bars. Each possible value must be an XML name token. For example, the following declarations say that the value of the month attribute of a date element must be one of the twelve English month names, that the value of the day attribute must be a number between 1 and 31, and that the value of the year attribute must be an integer between 1970 and 2009:

<!ATTLIST date month (January | February | March | April | May | June   | July | August | September | October | November | December) #REQUIRED > <!ATTLIST date day (1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12   | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25   | 26 | 27 | 28 | 29 | 30 | 31) #REQUIRED > <!ATTLIST date year (1970 | 1971 | 1972 | 1973 | 1974 | 1975 | 1976   | 1977 | 1978 | 1979 | 1980 | 1981 | 1982 | 1983 | 1984 | 1985 | 1986   | 1987 | 1988 | 1989 | 1990 | 1991 | 1992 | 1993 | 1994 | 1995 | 1996   | 1997 | 1998 | 1999 | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006   | 2007 | 2008 | 2009 ) #REQUIRED > <!ELEMENT date EMPTY>

Given this DTD, this date element is valid:

<date month="January" day="22" year="2001"/>

However, these date elements are invalid:

<date month="01"      day="22" year="2001"/> <date month="Jan"     day="22" year="2001"/> <date month="January" day="02" year="2001"/> <date month="January" day="2"  year="1969"/> <date month="Janvier" day="22" year="2001"/>

This trick works here because all the desired values happen to be legal XML name tokens. However, we could not use the same trick if the possible values included whitespace or any punctuation besides the underscore, hyphen, colon, and period.

3.3.1.5 ID

An ID type attribute must contain an XML name (not a name token but a name) that is unique within the XML document. More precisely, no other ID type attribute in the document can have the same value. (Attributes of non-ID type are not considered.) Each element may have no more than one ID type attribute.

As the keyword suggests, ID type attributes assign unique identifiers to elements. ID type attributes do not have to have the name "ID" or "id", though they very commonly do. For example, this ATTLIST declaration says that every employee element must have a social_security_number ID attribute:

<!ATTLIST employee social_security_number ID #REQUIRED>

ID numbers are tricky because a number is not an XML name and therefore not a legal XML ID. The normal solution is to prefix the values with an underscore or a common letter. For example:

<employee social_security_label="_078-05-1120"/>

3.3.1.6 IDREF

An IDREF type attribute refers to the ID type attribute of some element in the document. Thus, it must be an XML name. IDREF attributes are commonly used to establish relationships between elements when simple containment won't suffice.

For example, imagine an XML document that contains a list of project elements and employee elements. Every project has a project_id ID type attribute, and every employee has a social_security_number ID type attribute. Furthermore, each project has team_member child elements that identify who's working on the project, and each employee element has assignment children that indicate to which projects that employee is assigned. Since each project is assigned to multiple employees and some employees are assigned to more than one project, it's not possible to make the employees children of the projects or the projects children of the employees. The solution is to use IDREF type attributes like this:

<project project_id="p1">   <goal>Develop Strategic Plan</goal>   <team_member person="ss078-05-1120"/>   <team_member person="ss987-65-4320"/> </project> <project project_id="p2">   <goal>Deploy Linux</goal>   <team_member person="ss078-05-1120"/>   <team_member person="ss9876-12-3456"/> </project> <employee social_security_label="ss078-05-1120">   <name>Fred Smith</name>   <assignment project_id="p1"/>   <assignment project_id="p2"/> </employee> <employee social_security_label="ss987-65-4320">   <name>Jill Jones</name>   <assignment project_id="p1"/> </employee> <employee social_security_label="ss9876-12-3456">   <name>Sydney Lee</name>   <assignment project_id="p2"/> </employee>

In this example, the project_id attribute of the project element and the social_security_number attribute of the employee element would be declared to have type ID. The person attribute of the team_member element and the project_id attribute of the assignment element would have type IDREF. The relevant ATTLIST declarations look like this:

<!ATTLIST employee social_security_number ID    #REQUIRED> <!ATTLIST project  project_id             ID    #REQUIRED> <!ATTLIST team_member person              IDREF #REQUIRED> <!ATTLIST assignment  project_id          IDREF #REQUIRED>

These declarations constrain the person attribute of the team_member element and the project_id attribute of the assignment element to match the ID of something in the document. However, they do not constrain the person attribute of the team_member element to match only employee IDs or constrain the project_id attribute of the assignment element to match only project IDs. It would be valid (though not necessarily correct) for a team_member to hold the ID of another project or even the same project.

3.3.1.7 IDREFS

An IDREFS type attribute contains a whitespace-separated list of XML names, each of which must be the ID of an element in the document. This is used when one element needs to refer to multiple other elements. For instance, the previous project example could be rewritten so that the assignment children of the employee element were replaced by a single assignments attribute. Similarly, the team_member children of the project element could be replaced by a team attribute like this:

<project project_id="p1" team="ss078-05-1120 ss987-65-4320">   <goal>Develop Strategic Plan</goal> </project> <project project_id="p2" team="ss078-05-1120 ss9876-12-3456">   <goal>Deploy Linux</goal> </project> <employee social_security_label="ss078-05-1120" assignments="p1 p2">   <name>Fred Smith</name> </employee> <employee social_security_label="ss987-65-4320" assignments="p1">   <name>Jill Jones</name> </employee> <employee social_security_label="ss9876-12-3456" assignments="p2">   <name>Sydney Lee</name> </employee>

The appropriate declarations are:

<!ATTLIST employee social_security_number ID     #REQUIRED                    assignments            IDREFS #REQUIRED> <!ATTLIST project  project_id             ID     #REQUIRED                    team                   IDREFS #REQUIRED>

3.3.1.8 ENTITY

An ENTITY type attribute contains the name of an unparsed entity declared elsewhere in the DTD. For instance, a movie element might have an entity attribute identifying the MPEG or QuickTime file to play when the movie was activated:

<!ATTLIST movie source ENTITY #REQUIRED>

If the DTD declared an unparsed entity named X-Men-trailer, then this movie element might be used to embed that video file in the XML document:

<movie source="X-Men-trailer"/>

We'll discuss unparsed entities in more detail later in this chapter.

3.3.1.9 ENTITIES

An ENTITIES type attribute contains the names of one or more unparsed entities declared elsewhere in the DTD, separated by whitespace. For instance, a slide_show element might have an ENTITIES attribute identifying the JPEG files to show and in which order they were to be shown:

<!ATTLIST slide_show slides ENTITIES #REQUIRED>

If the DTD declared unparsed entities named slide1, slide2, slide3, and so on through slide10, then this slide_show element might be used to embed the show in the XML document:

<slide_show slides="slide1 slide2 slide3 slide4 slide5 slide6                     slide7 slide8 slide9 slide10"/>

3.3.1.10 NOTATION

A NOTATION type attribute contains the name of a notation declared in the document's DTD. This is perhaps the rarest attribute type and isn't much used in practice. In theory, it could be used to associate types with particular elements, as well as limiting the types associated with the element. For example, these declarations define four notations for different image types and then specify that each image element must have a type attribute that selects exactly one of them:

<!NOTATION gif  SYSTEM "image/gif"> <!NOTATION tiff SYSTEM "image/tiff"> <!NOTATION jpeg SYSTEM "image/jpeg"> <!NOTATION png  SYSTEM "image/png"> <!ATTLIST  image type NOTATION (gif | tiff | jpeg | png) #REQUIRED>

The type attribute of each image element can have one of the four values gif, tiff, jpeg, or png but not any other value. This has a slight advantage over the enumerated type in that the actual MIME media type of the notation is available, whereas an enumerated type could not specify image/png or image/gif as an allowed value because the forward slash is not a legal character in XML names.

3.3.2 Attribute Defaults

As well as providing a data type, each ATTLIST declaration includes a default declaration for that attribute. There are four possibilities for this default:

#IMPLIED: The attribute is optional. Each instance of the element may or may not provide a value for the attribute. No default value is provided.
#REQUIRED: The attribute is required. Each instance of the element must provide a value for the attribute. No default value is provided.
#FIXED: The attribute value is constant and immutable. This attribute has the specified value regardless of whether the attribute is explicitly noted on an individual instance of the element. If it is included, though, it must have the specified value.
Literal: The actual default value is given as a quoted string.

For example, this ATTLIST declaration says that person elements can have but do not have to have born and died attributes:

<!ATTLIST person born CDATA #IMPLIED                  died CDATA #IMPLIED >

This ATTLIST declaration says that every circle element must have center_x, center_y, and radius attributes:

<!ATTLIST circle center_x NMTOKEN #REQUIRED                  center_y NMTOKEN #REQUIRED                  radius   NMTOKEN #REQUIRED >

This ATTLIST declaration says that every biography element has an xmlns:xlink attribute and that the value of that attribute is http://www.w3.org/1999/xlink, even if the start-tag of the element does not explicitly include an xmlns:xlink attribute.

<!ATTLIST biography    xmlns:xlink CDATA #FIXED "http://www.w3.org/1999/xlink">

This ATTLIST declaration says that every web_page element has a protocol attribute. If a particular web_page element doesn't have an explicit protocol attribute, then the parser will supply one with the value http:

<!ATTLIST web_page protocol NMTOKEN "http">

3.4 General Entity Declarations

As you learned in Chapter 2, XML predefines five entities for your convenience:

<: The less-than sign; a.k.a. the opening angle bracket (<)
&: The ampersand (&)
>: The greater-than sign; a.k.a. the closing angle bracket (>)
": The straight, double quotation marks (")
': The apostrophe; a.k.a. the straight single quote (')

The DTD can define many more. This is useful not just in valid documents, but even in documents you don't plan to validate.

Entity references are defined with an ENTITY declaration in the DTD. This gives the name of the entity, which must be an XML name, and the replacement text of the entity. For example, this entity declaration defines &super; as an abbreviation for supercalifragilisticexpialidocious:

<!ENTITY super "supercalifragilisticexpialidocious">

Once that's done, you can use &super; anywhere you'd normally have to type the entire word (and probably misspell it).

Entities can contain markup as well as text. For example, this declaration defines &footer; as an abbreviation for a standard web-page footer that will be repeated on many pages:

<!ENTITY footer '<hr size="1" noshade="true"/> <font CLASS="footer"> <a href="index.html">O&apos;Reilly Home</a> | <a href="sales/bookstores/">O&apos;Reilly Bookstores</a> | <a href="order_new/">How to Order</a> | <a href="oreilly/contact.html">O&apos;Reilly Contacts</a><br> <a href="http://international.oreilly.com/">International</a> | <a href="oreilly/about.html">About O&apos;Reilly</a> | <a href="affiliates.html">Affiliated Companies</a> </font> <p> <font CLASS="copy"> Copyright 2000, O&apos;Reilly &amp; Associates, Inc.<br/> <a href="mailto:webmaster@oreilly.com">webmaster@oreilly.com</a> </font> </p> '>

The entity replacement text must be well-formed. For instance, you cannot put a start-tag in one entity and the corresponding end-tag in another entity.

The other thing you have to be careful about is that you need to use different quote marks inside the replacement text from the ones that delimit it. Here we've chosen single quotes to surround the replacement text and double quotes internally. However, we did have to change the single quote in "O'Reilly" to the predefined general entity reference '. Replacement text may itself contain entity references that are resolved before the text is replaced. However, self-referential and circular references are forbidden.

General entities insert replacement text into the body of an XML document. They can also be used inside the DTD in places where they will eventually be included in the body of an XML document, for instance in an attribute default value or in the replacement text of another entity. However, they cannot be used to provide the text of the DTD itself. For instance, this is illegal:

Shortly, we'll see how to use a different kind of entity the parameter entity to achieve the desired result.

3.5 External Parsed General Entities

The footer example is about at the limits of what you can comfortably fit in a DTD. In practice, web sites prefer to store repeated content like this in external files and load it into their pages using PHP, server-side includes, or some similar mechanism. XML supports this technique through external general entity references, though in this case the client, rather than the server, is responsible for integrating the different pieces of the document into a coherent whole.

An external parsed general entity reference is declared in the DTD using an ENTITY declaration. However, instead of the actual replacement text, the SYSTEM keyword and a URI to the replacement text is given. For example:

<!ENTITY footer SYSTEM "http://www.oreilly.com/boilerplate/footer.xml">

Of course, a relative URL will often be used instead. For example:

<!ENTITY footer SYSTEM "/boilerplate/footer.xml">

In either case when the general entity reference &footer; is seen in the character data of an element, the parser may replace it with the document found at http://www.oreilly.com/boilerplate/footer.xml. References to external parsed entities are not allowed in attribute values. Most of the time, this shouldn't be too big a hassle because attribute values tend to be small enough to be easily included in internal entities.

Notice we wrote that the parser may replace the entity reference with the document at the URL, not that it must. This is an area where parsers have some leeway in just how much of the XML specification they wish to implement. A validating parser must retrieve such an external entity. However, a nonvalidating parser may or may not choose to retrieve the entity.

Furthermore, not all text files can serve as external entities. In order to be loaded in by a general entity reference, the document must be potentially well-formed when inserted into an existing document. This does not mean the external entity itself must be well-formed. In particular, the external entity might not have a single root element. However, if such a root element were wrapped around the external entity, then the resulting document should be well-formed. This means, for example, that all elements that start inside the entity must finish inside the same entity. They cannot finish inside some other entity. Furthermore, the external entity does not have a prolog and, therefore, cannot have an XML declaration or a document type declaration.

3.5.1 Text Declarations

Instead of an XML declaration, an external entity may have a text declaration; this looks a lot like an XML declaration. The main difference is that in a text declaration the encoding declaration is required, while the version info is optional. Furthermore, there is no standalone declaration. The main purpose of the text declaration is to warn the parser if the external entity uses a different text encoding than the including document. For example, this is a common text declaration:

<?xml version="1.0" encoding="MacRoman"?>

However, you could also use this text declaration with no version attribute:

<?xml encoding="MacRoman"?>

Example 3-5 is a well-formed external entity that could be included from another document using an external general entity reference.

Example 3-5. An external parsed entity

<?xml encoding="ISO-8859-1"?> <hr size="1" noshade="true"/> <font CLASS="footer">   <a href="index.html">O'Reilly Home</a> |   <a href="sales/bookstores/">O'Reilly Bookstores</a> |   <a href="order_new/">How to Order</a> |   <a href="oreilly/contact.html">O'Reilly Contacts</a><br>   <a href="http://international.oreilly.com/">International</a> |   <a href="oreilly/about.html">About O'Reilly</a> |   <a href="affiliates.html">Affiliated Companies</a> </font> <p>   <font CLASS="copy">     Copyright 2000, O'Reilly &amp; Associates, Inc.<br/>     <a href="mailto:webmaster@oreilly.com">webmaster@oreilly.com</a>   </font> </p>

3.6 External Unparsed Entities and Notations

Not all data is XML. There are a lot of ASCII text files in the world that don't give two cents about escaping < as < or adhering to the other constraints by which an XML document is limited. There are probably even more JPEG photographs, GIF line art, QuickTime movies, MIDI sound files, and so on. None of these are well-formed XML, yet all of them are necessary components of many documents.

The mechanism that XML suggests for embedding these things in your documents is the external unparsed entity. The DTD specifies a name and a URI for the entity containing the non-XML data. For example, this ENTITY declaration associates the name turing_getting_off_bus with the JPEG image at http://www.turing.org.uk/turing/pi1/bus.jpg:

<!ENTITY turing_getting_off_bus          SYSTEM "http://www.turing.org.uk/turing/pi1/bus.jpg"          NDATA jpeg>

3.6.1 Notations

Since the data is not in XML format, the NDATA declaration specifies the type of the data. Here the name jpeg is used. XML does not recognize this as meaning an image in a format defined by the Joint Photographs Experts Group. Rather this is the name of a notation declared elsewhere in the DTD using a NOTATION declaration like this:

<!NOTATION jpeg SYSTEM "image/jpeg">

Here we've used the MIME media type image/jpeg as the external identifier for the notation. However, there is absolutely no standard or even a suggestion for exactly what this identifier should be. Individual applications must define their own requirements for the contents and meaning of notations.

3.6.2 Embedding Unparsed Entities in Documents

The DTD only declares the existence, location, and type of the unparsed entity. To actually include the entity in the document at one or more locations, you insert an element with an ENTITY type attribute whose value is the name of an unparsed entity declared in the DTD. You do not use an entity reference like &turing_getting_off_bus;. Entity references can only refer to parsed entities.

Suppose the image element and its source attribute are declared like this:

<!ELEMENT image EMPTY> <!ATTLIST image source ENTITY #REQUIRED>

Then, this image element would refer to the photograph at http://www.turing.org.uk/turing/pi1/bus.jpg:

<image source="turing_getting_off_bus"/>

We should warn you that XML doesn't guarantee any particular behavior from an application that encounters this type of unparsed entity. It very well may not display the image to the user. Indeed, the parser may be running in an environment where there's no user to display the image to. It may not even understand that this is an image. The parser may not load or make any sort of connection with the server where the actual image resides. At most, it will tell the application on whose behalf it's parsing that there is an unparsed entity at a particular URI with a particular notation and let the application decide what, if anything, it wants to do with that information.

Unparsed general entities are not the only plausible way to embed non-XML content in XML documents. In particular, a simple URL, possibly associated with an XLink, does a fine job for many purposes, just as it does in HTML (which gets along just fine without any unparsed entities). Including all the necessary information in a single empty element like <image source = "http://www.turing.org.uk/turing/pi1/bus.jpg" /> is arguably preferable to splitting the same information between the element where it's used and the DTD of the document in which it's used. The only thing an unparsed entity really adds is the notation, but that's too nonstandard to be of much use.

In fact, many experienced XML developers, including the authors of this book, feel strongly that unparsed entities are a complicated, confusing mistake that should never have been included in XML in the first place. Nonetheless, they are a part of the specification, so we describe them here.

3.6.3 Notations for Processing Instruction Targets

Notations can also be used to identify the exact target of a processing instruction. A processing instruction target must be an XML name, which means it can't be a full path like /usr/local/bin/tex. A notation can identify a short XML name like tex with a more complete specification of the external program for which the processing instruction is intended. For example, this notation associates the target tex with the more complete path /usr/local/bin/tex:

<!NOTATION tex SYSTEM "/usr/local/bin/tex">

In practice, this technique isn't much used or needed. Most applications that read XML files and pay attention to particular processing instructions simply recognize a particular target string like php or robots.

3.7 Parameter Entities

It is not uncommon for multiple elements to share all or part of the same attribute lists and content specifications. For instance, any element that's a simple XLink will have xlink:type and xlink:href attributes, and perhaps xlink:show and xlink:actuate attributes. In XHTML, a th element and a td element contain more or less the same content. Repeating the same content specifications or attribute lists in multiple element declarations is tedious and error-prone. It's entirely possible to add a newly defined child element to the declaration of some of the elements but forget to include it in others.

For example, consider an XML application for residential real-estate listings that provides separate elements for apartments, sublets, coops for sale, condos for sale, and houses for sale. The element declarations might look like this:

<!ELEMENT apartment (address, footage, rooms, baths, rent)> <!ELEMENT sublet    (address, footage, rooms, baths, rent)> <!ELEMENT coop      (address, footage, rooms, baths, price)> <!ELEMENT condo     (address, footage, rooms, baths, price)> <!ELEMENT house     (address, footage, rooms, baths, price)>

There's a lot of overlap between the declarations, i.e., a lot of repeated text. And if you later decide you need to add an additional element, available_date for instance, then you need to add it to all five declarations. It would be preferable to define a constant that can hold the common parts of the content specification for all five kinds of listings and refer to that constant from inside the content specification of each element. Then to add or delete something from all the listings, you'd only need to change the definition of the constant.

An entity reference is the obvious candidate here. However, general entity references are not allowed to provide replacement text for a content specification or attribute list, only for parts of the DTD that will be included in the XML document itself. Instead, XML provides a new construct exclusively for use inside DTDs, the parameter entity, which is referred to by a parameter entity reference. Parameter entities behave like and are declared almost exactly like a general entity. However, they use a % instead of an &, and they can only be used in a DTD while general entities can only be used in the document content.

3.7.1 Parameter Entity Syntax

A parameter entity reference is declared much like a general entity reference. However, an extra percent sign is placed between the <!ENTITY and the name of the entity. For example:

<!ENTITY % residential_content "address, footage, rooms, baths"> <!ENTITY % rental_content      "rent"> <!ENTITY % purchase_content    "price">

Parameter entities are dereferenced in the same way as a general entity reference, only with a percent sign instead of an ampersand:

<!ELEMENT apartment (%residential_content;, %rental_content;)> <!ELEMENT sublet    (%residential_content;, %rental_content;)> <!ELEMENT coop      (%residential_content;, %purchase_content;)> <!ELEMENT condo     (%residential_content;, %purchase_content;)> <!ELEMENT house     (%residential_content;, %purchase_content;)>

When the parser reads these declarations, it substitutes the entity's replacement text for the entity reference. Now all you have to do to add an available_date element to the content specification of all five listing types is add it to the residential_content entity like this:

<!ENTITY % residential_content "address, footage, rooms,                                 baths, available_date">

The same technique works equally well for attribute types and element names. You'll see several examples of this in the next chapter on namespaces and in Chapter 9.

This trick is limited to external DTDs. Internal DTD subsets do not allow parameter entity references to be only part of a markup declaration. However, parameter entity references can be used in internal DTD subsets to insert one or more entire markup declarations, typically through external parameter entities.

3.7.2 Redefining Parameter Entities

What makes parameter entity references particularly powerful is that they can be redefined. If a document uses both internal and external DTD subsets, then the internal DTD subset can specify new replacement text for the entities. If ELEMENT and ATTLIST declarations in the external DTD subset are written indirectly with parameter entity references instead of directly with literal element names, the internal DTD subset can change the DTD for the document. For instance, a single document could add a bedrooms child element to the listings by redefining the residential_content entity like this:

<!ENTITY % residential_content "address, footage, rooms,                                 bedrooms, baths, available_date">

In the event of conflicting entity declarations, the first one encountered takes precedence. The parser reads the internal DTD subset first. Thus the internal definition of the residential_content entity is used. When the parser reads the external DTD subset, every declaration that uses the residential_content entity will contain a bedrooms child element it wouldn't otherwise have.

Modular XHTML, which we'll discuss in Chapter 7, makes heavy use of this technique to allow particular documents to select only the subset of HTML that they actually need.

3.7.3 External DTD Subsets

Real-world DTDs can be quite complex. The SVG DTD is over 1,000 lines long. The XHTML 1.0 strict DTD (the smallest of the three XHTML DTDs) is more than 1,500 lines long. And these are only medium-sized DTDs. The DocBook XML DTD is over 11,000 lines long. It can be hard to work with, comprehend, and modify such a large DTD when it's stored in a single monolithic file.

Fortunately, DTDs can be broken up into independent pieces. For instance, the DocBook DTD is distributed in 28 separate pieces covering different parts of the spec: one for tables, one for notations, one for entity declarations, and so on. These different pieces are then combined at validation time using external parameter entity references.

An external parameter entity is declared using a normal ENTITY declaration with a % sign just like a normal parameter entity. However, rather than including the replacement text directly, the declaration contains the SYSTEM keyword followed by a URI to the DTD piece it wants to include. For example, the following ENTITY declaration defines an external entity called names whose content is taken from the file at the relative URI names.dtd. Then the parameter entity reference %names; inserts the contents of that file into the current DTD.

<!ENTITY % names SYSTEM "names.dtd"> %names;

You can use either relative or absolute URIs as convenient. In most situations, relative URIs are more practical.

3.8 Conditional Inclusion

XML offers the IGNORE directive for the purpose of "commenting out" a section of declarations. For example, a parser will ignore the following declaration of a production_note element, as if it weren't in the DTD at all:

<![IGNORE[   <!ELEMENT production_note (#PCDATA)> ]]>

This may not seem particularly useful. After all, you could always simply use an XML comment to comment out the declarations you want to remove temporarily from the DTD. If you feel that way, the INCLUDE directive is going to seem even more pointless. Its purpose is to indicate that the given declarations are actually used in the DTD. For example:

<![INCLUDE[   <!ELEMENT production_note (#PCDATA)> ]]>

This has exactly the same effect and meaning as if the INCLUDE directive were not present. However, now consider what happens if we don't use INCLUDE and IGNORE directly. Instead, suppose we define a parameter entity like this:

<!ENTITY % notes_allowed "INCLUDE">

Then we use a parameter entity reference instead of the keyword:

<![%notes_allowed;[   <!ELEMENT production_note (#PCDATA)> ]]>

The notes_allowed parameter entity can be redefined from outside this DTD. In particular, it can be redefined in the internal DTD subset of a document. This provides a switch individual documents can use to turn the production_note declaration on or off. This technique allows document authors to select only the functionality they need from the DTD.

3.9 Two DTD Examples

Some of the best techniques for DTD design only become apparent when you look at larger documents. In this section, we'll develop DTDs that cover the two different document formats for describing people that were presented in Example 2-4 and Example 2-5 of the last chapter.

3.9.1 Data-Oriented DTDs

Data- oriented DTDs are very straightforward. They make heavy use of sequences, occasional use of choices, and almost no use of mixed content. Example 3-6 shows such a DTD. Since this is a small example, and since it's easier to understand when both the document and the DTD are on the same page, we've made this an internal DTD included in the document. However, it would be easy to take it out and put it in a separate file.

Example 3-6. A flexible yet data-oriented DTD describing people

<?xml version="1.0"?> <!DOCTYPE person  [   <!ELEMENT person (name+, profession*)>   <!ELEMENT name EMPTY>   <!ATTLIST name first CDATA #REQUIRED                  last  CDATA #REQUIRED>   <!-- The first and last attributes are required to be present        but they may be empty. For example,        <name first="Cher" last=""> -->   <!ELEMENT profession EMPTY>   <!ATTLIST profession value CDATA #REQUIRED> ]> <person>   <name first="Alan" last="Turing"/>   <profession value="computer scientist"/>   <profession value="mathematician"/>   <profession value="cryptographer"/> </person>

The DTD here is contained completely inside the internal DTD subset. First a person ELEMENT declaration states that each person must have one or more name children, and zero or more profession children, in that order. This allows for the possibility that a person changes his name or uses aliases. It assumes that each person has at least one name but may not have a profession.

This declaration also requires that all name elements precede all profession elements. Here the DTD is less flexible than it ideally would be. There's no particular reason that the names have to come first. However, if we were to allow more random ordering, it would be hard to say that there must be at least one name. One of the weaknesses of DTDs is that it occasionally forces extra sequence order on you when all you really need is a constraint on the number of some element. Schemas are more flexible in this regard.

Both name and profession elements are empty so their declarations are very simple. The attribute declarations are a little more complex. In all three cases the form of the attribute is open, so all three attributes are declared to have type CDATA. All three are also required. However, note the use of comments to suggest a solution for edge cases such as celebrities with no last names. Comments are an essential tool for making sense of otherwise obfuscated DTDs.

3.9.2 Narrative-Oriented DTDs

Narrative-oriented DTDs tend be a lot looser and make much heavier use of mixed content than do DTDs that describe more database-like documents. Consequently, they tend to be written from the bottom up, starting with the smallest elements and building up to the largest. They also tend to use parameter entities to group together similar content specifications and attribute lists.

Example 3-7 is a standalone DTD for biographies like the one shown in Example 2-5 of the last chapter. Notice that not everything it declares is actually present in Example 2-5. That's often the case with narrative documents. For instance, not all web pages contain unordered lists, but the XHTML DTD still needs to declare the ul element for those XHTML documents that do include them. Also, notice that a few attributes present in Example 2-5 have been made into fixed defaults here. Thus, they could be omitted from the document itself, once it is attached to this DTD.

Example 3-7. A narrative-oriented DTD for biographies

<!ATTLIST biography xmlns:xlink CDATA #FIXED                                        "http://www.w3.org/1999/xlink"> <!ELEMENT person (first_name, last_name)> <!-- Birth and death dates are given in the form yyyy/mm/dd --> <!ATTLIST person born CDATA #IMPLIED                  died CDATA #IMPLIED> <!ELEMENT date   (month, day, year)> <!ELEMENT month  (#PCDATA)> <!ELEMENT day    (#PCDATA)> <!ELEMENT year   (#PCDATA)> <!-- xlink:href must contain a URI.--> <!ATTLIST emphasize xlink:type (simple) #IMPLIED                     xlink:href CDATA   #IMPLIED> <!ELEMENT profession (#PCDATA)> <!ELEMENT footnote   (#PCDATA)> <!-- The source is given according to the Chicago Manual of Style      citation conventions --> <!ATTLIST footnote source CDATA #REQUIRED> <!ELEMENT first_name (#PCDATA)> <!ELEMENT last_name  (#PCDATA)> <!ELEMENT image EMPTY> <!ATTLIST image source CDATA   #REQUIRED                 width  NMTOKEN #REQUIRED                 height NMTOKEN #REQUIRED                 ALT    CDATA   #IMPLIED > <!ENTITY % top_level "( #PCDATA | image | paragraph | definition                        | person | profession | emphasize | last_name                       | first_name | footnote | date )*"> <!ELEMENT paragraph  %top_level; > <!ELEMENT definition %top_level; > <!ELEMENT emphasize  %top_level; > <!ELEMENT biography  %top_level; >

The root biography element has a classic mixed-content declaration. Since there are several elements that can contain other elements in a fairly unpredictable fashion, we group all the possible top-level elements (elements that appear as immediate children of the root element) in a single top_level entity reference. Then we can make all of them potential children of each other in a straightforward way. This also makes it much easier to add new elements in the future. That's important since this one small example is almost certainly not broad enough to cover all possible biographies.

3.10 Locating Standard DTDs

DTDs and validity are most important when you're exchanging data with others ; they let you verify that you're sending what the receiver expects and vice versa. Of course, this works best if both ends of a conversation agree on which DTD and vocabulary they will use. There are many standard DTDs for different professions and disciplines and more are created every day. It is often better to use an established DTD and vocabulary than to design your own.

However, there is no agreed-upon, central repository that documents and links to such efforts. We know of at least three attempts to create such a central repository. These are:

James Tauber's schema.net at http://www.schema.net/
OASIS's xml.org at http://www.xml.org/xml/registry.jsp
Microsoft's Biztalk initiative at http://www.biztalk.org (registration required)

However, none of these has succeeded in establishing itself as the standard place to list DTDs and none of them cover more than a minority of the existing public DTDs. Indeed, probably the largest list of DTDs online does not even attempt to be a general repository, instead being simply the result of collecting XML and SGML news for several years. This is Robin Cover's list of XML applications at http://www.oasis-open.org/cover/siteIndex.html#toc-applications.

The W3C is one of the most prolific producers of standard XML DTDs. It has moved almost all of its future development to XML including SVG, the Platform for Internet Content Selection (PICS), the Resource Description Framework (RDF), the Mathematical Markup Language (MathML), and even HTML itself. DTDs for these XML applications are generally published as appendixes to the specifications for the applications. The specifications are all found at http://www.w3.org/TR/.

However, XML isn't just for the Web, and far more activity is going on outside the W3C than inside it. Generally, within any one field, you should look to that field's standards bodies for DTDs relating to that area of interest. For example, the American Institute of Certified Public Accountants has published a DTD for XFRML, the Extensible Financial Reporting Markup Language. The Object Management Group (OMG) has published a DTD for describing Unified Modeling Language (UML) diagrams in XML. The Society of Automotive Engineers has published an XML application for emissions information as required by the 1990 U.S. Clean Air Act. Chances are that in any industry that makes heavy use of information technology, some group or groups, either formal or informal, are already working on DTDs that cover parts of that industry.

[1] The document type declaration and the document type definition are two different things. The abbreviation DTD is properly used only to refer to the document type definition.

CONTENTS