Section 7.5. Document Type Definition (DTD) | Web Design in a Nutshell: A Desktop Quick Reference (In a Nutshell (OReilly))

7.5. Document Type Definition (DTD)

A Document Type Definition (DTD) is a file associated with SGML and XML documents that defines how markup tags should be interpreted by the application reading the document. The DTD uses SGML syntax to explain precisely which elements and attributes may appear in a document and the context in which they may be used. DTDs were briefly introduced earlier in this chapter. In this section, we'll take a closer look.

A DTD is a text document that contains a set of rules, formally known as element declarations , attlist (attribute) declarations, and entity declarations. DTDs are most often stored in a separate file (with the .dtd suffix) and shared by multiple documents; however, DTD information can be included inside the XML document as well. Both methods are demonstrated later in this section.

Reading DTDs

While you may never be required to write a DTD, knowing how to read one is a useful skill if you plan on getting cozy with XHTML or any other DTD released by the W3C. This chapter should give you a good start, but you may also want to check out these online resources.

"How to Read W3C Specs" by J. David Eisenberg at www.alistapart.com/articles/readspec/
W3Schools DTD Tutorial at www.w3schools.com/dtd/default.asp

7.5.1. Document Type Declarations

XML documents specify which DTD they use via a document type declaration (also called a DOCTYPE declaration).

When the DTD is an external document, the DOCTYPE declaration identifies the root element for the document, lists the method used to identify the DTD (SYSTEM or PUBLIC), and then finally provides the location or name of the DTD itself. When using an external DTD, it is recommended that you include the standalone attribute set to "no" in the XML declaration.

A SYSTEM identifier points to the DTD file by location (its URI), as shown in this example:

 <?xml version="1.0" standalone="no"?> <!DOCTYPE compilation SYSTEM "http://www.littlechair.com/notreal/comp.dtd">

DTDs that are shared by a large community or are hosted at multiple sites may have a PUBLIC ID that specifies the XML application. When public IDs are used, it is common practice to supply an additional SYSTEM URI because it is better supported. Web developers who write documents in XHTML will be familiar with the following DOCTYPE declaration that indicates the root element (html) and the public identifier for XHTML Strict. This declaration also specifies its URL as a backup method.

 <?xml version="1.0" standalone="no?"> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

As an alternative, the DTD may be included in the XML document itself, rather than as an external .dtd document. This is done by placing the DTD within square brackets in the document type declaration as shown here:

 <?xml version="1.0"?> <!DOCTYPE phonebook [    <!ELEMENT listing (name, number)>    <!ELEMENT name    (#PCDATA)>    <!ELEMENT number  (#PCDATA> ]>

An XML document may combine external and internal DTD subsets.

7.5.2. Valid XML

When an XML document conforms to all the rules established in the DTD, it is said to be valid , meaning that all the elements are used correctly.

A well-formed document is not necessarily valid, but if a document proves to be valid, it follows that it is also well-formed.

When your document uses a DTD, you can check it for mistakes using a validating parser. The parser checks the document against the DTD for contextual errors, such as missing elements or improper order of elements. Some common parsers are Xerces from the Apache XML Project (available at xml.apache.org) and Microsoft MSXML (msdn.microsoft.com/xml/default.asp). A full list of validating parsers is provided by Web Developer's Virtual Library at wdvl.com/Software/XML/parsers.html.

As an alternative to downloading your own parser, you can use a free online parsing service. Just enter the locations of your documents at these sites:

The Brown University Scholarly Technology Group's XML Validation Form at www.stg.brown.edu/service/xmlvalid/
W3Schools XML Validator (based on MSXML) at www.w3schools.com/dom/dom_validate.asp

XML Names

When naming elements and attributes (and other less common XML constructs), you must follow the rules for XML names :

Names may contain letters, numbers, or non-English character glyphs (such as).
Names may not start with a number or punctuation (exception: _ (underscore) is allowed at the start).
Names must not start with "xml."
Names may not contain whitespace of any kind (space, carriage return, line feed, or non-breaking space).

7.5.3. DTD Syntax

The following example is made up of lines taken from the XHTML Strict DTD (the full DTD is over 1,500 lines long). It contains samples of element , attlist (attribute), and entity declarations.

 <!ELEMENT title (#PCDATA  )> <!ELEMENT meta EMPTY> <!ELEMENT ul (li)+> <!ENTITY % i18n  "lang        %LanguageCode; #IMPLIED   xml:lang    %LanguageCode; #IMPLIED   dir         (ltr|rtl)      #IMPLIED"   > <!ATTLIST title   %i18n;   id          ID             #IMPLIED   > <!ATTLIST meta   %i18n;   id          ID             #IMPLIED   http-equiv  CDATA          #IMPLIED   name        CDATA          #IMPLIED   content     CDATA          #REQUIRED   scheme      CDATA          #IMPLIED   >

7.5.3.1. Element declarations

Element declarations are the core of the DTD. Every element must have an element declaration in order for the document to validate. Consider the parts of this declaration for the title element.

 <!ELEMENT title (#PCDATA)>

!ELEMENT identifies the line as an element declaration (no surprise there). The next part provides the element name (in this case, title) that will be used in the markup tag. Finally, the material within the parentheses identifies the content model for the element, or in other words, what type of content it may contain. In this example, the content model for the title element must be #PCDATA, which stands for parsed character data. This means the content is character data that may or may not include escaped character entities (such as < and & for < and &, respectively), but it may not include other elements.

7.5.3.2. Attlist (attribute) declarations

ATTLIST (attribute) declarations are used to declare the attributes permitted for a particular element. The following attribute declaration from the previous XHTML example says that the meta element may use the attributes id, http-equiv, name, content, and scheme. %i18n is an entity that represents still more available attributes (more on entities next).

 <!ATTLIST meta   %i18n;   id          ID             #IMPLIED   http-equiv  CDATA          #IMPLIED   name        CDATA          #IMPLIED   content     CDATA          #REQUIRED   scheme      CDATA          #IMPLIED   >

After each attribute name is its attribute type, which provides an indication of the type of information its value may contain. The most common attribute types are CDATA (character data) and an enumerated list of possible values (for example (left|right|center)). Other attribute types include ID, IDREF, IDREFS, NMTOKEN, NMTOKENS, ENTITY, ENTITIES, NOTATION, and xml: (a predefined XML value).

Finally, a default value is provided for each attribute. The default value itself may be listed, or there may be an indication of whether the attribute is required within the element (#REQUIRED), optional (#IMPLIED), or fixed (#FIXED value).

7.5.3.3. Entity declarations

In XML, an entity is a string of characters that stands for something else. An entity can be used to represent a single character or a selection of marked up content, such as a footer containing copyright information. Entity declarations provide the name of the entity (which must be a legal XML name; see the earlier sidebar "XML Names") and its replacement text. The five character entities proved by XML were listed in Table 7-1.

General entities insert replacement text into the body of an XML document. The syntax for declaring general entities is:

 <!ENTITY address "1005 Gravenstein Highway, North Sebastopol, CA 95472">

As a result, wherever the author places an &address; entity in the XML source, it will be replaced by the full address upon display. The content may include markup tags. (Be sure that when double quotes are used to delimit the entity value, single quotes are used in the enclosed content, or vice versa.) The content of an entity may also reside in a separate, external file that is referenced in the entity declaration by its URL.

The XHTML sample at the beginning of this section includes another kind of entity called a parameter entity, shown here:

 <!ENTITY % i18n  "lang        %LanguageCode; #IMPLIED   xml:lang    %LanguageCode; #IMPLIED   dir         (ltr|rtl)      #IMPLIED"   >

Parameter entities are used only within the DTD itself to declare groups of elements or entities that are repeated throughout the DTD. They are indicated by the % symbol (rather than &). The entity declaration above creates a parameter entity called %i18n (shorthand for "internationalization") that includes three language-related attributes. Because these three attributes apply to nearly every XHTML element, instead of repeating them in every ATTLIST declaration, a parameter entity is used instead to reduce repetition. You can see it in use in the attribute declaration for the meta element.

7.5.4. When to Use a DTD

If you create a markup language in XML, it is not mandatory that it have a DTD. In fact, DTDs come with a few disadvantages. A DTD is useful when you have specific markup requirements to apply across a large number of documents. A DTD can ensure that certain data fields are present or delivered in a particular format. You may also want to spend the time preparing a DTD if you need to coordinate content from various sources and authors. Having a DTD makes it easier to find mistakes in your code.

The disadvantages to DTDs are that they require time and effort to develop and are inconvenient to maintain (particularly while the XML language is in flux). DTDs slow down processing times and may be too restrictive on the user's end. Another problem with DTDs is that they are not compatible with the namespace convention (discussed next). Elements and attributes from another namespace won't validate under a DTD unless the DTD explicitly includes them. If you are creating just a few XML documents, you may choose not to create a DTD. If you are using namespaces and it is necessary to have documentation of your XML vocabulary, you must use an XML Schema.

Because XHTML is a markup language that is used on a global scale, it was necessary to define the language and its various versions in DTDs. An XHTML document must include a DOCTYPE declaration to specify which DTD it follows in order to validate.