Building Your Own DTD | Professional XML (Programmer to Programmer)

Now that you know how to reference your DTD definitions with your XML documents, the next step is to look at how you can build your own DTDs. A number of tools enable you to build DTDs easily, but the first step is to learn how to build them from scratch in Notepad. This can help you understand all the steps that go into making DTDs.

First locate the XML file you want to work with. For this example, you can use the Shakespeare play Hamlet as it is represented in XML.

Note

You can find the play Hamlet as XML online at andrew.cmu.edu/user/akj/shakespeare/. On this page, you will find all of Shakespeare's plays, including Hamlet. To get the XML file, simply right-click the file and select Save Target As from the provided menu (in Microsoft's Internet Explorer). If this page is not present at the time of this reading, then simply do a search in an Internet search engine for "XML Shakespeare" to find a large number of results.

The Hamlet.xml file is a large file that includes all of the parts of the play itself. It is presented partially in Listing 5-9.

Listing 5-9: Part of the Hamlet.xml file

      <?xml version="1.0"?>      <?xml-stylesheet type="text/css" href="shakes.css"?>      <PLAY>         <TITLE>The Tragedy of Hamlet, Prince of Denmark</TITLE>         <PERSONAE>            <TITLE>Dramatis Personae</TITLE>            <PERSONA>CLAUDIUS, king of Denmark. </PERSONA>            <PERSONA>HAMLET, son to the late, and nephew to the present king.</PERSONA>            <PERSONA>POLONIUS, lord chamberlain. </PERSONA>            <PERSONA>HORATIO, friend to Hamlet.</PERSONA>            <PERSONA>LAERTES, son to Polonius.</PERSONA>                  <PERSONA>LUCIANUS, nephew to the king.</PERSONA>            <PGROUP>               <PERSONA>VOLTIMAND</PERSONA>               <PERSONA>CORNELIUS</PERSONA>               <PERSONA>ROSENCRANTZ</PERSONA>               <PERSONA>GUILDENSTERN</PERSONA>               <PERSONA>OSRIC</PERSONA>               <GRPDESCR>courtiers.</GRPDESCR>            </PGROUP>            <PERSONA>A Gentleman</PERSONA>            <PERSONA>A Priest. </PERSONA>            <PGROUP>               <PERSONA>MARCELLUS</PERSONA>               <PERSONA>BERNARDO</PERSONA>               <GRPDESCR>officers.</GRPDESCR>            </PGROUP>            <PERSONA>FRANCISCO, a soldier.</PERSONA>            <PERSONA>REYNALDO, servant to Polonius.</PERSONA>            <PERSONA>Players.</PERSONA>            <PERSONA>Two Clowns, grave-diggers.</PERSONA>            <PERSONA>FORTINBRAS, prince of Norway. </PERSONA>            <PERSONA>A Captain.</PERSONA>            <PERSONA>English Ambassadors. </PERSONA>            <PERSONA>GERTRUDE, queen of Denmark, and mother to Hamlet. </PERSONA>            <PERSONA>OPHELIA, daughter to Polonius.</PERSONA>            <PERSONA>Lords, Ladies, Officers, Soldiers, Sailors, Messengers,             and other Attendants.</PERSONA>            <PERSONA>Ghost of Hamlet's Father. </PERSONA>         </PERSONAE>         <SCNDESCR>SCENE  Denmark.</SCNDESCR>         <PLAYSUBT>HAMLET</PLAYSUBT>         <ACT>            <TITLE>ACT I</TITLE>            <SCENE>               <TITLE>SCENE I.  Elsinore. A platform before the castle.</TITLE>               <STAGEDIR>FRANCISCO at his post. Enter to him BERNARDO</STAGEDIR>               <SPEECH>                  <SPEAKER>BERNARDO</SPEAKER>                  <LINE>Who's there?</LINE>               </SPEECH>               <SPEECH>                  <SPEAKER>FRANCISCO</SPEAKER>                  <LINE>Nay, answer me: stand, and unfold yourself.</LINE>               </SPEECH>

Although this is just a partial view of the XML file, you can see that it is a large file. Even though it is large, very few elements are involved. This means it isn't going to take much effort to create the DTD that you can use to validate the Hamlet.xml file.

After you have the Hamlet.xml file on your computer, the next step is to start building the DTD for this file. The DTD is a representation of the structure allowed for this large XML document. The first step is to incorporate the document type declaration within your Hamlet.xml file.

Document Type Declaration

The DTD acronym discussed so far in this chapter refers to Document Type Definition-a file that defines the XML structure of particular XML files. Don't get the term DTD file confused with the DTD we are talking about now-the document type declaration element.

The document type declaration is the element that you place within an XML file to declare the DTD (Document Type Definition) to use to validate the XML contained within the document. An example document type declaration is presented here:

      <!DOCTYPE PLAY SYSTEM "http://www.wrox.com/files/dtd/Hamlet.dtd">

A document type declaration starts with a <!DOCTYPE and ends with a closing >. The different parts of this particular DTD are presented in Figure 5-2.

image from book
Figure 5-2

This generic construction of the DOCTYPE element is presented here:

      <!DOCTYPE [root element name] SYSTEM [URI]>

Other possible constructions include:

      <!DOCTYPE [root element name] [inline DTD]>      <!DOCTYPE [root element name] SYSTEM [URI] [inline DTD]>      <!DOCTYPE [root element name] PUBLIC [identifier] [URI]>      <!DOCTYPE [root element name] PUBLIC [identifier] [URI] [inline DTD]>

After the initial <!DOCTYPE> element declaration, the first item (or attribute) provided is the root element of the XML being defined. In the example from the XML file shown in Listing 5-9, the root element is <PLAY>. Therefore, this is the value that must be used in the <!DOCTYPE> element.

The SYSTEM and PUBLIC Keywords

You can declare the DTD within the XML document as shown earlier in Listing 5-6. If you are not taking that particular approach, then you are going to want to use the either the SYSTEM or PUBLIC keyword to specify whether your DTD is a private or public DTD.

By far, the most common method is to use the SYSTEM keyword, thereby making all your DTDs private. This doesn't inhibit you from sharing your DTDs with other groups, entities, or organizations. When using the SYSTEM keyword, you must specify the URI (unique resource identifier) of the DTD. In the previous examples, you saw that the URI can be a direct physical path to the file as it relates to the XML file using the DTD:

      <!DOCTYPE Play SYSTEM "C:\Wrox\Files\DTD\Hamlet.dtd">

It can also be an HTTP accessible hyperlink to the DTD file:

      <!DOCTYPE Play SYSTEM "http://www.wrox.com/files/dtd/Hamlet.dtd">

Generally, you should stick to the SYSTEM keyword and never use the PUBLIC keyword. Using the PUBLIC keyword in the <!DOCTYPE> element means that a standards body (either an official or non-official standards body) has defined a standard that is available to the public. You might not realize it, but you have already seen this used once in this chapter. In Listing 5-3, a <!DOCTYPE> defines the vocabulary of an XHTML document. This <!DOCTYPE> is presented here:

      <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"       "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

In this case, the <!DOCTYPE> element contains the root element of the XML document that it defines, html, the PUBLIC keyword, its identifier-“-//W3C//DTD XHTML 1.0 Transitional//EN”, and finally ending with a URI of "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd".

The first character of the identifier (a dash) means that a non-official standards body developed the DTD. A plus sign instead of a dash (or minus sign) means that an official standards body developed the DTD. The second attribute within the identifier (all the attributes are separated by //) specifies the governing body that defined the DTD. In this case, the World Wide Web Consortium, also known as the W3C, developed this DTD. The third attribute specifies the name of the DTD defined and its version. Then finally, the fourth attribute defines the language used in the definition because this DTD might be available in multiple languages.

Following the identifier is the URI defining the location of the DTD. If you are developing your own public DTD, you follow the same rules as shown here. Remember that you really could achieve the same thing if you just declared your XML vocabulary using the SYSTEM keyword and its related structure.

Using the URI and Inline DTD Together

As you examine the possible structures of the <!DOCTYPE> element, note that it is possible to combine both the external and internal DTDs.

      <!DOCTYPE [root element name] SYSTEM [URI] [inline DTD]>      <!DOCTYPE [root element name] PUBLIC [identifier] [URI] [inline DTD]>

This means that in addition to invoking a DTD by making an external reference (as shown in Listing 5-10), you can also extend the DTD by using it in combination with some inline DTD markup (as shown in Listing 5-11).

Listing 5-10: Using an external DTD

      <?xml version="1.0" encoding="UTF-8" ?>      <!DOCTYPE PLAY SYSTEM "http://www.wrox.com/files/dtd/Hamlet.dtd">      <PLAY>         <TITLE>The Tragedy of Hamlet, Prince of Denmark</TITLE>         <PERSONAE>            <TITLE>Dramatis Personae</TITLE>            <PERSONA>CLAUDIUS, king of Denmark. </PERSONA>            <PERSONA>HAMLET, son to the late, and nephew to the present king.</PERSONA>            <PERSONA>POLONIUS, lord chamberlain. </PERSONA>            <PERSONA>HORATIO, friend to Hamlet.</PERSONA>            <PERSONA>LAERTES, son to Polonius.</PERSONA>            <PERSONA>LUCIANUS, nephew to the king.</PERSONA>            <PGROUP>               <PERSONA>VOLTIMAND</PERSONA>               <PERSONA>CORNELIUS</PERSONA>               <PERSONA>ROSENCRANTZ</PERSONA>               <PERSONA>GUILDENSTERN</PERSONA>               <PERSONA>OSRIC</PERSONA>               <GRPDESCR>courtiers.</GRPDESCR>            </PGROUP>      <!-- XML cut short for space reasons -->

Listing 5-11: Using an external DTD with some inline DTD markup

      <?xml version="1.0" encoding="UTF-8" ?>      <!DOCTYPE PLAY SYSTEM "http://www.wrox.com/files/dtd/Hamlet.dtd" [         <!ELEMENT TITLE (#PCDATA)>      ]>            <PLAY>         <TITLE>The Tragedy of Hamlet, Prince of Denmark</TITLE>         <PERSONAE>            <TITLE>Dramatis Personae</TITLE>            <PERSONA>CLAUDIUS, king of Denmark. </PERSONA>            <PERSONA>HAMLET, son to the late, and nephew to the present king.</PERSONA>            <PERSONA>POLONIUS, lord chamberlain. </PERSONA>            <PERSONA>HORATIO, friend to Hamlet.</PERSONA>            <PERSONA>LAERTES, son to Polonius.</PERSONA>            <PERSONA>LUCIANUS, nephew to the king.</PERSONA>            <PGROUP>               <PERSONA>VOLTIMAND</PERSONA>               <PERSONA>CORNELIUS</PERSONA>               <PERSONA>ROSENCRANTZ</PERSONA>               <PERSONA>GUILDENSTERN</PERSONA>               <PERSONA>OSRIC</PERSONA>               <GRPDESCR>courtiers.</GRPDESCR>            </PGROUP>      <!-- XML cut short for space reasons -->

In this case, not only is the Hamlet.dtd utilized, but this DTD is extended by changing the content specification of the <TITLE> element by adding an additional inline DTD. Note that not all XML parsers understand such definitions, and you often get validation errors with this type of structure.

Element Declarations

When building your own DTD, whether it is in a separate file or inline within the XML, you are really defining elements, entities, attributes, and notations. You now look at defining elements. When defining a DTD, you must define every XML element using a DTD element declaration. The generic usage of the element declaration is as follows:

      <!ELEMENT [element name] [content specification]>

In this case, the element name is the name used for the element being defined, whereas the content specification determines what is allowed as a value within the element. This content definition section can get rather complex because it can contain a number of subelements of different types that are part of a specific sequence.

Therefore, to create a DTD for an order form XML document similar to the one used previously in this chapter, you can start by creating an element declaration for the XML document's root element, <PLAY>. This DTD document is presented in Listing 5-12.

Listing 5-12: Hamlet.dtd

      <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT PLAY (#PCDATA)>

You can also see from the element that it is similar to the <!DOCTYPE> element used earlier. To declare an element definition, you use the <!ELEMENT> element. Just like <!DOCTYPE>, <!ELEMENT> is case-sensitive. Therefore, it is illegal to write this as <!Element> (just as you can't use <!Doctype>).

Listing 5-12 shows a single XML element, PLAY, being defined. The content specification allowed for the <PLAY> element is defined as #PCDATA. This essentially means anything is allowed as long as it is parsed character data.

With this definition in Hamlet.dtd in place, you can use the following XML structure in an XML document that makes use of this DTD:

      <PLAY>Here is some sample text</PLAY>

This also means that you can have the following:

      <PLAY></PLAY>

But you are not allowed to place items (such as additional nested XML elements) within the <PLAY> element like this:

      <PLAY>         <TITLE>The Tragedy of Hamlet, Prince of Denmark</TITLE>         <PERSONAE></PERSONAE>         <SCNDESCR>SCENE Denmark.</SCNDESCR>         <PLAYSUBT>HAMLET</PLAYSUBT>      </PLAY>

Using the DTD from Listing 5-12, the previous code would be illegal. Next, this chapter reviews how to further define the XML document so that constructions such as the preceding one can be built.

Content Specification with ANY

One method to provide a content specification for an element is to use the ANY value. This is illustrated in Listing 5-13.

Listing 5-13: Hamlet.dtd

      <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT ACT ANY>      <!ELEMENT GRPDESCR ANY>      <!ELEMENT LINE ANY>      <!ELEMENT PERSONA ANY>      <!ELEMENT PERSONAE ANY>      <!ELEMENT PGROUP ANY>      <!ELEMENT PLAY ANY>      <!ELEMENT PLAYSUBT ANY>      <!ELEMENT SCENE ANY>      <!ELEMENT SCNDESCR ANY>      <!ELEMENT SPEAKER ANY>      <!ELEMENT SPEECH ANY>      <!ELEMENT STAGEDIR ANY>      <!ELEMENT TITLE ANY>

This DTD provides a DTD definition for all XML elements contained within the Hamlet.xml file. This means that you can use the following syntax and still have a valid XML document:

      <PLAY>This is my play!</PLAY>

But it also means that you can use any child elements that you want, such as the following:

      <PLAY>         <TITLE>The Tragedy of Hamlet, Prince of Denmark</TITLE>         <PERSONAE></PERSONAE>         <SCNDESCR>SCENE Denmark.</SCNDESCR>         <PLAYSUBT>HAMLET</PLAYSUBT>      </PLAY>

The ANY keyword really means that you can place any character data or any set of elements within the defined item and that specified item is then considered valid. Although this is an easy way to create a DTD definition, it usually isn't the best approach because it provides only possible XML elements that may be contained within a valid XML document. It provides a minimal list of rules. This means someone using a DTD such as the one defined in Listing 5-13 could build an XML document such as the one illustrated in Listing 5-14.

Listing 5-14: Hamlet.xml

      <PLAY>         <TITLE>The Tragedy of Hamlet, Prince of Denmark</TITLE>         <PERSONAE></PERSONAE>         <SCNDESCR>SCENE Denmark.</SCNDESCR>         <PLAYSUBT>HAMLET</PLAYSUBT>         <PLAY>Another Play</PLAY>         <PLAY>            <TITLE>The Tragedy of Hamlet, Prince of Denmark</TITLE>            <PERSONAE></PERSONAE>            <SCNDESCR>SCENE Denmark.</SCNDESCR>            <PLAYSUBT>HAMLET</PLAYSUBT>         </PLAY>      </PLAY>

From this, you can see that the <PLAY> element is used in a number of different ways. For instance, it is used as the root element with a series of child elements. One of the child elements is another couple of <PLAY> elements that are used in a completely different manner.

In the end, certain situations may require use of the ANY value for the content specification of elements that you define in the DTD, but in many cases you may prefer to strictly define the child elements or even limit the element to character data only. This is where the value of #PCDATA comes in.

Placing Limits on Elements with #PCDATA

As stated, a #PCDATA value means that the XML element being defined is allowed to have only parsed character data and is not allowed anything else-including any child elements. Usage of #PCDATA is illustrated in the following example:

      <!ELEMENT SPEAKER (#PCDATA)>

Notice that the #PCDATA is held within parenthesis when being included in the element definition. If you go back to the Hamlet.dtd (presented in Listing 5-13), you can change all the definitions for the elements that are not allowed to have any subsequent child elements. This change is presented in Listing 5-15.

Listing 5-15: Hamlet.dtd

      <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT ACT ANY>      <!ELEMENT GRPDESCR (#PCDATA)>      <!ELEMENT LINE ANY>      <!ELEMENT PERSONA (#PCDATA)>      <!ELEMENT PERSONAE ANY>      <!ELEMENT PGROUP ANY>      <!ELEMENT PLAY ANY>      <!ELEMENT PLAYSUBT (#PCDATA)>      <!ELEMENT SCENE ANY>      <!ELEMENT SCNDESCR (#PCDATA)>      <!ELEMENT SPEAKER (#PCDATA)>      <!ELEMENT SPEECH ANY>      <!ELEMENT STAGEDIR (#PCDATA)>      <!ELEMENT TITLE (#PCDATA)>

Now all the elements that disallow child elements are defined using #PCDATA instead of ANY. Running the Hamlet.xml file with this DTD, the validation process succeeds. The additional rules provide more defined structure for the XML files that use this DTD. The processing of these documents has become easier.

Note that one of the limitations of using DTDs (instead of something like XML Schemas) is that you can define the textual content contained within an element only as parsed character data-nothing more specific. As shown, you do this by using #PCDATA. Unlike XML Schemas, DTDs don't let you determine that an element can contain only an integer, double, or a string value.

Empty Values

Having an empty element in your XML document may be important as a signal of a Boolean value and nothing more, or it might show a null value that should be stored in the database. DTDs allow for an empty element declaration.

      <!ELEMENT Member EMPTY>

In this case, to declare an empty element, simply use the EMPTY keyword in the <!ELEMENT> element declaration. Remember that it is case-sensitive.

Child Elements

One of the first steps in building a DTD is to define your root element. Root elements within XML documents generally contain child elements (or nested elements). DTD does allow you to define root elements through the use of the content specification section of the <!ELEMENT> element. The root element of the Hamlet.xml file is <PLAY>. Listing 5-16 shows a revised version of the Hamlet.dtd document to further define the <PLAY> element and the other elements that allow for child elements.

Listing 5-16: Hamlet.dtd

            <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT ACT (TITLE, SCENE+)>      <!ELEMENT GRPDESCR (#PCDATA)>      <!ELEMENT LINE (#PCDATA | STAGEDIR)*>      <!ELEMENT PERSONA (#PCDATA)>      <!ELEMENT PERSONAE (TITLE | PERSONA | PGROUP)+>      <!ELEMENT PGROUP (PERSONA+, GRPDESCR)>      <!ELEMENT PLAY (TITLE, PERSONAE, SCNDESCR, PLAYSUBT, ACT+)>      <!ELEMENT PLAYSUBT (#PCDATA)>      <!ELEMENT SCENE (TITLE | STAGEDIR | SPEECH)+>      <!ELEMENT SCNDESCR (#PCDATA)>      <!ELEMENT SPEAKER (#PCDATA)>      <!ELEMENT SPEECH (SPEAKER | LINE | STAGEDIR)+>      <!ELEMENT STAGEDIR (#PCDATA)>      <!ELEMENT TITLE (#PCDATA)>

When defining the required child elements, you define these elements within parenthesis in the <!ELEMENT> element itself. Looking specifically at the <PLAY> element, you can see that it can contain five child elements:

      <!ELEMENT PLAY (TITLE, PERSONAE, SCNDESCR, PLAYSUBT, ACT+)>

This definition means that the <PLAY> element is required to contain a <TITLE>, <PERSONAE>, <SCNDESCR>, <PLAYSUBT>, and <ACT> child elements. The defined elements are separated using commas. None of the elements are actually required (except for <ACT> because of the plus sign-this will be explained shortly). These elements are required to be set in the <PLAY> element is this exact order because of their placement in this definition. This means that if <PERSONAE> comes before <TITLE>, the XML document won't validate.

Because the PLAY definition in the DTD document includes a TITLE as a possible child element, you must define the TITLE child element in the DTD document.

      <!ELEMENT TITLE (#PCDATA)>

Looking through the Hamlet.dtd document shown in Listing 5-16, you can see that each of the five child elements are also defined in the document. Some even nest further as their definition includes yet more child elements that must also be defined. The definition of the ACT definition shows even more child elements, thereby allowing further nesting in the XML document.

      <!ELEMENT ACT (TITLE, SCENE+)>

Specifying a Number of Instances Required

Some XML documents require you to specify a set number of instances where the child element may occur in the XML document. For instance, suppose you have the XML document shown in Listing 5-17.

Listing 5-17: An XML document with two <Address> child elements

            <?xml version="1.0" encoding="UTF-8"?>      <!DOCTYPE Mail SYSTEM "Mail.dtd">      <Mail>         <Name>Bill Evjen</Name>         <Address>123 Main Street</Address>         <Address>St. Charles, MO</Address>         <ZipCode>63301</ZipCode>      </Mail>

If you are building a DTD for this bit of XML, your DTD appears as presented in Listing 5-18.

Listing 5-18: Mail.dtd

      <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT Mail (Name, Address, Address, ZipCode)>      <!ELEMENT Name (#PCDATA)>      <!ELEMENT Address (#PCDATA)>      <!ELEMENT ZipCode (#PCDATA)>

In this case, you can see that the <Name>, <Address>, and <ZipCode> elements are defined, and the <Mail> element specifies that it must include child elements for all of these. Note that the <Address> child element is mentioned twice-meaning that it has to appear two times in the document. If you include just a single <Address> element, the XML document is considered invalid.

Reusing XML Elements

It is also possible to reuse the elements that are defined within the DTD for any number of elements. For instance Listing 5-19 changes the XML document that is presented in Listing 5-17 so that it now includes two sets of addresses.

Listing 5-19: An XML document with two sets of addresses

      <?xml version="1.0" encoding="UTF-8"?>      <!DOCTYPE Mail SYSTEM "Mail.dtd">      <Mail>         <Home>            <Name>Bill Evjen</Name>            <Address>123 Main Street</Address>            <Address>St. Charles, MO</Address>            <ZipCode>63301</ZipCode>         </Home>         <Business>            <Name>Lipper</Name>            <Address>123 Main Street</Address>            <Address>St. Louis, MO</Address>                  <ZipCode>63141</ZipCode>         </Business>      </Mail>

In this case, <Mail> includes two child elements-<Home> and <Business>; each of which makes use of the same <Name>, <Address>, and <ZipCode> elements. For this reason, you only define each of these elements a single time. This is illustrated in the DTD for this XML file in Listing 5-20.

Listing 5-20: Mail.dtd

      <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT Mail (Home, Business)>      <!ELEMENT Home (Name, Address, Address, ZipCode)>      <!ELEMENT Business (Name, Address, Address, ZipCode)>      <!ELEMENT Name (#PCDATA)>      <!ELEMENT Address (#PCDATA)>      <!ELEMENT ZipCode (#PCDATA)>

From this you can see that the <Name>, <Address>, and <ZipCode> elements are defined only a single time, but they are used by both the <Home> and <Business> elements.

The + Quantifier

You saw earlier that it was possible to force a repeat of the <Address> element as a child element by repeating the number of times it was defined within the <!ELEMENT> element.

      <!ELEMENT Mail (Name, Address, Address, ZipCode)>

This is an easy way to get a specific number of child element instances in the document, but at the same time, it is very restrictive. If you use it, you are always required to have two instances of the <Address> element-no less and no more. Even if you require only a single instance of the <Address> element, you must still include two instances. Also, if you have a foreign address, which in some cases might require three or four <Address> lines, you would still be unable to place more than two instances in the document.

Instead of placing the Address definition in the <!ELEMENT> element twice, another option is to use a quantifier. A quantifier is a symbol that you place after the defined item to specify more or fewer restrictions on the item. This was used in the Hamlet.dtd file.

      <!ELEMENT PLAY (TITLE, PERSONAE, SCNDESCR, PLAYSUBT, ACT+)>

Here, the + quantifier is used with the <ACT> element definition. The + quantifier signifies that the <ACT> element can appear one or more times within the <PLAY> element. You can also change the previous <Mail> element definition so that the <Address> element is allowed one or more times using the + quantifier.

      <!ELEMENT Mail (Name, Address+, ZipCode)>

The Address+ here signifies that the <Address> element can appear one or more times within the <Mail> element. This means that the following bit of XML in Listing 5-21 is valid:

Listing 5-21: An XML document using the + quantifier

      <?xml version="1.0" encoding="UTF-8"?>      <!DOCTYPE Mail SYSTEM "Mail.dtd">      <Mail>         <Name>Bill Evjen</Name>         <Address>123 Main Street; St. Charles, MO</Address>         <ZipCode>63301</ZipCode>      </Mail>

Here the <Address> element is only used a single time. If the author of this XML document, however, wanted to use the <Address> element more often, it would be possible to do so. This is illustrated in Listing 5-22.

Listing 5-22: Another instance in using the + quantifier

      <?xml version="1.0" encoding="UTF-8"?>      <!DOCTYPE Mail SYSTEM "Mail.dtd">      <Mail>         <Name>Bill Evjen</Name>         <Address>123 Main Street</Address>         <Address>Suite 520</Address>         <Address>St. Charles, MO</Address>         <ZipCode>63301</ZipCode>      </Mail>

In this case, the <Address> element is used three times, and this is considered a valid XML document. The + quantifier does signify, however, that the <Address> element must be included at least once. This means that the following XML (Listing 5-23) is considered invalid.

Listing 5-23: An invalid XML document

      <?xml version="1.0" encoding="UTF-8"?>      <!DOCTYPE Mail SYSTEM "Mail.dtd">      <Mail>         <Name>Bill Evjen</Name>         <ZipCode>63301</ZipCode>      </Mail>

As you saw earlier, the child elements are defined within a set of parenthesis and the Address element was followed with a + quantifier to signify that it can have one or more instances. If you want to apply this setting to all the children of the <Mail> element, one method would be to use the + quantifier in each of the elements:

      <!ELEMENT Mail (Name+, Address+, ZipCode+)>

Because a + quantifier follows the Name, Address, and ZipCode definitions, all these elements can appear one or more times (in this sequence only). If you want to make such a declaration, another method is to apply the + quantifier to each of the items contained within the parenthesis as shown here:

      <!ELEMENT Mail (Name, Address, ZipCode)+>

In this case, the + quantifier follows the parenthesis, and this means that this quantifier applies to everything contained within the parenthesis. This appeared earlier in the Hamlet.dtd in the <PERSONAE> element definition.

      <!ELEMENT PERSONAE (TITLE | PERSONA | PGROUP)+>

The ? Quantifier

Another quantifier to work with in building your DTD documents is the ? quantifier. The ? quantifier allows you to specify that zero or only a single instance of the child element can be contained within the element. Suppose you have an XML document like the one presented in Listing 5-24.

Listing 5-24: An XML document using the ? quantifier

      <?xml version="1.0" encoding="UTF-8"?>      <!DOCTYPE Mail SYSTEM "Mail.dtd">      <Mail>         <Salutation>Mr.</Salutation>         <Name>Bill Evjen</Name>         <Address>123 Main Street; St. Charles, MO</Address>         <ZipCode>63301</ZipCode>      </Mail>

In this case, a new XML element <Salutation> is contained as a child element within the <Mail> element. You could probably structure it so that the <Salutation> element is considered optional. This means that the <Salutation> element can appear either zero times or at least once in the document. Also in this case, it doesn't make much sense for the <Salutation> element to appear more than once, thereby making the ? quantifier an ideal choice in defining the child element.

In defining the child element using the ? quantifier, you take a similar approach to that used with the + quantifier. This approach is illustrated in Listing 5-25.

Listing 5-25: The Mail.dtd using the ? quantifier

      <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT Mail (Salutation?, Name, Address+, ZipCode)>      <!ELEMENT Salutation (#PCDATA)>      <!ELEMENT Name (#PCDATA)>      <!ELEMENT Address (#PCDATA)>      <!ELEMENT ZipCode (#PCDATA)>

In this case, the <Salutation> child element is defined with a ? quantifier specifying that the element can only appear zero or one time within the <Mail> element. This means that the XML document presented in Listing 5-26 is considered valid.

Listing 5-26: A valid XML document using the Mail.dtd

            <?xml version="1.0" encoding="UTF-8"?>      <!DOCTYPE Mail SYSTEM "Mail.dtd">      <Mail>         <Salutation>Mr.</Salutation>         <Name>Bill Evjen</Name>         <Address>123 Main Street</Address>         <Address>St. Charles, MO</Address>         <ZipCode>63301</ZipCode>      </Mail>

This example shows the <Salutation> element a single time (the maximum allowed). The XML document presented in Listing 5-27 is also valid.

Listing 5-27: Another valid XML document using the Mail.dtd

      <?xml version="1.0" encoding="UTF-8"?>      <!DOCTYPE Mail SYSTEM "Mail.dtd">      <Mail>         <Name>Bill Evjen</Name>         <Address>123 Main Street</Address>         <Address>St. Charles, MO</Address>         <ZipCode>63301</ZipCode>      </Mail>

Because you use the ? quantifier, if you use the <Salutation> element more than once, you produce an invalid XML document (see Listing 5-28).

Listing 5-28: An invalid XML document

      <?xml version="1.0" encoding="UTF-8"?>      <!DOCTYPE Mail SYSTEM "Mail.dtd">      <Mail>         <Salutation>Mr.</Salutation>         <Salutation>Mr.</Salutation>         <Name>Bill Evjen</Name>         <Address>123 Main Street</Address>         <Address>St. Charles, MO</Address>         <ZipCode>63301</ZipCode>      </Mail>

You can also apply the ? quantifier, like the + quantifier, to an entire set of child elements as presented here:

      <!ELEMENT Mail (Salutation, Name, Address+, ZipCode)?>

Notice how the ? quantifier is applied to each of the elements except the <Address> element. The + quantifier applies directly to this sequence of elements.

The * Quantifier

The final quantifier is the * quantifier. The use of this quantifier signifies that the child element can be contained within the designated element zero or more times. An example DTD using the * quantifier is presented in Listing 5-29.

Listing 5-29: Using the * quantifier in the Mail.dtd

      <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT Mail (Salutation?, Name*, Address+, ZipCode)>      <!ELEMENT Salutation (#PCDATA)>      <!ELEMENT Name (#PCDATA)>      <!ELEMENT Address (#PCDATA)>      <!ELEMENT ZipCode (#PCDATA)>

In this case, the <Name> element can appear zero or more times within the XML document that uses this DTD for validation. This means that the XML document presented in Listing 5-30 is considered valid XML.

Listing 5-30: A valid XML document

      <?xml version="1.0" encoding="UTF-8"?>      <!DOCTYPE Mail SYSTEM "Mail.dtd">      <Mail>         <Salutation>Mr.</Salutation>         <Address>123 Main Street</Address>         <Address>St. Charles, MO</Address>         <ZipCode>63301</ZipCode>      </Mail>

This also means that Listing 5-31 is considered valid.

Listing 5-31: Another valid XML document

      <?xml version="1.0" encoding="UTF-8"?>      <!DOCTYPE Mail SYSTEM "Mail.dtd">      <Mail>         <Salutation>Mr.</Salutation>         <Name>Bill Evjen</Name>         <Name>or Resident</Name>         <Address>123 Main Street</Address>         <Address>St. Charles, MO</Address>         <ZipCode>63301</ZipCode>      </Mail>

Allowing a Choice

A choice option allows you to specify a selection of available child elements that can be used. For instance, suppose you wanted to allow a choice between <Item>, <Items>, or <Pallets> in your XML document. To accomplish this, you structure a DTD in the following fashion (Listing 5-32).

Listing 5-32: Providing a choice via your DTD

            <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT Quantity (Item | Items | Pallet)>      <!ELEMENT Item (#PCDATA)>      <!ELEMENT Items (#PCDATA)>      <!ELEMENT Pallet (#PCDATA)>

As you can see by the <Quantity> definition, you are providing a choice of three items to the consumer of this DTD-<Item>, <Items>, or <Pallets>. The options provided via the DTD are separated by a vertical bar (or pipe) instead of by commas as is done normally. This means that the following XML document is considered valid:

      <Quantity>         <Item>3Q7854P</Item>      </Quantity>

This is also considered valid XML:

      <Quantity>         <Items>3Q7854P-6TY458P</Items>      </Quantity>

Also valid is:

      <Quantity>         <Pallet>5H3899K</Pallet>      </Quantity>

Although only three items are provided as choices for the child element of the <Quantity> element, you can actually place as many options as you wish as long as they are all separated by a vertical bar.

Just like standard child elements, these choice child elements can take quantifiers.

      <!ELEMENT Quantity (Item | Items | Pallet)+>

The use of the + quantifier means that you can have any of the choices one or more times in your document. The following is, therefore, considered valid XML:

      <Quantity>         <Item>3Q7854P</Item>         <Item>6TY458P</Item>         <Pallet>5H3899K</Pallet>      </Quantity>

Attribute Declarations

Not all XML documents contain only elements and their values and nothing more. Many XML documents use attributes to further define the XML document. Just as you can easily define your elements using a DTD, you can also incorporate the associated attributes into an element. The generic usage of the attribute declaration is shown here:

      <!ATTLIST [element name] [attribute name] [attribute type] [default value]                 [attribute name] [attribute type] [default value]>

In this case, the element name is the name of the element to which the attribute is added to. The attribute name is the name of the attribute. The attribute type is a way to qualify the data type (a rather limited process). Finally, the default value is the starting value of the item.

Before you begin to create a set of attributes using the <!ATTLIST> element, take a look at the following bit of XML:

      <Name first="Bill" middle="J." last="Evjen" />

Here you can see a single element, <Name>, which is really an empty element. Although empty, the <Name> element contains three attributes. Listing 5-33 shows how to declare attributes within this element.

Listing 5-33: Declaring attributes for the <Name> element

      <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT Name EMPTY>      <!ATTLIST Name first CDATA "">      <!ATTLIST Name middle CDATA "">      <!ATTLIST Name last CDATA "">

For this example, just a single element defined-<Name>. From the DTD you can see that the <Name> element is declared as an empty element with three possible attributes. All the attributes are assigned to the <Name> element and given a data type of CDATA. This data type specification means that the attribute will contain character data. As a default value, nothing is assigned and an empty string is used instead.

Using this DTD, you can write the following bit of XML:

      <Name first="Bill" last="Evjen" />

In this case the first and last attributes are used, but the middle attribute is not used. This is fine because none of the attributes is required to define the attributes within the DTD. This means that the following XML is also valid:

      <Name last="Evjen" />

The following would also be considered valid XML:

      <Name />

Note that the following bit of XML is also considered valid:

      <Name last="Evjen" first="Bill" />

Here you can see that the order of the attributes has been inverted. XML parsers ignore attribute ordering-allowing the attributes to be used in any order.

Attribute Data Types

One of the requirements when declaring your attributes within a DTD is that the attribute be given a specific data type. In the previous example, you saw what is, probably, one of the more common data types used-CDATA. The list of available data types is presented in the following table.

Open table as spreadsheet

Data Type	Description
`CDATA`	Any character data.
`IDREF`	Forces a unique ID to be provided for the attribute.
`IDREFS`	Allows for multiple IDs to be provided. IDs must be separated by whitespace.
`ENTITY`	Allows for an entity to be provided. Entities are discussed shortly.
`ENTITIES`	Allows for multiple entities to be provided. Entities must be separated by whitespace. Entities are discussed shortly.
`NMTOKEN`	Allows for an XML name token to be provided.
`NMTOKENS`	Allows for multiple XML name tokens to be provided. Name tokens must be separated by whitespace.
`NOTATION`	Allows for one or more notations to be provided.

The #REQUIRED Keyword

If you have an attribute value that is required, you simply use the #REQUIRED keyword when declaring the attributes. Listing 5-34 shows how the three attributes for the <Name> element are turned into required attributes.

Listing 5-34: Declaring required attributes for the <Name> element

      <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT Name EMPTY>      <!ATTLIST Name first CDATA #REQUIRED>      <!ATTLIST Name middle CDATA #REQUIRED>      <!ATTLIST Name last CDATA #REQUIRED>

You make the attribute a required attribute, by utilizing the #REQUIRED keyword. Note that the keyword is case-sensitive. This forces the attribute to be present, even if it is empty. The following XML is considered invalid:

      <Name first="Bill" last="Evjen" />

However, this bit of XML is considered valid:

      <Name first="Bill" middle="" last="Evjen" />

Even though no value is provided, the middle attribute is present and, therefore, the XML document is now considered valid.

Note that when using the #REQUIRED keyword, you are no longer required to provide a default value for the attribute because the user of the DTD will be providing one.

The #IMPLIED Keyword

Earlier I provided three attributes with a default value of “” instead of something actually meaningful. In the case of these three attributes, it doesn't make much sense to provide a default value because everyone has a different name. One way around this problem is to use the #REQUIRED keyword and force everyone to provide a value for all three attributes. This can work; but what if you don't want to require all these values? For instance, suppose you want to make the first and last attributes required, whereas the middle attribute can remain optional? In this kind of scenario, using the #IMPLIED keyword in your attribute declaration makes complete sense. Listing 5-35 shows its use in the DTD.

Listing 5-35: Declaring implied attributes for the <Name> element

      <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT Name EMPTY>      <!ATTLIST Name first CDATA #REQUIRED>      <!ATTLIST Name middle CDATA #IMPLIED>      <!ATTLIST Name last CDATA #REQUIRED>

In this case, the first and last attributes are required. The middle attribute, however, is not required and it doesn't include a default value if it isn't included. It is as if a null value is provided instead. With this DTD, the following bit of XML is considered valid:

      <Name first="Bill" middle="J." last="Evjen" />

Also, the following bit of XML is just as valid:

      <Name first="Bill" last="Evjen" />

The #FIXED Keyword

The last keyword to review is the #FIXED keyword. It enables you to assign an attribute with a default value that cannot be changed for any reason. Listing 5-36 shows an example of the #FIXED keyword.

Listing 5-36: Declaring fixed attributes for the <Name> element

      <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT Name EMPTY>      <!ATTLIST Name member CDATA #FIXED "true">      <!ATTLIST Name first CDATA #REQUIRED>      <!ATTLIST Name middle CDATA #IMPLIED>      <!ATTLIST Name last CDATA #REQUIRED>

To declare an attribute that makes use of the #FIXED keyword, you follow the keyword with the default value in quotes. Listing 5-36 shows that the member attribute is set to be a fixed attribute with a default value of true. With this declaration in place, the following bit of XML is considered valid:

      <Name member="true" first="Bill" last="Evjen" />

Setting the member attribute to false causes the XML document to be invalid:

      <Name member="false" first="Bill" last="Evjen" />

One interesting point is that the attribute need not be included at all. If it is included, then the required value must be utilized. However, if the attribute is not included, then the XML parser makes use of the value that is provided via the DTD as if it were present. This means that the following bit of XML is also considered valid:

      <Name first="Bill" last="Evjen" />

Using Enumerations as Values

In some instances, you want an attribute to only contain a set of specific values. In these cases, you provide the user of the DTD with a list of enumerated values that can be used with the attribute. This is rather similar to enumerations, or choices, that were used when declaring an element.

Suppose you have a member attribute that you want to take a true or false value and nothing else. You accomplish by providing the true and false values as enumerations. This syntax is illustrated in Listing 5-37.

Listing 5-37: Declaring enumerations to use with an attribute

      <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT Name EMPTY>      <!ATTLIST Name member (true | false) "true">      <!ATTLIST Name first CDATA #REQUIRED>      <!ATTLIST Name middle CDATA #IMPLIED>      <!ATTLIST Name last CDATA #REQUIRED>

In this case, the member attribute allows for an enumeration of values-either true or false. These enumerations must be contained within parenthesis separated by vertical bars. Following the parenthesis is the default value of the member attribute if it is not included by the user of the DTD.

With this DTD, the following bit of XML is considered valid:

      <Name member="true" first="Bill" last="Evjen" />

This also means that the inverse value for the member attribute is also valid:

      <Name member="false" first="Bill" last="Evjen" />

Then, if no member attribute is provided a default value of true is assumed. Even if the member attribute is not included, the XML is still valid:

      <Name first="Bill" last="Evjen" />

When working with enumerations, you can also use the keywords discussed earlier. For instance, if you wish to make the member attribute required, you use the syntax in your DTD illustrated in Listing 5-38.

Listing 5-38: Declaring enumerations to use with a required attribute

            <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT Name EMPTY>      <!ATTLIST Name member (true | false) #REQUIRED>      <!ATTLIST Name first CDATA #REQUIRED>      <!ATTLIST Name middle CDATA #IMPLIED>      <!ATTLIST Name last CDATA #REQUIRED>

From Listing 5-38, you can see that the default value was replaced with a #REQUIRED keyword to make the member attribute required. Now the user of this DTD is required to give a true or false value for the member attribute in order to have a valid XML document.

Entity Declarations

In the first chapter of this book, you were introduced to entities. An entity is the capability to map a character string to a specific symbol or character. XML already provides some entities out of the box as is presented in the following table.

Open table as spreadsheet

Character	Entity
<	`<`
>	`>`
“	`"`
'	`'`
&	`&`

In this table, you can see that the entity for the & symbol is &. To use the & symbol in your document, you type & in its place and then, when the XML is parsed, the & string is converted to the appropriate referenced character.

Entities can be provided as internal or external entities. This section reviews internal entities.

Internal Entities

To declare an internal entity, you use the following syntax:

      <!ENTITY [entity key] [entity translated value]>

As you can see, it is rather simple to create an internal entity within your DTD. To create an entity you use the <!ENTITY> declaration within the DTD and simply provide it with a key and a translated value for the XML parser to use when it encounters the key in an XML document. Listing 5-39 shows a DTD making use of the <!ENTITY> declaration.

Listing 5-39: Fund.dtd using an entity

            <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT Fund (Name, NumberShares, DataProvider)>      <!ELEMENT Name (#PCDATA)>      <!ELEMENT NumberShares (#PCDATA)>      <!ELEMENT DataProvider (#PCDATA)>      <!ENTITY LIP "Lipper Inc., A Reuters Company">

In this DTD, a single element is defined that includes three child elements. At the bottom of the DTD, an entity is declared using the <!ENTITY> declaration. A key of LIP is provided that should then be translated to Lipper Inc., A Reuters Company by an XML parser. Listing 5-40 shows this entity being used within an XML document.

Listing 5-40: Fund.xml using Fund.dtd

      <?xml version="1.0" encoding="UTF-8"?>      <!DOCTYPE Fund SYSTEM "Fund.dtd">      <Fund>       <Name>XYZ Fund Global Growth</Name>       <NumberShares>22</NumberShares>       <DataProvider>&LIP;</DataProvider>      </Fund>

From this code, you can see that the entity key is provided as a value of the <DataProvider> element. Note that the key, although declared as LIP in the DTD, must be preceded with an ampersand and followed by a semi-colon (&LIP;). The Fund.xml file in Internet Explorer is shown in Figure 5-3.

image from book
Figure 5-3

From this figure, you can see that the &LIP; character sequence was converted by the XML parser to a larger content set because the <!ENTITY> declaration was utilized in the DTD.

External Entities

In addition to internal entities, you can also make reference to external entities. This allows you to input XML fragments and other single items into your XML documents. The general usage of an external entity is presented here:

      <!ENTITY [entity key] SYSTEM [URI]>

The difference between this declaration and the internal entity declaration is that this one includes the keyword SYSTEM that signifies that this is an external entity. Then, instead of providing the translated value of the entity key, you put a pointer in place to indicate its location.

An example usage is shown here:

      <!ENTITY LIP SYSTEM "http://www.lipperweb.com/entities/companyname.xml">

Notation Declarations

Notation declarations are a rudimentary way of providing some type casting capabilities to the values contained within your XML elements. It is more of a recommendation, and there is no actual enforcement by the XML parsers when you are using a notation declaration. One possible generic usage of the notation declaration is presented here:

      <!NOTATION [name] SYSTEM [URI or Description]>

To create a notation declaration, you use the <!NOTATION> declaration. An example of creating an element with a date requirement within your DTD is presented in Listing 5-41.

Listing 5-41: Fund.dtd using a notation

      <?xml version="1.0" encoding="UTF-8"?>      <!ELEMENT Fund (Name, NumberShares, DataProvider, OrderDate)>      <!ELEMENT Name (#PCDATA)>      <!ELEMENT NumberShares (#PCDATA)>      <!ELEMENT DataProvider (#PCDATA)>      <!ELEMENT OrderDate (#PCDATA)>      <!NOTATION Name SYSTEM "http://www.lipperweb.com/namingstandards.html">

Here a notation declaration is utilized to specify that the name needs to follow a specific naming standard and that the standard can be found at a specific URL on the Internet. Using this notation in no way forces XML parsers to make sure that the standard is followed, it is there purely as a reference. You can put anything in place of the hyperlink as well. In fact, the value can be also a MIME type specifying the file type of the value contained within the XML element. For those that move images, documents, or other binary items around via an XML file, this might be a good method to specify to the user of the XML document the MIME type that can be contained within a specified XML element. An example MIME type is image/png.