Now that we've discussed XML documents, we will move on to discuss the entity that describes the structure of XML documents. This document is the Document Type Definition (DTD). Let's review where we are in the XML process now (see Figure 1.6). Figure 1.6. DTDs in the XML process.
I've heard DTDs called everything from a dead-end road to a waste of time to learn. This stems from the fact that the W3C is going to replace DTDs with the new method of describing XML document structures, XML schemas. (We'll talk about schemas later in this chapter.) The reasons for replacing DTDs are valid (see the topic "XML Schemas" in this chapter). What I think some people are overlooking is that DTDs have been around a long time, and there are a lot of them out there. These DTDs will, first, need to be understood and, second, need to be converted to schema documents. It could be considered a little difficult to convert something to another form that you don't understand in the first place. When do you use a DTD? DTDs are essential for large document-management projects. A serious problem with managing large volumes of documents of similar type is enforcing coherence to a standard. Documents are typically written or generated by more than one author over a period of time. Unfortunately, it's a fact of life that when left on their own, authors tend to adopt different styles. The resulting documents might turn out to have little or no resemblance to each other even though they are supposed to be identical! The solution to this is to adopt a style guide of some kind. Take this one step further andvoila!you have DTDs. They force authors to conform to a standard document layout. When do you not use a DTD? A DTD is a waste of time for small documents. Writing a DTD is not an easy task, just as writing information systems procedures is not easy. Take this example: DTDs are overkill for letters, memos, faxes, and documents of this type. Usually, enforcing strict guidelines for these is not essential. You wouldn't want a DTD in which people are limited to the use of stationery. In contrast to these examples, though, you have to realize that there are environments in which a DTD would make sense for letters (for example, when there is a legal obligation to record correspondence). DTDs can be defined internally to an XML document or referenced internally and accessed externally. Moreover, a DTD is not a required element of an XML document but an optional entity. A validating XML parser does need a DTD to ensure that a document conforms to a certain structure, but a nonvalidating parser can still interpret the document's data. From here, we'll start looking at the components of DTDs and how they define an XML document's structure. The DOCTYPE DeclarationThe DOCTYPE declaration is really not a DTD declaration. It is an XML instruction with its own syntax, and it defines the root element of the XML document. Look at the following example: <!DOCTYPE RESUMES [ the rest of the declarations for this XML document ]> Here, the exclamation point ( ! ) precedes the keyword DOCTYPE . This is followed by the root element of the document and then all of the declaration for the XML document enclosed in brackets [ and ] . Public Versus PrivateThe DOCTYPE declaration has two additional attributes that specify whether or not the DTD is private or public. A public DTD is declared with the keyword PUBLIC . It has been publicized for widespread use and assigned a one-of-a-kind name , ensuring that the same DTD is used in every instance. It also gives the location of the DTD. An example of this is the DTD that defines the structure of HTML.The naming convention is known as the formal public identifier, and coverage is beyond the scope of this book. I refer those interested to the XML specification. A private DTD utilizes the keyword SYSTEM . It specifies a DTD that is applied to all elements declared inside of the root element.This type of DTD is generally created for limited usage, such as an XML document regularly used by your company but not by the general public or other companies. Here's an example that expounds on the previous example. It shows both forms of the declaration. <!DOCTYPE RESUMES SYSTEM " location of DTD "> <!DOCTYPE RESUMES PUBLIC " name " " location of DTD "> I want to emphasize one point here: PUBLIC and SYSTEM apply only to external DTDs. If these keywords are left out and a DTD is required, then the DTD must be defined internally to the document.
CommentsDTD comments are identical in form and function to HTML and XML comments. They begin with <!- and end with --> .The following example is a DTD comment: <!-- ******* Here are the element descriptions ******* --> Declaring ElementsEach and every element of an XML document must be declared in the appropriate DTD.These declarations must be contained inside the brackets of the DOCTYPE declaration, as the following example shows: <!DOCTYPE RESUMES [ all element declarations here] > To declare an element, use the following syntax: <!ELEMENT elementname rule > The elementname is the name between the tag delimiters. The rules that govern these names follow. ANYThis is the simplest of the element rules. It specifies that between the opening tag and the closing tag, other tags and character data ( PCDATA ) can appear. It is written as follows : <!ELEMENT RESUMES ANY> When this rule is used, you will most commonly see it in the declaration of the root element. Think about that for a minute. With this rule, you're actually saying any other tags and data. You are telling the parser to accept anything between the document root element and its closing tag. In effect, you are turning off validation checking of your document. You'll think you're the best DTD writer in the world because the parser will never report errors. I thought I was. Hey, this DTD writing can't be that bad until your documents start to deviate from standards and you can't figure out why. Be careful where you use this. #PCDATAThere are situations in which you might want only character data to appear. In this case, you would use the #PCDATA rule, which is stated like this: <!ELEMENT ADDRESS (#PCDATA)> PCDATA stands for parsed character data. In the previous example, notice that the declaration is enclosed in parentheses. Also, be aware that PCDATA means that other elements are not allowed between the opening and closing tags of the declared element, just character data. Table 1.3 lists PCDATA examples. Table 1.3. PCDATA Examples
In contrast to the situations listed in Table 1.3, there will be situations in which you want to declare an element as mandatory within another element. Take a look at this example: <!ELEMENT ADDRESS (STREET)> <!ELEMENT STREET (#PCDATA)> These declarations specify that the ADDRESS element must contain another element called STREET . This STREET element can contain only character data and can appear only once. Table 1.4 shows required element examples. Table 1.4. Required Element Examples
SequencesHow can we specify more than one element in a required order (sequence) inside of another element? Declare the required elements within parentheses in the order they are to appear in a comma-separated list, similar to this example: <!ELEMENT ADDRESS (STREET, CITY, COUNTRY)> <!ELEMENT STREET (#PCDATA)> <!ELEMENT CITY (#PCDATA)> <!ELEMENT COUNTRY (#PCDATA)> The declaration specifies that the ADDRESS element must contain three other elements: STREET , CITY , and COUNTRY in that order. The three other elements must consist of only PCDATA , such as the following: <ADDRESS> <STREET>911 Intranet Ave.</STREET> <CITY>Canberra</CITY> <COUNTRY>Australia</COUNTRY> </ADDRESS> ChoicesHow would you declare that either a specific element appears within a given element or another different element appears (the either-or selection)? Easy, use the pipe () symbol between the elements from which you can select. Listing 1.3 shows an example of this. Listing 1.3 Specifying a Choice Between Elements<!ELEMENT ADDRESS (STREET, CITY, COUNTRY, (ZIPPC))> <!ELEMENT STREET (#PCDATA)> <!ELEMENT CITY (#PCDATA)> <!ELEMENT COUNTRY (#PCDATA)> <!ELEMENT ZIP (#PCDATA)> <!ELEMENT PC (#PCDATA)> Here, in addition to what we've already talked about with the STREET , CITY , and COUNTRY elements, we declare a mandatory fifth element, which can be either a zip code ( ZIP ) or a postal code ( PC ). If you look back at our sample XML document in the section "Components of an XML Document," you'll see that the first job candidate's address has a postal code whereas the second has a zip code. ChildrenGenerally speaking, children or child elements are elements that are contained within other elements. When we talk about XPath queries in Chapter 6, "Using XPath Queries," we'll give a more formal definition, but this definition is all that's necessary for a DTD discussion. Now for the three things that I get confused about every time I have to use one of them. Those of you who are familiar with regular expressions will have no problems here. It's possible that elements and grouping of elements in parentheses, although declared, might not appear in a specific instance of an element. Also, it's possible that an element might occur more than once within another element (multiple occurrences). There are other possibilities, and we're going to talk about these possibilities here. One thing, though: Remember that we have already discussed the single occurrence of an element in the PCDATA examples section. Here are the other possibilities:
Empty ElementThis is another simple declaration. Any element that is an empty element is declared with the keyword EMPTY in the declaration. <!ELEMENT TEL EMPTY> A valid TEL element with this declaration would be either <TEL/> or <TEL></TEL> . EntitiesIn a DTD, an entity declaration functions very similarly to the way a macro functions in an application. In essence, a group of characters is substituted for another group of characters . I know this definition sounds redundant, but we'll get to an example. An entity is declared as follows: <!ENTITY name " substitute characters "> The quotation marks are required. After this declaration is made, the substitute characters will replace the name wherever the name occurs. Let's look at that example I promised you: <!ENTITY trademark "Ŵ"> After this declaration, you can make use of it with a statement such as the following: <drive> Our GadgetDrive&trademark is revolutionary </drive> The &trademark will be replaced with the symbol. It's important to note here that we have been talking about a general entity reference, which only works inside of an XML document. There are several others types, but they are outside this book's scope. Declaring AttributesAttributes consist of a name and a value pair. They appear inside a starting tag and provide additional data or information concerning that tag or the data contained within that tag pair. Here's an example in which HAIRCOLOR is an attribute of the tag <STEPCHILD> . <STEPCHILD HAIRCOLOR="red">John</STEPCHILD> Now let's look at how various attribute situations are handled in DTDs. Single AttributeAttributes, which appear within XML elements, must be declared in the accompanying DTD. You specify these attributes by using the following declaration: <!ATTLIST element attribute type default> The first word after ATTLIST is the name of the element containing the attribute. Then comes the name of the attribute followed by the attribute's data type and then the default value. Multiple AttributesMost HTML tags have more than one possible attribute. An example is the <table> tag. It can have attributes of WIDTH , CELLSPACING , BORDER , and more. XML elements are no different. They, too, can have multiple attributes. There are two ways to declare multiple attributes in a DTD. The first is to have each attribute listed on its own line, referencing the same element like this: <!ATTLIST TRIANGLE BASE CDATA "1"> <!ATTLIST TRIANGLE HEIGHT CDATA "1"> The second method relies on proper whitespace layout to make the declaration clear to the user .This is written as follows: <!ATTLIST TRIANGLE BASE CDATA "1" HEIGHT CDATA "1"> You can do it the way you prefer. The second method does rely on proper layout, and I'm sure you'll come across DTDs written this way, but you have to decide on the way you want to write it. Default ValuesIt isn't always necessary to specify a default value. The DTD specification contains three keywords that allow some leeway with default values. These keywords are as follows:
Data TypesThere are 10 different data types for attribute values. They are listed in Table 1.5. Table 1.5. Attribute Data Types
Ninety percent of DTD attribute data types consist of either CDATA or enumerated. These types are covered in detail next . Coverage of the other eight data types is outside the scope of this book. We will talk about two of them.
Namespaces and DTDsThere is no direct support for namespaces in DTD. This is not surprising when you remember that namespaces came along way after SGML appeared on the scene. There is a way around this limitation, however. As far as a DTD is concerned , a namespace declaration is nothing more than an attribute of an element. Knowing this, then, it is possible to define a namespace declaration in a DTD. Here's a namespace declaration: <VENDORS xmlns='http://www.myorg.com/tags'> In the DTD, you would declare the following: <!ATTLIST VENDORS xmlns CDATA #FIXED 'http://www.myorg.com/tags'> Here we declare the xmlns declaration as an attribute of the VENDORS tag and use CDATA to prevent any interpretation of the attribute's value. We also protect the value by making it read-only ( #FIXED ). Here are some examples for you. See if you can figure out the attribute declarations for each line. I'll put the answers at the end of the chapter. <!ATTLIST PERSON PERSONID CDATA> <!ATTLIST EMPLOYEE MALE (truefalse) "true"> <!ATTLIST EMPLOYEE STATUS (singledoubledivorcedwidow) #REQUIRED> Valid Versus Well- FormedNow that we've covered quite a bit of the properties of DTDs, I'd like to have a short discussion on what is a confusing topic to some. Earlier we talked about a well-formed XML document and what that meant . You can refer to the section "Well-Formed Documents" earlier in this chapter to refresh yourself on the requirements of a well-formed XML document. Now we have another term to become familiar with valid. In essence, an XML document is valid when it has an associated DTD and it conforms to it. Example of Writing a DTD for a Given XML DocumentNow, let's do something different. Let's take an XML document and some verbiage that describes some things about the document that are not readily apparent and then generate a DTD for that document. Sound like fun? We'll use a portion of the SUPPLIERS table from the Northwind database of SQL Server 2000 (see Figure 1.7). Figure 1.7. Part of the SUPPLIERS table in SQL Server 2000.
Listing 1.4 shows the base document. Listing 1.4 Sample XML Document<SUPPLIERS> <SUPPLIER> <SUPPLIERID>1</SUPPLIERID> <COMPANYNAME>Exotic Liquids</COMPANYNAME> <CONTACTNAME>Charlotte Cooper</CONTACTNAME> <ADDRESS>49 Gilbert St.</ADDRESS> <CITY>London</CITY> <PC>EC1 4SD</PC> <COUNTRY>UK</COUNTRY> <PHONE>(171) 555-2222</PHONE> </SUPPLIER> <SUPPLIER> <SUPPLIERID>2</SUPPLIERID> <COMPANYNAME>New Orleans Cajun Delights</COMPANYNAME> <CONTACTNAME>Shelley Burke</CONTACTNAME> <ADDRESS>P.O. Box 78934</ADDRESS> <CITY>New Orleans</CITY> <REGION>LA</REGION> <PC>70117</PC> <COUNTRY>USA</COUNTRY> <PHONE>(100) 555-4822</PHONE> </SUPPLIER> <SUPPLIERS> Here are a few points concerning this document that describe information you can't deduce by just examining it.
All right, let's start writing this DTD by identifying the elements. Table 1.6. Sample Element Breakdown
First we must identify the root element. This is the SUPPLIERS element. We use the DOCTYPE declaration for this, as shown in the following example: <!DOCTYPE SUPPLIERS[]> Remember that this declaration specifies only the root element; it does nothing for structure specification. So next we declare SUPPLIERS in an element declaration. <!ELEMENT SUPPLIERS (SUPPLIER)+> This states that SUPPLIERS must have at least one element contained within it named SUPPLIER and that there can be more than one of them. Here's what we have so far: <!DOCTYPE SUPPLIERS[ <!ELEMENT SUPPLIERS (SUPPLIER)+> ]> Within SUPPLIERS , there must be the element SUPPLIER specified like this: <!DOCTYPE SUPPLIERS[ <!ELEMENT SUPPLIERS (SUPPLIER)+> <!ELEMENT SUPPLIER...> ]> As enumerated in Table 1.6, several elements required within the element SUPPLIER : SUPPLIERID , COMPANYNAME , CONTACTNAME , ADDRESS , CITY , PC , and COUNTRY . Here are those additions: <!DOCTYPE SUPPLIERS[ <!ELEMENT SUPPLIERS (SUPPLIER)+> <!ELEMENT SUPPLIER (SUPPLIERID, COMPANYNAME, CONTACTNAME, ADDRESS, CITY, PC, COUNTRY)> ]> As my daughters, Stephanie and Katie, used to say,"ARE WE THERE YET?" (Don't you just hate that?) No, not quite but we're close. We still have to consider REGION , PHONE , and FAX . REGION is optional so use the ? like this: <!DOCTYPE SUPPLIERS[ <!ELEMENT SUPPLIERS (SUPPLIER)+> <!ELEMENT SUPPLIER (SUPPLIERID, COMPANYNAME, CONTACTNAME, ADDRESS, CITY, (REGION)?, PC, COUNTRY)> ]> The question mark after REGION says that it may appear 0 or 1 time. For PHONE and FAX , at least one is required, but more than one can appear in any combination. So use the symbol in combination with + like this: <!DOCTYPE SUPPLIERS[ <!ELEMENT SUPPLIERS (SUPPLIER)+> <!ELEMENT SUPPLIER (SUPPLIERID, COMPANYNAME, CONTACTNAME, ADDRESS, CITY, (REGION)?, PC, COUNTRY, (PHONE FAX)+)> ]> Of course, it is mandatory to include the ELEMENT definitions of each of the declared elements with PCDATA , so here it is in Listing 1.5. Listing 1.5 Completed DTD<!DOCTYPE SUPPLIERS[ <!ELEMENT SUPPLIERS (SUPPLIER)+> <!ELEMENT SUPPLIER (SUPPLIERID, COMPANYNAME, CONTACTNAME, ADDRESS, CITY, (REGION)?, PC, COUNTRY, (PHONE FAX)+)> <!ELEMENT SUPPLIERID (#PCDATA)> <!ELEMENT COMPANYNAME (#PCDATA)> <!ELEMENT CONTACTNAME (#PCDATA)> <!ELEMENT ADDRESS (#PCDATA)> <!ELEMENT CITY (#PCDATA)> <!ELEMENT REGION (#PCDATA)> <!ELEMENT PC (#PCDATA)> <!ELEMENT COUNTRY (#PCDATA)> <!ELEMENT PHONE (#PCDATA)> <!ELEMENT FAX (#PCDATA)> ]> This is the DTD in all its glory . A simplistic one, yes, but think about what you've accomplished up to this point. Example of an Invalid XML DocumentLet's take a step back now and look at the entire picture of what we've talked about in this chapter, an XML document from start to finish. Listing 1.6 shows a sample XML document that would never make it through any XML parser. See if you can find the problems. Again, I'll put the answers at the end of the chapter. Listing 1.6 A DTD That Needs Corrections<?xml version="1.0 standalone="no"? xmlns:vend='http://www.myorg.com/companytags' > <!DOCTYPE VENDORS [ <!ELEMENT VENDORS (VENDOR)?> <!ELEMENT (NAME, LOCATION, BUSINESS, DIVISION+)> <!ATTLIST DIVISION NAME CDATA #REQUIRED BUDGET CDATA #IMPLIED > <!ELEMENT NAME (#PCDATA)> <!ELEMENT LOCATION (#PCDATA)> <!ELEMENT BUSINESS (#PCDATA)> <!ELEMENT DIVISION (#PCDATA)> <!ELEMENT LOCATION (STREET, CITY, STATE, ZIP)> <!ATTLIST ZIP CDATA #REQUIRED> <!ELEMENT STREET (#PCDATA)> <!ELEMENT CITY (#PCDATA)> <!ELEMENT STATE (#PCDATA)> <!ELEMENT ZIP (#PCDATA)> ] <VENDORS> <VENDOR> <NAME>Iomega</NAME> <LOCATION> <STREET>1821 W.Iomega</STREET> <comment -- this is a test file--> <CITY>Roy</CITY> <STATE>UT</STATE> <ZIP SUB="8441">84067 </LOCATION> <BUSINESS>Manufacturing</BUSINESS> <DIVISION NAME="Sales" BUDGET="350000"> </DIVISION> <DIVISION NAME="IT" BUDGET="650000"> </DIVISION> <DIVISION NAME="HR" BUDGET="650000"> </DIVISION> </VENDOR> <VENDOR> <NAME>Dell</NAME> <LOCATION> <STREET>1000 W. Addison</STREET> <CITY>Dallas</CITY> <STATE>TX</STATE> <ZIP SUB="3456">40078 </LOCATION> <BUSINESS>Computer Manufacturing</BUSINESS> <DIVISION NAME="Sales" BUDGET="650000"> </DIVISION> <DIVISION NAME="IT" BUDGET="750000"> </DIVISION> <DIVISION NAME="HR" BUDGET="1650000"> </DIVISION> </VENDOR> </VENDORS> |