In addition to the preceding syntax, which specifies the fine structure of the XML document, the DTD makes heavy use of the entities feature. We saw entities in our survey of XML proper, since most (though not all) entity declarations are valid within an XML document itself. However, it is in the DTD that entities are most commonly defined and encountered .
An entity is a richly overloaded term that identifies a coherent block of content. The entity has two formal parts , a name and the block of content that the name represents. This can be almost any self-contained block of content anywhere . It can be huge or tiny.
The XML document is an entity. This document entity is often associated with the idea of a file. While an XML file contains exactly one document, many XML documents never exist as files. In fact, such entities are most interesting to us in this book: only two of our projects involve an XML file. The rest deal with dynamic, ephemeral data that is passed back and forth between live processes. Each of these communications is an XML document entity.
At the other end of the spectrum, an entity can be as atomic as a single byte. XML has five predefined entities, each of which represents a single character. These five characters are troublesome when they appear in character data (text), as they imply markup. The entity feature is used as an escape sequence to insert these characters where they would otherwise disturb the parser. For example, the entity ' represents the single-quote (apostrophe) character.
With the exception of these predefined entities and the implied document entity, entities are created by declaration. The declaration is a simple syntax, behind which lurks a bewildering tangle of variations.
<!ENTITY name value>
Entities fall into five classes.
Character entities are the simplest. They each resolve to a single character. They include the five predefined entities and the numeric entity, in which a reference to any character can be made with a numeric escape sequence. The form is &#number; . For example, Zoë resolves to Zo « .
Internal general entities are only slightly more complex. These are basic text substitutions that are explicitly defined and referenced in the basic XML file itself (not in a DTD). For example:
<!ENTITY cpyrt copyright 2001 Jacobson and Jacobson>
That text would be inserted into the following line:
Chapter 5. &cpyrt; All Rights Reserved
External general entities separate the value of the entity from the XML document. To resolve an external entity, the processor must read in a new document that is linked to this entity in the declaration. This action could be used to perform text substitution of large, volatile, or shared text content, where the files referenced contain blocks of straight character data, like
<!ENTITY address SYSTEM "http://lincoln.com/gettysburgaddress.txt"> <!ENTITY headline SYSTEM "headline.txt"> <!ENTITY motto SYSTEM "../missionstatement">
In these examples, the keyword SYSTEM informs the processor that the resource identifier will be found in the same document system as the document it is currently processing. This means, almost always, the Internet. Specifically, the processor seeks the referenced file beginning at the same location as the current file. In the motto example, the missionstatement file will be found in the parent directory of the directory in which the file with that reference was found. For example, while processing an XML file, we might call in the DTD it references. The motto entity declaration might be found in that DTD. If so, the file is sought relative to the current DTD file, not the original XML.
In addition to the SYSTEM identifier, a PUBLIC identifier can also be established. These identifiers are meant to be used to invoke common standards. For example, the W3C Internet protocol committee has a document that specifies standards for HTML. Every well-behaved web page, especially an XML-compliant one, should invoke the standard with a line like this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
Without the PUBLIC identifier, every browser, when it opened any web page, would be obliged to refer to the DTD document stored on the w3.org server. The load would be insufferable. Instead, an abstract symbol identifies the standard. The naming convention has fields separated by double slashes (and peppered with white space). These identify, respectively,
- ISO-registered (+) or not (-) organization. (The W3C is unregistered?)
W3C (OwnerID) organization itself
DTD (Public Text Class) type of document
HTML 4.0 Transitional (Public Text Description) subject, version, and status
EN (Public Text Language) language of the document
Catalog systems (there are multiple standards for these) retain local copies of the meaning of these symbols.
This discussion implies another consideration. The external entity can hold far more than simple text. It is more common that the entity contains markup, in particular data definitions, entity declarations, and other production rules.
Sound familiar? It should. The DTD is itself such a document. It is an entity invoked by the XML file. It often invokes other entities.
The DTD is written not in XML syntax but in a sister syntax that is interdependent. All validating XML parsers are capable of reading the DTD. But other entities can be even more exotic.
Unparsed entities are external documents that the XML processor is not expected to understand. The documents can be text based or, for example, a gif image or an audio file. All unparsed entities must be external entities, of course. The parser must be able to tell which external entities are meant to be parsed. It does this with the help of a markup called notations.
Notations are an XML feature that allows the document to identify the content type of external files. Notations are separately declared. Most formally they attach a name to a file definition using a PUBLIC or SYSTEM identifier. In practice, this resolves to one of two forms. A notation can be a MIME content “type identifier that is familiar throughout the Internet. Examples are text/html or image/gif or application/xml . Alternatively, a notation can refer to the application required to process the data. The MIME identification is more solid, of course, since application names and locations vary greatly from machine to machine.
Unparsed entities are differentiated from general external entities by the presence of notation, introduced by the keyword NDATA :
<!ENTITY snapshot SYSTEM "../photos/annaleah.jpg" NDATA image/jpeg>
So far we have seen four types of entities. Each of them can be declared in an XML file or preferably in a separate DTD. All four entities can be referenced (invoked) in the XML document ”but none of them can be referenced within a DTD.
Parameter entities are a special case of entities that can be referenced in a DTD. They resemble general internal entities except that they use a special character. As the ampersand ( & ) is the escape character that begins any entity reference, the percent symbol ( % ) introduces a parameter entity reference. The percent sign is also used in the declaration of a parameter entity, which is otherwise identical to the general entity declaration:
<!ENTITY % ENGLISH INCLUDE> <!ENTITY % SPANISH IGNORE>
This example demonstrates the utility of parameter entities in the DTD when used in conjunction with the keywords INCLUDE and IGNORE . The preceding definition would be followed later in the document by
<![ %ENGLISH; [ ..stuff.. ]] <![ %SPANISH; [ ..stuff.. ]]
Conditional sections like these can be as long as you like. They identify different versions of the DTD contents and can easily be reversed by changing the parameter entity declarations.