3.4 XML Document Logical Structure

The logical structure of an XML document consists of declarations, elements, character references, comments, and processing instructions.

Table 3-4. Structure of an XML Document
Structure	Consists of…	Part Referred to as…	Notes
Logical	Markup	Declarations Elements Attributes Character references CDATA sections Comments Namespaces Processing instructions	Explicit markup indicates each type of markup. The logical and physical structures must nest properly.
Physical	One or more storage units	Entities	Entities have content and a name. They can be parsed or unparsed. They can refer to other entities. Each document has a document entity. All have content. All are identified by name, except the document entity and an external DTD subset.

3.4.1 The XML Declaration

Two types of declarations exist: the XML declaration and a DTD. If you use both, the XML declaration must precede the DTD. Although the XML declaration is optional, the W3C specification suggests that you include it so that the appropriate parser, or parsing process, can interpret the document correctly. XML version information is required in an XML declaration. The version number indicates that the document conforms to that version of the XML specifications, although 1.0 is the only version defined to date.

A working draft of [XML 1.1] has been published that proposes to make relatively modest but incompatible changes to XML at the character set level, but no structural changes. For example, it proposes adding the [Unicode] line separator character to the set of allowed white space characters and changing the default for tokens from allowing only specified characters to allowing all characters not prohibited. With this provision, as Unicode expands, so will the allowed token characters.

The XML declaration and all processing instructions begin with less than question mark ("<?") and end with question mark greater than ("?>"). XML allows you to also specify a "standalone" attribute and an "encoding" attribute that gives the character encoding in use. The XML Recommendation requires that all parsers support UTF-8 and UTF-16; parsers can support additional encodings as well. Thus the XML declaration might look something like the declarations shown in Example 3-4.

Example 3-4 XML declarations

 <?xml version = "1.0"?> <?xml version = "1.0" standalone="yes" encoding = "UTF-8"?>

The "standalone" document declaration is optional. A "standalone" value of "yes" indicates that the document does not depend on an external DTD; "no" indicates that the document may depend on an external DTD. The encoding declaration is optional and describes the character encoding currently in use. An appendix to the XML Recommendation gives heuristics whereby a parser may recognize many encodings, including UTF-8 and UTF-16, even in the absence of an encoding attribute.

It was intended that applications for which external DTDs were inconvenient or unavailable would specify that the input must have standalone="yes". However, this approach is essentially never taken. In retrospect, it is not clear it was even worth the effort to define the standalone attribute.

3.4.2 Elements

An element consists of a start tag/end tag pair and any data found in between them. The start tag includes the name of an element type enclosed by a less than symbol ("<") and a greater than symbol (">"), as shown in Figure 3-2. The name of an element type is known as a generic identifier (GI).

Figure 3-2. An element with content

graphics/03fig02.gif

XML documents contain data marked up with element start and end tags. Element start tags may have attributes. Elements are the most common form of markup and are extensible.

XML elements have content.
XML elements have relationships.
XML elements have simple naming rules.

Each XML document contains one or more elements, which can be classified into one of two categories: empty elements and elements with content. Empty elements are simply markers where something occurs. To denote an empty element, you can use start and stop tags with no content or use an "empty tag" that ends with a slash greater than ("/>"). For example, the XML equivalent to HTML's "<HR>" is "<HR/>". The trailing "/>" tells the processor that the element is empty and no matching end tag exists. See Figure 3-3.

Figure 3-3. Two examples of empty elements

graphics/03fig03.gif

The content of an element can consist of one or more elements, mixed content, simple text content or, as described above, no content. In Example 3-5, the <book> element contains other elements. Conversely, the <chapter> element contains mixed content both text and other elements. The <title> and <section> elements are examples of simple content; they contain only text.

The relationships between XML elements are named according to parent and child nomenclature. In Example 3-5, <book> is the root element, whereas <title> and <chapter> are child elements of <book>. Note that <book> is also the parent element of both <title> and <chapter>.

Example 3-5 Element contents and relationships in an XML document

 <book>     Root element; also parent element   <title>XML Security</title> Child element   <chapter>XML and SecurityChild element     <section>Origins of XML</section>Grandchild of root     <section>XML Goals</section>   </chapter> </book>

When creating names of elements, you should avoid using a colon (" : ") because it is reserved for use by namespaces (as described in Section 3.5). You can use any name you want because the XML Recommendation reserves no words. You should, however, try to keep element names simple and descriptive. A name can contain letters, numbers, and other characters. Do not start a name with a number, punctuation characters, or the letters "xml" in any capitalization (i.e., Xml, XML, xMl), and do not include any spaces in an element name.

It is sometimes quite annoying that XML prohibits "Names," such as element and attribute names, from starting with digits. No strong reason for this restriction exists.

3.4.3 Attributes

XML elements can have attributes in the start tag, just as in HTML. Attributes provide additional information for elements but do not constitute part of the element's content.

Attributes have both a name and a value. In XML, the attribute value must always be quoted. You can use either single or double quotes, but double quotes are more common. Attribute specifications may appear only within start tags and empty-element tags.

Figure 3-4 identifies the parts of an element that denote an attribute. Example 3-6 shows the attribute

 language="latin"

incorporated into an XML document. Chapter 4 discusses specifying attributes in a DTD in more detail.

Figure 3-4. Example of an attribute

graphics/03fig04.gif

Example 3-6 Attributes in an XML document

 <classification>   <order>Ciconiformes</order>   <family>Ardeidae</family>   <species/>   <name language="latin">Ardea herodias</name>   <name language="english">Great Blue Heron</name>   <foe>Raccoon</foe>   <foe>Red-shouldered Hawk</foe> </classification>

3.4.4 Special Attributes xml:space and xml:lang

The XML Recommendation defines the attributes xml:lang and xml:space. The xml:lang attribute facilitates the use of documents containing human-language dependent text, especially if they employ multiple languages. The xml:space attribute allows elements to declare to an application whether their white space is "significant."

Language

You can insert the special attribute xml:lang in elements to specify the default language that the application will use in the contents and attribute values of that element in an XML document. If it is present and you are using a validating parser, you must have declared the xml:lang attribute in the DTD. The IETF [RFC 1766] specification, "Tags for the Identification of Languages," or its successor on the IETF Standards Track, defines the values of the xml:lang attribute. By convention, the language code appears in lowercase and the country code (if any) appears in uppercase.

IETF [RFC 1766] tags are constructed from two-letter language codes as defined by [ISO 639], from two-letter country codes as defined by [ISO 3166], or from language identifiers registered with the Internet Assigned Numbers Authority [IANA-LANGCODES]. The successor to IETF [RFC 1766] will likely introduce three-letter language codes for languages not presently covered by [ISO 639].

White Space

The use of white space in documents varies. For example, you can use spaces, tabs, and blank lines when writing code or creating markup to make them easier to read. Such white space is typically not intended to be part of the delivered version of the document. Even so, white space in the actual source code can be significant as white space in a poem is.

The XML Recommendation requires that an XML processor pass all characters that are not markup through to the application. When you attach the xml:space attribute to an element, it provides information to the application about handling white space found in that element.

No matter what you do, all white space that is part of element content must be passed to applications by an XML-conformant parser; an application could then make arbitrary decisions based on such white space. Without special application knowledge, all white space given to an application must be considered "significant" from a security point of view, even where the XML Recommendation says that it should be identified to the application as "insignificant."

You must declare the xml:space attribute if you use it with a validating parser. When declared, this attribute must be given as an enumerated type whose values are "default", "preserve", or both. For example:

 <!ATTLIST poem  xml:space (default preserve) 'preserve'> <!ATTLIST pre xml:space (preserve) #FIXED 'preserve'>

In this example, the value "default" indicates that applications' default white-space processing modes are acceptable for this element. The value "preserve" indicates that applications should preserve all of the white space. Chapter 4 provides more information about declarations and attributes in an XML document.

The xml:space behavior is inherited from parent elements. If an element containing an xml:space value contains other elements, they also inherit the xml:space behavior from the parent element, unless they have a xml:space attribute of their own.

3.4.5 CDATA Sections

CDATA character data is a mechanism typically used to include blocks of special characters as character data. CDATA sections provide a way to protect information from a parser. Specifically, a CDATA section specifies that all characters within it should be considered character data, whether or not they look like a tag or entity reference. A CDATA section can appear anywhere in a document where you can have character data.

CDATA sections begin with the string

 <![CDATA[

and end with the string

]]>

The character string "]]>" is, of course, not allowed within a CDATA section because it would signal the end of the section. Example 3-7 shows a CDATA section that treats "<greeting>" and "</greeting>" as character data, not markup.

CDATA is text that will not be analyzed by a parser, except to look for the magic CDATA termination string. The parser will not treat tags inside the text as markup, nor will it expand entities.

PCDATA, a type of element content (see Chapter 4), is text that will be parsed by a parser. In such a case, the parser will treat tags inside the text as markup and will expand entities.

Example 3-7 A CDATA section

 <![CDATA[<greeting>Hello, world!</greeting>]]>

3.4.6 Comments

Comments are used to annotate an XML document, but are not part of the document's text content. In addition, you can use them to comment out tag sets. Although an XML processor generally ignores comments, a processor may make it possible for an application to retrieve the text of comments.

Comments begin with

 <!--

and end with

-->

HTML uses the identical syntax for inserting comments into a document. For compatibility, the string "--" (double-hyphen) must not occur in a comment. Comments may appear anywhere in a document, provided the comment remains outside other markup. Example 3-8 gives several examples of comments.

Example 3-8 Comments

 <!-- This is a comment --> <!-- This is also a comment --> <!-- Begin the contributing author names --> <name>George W. Archibald</name> <name>James C. Lewis</name> <!-- End the contributing author names --> <!-- Comment out other contributing author names! <name>Able B. Charlie</name> <name>Delta E. Foxtrot</name> End commenting out other contributing author names. -->

When using comments, observe the following guidelines:

Never place a comment inside an entity declaration.
Never place a comment before the XML declaration. A comment may, however, occur after the XML declaration and before the root element or after the root element.
Never nest comments. The end of the first nested comment will terminate the outermost comment.
Never include two hyphens in a row (--) within a comment.
Never place a comment within a start or end tag.

3.4.7 Character Sets and Encoding

[ISO 10646] is the native character set of XML.

Every ISO character corresponds to a number between 0x0 to 0x10FFFF (in the hexadecimal system). Legal characters in XML are the tab, carriage return, line feed, and legal characters of [Unicode] or [ISO 10646] in the ranges 0x20 to 0xD7FF, 0xE000 to 0xFFFD, and 0x10000 to 0x10FFFF hexadecimal (i.e., all characters except for the surrogate blocks, 0xFFFE and 0xFFFF). Note, however, that some changes in this system have been proposed [XML 1.1].

All XML processors must accept the UTF-8 and UTF-16 encoding of ISO 10646.

Special Character Strings

Text in an XML document consists of intermingled character data and markup. All text that is not markup constitutes the character data of the document. The XML Recommendation specifies that the ampersand character ("&") and the left and right angle brackets ("<" and ">") may appear in their literal form only when used as follows:

As markup delimiters
Within a comment
Within a processing instruction
Within a CDATA section

If you use these characters outside of markup, you must "escape" them using either numeric character references or special escape strings, as defined in the XML Recommendation. In addition, the XML Recommendation specifies special escape strings to allow attribute values to contain both single and double quotes (see Table 3-5). You must escape the right angle bracket (">") using ">" or a numeric character reference when it appears in the string "]]>" in content and when that string does not mark the end of a CDATA section.

Numeric Character References

Numeric character references allow you to insert into your document any legal Unicode characters including those that satisfy the following criteria:

You cannot type the characters directly on your keyboard.
You cannot input the characters from other available devices.
The characters are not available in the character encoding in use.

Character references take one of two forms: decimal references that start with "&#" and hexadecimal references that start with "&#x". For example, "©" is the decimal representation of the standard copyright symbol ("©"). You can represent the Greek letter pi in an XML document using the decimal representation "π" or the hexadecimal representation "π". An XML processor must expand numeric character references immediately on parsing them and must treat them as character data.

On a Windows machine, you can find the character code for most characters in the Keystroke field of the Character Map (see Figure 3-5).

Figure 3-5. The Character Map showing a character code

graphics/03fig05.gif

Table 3-5. Predefined Special Character Strings
Character	Escape String
Left angle bracket ("<")	<
Right angle bracket (">")	>
Ampersand ("&")	&
Single apostrophe/single-quote (" ' ")	'
Double apostrophe/double-quote (" " ")	"

3.4.8 Processing Instructions

A processing instruction (PI) is an explicit mechanism for embedding information in a document intended for an application rather than for the XML parser or browser. PIs are not part of an XML document's character data.

A processing instruction begins with a target that identifies the application to which the instruction is directed. The XML processor must pass PIs on to the appropriate application. The application then decides how to handle the instructions. Applications that do not recognize the instructions simply ignore them.

Processing instructions have the following form:

 <?APPLICATION_NAME INSTRUCTIONS?>

Notice that the XML declaration that appears on the first line of an XML document looks like a processing instruction (see Example 3-4) but is not.

See Example 3-9 for another example of a processing instruction. You can declare the PI target beforehand using a NOTATION declaration as shown in this example. Chapter 4 discusses NOTATION declarations in more detail.

The PI data for the PI target application should appear in a format that the application can interpret. Note that PIs are not required to have data after the target. Only the application recognizes data following the PI target. The XML Recommendation reserves the target name "xml" in any capitalization, including mixed capitalization, for standardization in future versions of the specification.

Example 3-9 Sample processing instruction

 <!NOTATION mybirdapp SYSTEM file://mydir/birdapp.exe> <?mybirdapp Do_this?>

Processing instructions were included in XML because they appeared in [SGML] and it seemed like a good idea at the time. Today, most people wish they had been left out because they represent an unnecessary complexity, but feel they must continue to be supported for compatibility. The modern approach calls for encoding essentially all semantics using the simpler syntax of plain XML without PIs.

More radical proposals such as a "Simple XML" would omit attributes; instead, elements would be used for everything. This idea isn't likely to be adopted as a standard but you can define your XML that way if desired.

Table 3-4. Structure of an XML Document