A.2 Anatomy of an XML Document

The best way to explain how an XML document is composed is to present one. Example A-1 shows an XML document you might use to describe two authors.

Example A-1. A very simple XML document

<?xml version="1.0" encoding="us-ascii"?> <authors>     <person >         <name>Edward Lear</name>         <nationality>British</nationality>     </person>     <person >         <name>Isaac Asimov</name>         <nationality>American</nationality>     </person>     <person /> </authors>

The first line of the document is known as the XML declaration. This tells a processing application which version of XML you are using the version indicator is mandatory and which character encoding you have used for the document. In this example, the document is encoded in ASCII. (The significance of character encoding is covered later in this appendix.)

If the XML declaration is omitted, a processor will make certain assumptions about your document. In particular, it will expect it to be encoded in UTF-8, an encoding of the Unicode character set. However, it is best to use the XML declaration wherever possible, both to avoid confusion over the character encoding and to indicate to processors which version of XML you're using. (1.0 is most common, but 1.1, which makes relatively minor though potentially incompatible changes, has recently appeared.) Encoding handling should be automatic with Office, but you may need to watch for documents you import from other sources.

A.2.1 Elements and Attributes

The second line of Example A-1 begins an element, which has been named authors. The contents of that element include everything between the right angle bracket (>) in <authors> and the left angle bracket (<) in </authors>. The actual syntactic constructs <authors> and </authors> are often referred to as the element start tag and end tag, respectively. Do not confuse tags with elements! Tags mark the boundaries of elements. Note that elements, like the authors element here, may include other elements, as well as text. An XML document must contain exactly one root element, which contains all other content within the document. The name of the root element defines the type of the XML document.

Elements that contain both text and other elements simultaneously are classified as mixed content. Word supports the use of mixed content, while the other applications in the Office suite generally do not.

The sample "authors" document uses elements named person to describe the authors themselves. Each person element has an attribute named id. Unlike elements, attributes can only contain textual content. Their values must be surrounded by quotes. Either single quotes (') or double quotes (") may be used, as long as you use the same kind of closing quote as the opening one.

Within XML documents, attributes are frequently used for metadata (i.e., "data about data"), describing properties of the element's contents. This is the case in our example, where id contains a unique identifier for the person being described.

As far as XML is concerned, it does not matter in what order attributes are presented in the element start tag. For example, these two elements contain exactly the same information as far as an XML 1.0 conformant processing application is concerned:

<animal name="dog" legs="4"></animal> <animal legs="4" name="dog"></animal>

On the other hand, the information presented to an application by an XML processor on reading the following two lines will be different for each animal element because the ordering of elements is significant:

<animal><name>dog</name><legs>4</legs></animal> <animal><legs>4</legs><name>dog</name></animal>

XML treats a set of attributes like a bunch of stuff in a bag there is no implicit ordering while elements are treated like items on a list, where ordering matters.

New XML developers frequently ask when it is best to use attributes to represent information and when it is best to use elements. As you can see from the "authors" example, if order is important to you, then elements are a good choice. In general, there is no hard-and-fast best practice for choosing whether to use attributes or elements, though elements can contain other elements and attributes, while attributes can contain only text.

The final author described in our document has no information available. All we know about this person is his or her ID, mysteryperson. The document uses the XML shortcut syntax for an empty element. The following is a reasonable alternative:

<person ></person>

A.2.2 Name Syntax

XML 1.0 has certain rules about element and attribute names. In particular:

Names are case-sensitive, e.g., <person/> is not the same as <Person/>.
Names beginning with xml (in any permutation of uppercase or lowercase) are reserved for use by XML 1.0 and its companion specifications.
A name must start with a letter or an underscore, not a digit, and may continue with any letter, digit, underscore, or period. (Actually, a name may also contain a colon, but the colon is used to delimit a namespace prefix and is not available for arbitrary use as of the Second Edition of XML 1.0.)

A precise description of names can be found in Section 2.3 of the XML 1.0 specification, at http://www.w3.org/TR/REC-xml#sec-common-syn.

A.2.3 XML Namespaces

XML 1.0 lets developers create their own elements and attributes, but leaves open the potential for overlapping names. title in one context may mean something entirely different than title in a different context. The Namespaces in XML specification (which can be found at http://www.w3.org/TR/REC-xml-names/) provides a mechanism developers can use to identify particular vocabularies using Uniform Resource Identifiers (URIs).

URIs are a combination of the familiar Uniform Resource Locators (URLs) and Uniform Resource Names (URNs). From the perspective of XML namespaces, URIs are convenient because they combine an easily used syntax with a notion of ownership. While it's possible for me to create namespace URIs that begin with http://microsoft.com, general practice holds that it would be better for me to create URIs that begin with http://simonstl.com, a domain I own, and leave http://microsoft.com to Microsoft. In general, organizations and individuals who create XML vocabularies should choose a namespace URI in a space they control. This makes it possible (though it isn't required) to put information there documenting the vocabulary, or other resources for processing the vocabulary.

The rules for XML names don't permit developers to create elements with names like http://simonstl.com/ns/mine:Title, and it's not clear that working with names like that would be much fun anyway. To get around these problems, the Namespaces in XML specification defines a mechanism for associating URIs with element and attribute names through prefixes. Instead of typing out the whole URI, developers can work with a much shorter prefix, or even set a default URI that applies to names without prefixes.

To create a prefix, you use a namespace declaration, which looks like an attribute. For example, to create a prefix of xhtml associated with the URI http://www.w3.org/1999/xhtml, you would use an xmlns:xhtml attribute as shown below:

<container xmlns:xhtml="http://www.w3.org/1999/xhtml" > .... </container>

To apply a prefix, you put it in front of the element or attribute name, with a colon separating the prefix from the name. To put an XHTML p element inside of that container, you could write:

<container xmlns:xhtml="http://www.w3.org/1999/xhtml" > <xhtml:p>This is an XHTML paragraph!</xhtml:p> </container>

When a program encountered the xhtml:p, it would know that p was the local name of the element, xhtml was the prefix, and http://www.w3.org/1999/xhtml was the URI for that element. The namespace declaration applies to all elements inside the element where it appears, as well as the element containing the declaration. For example, the xhtml prefix works for all three of these paragraphs:

<container xmlns:xhtml="http://www.w3.org/1999/xhtml" > <xhtml:p>This is XHTML paragraph 1!</xhtml:p> <xhtml:p>This is XHTML paragraph 2!</xhtml:p> <xhtml:p>This is XHTML paragraph 3!</xhtml:p> </container>

In most XML processing, the prefix doesn't matter the local name and the URI are what count, and the prefix is just a mechanism for associating them. (This is especially important in XSLT processing and XML Schemas.) In some documents, especially documents that use only structures from one namespace or where one vocabulary is dominant, developers choose to use the default namespace rather than prefixes. When the default namespace is used (assigned with an xmlns attribute), elements without a prefix are associated with a given URI. In XHTML, an XML derivative of HTML, this is the most typical path, since HTML developers aren't used to putting prefixes on all of their element names. A typical XHTML document might look like this:

<html xmlns="http://www.w3.org/1999/xhtml">   <head>     <title>My Document</title>   </head>   <body>     <p>Could use some content here</p>   </body> </html>

In this case, the URI http://www.w3.org/1999/xhtml applies to every element in the document, including html, head, title, body, and p. The default namespace has one quirk, though: it doesn't apply to attributes. Attributes can be given a namespace by explicitly using a prefix in their name, but unprefixed attributes have no namespace URI. This often doesn't matter, but it can be important when writing XSLT stylesheets and creating XML Schemas.

Typically, the namespaces used by a document are declared on the root element of the document, which lets them apply to all the content inside that document. They can, of course, also be declared throughout the document, though this makes it more difficult to read. Declarations can override each other as well, and the declaration closest to a given use of a prefix in the hierarchy will be used. This lets developers mix and match XML vocabularies even when they use the same prefix.

Namespaces are very simple on the surface but are a well-known field of combat in XML arcana. For more information on namespaces, see Tim Bray's "XML Namespaces by Example," published at http://www.xml.com/pub/a/1999/01/namespaces.html; XML In a Nutshell; or Learning XML.

A.2.4 Well-Formedness

An XML document that conforms to the rules of XML syntax is known as well-formed. At its most basic level, well-formedness means that elements should be properly matched, and all opened elements should be closed. A formal definition of well-formedness can be found in Section 2.1 of the XML 1.0 specification, at http://www.w3.org/TR/REC-xml#sec-well-formed. Table A-1 shows some XML documents that are not well-formed.

Table A-1. Examples of poorly formed XML documents
Document	Reason why it's not well-formed
<foo> <bar> </foo> </bar>	The elements are not properly nested because `foo` is closed while inside its child element `bar`.
<foo> <bar> </foo>	The `bar` element was not closed before its parent, `foo`, was closed.
<foo baz> </foo>	The `baz` attribute has no value. While this is permissible in HTML (e.g., `<table border>`), it is forbidden in XML.
<foo baz=23> </foo>	The `baz` attribute value, 23, has no surrounding quotes. Unlike HTML, all attribute values must be quoted in XML.

A.2.5 Comments and Processing Instructions

As in HTML, it is possible to include comments within XML documents. XML comments are intended to be read only by people. With HTML, developers have occasionally employed comments to add application-specific functionality. For example, the server-side include functionality of most web servers uses instructions embedded in HTML comments. In XML, comments should not be used for any purpose other than those for which they were intended, as they are usually stripped from the document during parsing.

The start of a comment is indicated with . Any sequence of characters, aside from the string --, may appear within a comment. Comments can appear at the start or end of a document as well as inside elements. They cannot appear inside attributes or inside of a tag. A comment might look like:

<!--Hello, this is a comment -->

Comments tend to be used more in XML documents intended for human consumption than those intended for machine consumption. If you want to pass information to an XML application without affecting the structure of the document, you can use processing instructions, or PIs. Processing instructions use <? as a starting delimiter and ?> as a closing delimiter, must contain a target conforming to the rules for XML names, and may contain additional data. A typical PI might look like:

<?xml-style type="text/css" href="mystyle.css" ?>

In this case, xml-style is the target and type="text/css" href="mystyle.css" is the data. For more information on PIs, see Section 2.6 of the XML 1.0 specification, at http://www.w3.org/TR/REC-xml#sec-pi.

A.2.6 Entity References

You may occasionally need to use the mechanism for escaping characters. Because some characters have special significance in XML, there needs to be a way to represent them. For example, in some cases the < symbol might really be intended to mean "less than" rather than to signal the start of an element name. Clearly, just inserting the character without any escaping mechanism would result in a poorly formed document because a processing application would assume you were starting another element. Another instance of this problem is needing to include both double quotes and single quotes simultaneously in an attribute's value. Here's an example that illustrates both these difficulties:

<badDoc>   <para>     I'd really like to use the < character   </para>   <note title="On the proper 'use' of the " character"/> </badDoc>

XML avoids this problem by the use of the predefined entity reference. The word "entity" in the context of XML simply means a unit of content. The term "entity reference" means just that, a symbolic way of referring to a certain unit of content. XML predefines entities for the following symbols: left angle bracket (<), right angle bracket (>), apostrophe ('), double quote ("), and ampersand (&).

An entity reference is introduced with an ampersand (&), which is followed by a name (using the word "name" in its formal sense, as defined by the XML 1.0 specification), and terminated with a semicolon (;). Table A-2 shows how the five predefined entities can be used within an XML document.

Table A-2. Predefined entity references in XML 1.0
Literal character	Entity reference
`<`	`<`
`>`	`>`
'	`'`
"	`"`
`&`	`&`

Here's our problematic document revised to use entity references:

<badDoc>   <para>     I'd really like to use the &lt; character   </para>   <note title="On the proper &apos;use&apos;  of the &quot;character"/> </badDoc>

Being able to use the predefined entities is often all you need; in general, entities are provided as a convenience for human-created XML. XML 1.0 allows you to define your own entities and use entity references as "shortcuts" in your document. Section 4 of the XML 1.0 specification, available at http://www.w3.org/TR/REC-xml#sec-physical-struct, describes the use of entities.

A.2.7 Character References

You may find character references in Office 2003 XML documents. Character references allow you to denote a character by its numeric position in Unicode character set (this position is known as its code point). Table A-3 contains a few examples that illustrate the syntax.

Table A-3. Example character references
Actual character	Character reference
1	`0`
A	`A`
~	`Ñ`
®	`®`

Note that the code point can be expressed in decimal or, with the use of x as a prefix, in hexadecimal.

A.2.8 Character Encodings

The subject of character encodings is frequently a mysterious one for developers. Most code tends to be written for one computing platform and, normally, to run within one organization. Although the Internet is changing things quickly, most of us have never had cause to think too deeply about internationalization.

XML, designed to be an Internet-friendly syntax for information exchange, has internationalization at its very core. One of the basic requirements for XML processors is that they support the Unicode standard character encoding. Unicode attempts to include the requirements of all the world's languages within one character set. Consequently, it is very large!

A.2.8.1 Unicode encoding schemes

Unicode 3.0 has more than 57,700 code points, each of which corresponds to a character. (You can obtain charts of all these characters online by visiting http://www.unicode.org/charts/.) If one were to express a Unicode string by using the position of each character in the character set as its encoding (in the same way as ASCII does), expressing the whole range of characters would require four octets for each character. (An octet is a string of eight binary digits, or bits. A byte is commonly, but not always, considered the same thing as an octet.) Clearly, if a document is written in 100 percent American English, it will be four times larger than required all the characters in ASCII fitting into a 7-bit representation. This places a strain both on storage space and on memory requirements for processing applications.

Fortunately, two encoding schemes for Unicode alleviate this problem: UTF-8 and UTF-16. As you might guess from their names, applications can process documents in these encodings in 8- or 16-bit segments. When code points are required in a document that cannot be represented by one chunk, a bit-pattern is used that indicates that the following chunk is required to calculate the desired code point. In UTF-8 this is denoted by having the most significant bit of the first octet set to 1.

This scheme means that UTF-8 is a highly efficient encoding for representing languages using Latin alphabets, such as English. All of the ASCII character set is represented natively in UTF-8 an ASCII-only document and its equivalent in UTF-8 are byte-for-byte identical. UTF-16 is more efficient for representing languages that use Unicode characters represented by larger numeric values, notably Chinese, Japanese, and Korean.

This knowledge will also help you debug encoding errors. One frequent error arises because of the fact that ASCII is a proper subset of UTF-8 programmers get used to this fact and produce UTF-8 documents, but use them as if they were ASCII. Things start to go awry when the XML parser processes a document containing, for example, characters such as Á (replace with accented A). Because this character cannot be represented using only one octet in UTF-8, this produces a two-octet sequence in the output document; in a non-Unicode viewer or text editor, it looks like a couple of characters of garbage.

A.2.8.2 Other character encodings

Unicode, in the context of computing history, is a relatively new invention. Native operating system support for Unicode is by no means widespread. For instance, although Windows NT offers Unicode support, Windows 95 and 98 do not have it.

XML 1.0 allows a document to be encoded in any character set registered with the Internet Assigned Numbers Authority (IANA). European documents are commonly encoded in one of the ISO Latin character sets, such as ISO-8859-1. Japanese documents commonly use Shift-JIS, and Chinese documents use GB2312 and Big 5.

A full list of registered character sets may be found at http://www.iana.org/assignments/character-sets.

XML processors are not required by the XML 1.0 specification to support any more than UTF-8 and UTF-16, but most commonly support other encodings, such as US-ASCII and ISO-8859-1. Although many XML transactions are currently conducted in ASCII (or the ASCII subset of UTF-8), there is nothing to stop XML documents from containing, say, Korean text. You will, however, probably have to dig into the encoding support of your computing platform to find out if it is possible for you to use alternate encodings.

A.2.9 Validity

In addition to well-formedness, XML 1.0 offers another level of verification called validity. To explain why validity is important, let's take a simple example. Imagine you invented a simple XML format for your friends' telephone numbers:

<phonebook>   <person>     <name>Albert Smith</name>     <number>123-456-7890</number>   </person>   <person>     <name>Bertrand Jones</name>     <number>456-123-9876</number>   </person> </phonebook>

Based on your format, you also construct a program to display and search your phone numbers. This program turns out to be so useful, you share it with your friends. However, your friends aren't so hot on detail as you are, and try to feed your program this phone book file:

<phonebook>   <person>     <name>Melanie Green</name>     <phone>123-456-7893</phone>   </person> </phonebook>

Note that, although this file is perfectly well-formed, it doesn't fit the format you prescribed for the phone book, because there's a phone element where there should have been a number element. You will likely need to change your program to cope with this situation. If your friends had used number as you did to denote the phone number, there wouldn't have been a problem. However, as it is, this second file probably won't be usable by programs set up to work with the first file; from the program's perspective, it isn't valid.

For validity to be a useful general concept, we need a machine-readable way of saying what a valid document is; that is, which elements and attributes must be present and in what order. XML 1.0 achieves this by introducing document type definitions (DTDs). Office doesn't use DTDs, preferring to use W3C XML Schemas, as described in Appendix C.

A.2.9.1 Document Type Definitions (DTDs)

The purpose of a DTD is to express which elements and attributes are allowed in a certain document type and to constrain the order in which elements must appear within that document type. A DTD is generally composed of one file or a group of connected files, containing declarations defining element types, attribute lists, and entities. DTDs are explored in Appendix D.

A.2.9.2 Connecting DTDs to documents

Even if you don't work with DTDs, you should be aware of how DTDs are linked to XML documents. This is done with a document type declaration, <!DOCTYPE ...>, inserted at the beginning of the XML document, after the XML declaration in our fictitious example:

<?xml version="1.0" encoding="us-ascii"?> <!DOCTYPE authors SYSTEM "http://example.com/authors.dtd"> <authors>     <person >         <name>Edward Lear</name>         <nationality>British</nationality>     </person>     <person >         <name>Isaac Asimov</name>         <nationality>American</nationality>     </person>     <person /> </authors>

This example assumes the DTD file has been placed on a web server at example.com. Note that the document type declaration specifies the root element of the document, not the DTD itself. You could use the same DTD to define person, name, or nationality as the root element of a valid document. Certain DTDs, such as the DocBook DTD for technical documentation (see http://www.docbook.org), use this feature to good effect, allowing you to use the same DTD while working with multiple document types.

A validating XML processor is obligated to check the input document against its DTD. If it does not validate, the document is rejected. To return to the phone book example, if your application validated its input files against a phone book DTD, you would have been spared the problems of debugging your program and correcting your friend's XML because your application would have rejected the document as being invalid. Office 2003 doesn't perform validation against DTDs; instead, it validates against XML Schemas.