XML Document Parts

An XML document contains different parts. It will always start with a prolog. The remainder of the XML document is contained within the document root or root element.

Document prolog

The document prolog appears at the top of an XML document and contains information about the XML document as a whole. It must appear before the root element in the document. The prolog is a bit like the <head> section of an HTML document. It can also include comments.

XML declaration

The prolog usually starts with an XML declaration, although this is optional. If you do include a declaration, it must be the first line of your XML document. The declaration tells software applications and humans that the content is an XML document:

 <?xml version="1.0"?>

The XML declaration includes an XML version, in this case 1.0. At the time of writing, the latest recommendation was XML 1.1. However, you should continue to use the version="1.0" attribute value for backward compatibility with XML processors. For example, adding a version 1.1 declaration causes an error when the XML document is opened in Microsoft Internet Explorer 6.

The XML declaration can also include the encoding and standalone attributes.

XML documents contain characters that follow the Unicode standard, maintained by the Unicode Consortium. You can find out more at www.unicode.org/.

Encoding determines the character set for the XML document. You can use Unicode character sets UFT-8 and UTF-16 or ISO character sets like ISO 8859-1, Latin-1 Western Europe. If no encoding attribute is included, it is assumed that the document uses UTF-8 encoding. Languages like Japanese and Chinese need UTF-16 encoding. Western European languages often use ISO 8859-1 to cope with the accents that arent part of the English language.

The encoding attribute must appear after the version attribute:

 <?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-16"?> <?xml version="1.0" encoding="ISO-8859-1">

The standalone attribute indicates whether the XML document uses external information, such as a Document Type Definition (DTD). A DTD specifies the rules about which elements and attributes to use in the XML document. It also provides information about the number of times each element can appear and whether an element is required or optional.

The standalone attribute is optional but must appear as the last attribute in the declaration. The value standalone="no" cant be used when you are including an external DTD or style sheet.

 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>

Processing instructions

The prolog can also include processing instructions (PI). These instructions pass information about the XML document to other applications.

Processing instructions start with <? and finish with ?> . The first item in a PI is a name , called the PI target. PI names that start with xml are reserved.

A common PI is the inclusion of an external XSLT style sheet. This PI must appear before the document root:

 <?xml-stylesheet type="text/xsl" href="listStyle.xsl"?>

Processing instructions can also appear in other places in the XML document.

Document Type Definitions

Document Type Definitions (DTDs), or DOCTYPE declarations, appear in the prolog. These are rules about the elements and attributes within the XML document. A DTD provides information about which elements are legal in an XML document and tells you which elements are required and which are optional. In other words, a DTD provides the rules for a valid XML document.

The prolog can include a set of declarations about the XML document, a reference to an external DTD, or both. This code shows an external DTD reference:

 <?xml version="1.0"?> <!DOCTYPE phoneBook SYSTEM "phoneBook.dtd">

Well look at DTDs in more detail in Chapter 3.

Tree

Everything that isnt in the prolog is contained within the document tree. This includes the elements, attributes, and text in a hierarchical structure. The root node is the trunk of the tree. You call the child elements of the root node branches.

As weve seen, elements can include other elements or attributes. They can also contain text values or a mixture of both. HTML provides good examples of mixed content.

 <p>This is a paragraph element with an element <br/> inside</p>

This distinction becomes important when you use a schema to describe the structure of the document tree.

Document root

An XML document can have only one root element. All of the elements within an XML document are contained within this root element.

The root element can have any name at all, providing that it conforms to the standard element naming conventions. In HTML documents, you can think of the <html> tag as the root element.

White space

XML documents include white space so that humans can read them more easily. White space refers to spaces, tabs, and returns that space out the content in the document. The XML specification allows you to include white space anywhere within an XML document except before the XML declaration.

XML processors do take notice of white space in a document, but many wont display the spaces. For example, Internet Explorer wont display more than one space at a time when it displays an XML or XHTML document.

If white space is important, maybe for poetry or a screenplay, you can use the xml:space attribute in an element. There are two possible values for this attribute: default and preserve . Choosing the default value is the same as leaving out the attribute.

You can add the xml:space="preserve" attribute to the root node of a document to preserve all space within the document tree:

 <phoneBook xml:space="preserve">

Namespaces

XML documents can get very complicated. One XML document can reference another XML document, and different rules may apply for each. When this happens, its possible that two different XML documents will use the same element names.

In order to overcome this problem, we use namespaces. Namespaces associate XML elements with an owner. A namespace ensures that each element name is unique within a document, even if other elements use the same name.

You can find out more about namespaces by reading the latest recommendation at the W3C website. At the time of writing, this was the Namespaces in XML 1.1 recommendation at www.w3.org/TR/2004/REC-xml-names11-20040204/.

It isnt compulsory to use namespaces in your XML documents, but it can be a good idea. Namespaces are also useful when you start to work with schemas and style sheets. Well look at some examples of schemas and style sheets in the next chapter.

Each namespace includes a reference to a Uniform Resource Identifier (URI). A URI is an Internet address, and each URI must be unique in the XML document. The URIs used in an XML document dont have to point to anything, although they often will.

You can define a namespace using the xmlns attribute within an element. Each namespace usually has a prefix that you use to identify elements belonging to that namespace. You cant start your prefixes with xml , and they shouldnt include spaces.

 <FOE:fullName xmlns:FOE="http://www.friendsofed.com/">   Sas Jacobs </FOE:fullName>

In the preceding element, the FOE prefix refers to the namespace http://www.friendsofed.com/. Ive prefixed the element <fullName> with FOE , and I can use it with other elements and attributes.

 <FOE:address>   123 Some Street, Some City, Some Country </FOE:address>

Ill then be able to tell that the <address> element also comes from the http://www.friendsofed.com/ namespace.

You can also define a namespace without using a prefix. If you do this, the namespace will apply to all elements that dont have a prefix or namespace defined.

The following listing shows how to use a namespace with no prefix in an XML element:

 <contact id="1" xmlns="http://www.friendsofed.com/">   <name>Sas Jacobs</name>   <address>123 Some Street, Some City, Some Country</address>   <phone>123 456</phone> </contact>

The namespace applies to all the child elements of the <contact> element so the <name> , <address> , and <phone> elements will use the default namespace http://www.friendsofed.com/.

Namespaces will become clearer when we start working with schemas and style sheets in Chapter 3.