Chapter 2. Creating Well-Formed XML Documents

CONTENTS

The World Wide Web Consortium
What Is a Well-Formed XML Document?
Markup and Character Data
The Prolog
The XML Declaration
Comments
Processing Instructions
Tags and Elements
The Root Element
Attributes
Building Well-Formed Document Structure
CDATA Sections
XML Namespaces
Infosets
Canonical XML

In the previous chapter, we got our start in XML with an overview of how XML lets you structure your own documents, what XML is all about, and what uses you can make of it. It's now time to take a look at XML in more depth and sharpen our XML understanding until it's crystal clear.

In HTML, about 100 elements already are defined. Browsers can check the HTML on a Web page and display that page as they see fit. In XML, you have more freedom and, thus, more responsibility. In XML, you define your own elements, and it's up to you to decide how they should be used. Despite their apparently free-form nature, however, XML documents are subject to a number of rules that allow them to be handled in a useful and reproducible way.

In fact, the rules to which XML documents are subject are significantly more stringent than the rules to which HTML documents are subject. As mentioned in Chapter 1, "Essential XML," if an XML document cannot be successfully understood by an XML processor, for example, the processor is not supposed to make any guesses about the structure of the document at all it's just supposed to quit, possibly returning an error.

As we also saw in Chapter 1, XML documents are subject to two specific constraints: well-formedness and validity. As far as the World Wide Web Consortium (W3C) is concerned, well-formedness is the more basic constraint. In the XML 1.0 specification itself, which represents the foundation of this chapter and Chapter 3,"Valid XML Documents: Creating Document Type Definitions," the W3C says that you can't even call a data object an XML document unless it's well-formed:

A data object is an XML document if it is well-formed, as defined in this specification. A well-formed XML document may in addition be valid if it meets certain further constraints.

Why is it so important that XML documents be well-formed? Why does the W3C specify that XML processors should not attempt to fix documents that are not well-formed?

The reason that the W3C makes this stipulation is mainly to stop XML processors from doing the same thing that HTML browsers have done to HTML: By trying to fix things, the major browsers have introduced their own versions of HTML that authors now rely on. The result is that many versions of HTML currently exist.

In this chapter, we'll see what makes an XML document well-formed, which is the minimal requirement that a data object must satisfy to be an XML document. The second constraint that you can require of XML documents is that they be valid, which means that they must obey the document type definition (DTD) or schema that you use to specify the legal syntax of the document. This chapter is all about what makes XML documents well-formed. Chapter 3 is all about what makes them valid.

Now that we're taking a look at how to build XML documents in a formal way, I'll start from the beginning so that we build a complete and solid foundation. That means starting with the W3C itself.

The World Wide Web Consortium

We already know that the W3C is the body responsible for defining exactly what XML is, but who is the W3C? The W3C is not a government body; instead, it's a group made up of member organizations (currently more than 400) that have an interest in the World Wide Web. The W3C is hosted by the Massachusetts Institute of Technology, Laboratory for Computer Science (MIT/LCS), in the United States; the Institut National de Recherche en Informatique et en Automatique (INRIA), in Europe; and the Keio University Shonan Fujisawa Campus, in Japan. Currently, it has about 50 full-time staff members.

How does W3C set up specifications for the Web? It does so by publishing those standards in HTML (and, recently, in XHTML) form at its Web site, http://www.w3.org. These specifications are given three different levels:

Notes. These are specifications that usually are submitted to the W3C by a member organization, and, that the W3C is making public although not necessarily endorsing. For example, the note submitted by Microsoft to W3C on Vector Markup Language (VML) is at http://www.w3.org/TR/NOTE-VML.
Working drafts. A working draft is a specification that is under consideration and open to comment. It's inappropriate to refer to such works as standards or as anything other than working drafts. For example, the working draft for XHTML 1.1 is at http://www.w3.org/TR/xhtml11/.
Recommendations. Working drafts that the W3C has accepted become recommendations. The W3C uses the term recommendations when it publishes its standards (because the W3C is not a government body, it does not use the term standard). For example, the XML 1.0 recommendation is at http://www.w3.org/TR/REC-xml.

Besides these official specification levels, W3C also has candidate recommendations, which are working drafts that have been proposed but not yet accepted as recommendations, and companion recommendations, which augment recommendations. In fact, there are plenty of companion recommendations for XML (such as schemas, XLinks, Xpointers, and so on), and you'll find a good list of them at http://www.w3c.org/xml.

The recommendation for XML 1.0, which defines XML, is at http://www.w3.org/TR/REC-xml; you'll also find it in Appendix A, "The XML 1.0 Specification." This specification is the most important one as far as this book is concerned. Together with the associated standards (Unicode and ISO/IEC 10646 for characters, Internet RFC 1766 for language identification tags, ISO 639 for language name codes, and ISO 3166 for country name codes), this recommendation gives you all you need to understand XML Version 1.0 and create XML documents. Now it's time to put this recommendation to work, creating well-formed XML documents.

What Is a Well-Formed XML Document?

The W3C, which is responsible for the term well-formedness, defines it this way in the XML 1.0 recommendation (I'll take a look at each of these stipulations later):

A textual object is a well-formed XML document if:

Taken as a whole, it matches the production labeled document.
It meets all the well-formedness constraints given in this specification (that is, http://www.w3.org/TR/REC-xml ).
Each of the parsed entities which is referenced directly or indirectly within the document is well-formed.

W3C calls the individual specifications within a working draft or a recommendation productions. In this case, to be well-formed, a document must follow the "document" production, which means that the document itself must have three parts: a prolog (which can be empty), a root element, and an optional miscellaneous part.

The prolog, which I'll talk about in a few pages, can and should include an XML declaration (such as <?xml version = "1.0"?>) and an optional miscellaneous part that includes comments, processing instructions, and so on.

The root element of the document can itself hold other elements in fact, it's hard to imagine useful XML documents in which the root element does not contain other elements. Note that each well-formed XML document must have exactly one root element, and all other elements in the document must be enclosed in the root element (this does not apply to the parts of the prolog, of course, because items such as processing instructions and comments are not considered elements).

The optional miscellaneous part can be made up of XML comments, processing instructions, and whitespace (including spaces, tabs, and so on). I'll take a look at the prolog, the root element, and the miscellaneous part later in this chapter.

The next stipulation in the list says that to be well-formed, XML documents must also satisfy the well-formedness constraints listed in the XML 1.0 specification. This means that XML documents must adhere to the syntax rules specified in the XML 1.0 recommendation. I'll talk about those rules in this chapter, including the naming rules that you should follow when naming tags, how to nest elements, and so on.

Well-Formedness Constraint

If you search the XML 1.0 specification, which also appears in Appendix A, you'll see that all constraints that you must satisfy to create a well-formed document are marked with the words "Well-Formedness Constraint."

Finally, the last stipulation in the W3C well-formed document list is that each parsed entity must itself be well-formed. What does that mean?

The parts of an XML document are called entities. An entity is a part of a document that can hold text or binary data (but not both). An entity may refer to other entities and thus cause them to be included in the document. Entities can be either parsed (character data) or unparsed (character data that can include non-XML text or binary data that the XML processor does not parse). In other words, the term entity is just a generic way of referring to a data storage unit in XML. For example, a file with a few XML elements in it is an entity, but it's not a document unless it's also well-formed.

This stipulation about parsed entities means that if you refer to an entity and include that entity's data (which can include data from external sources) in your document, the included data also must be well-formed.

That's the W3C's definition of a well-formed document, but it's far from clear at this point. What are the well-formedness constraints that we need to follow? What exactly can be in a prolog? To answer questions like these, the rest of this chapter examines what these constraints mean in detail.

I'll start by looking at an XML document that we can refer to throughout the chapter as we examine what it means for a document to be well-formed. In this case, I'll store customer data for specific purchases in a document called order.xml. I'll start with the XML declaration itself:

<?xml version = "1.0"?>

Here, I'm using the <?xml?> declaration to indicate that this document is written in XML, and I'm specifying the only version possible at this time, version 1.0. Because all the documents in this chapter are self-contained (that is, they don't refer to or include any external entities), I can also use the standalone attribute, setting it to yes like this:

<?xml version = "1.0" standalone="yes"?>

This attribute, which may or may not be used by an XML parser, indicates that the document is completely self-contained. Technically, XML documents do not need to start with the XML declaration, but W3C recommends it.

Next, I add the root element, which I'll call <DOCUMENT> in this case (although you can use any name):

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     .     .     . </DOCUMENT>

The root element can contain other elements, of course. Here, I add elements for three customers to the document:

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <CUSTOMER>     .     .     .     </CUSTOMER>     <CUSTOMER>     .     .     .     </CUSTOMER>     <CUSTOMER>     .     .     .     </CUSTOMER> </DOCUMENT>

For each customer, I store a name in a <NAME> element, which itself encloses a <LAST_NAME> and a <FIRST_NAME> element, like this:

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <CUSTOMER>         <NAME>             <LAST_NAME>Smith</LAST_NAME>             <FIRST_NAME>Sam</FIRST_NAME>         </NAME>     .     .     .     </CUSTOMER>     <CUSTOMER>     .     .     .     </CUSTOMER>     <CUSTOMER>     .     .     .     </CUSTOMER> </DOCUMENT>

I can also store the details of customer orders with a new element, <DATE>, and an element named <ORDERS> like this:

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <CUSTOMER>         <NAME>             <LAST_NAME>Smith</LAST_NAME>             <FIRST_NAME>Sam</FIRST_NAME>         </NAME>         <DATE>October 15, 2001</DATE>         <ORDERS>     .     .     .         </ORDERS>     .     .     .     </CUSTOMER>     <CUSTOMER>     .     .     .     </CUSTOMER>     <CUSTOMER>     .     .     .     </CUSTOMER> </DOCUMENT>

I also can record each item that a customer bought with an <ITEM> element, which itself is broken up into <PRODUCT>, <NUMBER>, and <PRICE> elements:

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <CUSTOMER>         <NAME>             <LAST_NAME>Smith</LAST_NAME>             <FIRST_NAME>Sam</FIRST_NAME>         </NAME>         <DATE>October 15, 2001</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Tomatoes</PRODUCT>                 <NUMBER>8</NUMBER>                 <PRICE>$1.25</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Oranges</PRODUCT>                 <NUMBER>24</NUMBER>                 <PRICE>$4.98</PRICE>             </ITEM>         </ORDERS>     .     .     .     </CUSTOMER>     <CUSTOMER>     .     .     .     </CUSTOMER>     <CUSTOMER>     .     .     .     </CUSTOMER> </DOCUMENT>

That's what the data looks like for one customer; here's the full document, including data for all three customers:

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <CUSTOMER>         <NAME>             <LAST_NAME>Smith</LAST_NAME>             <FIRST_NAME>Sam</FIRST_NAME>         </NAME>         <DATE>October 15, 2001</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Tomatoes</PRODUCT>                 <NUMBER>8</NUMBER>                 <PRICE>$1.25</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Oranges</PRODUCT>                 <NUMBER>24</NUMBER>                 <PRICE>$4.98</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER>     <CUSTOMER>         <NAME>             <LAST_NAME>Jones</LAST_NAME>             <FIRST_NAME>Polly</FIRST_NAME>         </NAME>         <DATE>October 20, 2001</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Bread</PRODUCT>                 <NUMBER>12</NUMBER>                 <PRICE>$14.95</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Apples</PRODUCT>                 <NUMBER>6</NUMBER>                 <PRICE>$1.50</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER>     <CUSTOMER>         <NAME>             <LAST_NAME>Weber</LAST_NAME>             <FIRST_NAME>Bill</FIRST_NAME>         </NAME>         <DATE>October 25, 2001</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Asparagus</PRODUCT>                 <NUMBER>12</NUMBER>                 <PRICE>$2.95</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Lettuce</PRODUCT>                 <NUMBER>6</NUMBER>                 <PRICE>$11.50</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER> </DOCUMENT>

Documents like this can grow very long and can consist of markup that is many levels deep. Handling such documents is not a problem for XML parsers, however, as long as the document is well-formed (and, if the parser is a validating parser, valid). In this chapter, I'll refer back to this document, modifying it and taking a look at its parts as we see what makes a document well-formed.

We're ready now to take XML documents apart, piece by piece. I'll start with the basics and work up through the prolog, root element, enclosed elements, and so on. We're going to see it all in this chapter.

At their most basic level, then, XML documents are combinations of markup and character data. We'll start from that point.

Markup and Character Data

XML documents are made up of markup and character data. Binary data might contribute to XML documents some day, but there is no provision for enclosing binary data in a document made up of markup and character data yet; until there is, you refer to external binary data with entity references, as we'll see.

The markup in a document gives it its structure. Markup includes start tags, end tags, empty element tags, entity references, character references, comments, CDATA section delimiters (we'll see more about CDATA sections in a few pages), document type declarations, and processing instructions. So what's the character data in an XML document? All the text in a document that is not markup is character data.

Here's a quick example using markup and character data that we've already seen:

<?xml version="1.0" encoding="UTF-8"?> <DOCUMENT>     <GREETING>         Hello From XML     </GREETING>     <MESSAGE>         Welcome to the wild and woolly world of XML.     </MESSAGE> </DOCUMENT>

Tags begin with < and end with >, so it's easy to see that the markup here consists of tags, such as <?xml version="1.0" encoding="UTF-8"?>, <DOCUMENT>, and so on. The text Hello From XML and Welcome to the wild and woolly world of XML is the character data.

However, markup does not need to begin and end with < and >. Markup also can start with & and end with ; in the case of general entity references (an entity reference is replaced by the entity it refers to when it's parsed) or can start with % and end with ; in the case of parameter entity references, which are used in DTDs as we'll see in Chapter 3. Using entity references, some of the markup in a document can become character data when you process that document. For example, the markup > is a general entity reference that is turned into a > when parsed, and the markup < is turned into a < when parsed. Here's an example:

<?xml version="1.0" encoding="UTF-8"?> <DOCUMENT>     <GREETING>         This text is inside the &lt;GREETING&gt; element     </GREETING> </DOCUMENT>

You can see this XML document in Internet Explorer in Figure 2.1, where you see that the markup > was turned into a >, and that the markup < was turned into a <.

Figure 2.1. Using markup in Internet Explorer.

graphics/02fig01.gif

Because some markup can turn into character data when parsed, the character data that results after everything has been parsed and markup that should be replaced by character data has been replaced has a special name: parsed character data.

Whitespace

If you're ever concerned about exactly what characters are legal in XML documents, you'll find them listed in the XML 1.0 specification under the production named Char. It's worth noting that spaces, carriage returns, line feeds, and tabs are all treated as whitespace in XML. Take a look this document:

<?xml version="1.0" encoding="UTF-8"?> <DOCUMENT> <GREETING> Hello From XML </GREETING> <MESSAGE> Welcome to the wild and woolly world of XML. </MESSAGE> </DOCUMENT>

Practically speaking, that document is equivalent to this one:

<?xml version="1.0" encoding="UTF-8"?> <DOCUMENT><GREETING>Hello From XML</GREETING> <MESSAGE>Welcome to the wild and woolly world of XML.</MESSAGE></DOCUMENT>

It's also worth noting that the XML recommendation specifies that XML documents use the UNIX convention for line endings, which means that lines are ended with a linefeed character only (ASCII code 10). In DOS files, lines are ended with carriage-return linefeed pairs (ASCII codes 13 and 10), but when parsed, that's treated simply as a single linefeed (ASCII code 10).

Handling of Whitespace

You can use the special attribute xml:space in an element to indicate that whitespace should be preserved by applications within that element (if you use xml:space in documents with a DTD, you also must declare it before using it). You can set this attribute to "default" to indicate that the default handling of whitespace is fine, or you can set this attribute to "preserve" to indicate that you want all applications to preserve whitespace as it is in the document.

That gets us started with what can go into XML documents: markup and character data. It's now time to move to the next step up and begin working on the actual structure of XML documents, starting with the prolog.

The Prolog

Prologs come at the very beginning of XML documents. XML documents actually do not need prologs to be considered well-formed. However, the W3C recommends that you include at least the XML declaration, which indicates the version of XML, in the document's prolog. In general, prologs can contain XML declarations, comments, processing instructions, whitespace, and document type declaration(s).

Here's an example: In this case, I've marked the document's prolog, which contains an XML declaration, a processing instruction, and a DTD (which we'll see more about in Chapter 3):

<?xml version = "1.0" standalone="yes"?> <?xml-stylesheet type="text/css" href="greeting.css"?> <!DOCTYPE DOCUMENT [ <!ELEMENT DOCUMENT (CUSTOMER)*> <!ELEMENT CUSTOMER (NAME,DATE,ORDERS)> <!ELEMENT NAME (LAST_NAME,FIRST_NAME)> <!ELEMENT LAST_NAME (#PCDATA)> <!ELEMENT FIRST_NAME (#PCDATA)> <!ELEMENT DATE (#PCDATA)> <!ELEMENT ORDERS (ITEM)*> <!ELEMENT ITEM (PRODUCT,NUMBER,PRICE)> <!ELEMENT PRODUCT (#PCDATA)> <!ELEMENT NUMBER (#PCDATA)> <!ELEMENT PRICE (#PCDATA)> ]> <DOCUMENT>     <CUSTOMER>         <NAME>             <LAST_NAME>Smith</LAST_NAME>             <FIRST_NAME>Sam</FIRST_NAME>         </NAME>     .     .     .

Each part of the prolog bears a closer look, and I'll examine them here (except for document type declarations, which we'll explore in Chapter 3).

The XML Declaration

An XML document can (and should, according to W3C) start with an XML declaration, which can indicate that the document is written in XML. If you use an XML declaration, it should be the first line in the document. Nothing should come before the XML declaration. Here's an example:

<?xml version = "1.0" standalone="yes" encoding="UTF-8"?>

The XML declaration uses the <?xml?> element. In earlier drafts of XML, that was <?XML?>, but it was made lowercase in the final recommendation; it's an error to use uppercase. (You'll still find applications out there that insist on the original uppercase version, however. Browsers such as Internet Explorer accept either version and are thus not fully compliant with the W3C recommendation.)

You can use three possible attributes in the XML declaration:

version. This is the XML version; currently, only 1.0 is possible here. This attribute is required if you use an XML declaration.
encoding. This is the language encoding for the document. As discussed in Chapter 1, the default here is UTF-8. You also can use Unicode, UCS-2 or UCS-4, and many other character sets, such as ISO character sets. This attribute is optional.
standalone. Set this to "yes" if the document does not refer to any external entities; otherwise, use "no." This attribute is optional.

Comments

XML comments are very much like HTML comments. You can use comments to include explanatory notes in your document that are ignored by XML parsers; comments may appear anywhere in a document outside other markup. As in HTML, you start a comment with . Here's an example:

<?xml version="1.0" encoding="UTF-8"?> <DOCUMENT>     <!--Start the document off with a greeting.-->     <GREETING>     <!--Here's the greeting's text.-->         Hello from XML!     </GREETING> </DOCUMENT>

You should follow a few rules when adding comments to an XML document. Comments must not come before an XML declaration; for example, this is incorrect:

<!--Here's my document.--> <?xml version="1.0" encoding="UTF-8"?> <DOCUMENT>     <GREETING>         Hello from XML!     </GREETING> </DOCUMENT>

You also can't put a comment inside markup, like this:

<?xml version="1.0" encoding="UTF-8"?> <DOCUMENT <!--Start the document-->>     <GREETING>         Hello from XML!     </GREETING> </DOCUMENT>

In addition, you cannot use -- inside a comment because XML parsers look for that string inside a comment to indicate the end of the comment. For example, this is incorrect:

<?xml version="1.0" encoding="UTF-8"?> <DOCUMENT>     <!--Start the document off--politely--with a greeting.-->     <GREETING>         Hello from XML!     </GREETING> </DOCUMENT>

You can use comments to remove parts of documents as long as the enclosed parts do not themselves contain any comments. For example, here I'm commenting out the <MESSAGE> element:

<?xml version="1.0" encoding="UTF-8"?> <DOCUMENT>     <GREETING>         Hello From XML     </GREETING> <!--     <MESSAGE>         Welcome to the wild and woolly world of XML.     </MESSAGE> --> </DOCUMENT>

Here's how a parser treats this document:

<?xml version="1.0" encoding="UTF-8"?> <DOCUMENT>     <GREETING>         Hello From XML     </GREETING> </DOCUMENT>

Processing Instructions

As their name implies, processing instructions are instructions to the XML processor. These instructions start with <? and end with ?>. The only restriction here is that you can't use <?xml?> (or <?XML?>, which is also reserved). Processing instructions must be understood by the XML processor, so they're processor-dependant, not built into the XML recommendation.

A very common and well-understood processing instruction (although like other processing instructions, not a part of the XML 1.0 recommendation) is <?xml-stylesheet?>, which connects a style sheet with the document. Here's an example:

<?xml version = "1.0" standalone="yes"?> <?xml-stylesheet type="text/css" href="greeting.css"?> <DOCUMENT>     <GREETING>         Hello From XML     </GREETING>     <MESSAGE>         Welcome to the wild and woolly world of XML.     </MESSAGE> </DOCUMENT>

XML processors, such as Internet Explorer 5 or Netscape Navigator 6, understand <?xml-stylesheet?> already.

I've now taken a look at everything that a prolog can contain (except DTDs): XML declarations, comments, processing instructions, and whitespace. It's time to take a look at the actual structure of an XML document, as created with tags and elements.

Tags and Elements

You give structure to an XML document using markup, which consists of elements. In turn, an XML element consists of a start tag and an end tag, except in the case of elements that are defined to be empty, which consist of only one tag.

A start tag (also called an opening tag) starts with < and ends with >. End tags (also called closing tags) begin with </ and end with >.

Tag Names

The XML specification is very specific about tag names; you can start a tag name with a letter, an underscore, or a colon. The next characters may be letters, digits, underscores, hyphens, periods, and colons (but no whitespace).

Avoid Colons in Tag Names

Although the XML 1.0 recommendation does not say so, you should definitely avoid using colons in tag names because you use a colon when specifying namespaces in XML, as I'll discuss later in this chapter.

Here are some allowable XML tags:

<DOCUMENT>

<document>

<_Record>

<customer>

<PRODUCT>

Note that because XML processors are case-sensitive, the <DOCUMENT> tag is not the same as a <document> tag. (In fact, you can even have <DOCUMENT> and <document> and even <DoCuMeNt> as different tags in the same document, but I really recommend against it.)

Here are the corresponding closing tags:

</DOCUMENT> </document> </_Record> </customer> </PRODUCT>

These are some tags that XML considers illegal:

<2001DOCUMENT> <.document> <Record Number> <customer*name> <PRODUCT(ID)>

Using start and end tags, you can create elements, as in this example, which has three elements, the <DOCUMENT>, <GREETING>, and <MESSAGE> elements; the <DOCUMENT> element contains the <GREETING> and <MESSAGE> elements:

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <GREETING>         Hello From XML     </GREETING>     <MESSAGE>         Welcome to the wild and woolly world of XML.     </MESSAGE> </DOCUMENT>

You also can create elements without using end tags if the elements are explicitly declared to be empty.

Empty Elements

Empty elements have only one tag, not a start and end tag. You may be familiar with empty elements from HTML; for example, the HTML <IMG>, <LI>, <HR>, and   elements are empty, which is to say that they do not enclose any content (either character data or markup).

Empty elements are represented with only one tag (in HTML, there is no closing </IMG>, </LI>, </HR>, and  tags). In XML, you can declare elements to be empty in the document's DTD, as we'll see in Chapter 3.

In XML, you close an empty element with />. For example, if the <GREETING> element is empty, it might look like this in an XML document:

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <GREETING TEXT = "Hello From XML" /> </DOCUMENT>

This usage might seem a little strange at first, but this is XML's way of making sure that an XML processor isn't left searching for a nonexistent closing tag. In fact, in XHTML, which is a derivation of HTML in XML, the <IMG>, <LI>, <HR>, and   tags are actually used as <IMG />, <LI />, <HR />, and   (except that XHTML tags use lowercase letters). The additional / doesn't seem to give the major browsers any trouble. We'll see how to declare empty tags in Chapter 3.

The Root Element

Each well-formed XML document must contain one element that contains all the other elements. This containing element is called the root element. The root element is a very important part of XML documents, especially when you look at them from a programming point of view, because you parse XML documents starting with the root element. In order.xml, developed at the start of this chapter, the root element is the <DOCUMENT> element (although you can give the root element any name):

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <CUSTOMER>         <NAME>             <LAST_NAME>Smith</LAST_NAME>             <FIRST_NAME>Sam</FIRST_NAME>         </NAME>         <DATE>October 15, 2001</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Tomatoes</PRODUCT>                 <NUMBER>8</NUMBER>                 <PRICE>$1.25</PRICE>             </ITEM>             .             .             .             <ITEM>                 <PRODUCT>Lettuce</PRODUCT>                 <NUMBER>6</NUMBER>                 <PRICE>$11.50</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER> </DOCUMENT>

Attributes

Attributes in XML are much like attributes in HTML they're name-value pairs that let you specify additional data in start and empty tags. To assign a value to an attribute, you use an equal sign.

For example, I'm assigning values to the STATUS attribute of the <CUSTOMER> elements in this XML to indicate the status of a customer's credit:

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <CUSTOMER STATUS="Good credit">         <NAME>             <LAST_NAME>Smith</LAST_NAME>             <FIRST_NAME>Sam</FIRST_NAME>         </NAME>         <DATE>October 15, 2001</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Tomatoes</PRODUCT>                 <NUMBER>8</NUMBER>                 <PRICE>$1.25</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Oranges</PRODUCT>                 <NUMBER>24</NUMBER>                 <PRICE>$4.98</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER>     <CUSTOMER STATUS="Lousy credit">         <NAME>             <LAST_NAME>Jones</LAST_NAME>             <FIRST_NAME>Polly</FIRST_NAME>         </NAME>         <DATE>October 20, 2001</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Bread</PRODUCT>                 <NUMBER>12</NUMBER>                 <PRICE>$14.95</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Apples</PRODUCT>                 <NUMBER>6</NUMBER>                 <PRICE>$1.50</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER>     <CUSTOMER STATUS="Good credit">         <NAME>             <LAST_NAME>Weber</LAST_NAME>             <FIRST_NAME>Bill</FIRST_NAME>         </NAME>         <DATE>October 25, 2001</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Asparagus</PRODUCT>                 <NUMBER>12</NUMBER>                 <PRICE>$2.95</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Lettuce</PRODUCT>                 <NUMBER>6</NUMBER>                 <PRICE>$11.50</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER> </DOCUMENT>

You can see this XML document in Internet Explorer, including the attributes and their values, in Figure 2.2.

Figure 2.2. Using attributes in Internet Explorer.

graphics/02fig02.gif

An XML processor can read the attributes and their values, and you can put that data to work in your own applications. We'll see how to read attribute values in both JavaScript and Java in this book.

A lot of debate occurs over when you should store data using attributes and when you should store data using elements. There simply is no hard and fast rule, but here are a couple guidelines that I find useful.

First, too many attributes definitely make documents hard to read. For example, take a look at this element:

<CUSTOMER>     <NAME>         <LAST_NAME>Smith</LAST_NAME>         <FIRST_NAME>Sam</FIRST_NAME>     </NAME>     <DATE>October 15, 2001</DATE>     <ORDERS>         <ITEM>             <PRODUCT>Tomatoes</PRODUCT>             <NUMBER>8</NUMBER>             <PRICE>$1.25</PRICE>         </ITEM>     </ORDERS> </CUSTOMER>

It's fairly clear what's going on here, even if it is a little involved. However, if you try to convert all this data to attributes, you end up with something like this:

<CUSTOMER LAST_NAME="Smith" FIRST_NAME="Sam" DATE="October 15, 2001" PURCHASE="Tomatoes" PRICE="$1.25" NUMBER="8" />

Clearly, this will be a mess if you have a few such elements.

Another point is that you really can't specify document structure using attributes. For example, the example we've already seen in this chapter stores multiple ordered items per customer. However, attribute names must be unique, so it's much tougher to store data like this using attributes:

<CUSTOMER>     <NAME>         <LAST_NAME>Smith</LAST_NAME>         <FIRST_NAME>Sam</FIRST_NAME>     </NAME>     <DATE>October 15, 2001</DATE>     <ORDERS>         <ITEM>             <PRODUCT>Tomatoes</PRODUCT>             <NUMBER>8</NUMBER>             <PRICE>$1.25</PRICE>         </ITEM>         <ITEM>             <PRODUCT>Oranges</PRODUCT>             <NUMBER>24</NUMBER>             <PRICE>$4.98</PRICE>         </ITEM>     </ORDERS> </CUSTOMER>

The upshot is that deciding whether to store your data in attributes or to create new elements is really a matter of taste until you get beyond a few attributes per tag. If you find yourself using (not just defining, but using) more than four attributes in a tag, consider breaking up the tag into a number of enclosed tags. Doing so will make the document structure much easier to work with and edit later.

You should follow specific rules when creating attributes, and those include correctly setting attribute names and specifying attribute values.

Attribute Names

According to the XML 1.0 specification, attribute names must follow the same rules as those for tag names, which means that you can start an attribute name with a letter, an underscore, or a colon. The next characters may be letters, digits, underscores, hyphens, periods, and colons (but no whitespace because you separate attribute name-value pairs with whitespace).

Take a look at these examples showing legal attribute names:

<circle origin_x="10.0" origin_y="20.0" radius="10.0" /> <image src="image1.jpg"> <pen color="red" width="5"> <book pages="1231" >

Here are some illegal attribute names:

<circle 1origin_x="10.0" 1origin_y="20.0" 1radius="10.0" /> <image src name="image1.jpg"> <pen color@="red" width@="5"> <book pages(excluding front matter)="1231" >

Attribute Values

Because markup is always text, attributes are also text. Even if you're assigning a number to an attribute, you treat that number as a text string and enclose it in quotes, like this:

<circle origin_x="10.0" origin_y="20.0" radius="10.0" />

Among other things, this means that XML processors will return attribute values as text strings. If you want to treat them as numbers, you'll have to make sure that you translate them to numbers as the programming language you're using allows.

In XML, you must enclose attribute values in quotation marks. Usually, you use double quotes, but consider the case in which the attribute value itself contains double quotes you can't just surround such a value with double quotes because an XML processor will get confused as to where your text ends. In such a case, you can use single quotes to surround the text, like this:

<quotation text='He said, "Not that!"' />

What if the attribute value contains both single and double quotes? In that case, you can use the XML-defined entity ' for a single quote and " for double quotes, like this (I'll discuss the five XML-defined entity references in a few pages):

<person height="5&apos;6&quot;" />

Assigning Values to Attributes

If you're going to use an attribute, you must assign it a value. Not doing so is a violation of well-formedness that is, you cannot have "standalone" attributes in XML, such as the BORDER attribute in HTML tables, which need not be assigned a value.

A Useful Attribute xml:lang

One general attribute bears mention: xml:lang. It's often convenient to specify the language of a document's content and attribute values, especially to help software such as Web search engines. You can specify the language in XML tags with the xml:lang attribute. (In valid documents, this attribute, like any other, must be declared if it is used.)

You can set the xml:lang attribute to these values:

A two-letter language code, as defined by [ISO 639].
A language identifier registered with the Internet Assigned Numbers Authority (IANA). Such identifiers begin with the prefix i- (or I-).
A language identifier assigned by you, or for private use. Such identifies must begin with x- or X-.

As an example, I'm using the xml:lang attribute here to indicate that an element's language is English:

<p xml:lang="en">The color should be brown.</p>

You also can use language subcodes if you follow the language code with a hyphen and the subcode. A subcode indicates a dialect or a regional variation. For example, here I'm indicating that one element holds British English content and that one holds American English content:

<p xml:lang="en-GB">The colour should be brown.</p> <p xml:lang="en-US">The color should be brown.</p>

Besides defining element content, xml:lang also identifies the language used in a tag's attribute values, as in this case, where I'm using German:

<p farbe="braun" xml:lang="de">

Building Well-Formed Document Structure

We've gotten a lot of the syntax and rules of creating XML documents at the element and character data level down now. It's time to move on to the next level: actually giving your document structure.

The W3C has a lot of rules about how to structure your document in a way to make it well-formed, and I'll take a look at those rules here. In this chapter, I'm going to talk only about standalone documents; in Chapter 3, we'll see that we have to adjust these points somewhat for documents that have a DTD.

Checking Well-Formedness

If you have doubts whether your XML document is well-formed, use an online XML validator, such as the excellent one hosted by the Brown University Scholarly Technology Group at http://www.stg.brown.edu/service/xmlvalid/. You'll get a complete report on your document's well-formedness and validity. To see all the well-formedness constraints as set up by the W3C, look at http://www.w3.org/TR/REC-xml (or Appendix A), and search for the text "Well-Formedness Constraint," which is how W3C marks those constraints.

An XML Declaration Should Begin the Document

The first well-formedness structure constraint is that you should start the document with an XML declaration. Technically, you don't need to include an XML declaration in your document, but if you do, to make the document well-formed, the XML declaration must be the absolute first thing in the document, like this (not even whitespace should come before the XML declaration):

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <CUSTOMER STATUS="Good credit">         <NAME>             <LAST_NAME>Smith</LAST_NAME>             <FIRST_NAME>Sam</FIRST_NAME>         </NAME>         <DATE>October 15, 2001</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Tomatoes</PRODUCT>                 <NUMBER>8</NUMBER>                 <PRICE>$1.25</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Oranges</PRODUCT>                 <NUMBER>24</NUMBER>                 <PRICE>$4.98</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER>     .     .     .

Do You Need an XML Declaration?

The W3C says that XML documents should have an XML declaration, but documents really don't need to have one in all cases. For example, when you're combining XML documents with the same character encoding into one large one, you don't want to include an XML declaration at the head of each section of the document.

Include One or More Elements

To be a well-formed document, a document must include one or more elements. The first element it includes, of course, is the root element, and all other elements are enclosed by that element. The examples we've seen throughout this chapter show how this works, as here, where this XML document contains multiple elements:

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <GREETING>         Hello From XML     </GREETING>     <MESSAGE>         Welcome to the wild and woolly world of XML.     </MESSAGE> </DOCUMENT>

Include Both Start and End Tags for Elements That Aren't Empty

In HTML, Web browsers often handle the case where you omit end tags for HTML elements, even if you shouldn't omit those end tags, according to the HTML specification. For example, if you use the  tag and then follow it with another  tag without using a  tag the browser will have no problem.

In XML, the story is different. To make sure that a document is well-formed, every element that is not empty must have both a start tag and an end tag, as in the example we just saw:

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <GREETING>         Hello From XML     </GREETING>     <MESSAGE>         Welcome to the wild and woolly world of XML.     </MESSAGE> </DOCUMENT>

In fact, there's another well-formedness constraint here: End tags must match start tags to complete an element.

Close Empty Tags with />

Empty elements don't have closing tags. These tags have no content, which means that they do not enclose any character data or markup. Instead, these elements are made up entirely of one tag, like this:

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <GREETING TEXT = "Hello From XML" /> </DOCUMENT>

In XML, you must always end empty elements with />, as shown here, if you want your document to be well-formed. In general, the current crop of the major Web browsers deals well with elements such as  ; this is a good thing because the alternative was to write such elements as  , and that can be confusing. In fact, some browsers, such as Netscape, interpret that markup as two   elements.

The Root Element Must Contain All Other Elements

One element in well-formed documents, the root element, contains all other elements. In this case, for example, the root element is the <BOOKS> element:

<?xml version = "1.0" standalone="yes"?> <BOOKS>     <BOOK>         <TITLE>             Inside XML         </TITLE>         <REVIEW>             Excellent         </REVIEW>     </BOOK>     <BOOK>         <TITLE>             Other XML Book         </TITLE> <REVIEW>             OK         </REVIEW>     </BOOK> </BOOKS>

In this case, the root element must contain all other elements (excluding the XML declaration, comments, and other nonelements). This makes it easy for XML processors to handle XML documents as trees, starting at the root element, as we'll see when we start parsing XML documents.

Nest Elements Correctly

A very big part of making sure that documents are well-formed is ensuring that elements nest correctly (in fact, that's one of the reasons for the term well-formed). The idea here is that if an element contains a start tag for a tag that's not empty, it must also contain that element's end tag.

For example, this XML is fine:

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <GREETING>         Hello From XML     </GREETING>     <MESSAGE>         Welcome to the wild and woolly world of XML.     </MESSAGE> </DOCUMENT>

However, there's a nesting problem in this next document because an XML processor will encounter the <MESSAGE> tag before finding the closing </GREETING> tag:

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <GREETING>         Hello From XML     <MESSAGE>     </GREETING>         Welcome to the wild and woolly world of XML.     </MESSAGE> </DOCUMENT>

Because you should nest elements correctly to create a well-formed document, and because XML processors are supposed to refuse documents that are not well-formed, you can always count on every nonroot element to have exactly one (and only one) parent element that encloses it. For example, in the example before the previous example that was not well-formed, the <GREETING> and <MESSAGE> elements both have the same parent the <DOCUMENT> element itself, which is also the root element. Note that a parent element can enclose an indefinite number of child elements (which also can mean zero child elements).

Use Unique Attribute Names

One of the well-formedness constraints that the XML 1.0 specification lists is that no attribute name may appear more than once in the same start tag or empty element tag. It's hard to see how you would violate this one other than by mistake, as in this case, where I give a person two last names:

<PERSON LAST_NAME="Wooster" LAST_NAME="Jeeves">

Note that because XML is case-sensitive, attributes with different capitalizations are treated as being different, as in this case (although it's still hard to see how you'd write this except by mistake):

<PERSON LAST_NAME="Wooster" last_name="Jeeves">

(In general, using attribute names that differ only in terms of capitalization is a really bad idea.)

Use Only the Five Pre-Existing Entity References

Five predefined entity references exist in XML. An entity reference is replaced by the corresponding entity when the XML document is processed. You may already know about entity references from HTML; for example, the HTML entity reference © is replaced by the symbol when it parses an HTML document.

As in HTML, general entity references in XML start with & and end with ; in XML. Parameter entity references, which we'll use in DTDs in the next chapter, start with % and end with ;. Here are the five predefined entity references in XML and the character they are replaced with when parsed:

`&`	The `&` character
`<`	The `<` character
`>`	The `>` character
`'`	The `'` character
`"`	The `"` character

Normally, these characters are tricky to handle in XML documents because XML processors give them special importance that is, < and > straddle markup tags, you use quotation marks to surround attribute values, and the & character starts entity references. Replacing them with the previous entity references makes them safe because the XML processor will replace them with the appropriate character when processing the document. Using an entity reference for a character is often called escaping that character (following the terminology of programming languages that use "escape sequences" to embed special characters in text).

For example, say that you wanted to use the term "The S&O Railway" in a document; you could use the & entity reference for the ampersand this way:

<TOUR CAPTION="The S&amp;O Railway" />

Although only five predefined entity references exist in XML, you can define new entity references. I'll take a look at how to do that in the next chapter on DTDs.

The Final ; in Entity References

HTML browsers often let you omit the final ; in entity references if the entity reference is followed by whitespace (if the entity reference is embedded in non-whitespace text, you must include the final ; even in HTML). However, you cannot omit the final ; in XML entity references.

Surround Attribute Values with Quotes

In HTML, there's no problem if you omit the quotes around attribute values (as long as those values don't contain any whitespace). For example, this element presents no problem to HTML browsers:

<IMG SRC=image.jpg>

However, XML processors would refuse such an element because omitting the quotation marks around the attribute value image.jpg is a violation of well-formedness. Here's how this element would look when written properly:

<IMG SRC="image.jpg" />

You can also use single quotation marks, like this:

<IMG src='/books/1/261/1/html/2/image.jpg' />

In fact, if the attribute value contains double quotes, you should surround it with single quotes, as we've seen.

<quotation text='He said, "Not that!"' />

As indicated previously, XML makes provisions for handling single and double quotes inside attribute values. You can always replace single quotes with the entity reference for apostrophes, ' and double quotes with the entity reference ". For example, to assign the attribute height the value 5'6', you can do it this way:

<person height="5&apos;6&quot;" />

In XHTML, the derivation of HTML 4.0 in XML, you must surround attribute values in quotation marks, just as in any other XML document. I'm sure that this requirement will be one of the most persistently troublesome for Web authors switching to XHTML, simply because it's so easy to forget.

A few more well-formedness constraints on attribute values bear mention. Attribute values cannot contain direct or indirect references to external entities (more on this in Chapter 3), and you cannot use the < character in attribute values. If you must use <, use the entity reference < instead, like this, where I'm assigning the text <-- to the TEXT attribute:

<ARROW TEXT="&lt;--" />

In fact, so strong is the prohibition against using <, except to start markup, that you shouldn't use it anywhere in the document except for that purpose see the next section, "Use < and & Only to Start Tags and Entities."

Use < and & Only to Start Tags and Entities

XML processors assume that < always starts a tag and that & always starts an entity reference, so you should avoid using those characters for anything else. We've already seen this example where the ampersand in "The S&O Railway" is replaced by &:

<TOUR CAPTION="The S&amp;O Railway" />

You should also particularly avoid the < character in nonmarkup text as well. This can be difficult sometimes, as when the < character must be used as the less-than operator in JavaScript, as in this example in XHTML:

<?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/tr/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">     <head>         <title>             Using The if Statement In JavaScript         </title>     </head>     <body>         <script language="javascript">             var budget             budget = 234.77             if (budget < 0) {                 document.writeln("Uh oh.")             }         </script>         <center>             <h1>                 Using The if Statement In JavaScript             </h1>         </center>     </body> </html>

In cases like this, the W3C suggests that you enclose the JavaScript code in a CDATA section (see the next section) so that the XML processor will ignore it, but unfortunately, no major browser today understands CDATA sections. Another possible solution is to enclose the JavaScript code in a comment, , but W3C doesn't recommend this because XML processors are allowed to remove comments before passing the XML on to the underlying application and so would remove the JavaScript code entirely from the document.

You can use < for the < operator, like this:

<?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/tr/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">     <head>         <title>             Using The if Statement In JavaScript         </title>     </head>     <body>         <script language="javascript">             var budget             budget = 234.77             if (budget &lt; 0) {                 document.writeln("Uh oh.")             }         </script>         <center>             <h1>                 Using The if Statement In JavaScript             </h1>         </center>     </body> </html>

Practically speaking, however, this still represents a problem for the major browsers, although it's the way you should go in the long run. In the short run, you actually should remove the whole problem from the scope of the browser by placing the script code in an external file, here named script.js:

<?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/tr/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">     <head>         <title>             Using The if Statement In JavaScript         </title>     </head>     <body>         <script language="javascript" src="script.js">         </script>         <center>             <h1>                 Using The if Statement In JavaScript             </h1>         </center>     </body> </html>

CDATA Sections

As you know, XML processors are very sensitive to characters such as < and &. So what if you had a large section of text that contained a great many < and & characters that you didn't want to interpret as markup? You can escape those characters as < and &, of course, but with many such characters, that's awkward and hard to read. Instead, you can use a CDATA section.

CDATA sections hold character data that is supposed to remain unparsed by the XML processor. This is a useful asset to XML because otherwise all the text in an XML document is parsed and searched for characters such as < and &. You use CDATA sections simply to tell the XML processor to leave the enclosed text alone and to pass it on unchanged to the underlying application.

You start a CDATA section with the markup <![CDATA and end it with ]]>. Note that this means that CDATA sections are actually searched, but only for the ending text ]]>. Among other things, this means that you cannot include the text ]]> inside a CDATA section and it also means that you cannot nest CDATA sections.

Here's an example; in this case, I've added an element named <MARKUP> to a document, and this element itself contains markup that I want to preserve as character data (so that it can be printed out, for example). To make sure that the markup inside this element is preserved as text, I enclose it in a CDATA section like this:

<?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <MARKUP>     <![CDATA         <CUSTOMER>             <NAME>                 <LAST_NAME>Smith</LAST_NAME>                 <FIRST_NAME>Sam</FIRST_NAME>             </NAME>            <DATE>October 15, 2001</DATE>            <ORDERS>                 <ITEM>                     <PRODUCT>Tomatoes</PRODUCT>                     <NUMBER>8</NUMBER>                     <PRICE>$1.25</PRICE>                 </ITEM>                 <ITEM>                     <PRODUCT>Oranges</PRODUCT>                     <NUMBER>24</NUMBER>                     <PRICE>$4.98</PRICE>                 </ITEM>             </ORDERS>         </CUSTOMER>     ]]>     </MARKUP> </DOCUMENT>

As you can see, CDATA sections are powerful because they enable you to embed character data directly in XML documents without having it parsed. (Normally, character data in XML documents is parsed by the XML processor and becomes parsed character data.)

Here's another example. In this case, I'm adapting the JavaScript example in the previous section to show how the W3C wants to handle script code in XHTML pages by placing that code in a CDATA section:

<?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/tr/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">     <head>         <title>             Using The if Statement In JavaScript         </title>     </head>     <body>         <script language="javascript">             <![CDATA                 var budget                 budget = 234.77                 if (budget < 0) {                     document.writeln("Uh oh.")                 }             ]]>         </script>         <center>             <h1>                 Using The if Statement In JavaScript             </h1>         </center>     </body> </html>

Unfortunately, as mentioned in the previous section, the idea of a CDATA section, especially one that starts with the expression <![CDATA and ends with the expression ]]>, confuses the current versions of the major browsers. When those browsers are configured to handle XHTML, this situation will improve.

XML Namespaces

There's considerable freedom in XML because you can define your own tags. However, as more XML applications came to be developed, a problem arose that had been unforeseen by the creators of the original XML specification: tag name conflicts.

As we saw in Chapter 1, two popular XML applications are XHTML that is, HTML 4.0 as written in XML and MathML, which lets you display equations. XHTML is useful because it lets you handle all the standard HTML 4.0 tags; if you need to display equations, MathML can be essential. So what if you want to use MathML inside an XHTML Web page? That's a problem because the tags defined in XHTML and MathML overlap (specifically, each application defines a <var> and a <select> element).

The solution is to use namespaces. Namespaces enable you to make sure that one set of tags cannot conflict with another. Namespaces work by letting you prepend a name followed by a colon to tag and attribute names, changing those names so that they don't conflict.

XML namespaces are one of those XML companion recommendations that keep being added to the XML specification; you can find the specification for namespaces at http://www.w3.org/TR/REC-xml-names/. A lot of debate still rages about this one (largely because namespaces can make writing DTDs difficult), but it's now an official W3C recommendation.

Creating a Namespace

Here's an example. In this case, I'll use a fictitious XML application designed for cataloging books whose root element is <library>, and I'll add my own reviews to each book. I start off with a book as specified with the fictitious XML application:

<library>    <book>         <title>             Earthquakes for Lunch.         </title>    </book> </library>

Now I want to add my own comments to this <book> item. To do that, I start by confining the book XML application to its own namespace, for which I'll use the prefix book:. To define a new namespace, use the xmlns: prefix attribute, where prefix is the prefix that you want to use for the namespace:

<library    xmlns:book="http://www.amazingterrificbooks.com/spec">    <book>         <title>             Earthquakes for Lunch.         </title>    </book> </library>

To define a namespace, you assign the xmlns: prefix attribute to a unique identifier, which in XML is usually a uniform resource identifier (URI) (a URL, in this case) that may direct the XML processor to a DTD for the namespace (but it doesn't have to). After defining the book namespace, you can preface every tag and attribute name in this namespace with book:, like this:

<book:library    xmlns:book="http://www.amazingterrificbooks.com/spec">    <book:book>         <book:title>             Earthquakes for Lunch.         </book:title>    </book:book> </book:library>

Now the tag and attribute names have actually been changed; for example, <library> is now <book:library> as far as the XML processor is concerned. (If you've defined tag and attribute names in a document's DTD, you must redefine the tags and attributes there as well to make the new names legal.)

Because all tag and attribute names from the book namespace are now in their own namespace, I'm free to add my own namespace to the document so that I can add my own comments to each book entry. I start by defining a new namespace named steve:

<book:library    xmlns:book="http://www.amazingterrificbooks.com/spec"    xmlns:steve="http://www.starpowder.com/steve">    <book:book>         <book:title>             Earthquakes for Lunch.         </book:title>    </book:book> </book:library>

Now I can use the new steve namespace to add markup to the document like this, keeping it separate from the other markup:

<book:library    xmlns:book="http://www.amazingterrificbooks.com/spec"    xmlns:steve="http://www.starpowder.com/steve">    <book:book>         <book:title>             Earthquakes for Lunch.         </book:title>         <steve:review> This book was OK, no great shakes.         </steve:review>    </book:book> </book:library>

I also can use attributes in the steve namespace as long as I prefix them with steve:, like this:

<book:library    xmlns:book="http://www.amazingterrificbooks.com/spec"    xmlns:steve="http://www.starpowder.com/steve">    <book:book>         <book:title>             Earthquakes for Lunch.         </book:title>         <steve:review steve:ID="1000034">             This book was OK, no great shakes.         </steve:review>    </book:book> </book:library>

And that's how namespaces work you can use them to separate tags, even tags with the same name, from each other. As you can see, using multiple namespaces in the same document is no problem at all; just use the xmlns attribute in the enclosing element to define the appropriate namespaces.

`xmlns` in Child Elements

In fact, you can use the xmlns attribute in child elements to redefine an enclosing namespace if you'd like.

Creating Local Namespaces

You don't need to use the xmlns attribute in the root element; you can use this attribute in any element. In this case, I've moved the steve namespace definition to the element in which it's used:

<book:library    xmlns:book="http://www.amazingterrificbooks.com/spec">    <book:book>         <book:title>             Earthquakes for Lunch.         </book:title>         <steve:review         xmlns:steve="http://www.starpowder.com/steve"         steve:ID="1000034"/>             This book was OK, no great shakes.         </steve:review>    </book:book> </book:library>

Because namespace prefixes are really just text prepended to tag and attribute names, they follow the same rules for naming tags and attributes that is, a namespace can start with a letter or an underscore. The following characters can include underscores, letters, digits, hyphens, and periods. Although colons are legal in tag names, you can't use a colon in a namespace name, for obvious reasons. In addition, two namespace names are reserved: xml and xmlns. Note that because namespace prefixes are merely text prepended to tag and attribute names, followed by a colon (which is legal in names), XML processors that have never heard of namespaces can use them without a problem.

Names of Attributes in Namespaces

You can use two names to refer to the same namespace. Note, however, that because you must use attributes with unique names, you cannot use attributes with those two namespaces that share the same name in the same element.

Default Namespaces

Now I'll return to the example that introduced this topic: the idea of using MathML in an XHTML document. In this case, let's assume that I want to display an equation in an XHTML document. I start off with an XHTML document that looks like this:

<?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/tr/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">     <head>         <title>             Embedding MathML In XHTML         </title>     </head>     <body>         <center>             <h1>                 Embedding MathML In XHTML             </h1>         </center>         Here's the MathML:     </body> </html>

This document has a <!DOCTYPE> element that you use to connect a DTD to a document, and the <html> element defines a namespace with the xmlns attribute. Note in particular that this time, the xmlns attribute is used by itself, without defining any prefix to specify a namespace (xmlns="http://www.w3.org/1999/xhtml"). When you use the xmlns attribute alone, without specifying any prefix, you are defining a default namespace. All the enclosed elements are assumed to belong to that namespace.

In XHTML documents, it's customary to make the W3C XHTML namespace, http://www.w3.org/1999/xhtml, into the default namespace for the document. When you do, you can then use the standard HTML tag names without any prefix, as you see in this example.

However, I want to use MathML markup in this document. To do so, I add a new namespace, which I'll call m, to this document, using the namespace that W3C has specified for MathML, http://www.w3.org/TR/REC-MathML/:

<?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/tr/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"     xmlns:m="http://www.w3.org/TR/REC-MathML/">     <head>         <title>             Embedding MathML In XHTML         </title>     </head>     <body>         <center>             <h1>                 Embedding MathML In XHTML             </h1>         </center>         Here's the MathML:     </body> </html>

Now I can add MathML as I like, as long as I restrict that markup to the m namespace, like this:

<?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/tr/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"     xmlns:m="http://www.w3.org/TR/REC-MathML/">     <head>         <title>             Embedding MathML In XHTML         </title>     </head>     <body>         <center>             <h1>                 Embedding MathML In XHTML             </h1>         </center>         Here's the MathML:         <m:math>             <m:mrow>                 <m:mrow>                 <m:mn>3</m:mn>                     <m:mo>&InvisibleTimes;</m:mo>                     <m:msup>                         <m:mi>Z</m:mi>                         <m:mn>2</m:mn>                     </m:msup>                     <m:mo>-</m:mo>                     <m:mrow>                         <m:mn>6</m:mn>                         <m:mo>&InvisibleTimes;</m:mo>                         <m:mi>Z</m:mi>                     </m:mrow>                     <m:mo>+</m:mo>                     <m:mn>12</m:mn>                 </m:mrow>                 <m:mo>=</m:mo>                 <m:mn>0</m:mn>             </m:mrow>         </m:math>     </body> </html>

This document works fine, and you can see the result in the W3C Amaya browser in Figure 2.3.

Figure 2.3. A MathML document in the Amaya browser.

graphics/02fig03.gif

We'll have occasions to use namespaces throughout this book, as when we work with the XSL transformation language in Chapter 13 "XSL Transformations."

Infosets

While discussing creating XML documents, it's worth discussing a new XML specification: the XML Information Set specification, which you'll find at http://www.w3.org/TR/xml-infoset.

XML documents excel at storing data, and this has led developers to wonder whether XML will ultimately be able to solve an old problem: being able to directly compare and classify the data in multiple documents. For example, consider the World Wide Web as it stands today. There can be thousands of documents on a particular topic, but how can you possibly compare them? For example, a search for "XML" turns up about 675,000 matches, but it would be extraordinarily difficult to write a program that would compare the data in those documents because all that data isn't stored in any remotely compatible format.

The idea behind XML information sets, also called infosets, is to set up an abstract way of looking at an XML document so that it can be compared to others. To have an infoset, XML documents may not use colons in tag and attribute names unless they are used to support namespaces. Documents do not need to be valid to have an infoset, but they need to be well-formed.

An XML document's information set consists of two or more information items (the information set for any well-formed XML document contains at least the document information item and one element information item). An information item is an abstract representation of some part of an XML document, and each information item has a set of properties, some of which are considered core and some of which are considered peripheral.

An XML information set can contain 15 different types of information items, as listed in the W3C Infoset specification:

A document information item (core)
Element information items (core)
Attribute information items (core)
Processing instruction information items (core)
Reference to skipped entity information items (core)
Character information items (core)
Comment information items (peripheral)
A document type declaration information item (peripheral)
Entity information items (core for unparsed entities, peripheral for others)
Notation information items (core)
Entity start marker information items (peripheral)
Entity end marker information items (peripheral)
CDATA start marker information items (peripheral)
CDATA end marker information items (peripheral)
Namespace declaration information items (core)

There is always one document information item in the information set. Here's a list of the core properties of the document information item:

[children]. This property holds an ordered list of references to child information items, in the original document order.
[notations]. This property holds an unordered set of references to notation information items (which we'll see more about in Chapter 3).
[entities]. This property holds an unordered set of references to entity information items, one for each unparsed entity declaration in the DTD.

The document information item can also have these properties:

[base URI]. This property holds the absolute URI of the document entity.
[children - comments]. This property holds a reference to a comment information item for each comment outside the document element.
[children - doctype]. This property holds a reference to one document type declaration information item.
[entities - other]. This property holds a reference to an entity information item for each parsed general entity declaration in the DTD.

The other information items, such as element information items and processing instruction information items, have similar properties lists.

Currently, no applications create and work with infosets. However, W3C documentation often refers to the information stored in an XML document as its infoset, so this is an important term to know. The closest you come to working with infosets right now is working with canonical XML documents (see the next section, Canonical XML).

Canonical XML

Although infosets are a good idea, they are only abstract formulations of the information in an XML document. So, without reducing an XML document to its infoset, how can you actually approach the goal of being able to actually compare XML documents byte by byte?

It turns out that there is a way: You can use canonical XML. Canonical XML is a companion standard to XML, and you can read all about it at http://www.w3.org/TR/xml-c14n. Essentially, canonical XML is a strict XML syntax; documents in canonical XML can be compared directly. The information included in the canonical XML version of a document is the same as would appear in its infoset.

As you can imagine, two XML documents that actually contain the same information can be arranged differently. They can differ in terms of their structure, attribute ordering, and even character encoding. That means that it's very hard to compare such documents. However, when you place these documents in canonical XML format, they can be compared on a byte-by-byte level. In the canonical XML syntax, logically equivalent documents are identical byte for byte.

The canonical XML syntax is very strict; for example, canonical XML uses UTF-8 character encoding only, carriage-return linefeed pairs are replaced with linefeeds, tabs in CDATA sections are replaced by spaces, all entity references must be expanded, and much more, as specified in http://www.w3.org/TR/xml-c14n. Because canonical XML is intended to be byte-by-byte correct, the upshot is that if you need a document in canonical form, you should use software to convert your XML documents to that form.

One such package that will convert valid XML documents to canonical form comes with the XML for Java software that you can get from IBM's AlphaWorks (http://www.alphaworks.ibm.com/tech/xml4j); we touched on this in Chapter 1, and we will be using it later in the book, in Chapter 11, "Java and the XML DOM." XML for Java comes with a Java program named DOMWriter that can convert documents to canonical XML form. To use this program, you must make sure that your document is valid, which means giving it a DTD or a schema to be checked against. I'll add a DTD to the example order.xml that we've seen in this chapter (we'll see how to create DTDs in Chapter 3):

<?xml version = "1.0" standalone="yes"?> <!DOCTYPE DOCUMENT [ <!ELEMENT DOCUMENT (CUSTOMER)*> <!ELEMENT CUSTOMER (NAME,DATE,ORDERS)> <!ELEMENT NAME (LAST_NAME,FIRST_NAME)> <!ELEMENT LAST_NAME (#PCDATA)> <!ELEMENT FIRST_NAME (#PCDATA)> <!ELEMENT DATE (#PCDATA)> <!ELEMENT ORDERS (ITEM)*> <!ELEMENT ITEM (PRODUCT,NUMBER,PRICE)> <!ELEMENT PRODUCT (#PCDATA)> <!ELEMENT NUMBER (#PCDATA)> <!ELEMENT PRICE (#PCDATA)> ]> <DOCUMENT>     <CUSTOMER>         <NAME>             <LAST_NAME>Smith</LAST_NAME>             <FIRST_NAME>Sam</FIRST_NAME>         </NAME>         <DATE>October 15, 2001</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Tomatoes</PRODUCT>                 <NUMBER>8</NUMBER>                 <PRICE>$1.25</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Oranges</PRODUCT>                 <NUMBER>24</NUMBER>                 <PRICE>$4.98</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER>     <CUSTOMER>         <NAME>             <LAST_NAME>Jones</LAST_NAME>             <FIRST_NAME>Polly</FIRST_NAME>         </NAME>         <DATE>October 20, 2001</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Bread</PRODUCT>                 <NUMBER>12</NUMBER>                 <PRICE>$14.95</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Apples</PRODUCT>                 <NUMBER>6</NUMBER>                 <PRICE>$1.50</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER>     <CUSTOMER>         <NAME>             <LAST_NAME>Weber</LAST_NAME>             <FIRST_NAME>Bill</FIRST_NAME>         </NAME>         <DATE>October 25, 2001</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Asparagus</PRODUCT>                 <NUMBER>12</NUMBER>                 <PRICE>$2.95</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Lettuce</PRODUCT>                 <NUMBER>6</NUMBER>                 <PRICE>$11.50</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER> </DOCUMENT>

Now you can use the DOMWriter program with the special c switch to convert this document to canonical form; the > canonical.xml part at the end sends the output of the program to a file named canonical.xml. (We'll see how to set up the Java classpath environment variable as it must be set up to make this work in Chapter 11.).

%java dom.DOMWriter -c order.xml > canonical.xml java -cp xml4j.jar;xerces.jar;xercesSamples.jar dom.DOMWriter -c order.xml > canonical.xml

Here's the result. (Note that DOMWriter has preserved all the whitespace in the document, and the 
 entity references stand for the UTF-8 code for a linefeed. You also can give codes in hexadecimal if you include an "x" before the number like this for a linefeed: &#xA.)

<DOCUMENT>&#10;    <CUSTOMER>&#10;        <NAME>&#10; <LAST_NAME>Smith</LAST_NAME>&#10; <FIRST_NAME>Sam</FIRST_NAME>&#10;        </NAME>&#10; <DATE>October 15, 2001</DATE>&#10;        <ORDERS>&#10; <ITEM>&#10;                <PRODUCT>Tomatoes</PRODUCT>&#10; <NUMBER>8</NUMBER>&#10; <PRICE>$1.25</PRICE>&#10;            </ITEM>&#10; <ITEM>&#10;                <PRODUCT>Oranges</PRODUCT>&#10; <NUMBER>24</NUMBER>&#10; <PRICE>$4.98</PRICE>&#10;            </ITEM>&#10; </ORDERS>&#10;    </CUSTOMER>&#10;    <CUSTOMER>&#10; <NAME>&#10;            <LAST_NAME>Jones</LAST_NAME>&#10; <FIRST_NAME>Polly</FIRST_NAME>&#10;        </NAME>&#10; <DATE>October 20, 2001</DATE>&#10;        <ORDERS>&#10; <ITEM>&#10;                <PRODUCT>Bread</PRODUCT>&#10; <NUMBER>12</NUMBER>&#10; <PRICE>$14.95</PRICE>&#10;            </ITEM>&#10; <ITEM>&#10;                <PRODUCT>Apples</PRODUCT>&#10; <NUMBER>6</NUMBER>&#10; <PRICE>$1.50</PRICE>&#10;            </ITEM>&#10; </ORDERS>&#10;    </CUSTOMER>&#10;    <CUSTOMER>&#10; <NAME>&#10;            <LAST_NAME>Weber</LAST_NAME>&#10; <FIRST_NAME>Bill</FIRST_NAME>&#10;        </NAME>&#10; <DATE>October 25, 2001</DATE>&#10;        <ORDERS>&#10; <ITEM>&#10;                <PRODUCT>Asparagus</PRODUCT>&#10; <NUMBER>12</NUMBER>&#10; <PRICE>$2.95</PRICE>&#10;            </ITEM>&#10; <ITEM>&#10;                <PRODUCT>Lettuce</PRODUCT>&#10; <NUMBER>6</NUMBER>&#10; <PRICE>$11.50</PRICE>&#10;            </ITEM>&#10; </ORDERS>&#10;    </CUSTOMER>&#10;</DOCUMENT>

In their canonical form, documents can be compared directly, and any differences will be readily apparent.

This example is also useful because it shows exactly what a DTD looks like and provides us with the perfect starting point for Chapter 3, which is where we start writing DTDs ourselves and create valid XML documents.