Building Well-Formed Document Structure | Real World XML (2nd Edition)

Building Well- Formed Document Structure

We've learned a lot of the syntax and rules of creating XML documents at the element and character data levels. It's time to move on to the next level: actually giving your document structure.

The W3C has a lot of rules about how to structure your document in a way to make it well formed, and I'm going to take a look at those rules here. In this chapter, I'm going to talk about only standalone documents; in the next chapter, we'll see that we have to adjust these points somewhat for documents that have a DTD.

Checking Well- Formedness

If you have doubts about whether your XML document is well formed, use an online XML validator, such as the excellent one hosted by the Brown University Scholarly Technology Group at www.stg.brown.edu/service/xmlvalid/. You'll get a complete report on your document's well formedness and validity. To see all the well-formedness constraints as set up by the W3C, look at www.w3.org/TR/REC-xml (or Appendix A) and search for the text " Well-Formedness Constraint," which is how W3C names those constraints.

An XML Declaration Should Begin the Document

The first well-formedness structure constraint is that you should start the document with an XML declaration. Technically, you don't need to include an XML declaration in your document; if you do, to make the document well formed, the XML declaration must be absolutely the first thing in the document, like this (not even whitespace should come before the XML declaration):

  <?xml version = "1.0" standalone="yes"?>  <DOCUMENT>     <CUSTOMER STATUS="Good credit">         <NAME>             <LAST_NAME>Smith</LAST_NAME>             <FIRST_NAME>Sam</FIRST_NAME>         </NAME>         <DATE>October 15, 2003</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Tomatoes</PRODUCT>                 <NUMBER>8</NUMBER>                 <PRICE>.25</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Oranges</PRODUCT>                 <NUMBER>24</NUMBER>                 <PRICE>.98</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER>     .     .     .

Do You Need an XML Declaration?

The W3C says that XML documents should have an XML declaration, but documents really don't need to have one in all cases. For example, when you're combining XML documents into one large one, you don't want to include an XML declaration at the head of each section of the document.

Include One or More Elements

To be a well-formed document, a document must include one or more elements. The first element it includes, of course, is the root element; all other elements are enclosed by that element. The examples we've seen throughout this chapter show how this works, as here, where this XML document contains multiple elements within the root element:

 <?xml version = "1.0" standalone="yes"?>  <DOCUMENT>     <GREETING>         Hello From XML     </GREETING>     <MESSAGE>         Welcome to the wild and woolly world of XML.     </MESSAGE> </DOCUMENT>

Include Both Start and End Tags for Nonempty Elements

In HTML, Web browsers often handle the case in which you omit end tags for HTML elements, even if you shouldn't omit those end tags according to the HTML specification. For example, if you use the  tag and then follow it with another  tagwithout using a  tagthe browser will have no problem.

In XML, the story is different. To make sure a document is well formed, every nonempty element must have both a start tag and an end tag, as in the example we just saw:

 <?xml version = "1.0" standalone="yes"?>  <DOCUMENT>     <GREETING>         Hello From XML     </GREETING>     <MESSAGE>         Welcome to the wild and woolly world of XML.     </MESSAGE> </DOCUMENT>

In fact, there's another well-formedness constraint here: End tags must match start tags to complete an element.

Close Empty Tags with />

Some elementsempty elementsdon't have closing tags (although they may have attributes). These tags have no content, which means that they do not enclose any character data or markup. Instead, these elements are made up entirely of one tag, like this:

 <?xml version = "1.0" standalone="yes"?>  <DOCUMENT>  <GREETING TEXT = "Hello From XML" />  </DOCUMENT>

In XML, you must always end empty elements with /> , as shown here, if you want your document to be well formed. In general, the current crop of the major Web browsers deals well with elements such as   . This is good because the alternative is to write such elements like   , and that can be confusing. In fact, some browsers, such as Netscape Navigator, interpret that markup as two   elements.

The Root Element Must Contain All Other Elements

One element in well-formed documents contains all other elements. As we know, that element is called the root element. In this case, the root element is the <BOOKS> element:

 <?xml version = "1.0" standalone="yes"?>  <BOOKS>  <BOOK>         <TITLE>             Inside XML         </TITLE>         <REVIEW>             Excellent         </REVIEW>     </BOOK>     <BOOK>         <TITLE>             Other XML Book         <TITLE>         <REVIEW>             OK         </REVIEW>     </BOOK>  </BOOKS>

In this case, the root element must contain all other elements (excluding the XML declaration, comments, and other nonelements). This makes it easy for XML processors to handle XML documents as trees, starting at the root element, as we'll see when we start parsing XML documents.

Nest Elements Correctly

A very big part of making sure documents are well formed is ensuring that elements nest correctly (in fact, that's one of the reasons for the term well-formed ). The idea here is that if an element contains a start tag for a nonempty tag, it must also contain that element's end tag.

For example, this XML is fine:

 <?xml version = "1.0" standalone="yes"?>  <DOCUMENT>     <GREETING>         Hello From XML     </GREETING>     <MESSAGE>         Welcome to the wild and woolly world of XML.     </MESSAGE> </DOCUMENT>

However, there's a nesting problem in this next document because an XML processor will encounter the <MESSAGE> tag before finding the closing </GREETING> tag:

 <?xml version = "1.0" standalone="yes"?>  <DOCUMENT>     <GREETING>         Hello From XML  <MESSAGE>   </GREETING>  Welcome to the wild and woolly world of XML.     </MESSAGE> </DOCUMENT>

Because you should nest elements correctly to create a well-formed document, and because XML processors are supposed to refuse documents that are not well formed, you can always count on every nonroot element to have exactly one (and only one) parent element that encloses it. For example, in the example before the previous, nonwell-formed example, the <GREETING> and <MESSAGE> elements both have the same parent: the <DOCUMENT> element itself, which is also the root element. Note that a parent element can enclose an indefinite number of child elements (which can also mean zero child elements).

Use Unique Attribute Names

One of the well-formedness constraints that the XML 1.0 specification lists is that no attribute name may appear more than once in the same start tag or empty-element tag. It's hard to see how you would violate this one except by mistake, as in this case, where I give a person two last names:

 <PERSON LAST_NAME="Wooster" LAST_NAME="Jeeves">

Note that because XML is case sensitive, attributes with different capitalization are different, as in this case (although it's still hard to see how you'd write this except by mistake):

 <PERSON LAST_NAME="Wooster" LAST_NAME="Jeeves">

(In general, using attribute names that differ only in terms of capitalization is a really bad idea.)

Use Only the Five Pre-existing Entity References

XML has five predefined entity references. An entity reference is replaced by the corresponding entity when the XML document is processed . You may already know about entity references from HTML; for example, the HTML entity reference © is replaced by the symbol when it parses an HTML document.

As in HTML, general entity references in XML start with & and end with ; . Parameter entity references, which we'll use in DTDs in the next chapter, start with % and end with ; . Here are the five predefined entity references in XML and the characters they are replaced with when parsed:

& The & character
< The < character
> The > character
' The ' character
" The " character

Normally, these characters are tricky to handle in XML documents because XML processors give them special importance. That is, < and > straddle markup tags, you use quotation marks to surround attribute values, and the & character starts entity references. Replacing them with the previous entity references makes them safe because the XML processor replaces them with the appropriate character when processing the document. Using an entity reference for a character is often called escaping that character (following the terminology of programming languages that use "escape sequences" to embed special characters in text).

For example, say that you wanted to use the term "The S&O Railway" in a document; you could use the & entity reference for the ampersand this way:

 <TOUR CAPTION="The S&amp;O Railway" />

Although there are only five predefined entity references in XML, you can define new entity references. I'll take a look at how to do that in the next chapter on DTDs.

The Final ; in Entity References

HTML browsers often let you omit the final ; in entity references if the entity reference is followed by whitespace (if the entity reference is embedded in nonwhitespace text, you must include the final ; , even in HTML). However, you cannot omit the final ; in XML entity references.

Surround Attribute Values with Quotation Marks

In HTML, there's no problem if you omit the quotation marks around attribute values (as long as those values don't contain any whitespace). For example, this element presents no problem to HTML browsers:

 <IMG SRC=image.jpg>

However, XML processors would refuse such an element because omitting the quotation marks around the attribute value "image.jpg" is a violation of well-formedness. Here's how this element would look when written properly:

 <IMG SRC="image.jpg" />

You can also use single quotation marks, like this:

 <IMG SRC='image.jpg' />

In fact, if the attribute value contains double quotes, you should surround it with single quotes, as we've seen:

 <quotation text='He said, "Not that!"' />

XML makes provisions for handling single and double quotes inside attribute valuesyou can always replace single quotes with the entity reference for apostrophes , ' , and double quotes with the entity reference " . For example, to assign the attribute height the value 5'6" , you can do it this way:

 <person height="5&apos;6&quot;" />

In XHTML, the XML-based version of HTML 4 in XML, you must surround attribute values in quotation marks, just as in any other XML document. I'm sure that requirement is going to be one of the most persistently troublesome for Web authors switching to XHTML, simply because it's so easy to forget.

A few more well-formedness constraints on attribute values bear mention. Attribute values cannot contain direct or indirect references to external entities (more on this in the next chapter), and you cannot use the < character in attribute values. If you must use < , use the entity reference < instead, like this, where I'm assigning the text < to the TEXT attribute:

 <ARROW TEXT="&lt;--" />

In fact, so strong is the prohibition against using < except to start markup that you shouldn't use it anywhere in the document except for that purpose (see the next topic).

Use < and & Only to Start Tags and Entities

XML processors assume that < always starts a tag and & always starts an entity reference, so you should avoid using those characters for anything else. We've already seen this example where the ampersand in "The S&O Railway" is replaced by & :

 <TOUR CAPTION="The S&amp;O Railway" />

You should particularly avoid the < character in nonmarkup text. This can be difficult sometimes, as when the < character must be used as the less-than operator in JavaScript, as in this example in XHTML:

 <?xml version="1.0"?>  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/tr/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">     <head>         <title>             Using The if Statement In JavaScript         </title>     </head>     <body>         <script language="javascript">             var budget             budget = 234.77  if (budget < 0) {  document.writeln("Uh oh.")             }         </script>         <center>             <h1>                 Using The if Statement In JavaScript             </h1>         </center>     </body> </html>

W3C suggests that in cases like this, you should enclose the JavaScript code in a CDATA section (see the next topic in this chapter) so that the XML processor will ignore it. Unfortunately, no major browser today understands CDATA sections. Another possible solution is to enclose the JavaScript code in a comment,  . However, the W3C doesn't recommend this because XML processors are allowed to remove comments before passing the XML to the underlying application, and so they would remove the JavaScript code entirely from the document.

You can use < for the < operatorin fact, this is what you should do, like this:

 <?xml version="1.0"?>  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/tr/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">     <head>         <title>             Using The if Statement In JavaScript         </title>     </head>     <body>         <script language="javascript">             var budget             budget = 234.77  if (budget &lt; 0) {  document.writeln("Uh oh.")             }         </script>         <center>             <h1>                 Using The if Statement In JavaScript             </h1>         </center>     </body> </html>

Practically speaking, however, this still represents a problem for the major browsers, although it's the way you should go in the long run. In the short run, you should actually remove the whole problem from the scope of the browser by placing the script code in an external file, here named script.js:

 <?xml version="1.0"?>  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/tr/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">     <head>         <title>             Using The if Statement In JavaScript         </title>     </head>     <body>  <script language="javascript" src="script.js">   </script>  <center>             <h1>                 Using The if Statement In JavaScript             </h1>         </center>     </body> </html>