The Well-Formed XML Document | XML and SOAP Programming for BizTalk(TM) Servers (DV-MPS Programming)

[Previous] [Next]

For the impatient, here are the most common rules for well-formed XML documents. I'll explain these rules in detail in the next section.

Every element must have a start tag and an end tag.

A document must have a single, unique root element.

Element and attribute names are case-sensitive.

Elements must be properly nested. (They cannot have structural overlaps.)

Certain characters must be escaped, or represented by a combination of characters.

Attribute values must be in quotes.

Empty elements have a special form that they must adhere to.

The XML Declaration

A well-formed XML document starts with an optional XML declaration. The declaration defines the document as XML for the XML parser. The declaration can also provide the parser with useful information about the stream. The XML declaration looks like this:

 <?xml version="1.0"?>

XML 1.0 is currently the only version of XML, but the version attribute is still required. Notice that the attribute value, 1.0, is in quotes. You can use either single quotes or double quotes.

The encoding declaration follows the version attribute. It looks like this:

 encoding="SHIFT_JIS"

The encoding declaration indicates the character set used to create the document. By default, XML documents are encoded in Unicode, either 8-bit (UTF-8) or 16-bit (UTF-16). If you use either of these encodings, you don't need to specify the encoding attribute.

The stand-alone document declaration follows the encoding declaration. It looks like this:

 standalone="yes"

The value of the stand-alone declaration can be either yes or no. A value of yes tells the parser that everything it needs to know about the document is contained in the stream that follows. That is, no external pieces (entities) are specified, and no defaulted attribute values are specified by an external schema. A value of no tells the application that it might need to go outside and get some more information to complete parsing.

Both the encoding declaration and the stand-alone document declaration must appear inside the XML declaration. Combining them all gives us an XML declaration that looks like this:

 <?xml version="1.0" encoding="EBCDIC" standalone="no"?>

The XML declaration must come first in the well-formed document and, like the rest of XML, it is case-sensitive.

Start Tags and End Tags

A document consists of text and markup. Markup indicates the structure of the document to the application that processes the document. Markup consists of delimiting tags that are used to describe the data within a document. A tag is a markup construction that is set off from the content of a document by the left and right angle brackets (<>). Tag pairs are used to delimit data.

An element is a particular data object that you need to identify. An element has properties, the most important of which is the element name. (An element name is a descriptive name given to a piece or type of data.) You use tags to indicate the start and end of an element. Each element in an XML document must start with an indicator called a start tag and end with a complementary end tag. The end tag looks like the start tag, except that the end tag includes a slash (/) before the element name. The data between the start and end tags is the content of the element. Content can be text or even other elements with their own tags. Here's an example of an element:

 <prologue>We the people</prologue>

The start tag is <prologue>. The content is "We the people." The end tag is </prologue>. All three parts taken together constitute an element.

Element names must begin with a letter or an underscore. Following the first character are zero or more letters, numbers, underscores, or hyphens.

NOTE
Many XML 1.0 parsers will let you use a colon in your element names, but you shouldn't—colons are reserved for a W3C Recommendation called "Namespaces in XML" that was adopted after the XML 1.0 Recommendation. I will talk about namespaces later in this chapter.

Root Elements

An XML document must have a single, unique root element: an element that has a start tag at the top of the document and an end tag at the bottom of the document. The first start tag that is encountered is considered the root element. As soon as the end tag for this element is encountered, the document is finished. If the parser encounters another start tag after this, it will issue an error, because that new start tag will be at the root level, and only one element is allowed at the root level.

Case Sensitivity

In HTML, element and attribute names are not case-sensitive. That is, you can create a table with any of these tags: <TABLE>, <table>, or even <TaBlE>. The HTML parser considers them all equivalent.

All XML element and attribute names are case-sensitive. If you start an element with the <List-item> tag, you must end it with the </List-item> end tag—not the </LISTITEM> end tag, the </list-item> end tag, or any other variation.

Proper Nesting

Elements in a well-formed document must have a proper tree structure. The end tag for an element contained inside another element must come before the end tag of the parent. Most HTML browsers overlook this rule. For example, on an HTML page, you could write code that looks like this code:

 <P>The rights <B>of the <I>individual person</B> outweigh</I> the rights of the collective.

Notice that the B element starts, and then the I element starts. But the B element ends before the I element ends. A browser would render this code as follows:

The rights of the individual person outweigh the rights of the collective.

This is not proper nesting. XML that is formatted this way is rejected outright by the parser. To achieve the same effect with properly nested XML, the code would look like this:

 <P>The rights <B>of the <I>individual person</I></B> <I>outweigh</I> the rights of the collective.</P>

Special Characters

The parser depends on a small number of characters to determine which parts of an XML document are content and which parts are markup. To differentiate content from markup, the parser continually looks for special characters. We've used three such characters in this chapter so far—the left and right angle brackets (<>) and the slash (/). If you want to use special characters in your XML content and you don't want the parser to see them as markup, you must use entity references. An entity reference is a string of characters that are read by the parser and translated into another character. For example, if you place a left angle bracket somewhere in the data in a well-formed XML document, you'll get an error because the parser will be confused. Upon seeing the left angle bracket, the parser begins to turn what it's parsing into either a new tag or a closing tag. Consider the following markup:

 <P>Paul used the calculation A<B, but I don't agree.</P>

You, Paul, and I see a calculation here, because we read the sentence with human cognitive powers. The parser, however, will infer from the left angle bracket before the B that a tag is coming up. The parser will find the B, which is a valid start character for an element name, and then it will find the comma and throw an error, because the comma is not a valid character inside of a name.

If you want a literal left angle bracket to appear in your document (as a less-than character, for example), you must use the < entity reference:

 <P>Paul used the calculation A&lt;B, but I don't agree.</P>

Another character to watch out for is the ampersand. Notice that the entity reference starts with the ampersand character. If you want to store a literal ampersand, you must use the & entity reference, as shown here:

 <P>My AT&amp;T mobile phone is working much better now that they put the cell antenna on my neighbor's birdhouse.</P>

A well-formed parser recognizes five entity references:

Entity Reference	Meaning
<	< (less than)
>	> (greater than)
&	& (ampersand)
'	' (apostrophe or single quote)
"	" (double quote)

Attributes

Sometimes, just naming an element isn't enough. You can qualify or describe XML elements using attributes. Attributes contain additional information about the element. Attributes appear as name-value pairs in the start tag of an element:

 <insuranceClaim dateFiled="2000-06-24">

The attribute name is dateFiled. The attribute value is 2000-06-24. Notice that an equal sign separates the name from the value. You can have white space on either side or both sides of the equal sign—that's up to you.

Notice, also, that the value is in quotes. In HTML, attribute values don't need quotes if the value is a single word. In XML, all attribute values must be in quotes, even if the attribute is a single word. You can use either single quotes or double quotes, as long as you use them in pairs (not one of each).

You must escape certain characters if they appear in an attribute value. If you use single quotes to delimit your attribute value, you can use the double quote as a literal. Likewise, if you use double quotes to delimit your value, you can use a single quote inside:

 <driveway length="350'"> <gangster name='Wally "Fingers" Gambino'>

What if you need to use both single and double quotes inside an attribute value? Use the ' or " entity references:

 <irish-gangster name='Shawn "Lefty" O&apos;Doull'>

You must also escape the left angle bracket:

 <math calculation="A&lt;B">

You cannot place attributes in the end tag.

Over the years, XML programmers (and SGML programmers before that) have debated whether to specify a certain piece of information as content inside an element or as the value of an attribute. Below are some factors to consider when deciding which form is appropriate.

Whereas elements indicate objects such as purchase orders, line items, and part numbers, I like to think of attributes as specifying properties, such as date-last-modified, author, or currency-type. In other words, think of an attribute as a modifier of an element, just as an adjective modifies a noun.

Attributes are slightly more efficient in syntax. The smallest element has a seven-character overhead, whereas the smallest attribute has a five-character overhead. The difference becomes more dramatic when the attributes or elements have longer names.

Attributes are easier to access than elements in the W3C Document Object Model (DOM), but elements and attributes are equally easy to access in XSL.

Attributes cannot have element structure. If an object has child objects, make it an element, not an attribute.

But don't come to blows over these decisions. Let me make it easy for you: if you can't decide, make the information piece an element. And move on.

Empty Elements

Earlier I mentioned that elements have a start tag, an end tag, and content. Sometimes you might have an element that doesn't have any content. Why bother? Why would you create an element without any content? Consider the horizontal rule tag in HTML. You use it to draw a rule across the page, indicating some kind of break. A horizontal rule element has no content. Since it has no content, you have no need for an end tag to delimit the end of the element. In HTML, the horizontal rule tag looks like this:

 <HR>

In a well-formed XML document, that horizontal rule tag would work fine, except that the parser would look for the HR element's end tag—and wouldn't find it. You could use the following code to create a well-formed <HR> tag:

 <HR></HR>

But that would be a silly tag because it really doesn't indicate the purpose of the element. To solve this problem, the developers of the XML Recommendation used an obscure SGML feature that combines the start tag and the end tag into a single tag—the empty element tag. The empty element tag starts like a start tag and ends with a slash. For our horizontal rule element, the tag looks like this:

 <HR/>

When the XML parser sees this tag, it knows not to look for a corresponding end tag.

An empty element can have attributes. Consider the HTML element that inserts an image into a page: IMG. Although the element IMG uses the SRC attribute to point to an image, it's still an empty element. To be well formed, the image element must look like this:

 <IMG SRC="/images/hookah.gif"/>

Notice that the tag ends with a slash-right-angle bracket combination (/>). The slash and right angle bracket characters must be together, without any characters or white space between them.

Comments

Comments can appear almost anywhere in an XML document. Comments look like this:

 <!-- Better check these figures before sending the file -->

The parser ignores comments; they will not be sent to the application. Comments provide a good way to hide information in the source document.

Examples of Well-Formed Documents

This section offers some examples of well-formed documents. The first example, which follows, describes a configuration entry for a piece of computer hardware. Based on the element names, you can probably see what it describes. That's the power of descriptive markup!

 <configuration type="printer"> <parm name="port">/usr/lpr</parm> <parm name="driver">/usr/drivers/HP5SIPS</parm> <parm name="option">sheet feeder</parm> <configured/> <online/> </configuration>

Notice that the preceding example has no XML declaration. As you learned earlier in the chapter, the XML declaration is optional.

The next example shows a valuable piece of information:

 <?xml version="1.0"?> <Joke author="Groucho Marx"> <Setup>Outside of a dog, a book is man's best friend.</Setup> <Punchline>Inside of a dog, it's too dark to read.</Punchline> </Joke>

Take a look at the XML declaration at the top. Notice also that the case of each start tag and end tag matches.