What Is a Well-Formed XML Document? | Real World XML (2nd Edition)

What Is a Well- Formed XML Document?

The W3C, which is responsible for the term well- formedness , defines it this way in the XML 1.0 recommendation (I'll take a look at each of these stipulations later):

A textual object is a well-formed XML document if:

Taken as a whole, it matches the production labeled document.
It meets all the well-formedness constraints given in this specification (that is, www.w3.org/TR/REC-xml ).
Each of the parsed entities that is referenced directly or indirectly within the document is well-formed.

The W3C calls the individual specifications within a working draft or recommendation productions . In this case, to be well formed, a document must follow the "document" production, which means that the document itself must have three parts : a prolog (which can be empty), a root element, and an optional miscellaneous part.

The prolog, which I'll talk about in a few pages, can include an XML declaration (such as <?xml version = "1.0"?> ) and an optional miscellaneous part that includes comments, processing instructions, and so on.

The root element of the document can itself hold other elementsin fact, it's hard to imagine useful XML documents in which the root element does not contain other elements. Note that each well-formed XML document must have exactly one root element and that all other elements in the document must be enclosed in the root element (this does not apply to the parts of the prolog, of course, because items such as processing instructions and comments are not considered elements).

The optional miscellaneous part can be made up of XML comments, processing instructions, and whitespace (including spaces, tabs, and so on). I'll take a look at each of these three parts in this chapter: the prolog, the root element, and the miscellaneous part.

The next stipulation in the list says that to be well formed, XML documents must also satisfy the well-formedness constraints listed in the XML 1.0 specification. This means that XML documents must adhere to the syntax rules specified in the XML 1.0 recommendation. I'll talk about those rules in this chapter, including the naming rules you should follow when naming tags, how to nest elements, and so on.

Well-Formedness Constraint

If you search the XML 1.0 specification, which also appears in Appendix A, you'll see that all constraints you need to satisfy to create a well-formed document are marked with the term "Well-Formedness Constraint."

Finally, the last stipulation in the W3C well-formed document list is that each parsed entity must itself be well formed. What does that mean?

The parts of an XML document are called entities. An entity is a part of a document that can hold text or binary data (but not both). An entity may refer to other entities and so cause them to be included in the document. You can have either parsed (character data) or unparsed (character data that can include non-XML text or binary data that the XML processor does not parse) entities. In other words, the term entity is just a generic way of referring to a data storage unit in XMLfor example, a file with a few XML elements in it is an entity, but it's not a document unless it's also well formed.

The stipulation about parsed entities means that if you refer to an entity and so include that entity's data (which can include data from external sources) in your document, the included data must itself be well formed.

That's the W3C's definition of a well-formed document, but it's far from clear at this point. What are the well-formedness constraints we need to follow? What exactly can be in a prolog? To answer questions like that, the rest of this chapter is be devoted to examining what these constraints mean in detail.

I'm going to start by looking at an XML document that we can refer to throughout the chapter as we examine what it means for a document to be well formed. In this case, I'll store customer data for specific purchases in a document called ch02_01.xml. I start with the XML declaration itself:

 <?xml version = "1.0"?>

Here, I'm using the <?xml?> declaration to indicate that this document is written in XML and is specifying the only version possible at this time, version 1.0. Because all the documents in this chapter are self-contained (they don't refer to or include any external entities), I can also use the standalone attribute, setting it to "yes" like this:

 <?xml version = "1.0" standalone="yes"?>

This attribute, which may or may not be used by an XML parser, indicates that the document is completely self-contained. Technically, XML documents do not need to start with the XML declaration, but the W3C recommends it.

Next, I add the root element, which I'll call <DOCUMENT> in this case (although you can use any name ):

 <?xml version = "1.0" standalone="yes"?>  <DOCUMENT>   .   .   .   </DOCUMENT>

The root element can contain other elements, of course, as here, where I add elements for three customers to the document:

 <?xml version = "1.0" standalone="yes"?>  <DOCUMENT>  <CUSTOMER>   .   .   .   </CUSTOMER>   <CUSTOMER>   .   .   .   </CUSTOMER>   <CUSTOMER>   .   .   .   </CUSTOMER>  </DOCUMENT>

For each customer, I will store a name in a <NAME> element, which itself encloses <LAST_NAME> and <FIRST_NAME> elements like this:

 <?xml version = "1.0" standalone="yes"?>  <DOCUMENT>     <CUSTOMER>  <NAME>   <LAST_NAME>Smith</LAST_NAME>   <FIRST_NAME>Sam</FIRST_NAME>   </NAME>  .     .     .     </CUSTOMER>     <CUSTOMER>     .     .     .     </CUSTOMER>     <CUSTOMER>     .     .     .     </CUSTOMER> </DOCUMENT>

I can also store the details of customer orders with a new element, <DATE> , and an element named <ORDERS> like this:

 <?xml version = "1.0" standalone="yes"?>  <DOCUMENT>     <CUSTOMER>         <NAME>             <LAST_NAME>Smith</LAST_NAME>             <FIRST_NAME>Sam</FIRST_NAME>         </NAME>  <DATE>October 15, 2003</DATE>   <ORDERS>   .   .   .   </ORDERS>  .     .     .     </CUSTOMER>     <CUSTOMER>     .     .     .     </CUSTOMER>     <CUSTOMER>     .     .     .     </CUSTOMER> </DOCUMENT>

I can also record each item a customer bought with an <ITEM> element, which itself is broken up into <PRODUCT> , <NUMBER> , and <PRICE> elements:

 <?xml version = "1.0" standalone="yes"?>  <DOCUMENT>     <CUSTOMER>         <NAME>             <LAST_NAME>Smith</LAST_NAME>             <FIRST_NAME>Sam</FIRST_NAME>         </NAME>         <DATE>October 15, 2003</DATE>         <ORDERS>  <ITEM>   <PRODUCT>Tomatoes</PRODUCT>   <NUMBER>8</NUMBER>   <PRICE>.25</PRICE>   </ITEM>   <ITEM>   <PRODUCT>Oranges</PRODUCT>   <NUMBER>24</NUMBER>   <PRICE>.98</PRICE>   </ITEM>  </ORDERS>     .     .     .     </CUSTOMER>     <CUSTOMER>     .     .     .     </CUSTOMER>     <CUSTOMER>     .     .     .     </CUSTOMER> </DOCUMENT>

That's what the data looks like for one customer; here's the full document, including data for all three customers:

Listing ch02_01.xml

 <?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <CUSTOMER>         <NAME>             <LAST_NAME>Smith</LAST_NAME>             <FIRST_NAME>Sam</FIRST_NAME>         </NAME>         <DATE>October 15, 2003</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Tomatoes</PRODUCT>                 <NUMBER>8</NUMBER>                 <PRICE>.25</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Oranges</PRODUCT>                 <NUMBER>24</NUMBER>                 <PRICE>.98</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER>     <CUSTOMER>         <NAME>             <LAST_NAME>Jones</LAST_NAME>             <FIRST_NAME>Polly</FIRST_NAME>         </NAME>         <DATE>October 20, 2003</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Bread</PRODUCT>                 <NUMBER>12</NUMBER>                 <PRICE>.95</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Apples</PRODUCT>                 <NUMBER>6</NUMBER>                 <PRICE>.50</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER>     <CUSTOMER>         <NAME>             <LAST_NAME>Weber</LAST_NAME>             <FIRST_NAME>Bill</FIRST_NAME>         </NAME>         <DATE>October 25, 2003</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Asparagus</PRODUCT>                 <NUMBER>12</NUMBER>                 <PRICE>.95</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Lettuce</PRODUCT>                 <NUMBER>6</NUMBER>                 <PRICE>.50</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER> </DOCUMENT>

Documents like these can grow very long and consist of markup that is many levels deep. Handling such documents is not a problem for XML parsers, however, as long as the document is well formed (and, if the parser is a validating parser, valid). In this chapter, I'll refer back to this document, modifying it and taking a look at its parts as we see what makes a document well formed.

We're ready now to take XML documents apart piece by piece. I'll start with the basics and work up through the prolog, root element, enclosed elements, and so on. We're going to see it all in this chapter.

At their most basic level, then, XML documents are combinations of markup and character data, and I'm going to start from that point.