2.10 Checking Documents for Well-Formedness | XML in a Nutshell, Third Edition

2.10 Checking Documents for Well- Formedness

Every XML document, without exception, must be well- formed . This means it must adhere to a number of rules, including the following:

Every start-tag must have a matching end-tag.
Elements may nest but may not overlap.
There must be exactly one root element.
Attribute values must be quoted.
An element may not have two attributes with the same name .
Comments and processing instructions may not appear inside tags.
No unescaped < or & signs may occur in the character data of an element or attribute.

This is not an exhaustive list. There are many, many ways a document can be malformed . You'll find a complete list in Chapter 21. Some of these involve constructs that we have not yet discussed, such as DTDs. Others are extremely unlikely to occur if you follow the examples in this chapter (for example, including whitespace between the opening < and the element name in a tag).

Whether the error is small or large, likely or unlikely, an XML parser reading a document is required to report it. It may or may not report multiple well-formedness errors it detects in the document. However, the parser is not allowed to try to fix the document and make a best-faith effort of providing what it thinks the author really meant . It can't fill in missing quotes around attribute values, insert an omitted end-tag, or ignore the comment that's inside a start-tag. The parser is required to return an error. The objective here is to avoid the bug-for-bug compatibility wars that plagued early web browsers and continue to this day. Consequently, before you publish an XML documentwhether that document is a web page, input to a database, or something elseyou'll want to check it for well-formedness.

The simplest way to do this is by loading the document into a web browser that understands XML documents, such as Mozilla. If the document is well-formed, the browser will display it. If it isn't, then it will show an error message.

Instead of loading the document into a web browser, you can use an XML parser directly. Most XML parsers are not intended for end users. They are class libraries designed to be embedded into an easier-to-use program, such as Mozilla. They provide a minimal command-line interface, if that; this interface is often not particularly well documented. Nonetheless, it can sometimes be quicker to run a batch of files through a command-line interface than loading each of them into a web browser. Furthermore, once you learn about DTDs and schemas, you can use the same tools to validate documents, which most web browsers won't do.

There are many XML parsers available in a variety of languages. Here, we'll demonstrate checking for well-formedness with the Gnome Project's libxml , which you can download from http://xmlsoft.org. This open source package is written in fairly portable C and runs on most major platforms, including Windows, Linux, and Mac OS X. (It's preinstalled in many Linux distros.) The procedure should be similar for other parsers, although details will vary.

libxml is actually a library but it includes a program called xmllint that uses this library to check files for well-formedness. xmllint is run from a Unix shell or DOS prompt like any other command-line program. The arguments are the URLs to or filenames of the documents you want to check. Here's the result of running xmllint against an early version of Example 2-5. The very first line of output tells you where the first problem in the file is:

  % xmllint 2-5.xml  2-5.xml:5: error: Unescaped '<' not allowed in attributes values   <person born='1912/06/23'    ^ 2-5.xml:5: error: attributes construct error   <person born='1912/06/23'    ^ 2-5.xml:5: error: error parsing attribute name   <person born='1912/06/23'    ^ 2-5.xml:5: error: attributes construct error   <person born='1912/06/23'    ^ 2-5.xml:5: error: xmlParseStartTag: problem parsing attributes   <person born='1912/06/23'    ^ 2-5.xml:5: error: Couldn't find end of Start Tag image line 3   <person born='1912/06/23'    ^

As you can see, it found an error. In this case the error message wasn't particularly helpful. The actual problem wasn't that an attribute value contained a < character, it was that the closing quote was missing from the height attribute value. Still, that was enough information to locate and fix the problem. Despite the long list of output, xmllint only reports the first error in the document, so you may have to run it multiple times until all the mistakes are found and fixed. Once we fixed Example 2-5 to make it well-formed, xmllint simply printed the file it read:

  % xmllint 2-5.xml  <biography xmlns:xlink="http://www.w3.org/1999/xlink/">         <image source="http://www.turing.org.uk/turing/pi1/busgroup.jpg"   width="152" height="345"/>       <paragraph><person born='1912-06-23'   died='1954-06-07'><first_name>Alan</first_name> ...

Now that the document has been corrected to be well-formed, it can be passed to a web browser, a database, or whatever other program is waiting to receive it. Almost any nontrivial document crafted by hand will contain well-formedness mistakes, which makes it important to check your work before publishing it.