Chapter 2. XML Fundamentals

CONTENTS

  •  2.1 XML Documents and XML Files
  •  2.2 Elements, Tags, and Character Data
  •  2.3 Attributes
  •  2.4 XML Names
  •  2.5 Entity References
  •  2.6 CDATA Sections
  •  2.7 Comments
  •  2.8 Processing Instructions
  •  2.9 The XML Declaration
  •  2.10 Checking Documents for Well-Formedness

This chapter shows you how to write simple XML documents. You'll see that an XML document is built from text content marked up with text tags such as <SKU>, <Record_ID>, and <author> that look superficially like HTML tags. However, in HTML you're limited to about a hundred predefined tags that describe web-page formatting. In XML you can create as many tags as you need. Furthermore, these tags will mostly describe the type of content they contain rather than formatting or layout information. In XML you don't say that something is italicized or indented or bold; you say that it's a book or a biography or a calendar.

Although XML is looser than HTML in regards to which tags it allows, it is much stricter about where those tags are placed and how they're written. In particular, all XML documents must be well-formed. Well-formedness rules specify constraints such as "Every start-tag must have a matching end-tag" and "Attribute values must be quoted." These rules are unbreakable, which makes parsing XML documents easy and writing them a little harder, but they still allow an almost unlimited flexibility of expression.

2.1 XML Documents and XML Files

An XML document contains text, never binary data. It can be opened with any program that knows how to read a text file. Example 2-1 is close to the simplest XML document imaginable. Nonetheless, t is a well-formed XML document. XML parsers can read it and understand it (at least as far as a computer program can be said to understand anything).

Example 2-1. A very simple yet complete XML document
<person>   Alan Turing </person>

In the most common scenario, this document would be the entire contents of a file named person.xml, or perhaps 2-1.xml. However, XML is not picky about the filename. As far as the parser is concerned, this file could be called person.txt, person, or Hey you, there's some XML in this here file! Your operating system may or may not like these names, but an XML parser won't care. The document might not even be in a file at all. It could be a record or a field in a database. It could be generated on the fly by a CGI program in response to a browser query. It could even be stored in more than one file, though that's unlikely for such a simple document. If it is served by a web server, it will probably be assigned the MIME media type application/xml or text/xml. However, specific XML applications may use more specific MIME media types such as application/mathml+xml, application/XSLT+xml, image/svg+xml, text/vnd.wap.wml, or even text/html (in very special cases).

For generic XML documents, application/xml should be preferred to text/xml, although most web servers use text/xml by default. text/xml uses the ASCII character set as a default, which is incorrect for most XML documents.

2.2 Elements, Tags, and Character Data

The document in Example 2-1 is composed of a single element named person. The element is delimited by the start-tag <person> and the end-tag </person>. Everything between the start-tag and the end-tag of the element (exclusive) is called the element's content. The content of this element is the text string:

Alan Turing

The whitespace is part of the content, though many applications will choose to ignore it. <person> and </person> are markup. The string "Alan Turing" and its surrounding whitespace are character data. The tag is the most common form of markup in an XML document, but there are other kinds we'll discuss later.

2.2.1 Tag Syntax

XML tags look superficially like HTML tags. Start-tags begin with < and end-tags begin with </. Both of these are followed by the name of the element and are closed by >. However, unlike HTML tags, you are allowed to make up new XML tags as you go along. To describe a person, use <person> and </person> tags. To describe a calendar, use <calendar> and </calendar> tags. The names of the tags generally reflect the type of content inside the element, not how that content will be formatted.

2.2.1.1 Empty elements

There's also a special syntax for empty elements, i.e., elements that have no content. Such an element can be represented by a single empty-element tag that begins with < but ends with />. For instance, in XHTML, an XMLized reformulation of standard HTML, the line-break and horizontal-rule elements are written as <br /> and <hr /> instead of <br> and <hr>. These are exactly equivalent to <br></br> and <hr></hr>, however. Which form you use for empty elements is completely up to you. However, what you cannot do in XML and XHTML (unlike HTML) is use only the start-tag for instance <br> or <hr> without using the matching the end-tag. That would be a well-formedness error.

2.2.1.2 Case sensitivity

XML, unlike HTML, is case sensitive. <Person> is not the same as <PERSON> is not the same as <person>. If you open an element with a <person> tag, you can't close it with a </PERSON> tag. You're free to use upper- or lowercase or both as you choose. You just have to be consistent within any one element.

2.2.2 XML Trees

Let's look at a slightly more complicated XML document. Example 2-2 is a person element that contains more information suitably marked up to show its meaning.

Example 2-2. A more complex XML document describing a person
<person>   <name>     <first_name>Alan</first_name>     <last_name>Turing</last_name>   </name>   <profession>computer scientist</profession>   <profession>mathematician</profession>   <profession>cryptographer</profession> </person>
2.2.2.1 Parents and children

This XML document is still composed of one person element. However, now this element doesn't merely contain undifferentiated character data. It contains four child elements: a name element and three profession elements. The name element contains two child elements of its own, first_name and last_name.

The person element is called the parent of the name element and the three profession elements. The name element is the parent of the first_name and last_name elements. The name element and the three profession elements are sometimes called each other's siblings. The first_name and last_name elements are also siblings.

As in human society, any one parent may have multiple children. However, unlike human society, XML gives each child exactly one parent, not two or more. Each element (with one exception I'll note shortly) has exactly one parent element. That is, it is completely enclosed by another element. If an element's start-tag is inside some element, then its end-tag must also be inside that element. Overlapping tags, as in <strong><em>this common example from HTML</strong></em>, are prohibited in XML. Since the em element begins inside the strong element, it must also finish inside the strong element.

2.2.2.2 The root element

Every XML document has one element that does not have a parent. This is the first element in the document and the element that contains all other elements. In Example 2-1 and Example 2-2, the person element filled this role. It is called the root element of the document . It is also sometimes called the document element. Every well-formed XML document has exactly one root element. Since elements may not overlap, and since all elements except the root have exactly one parent, XML documents form a data structure programmers call a tree. Figure 2-1 diagrams this relationship for Example 2-2. Each gray box represents an element. Each black box represents character data. Each arrow represents a containment relationship.

Figure 2-1. A tree diagram for Example 2-2

figs/xian2_0201.gif

2.2.3 Mixed Content

In Example 2-2, the contents of the first_name, last_name, and profession elements were character data, that is, text that does not contain any tags. The contents of the person and name elements were child elements and some whitespace that most applications will ignore. This dichotomy between elements that contain only character data and elements that contain only child elements (and possibly a little whitespace) is common in documents that are data oriented. However, XML can also be used for more free-form, narrative documents such as business reports, magazine articles, student essays, short stories, web pages, and so forth, as shown by Example 2-3.

Example 2-3. A narrative-organized XML document
<biography>   <name><first_name>Alan</first_name> <last_name>Turing</last_name>   </name> was one of the first people to truly deserve the name    <emphasize>computer scientist</emphasize>. Although his contributions    to the field are too numerous to list, his best-known are the    eponymous <emphasize>Turing Test</emphasize> and    <emphasize>Turing Machine</emphasize>.   <definition>The <term>Turing Test</term> is to this day the standard   test for determining whether a computer is truly intelligent. This    test has yet to be passed. </definition>   <definition>The <term>Turing Machine</term> is an abstract finite    state automaton with infinite memory that can be proven equivalent    to any any other finite state automaton with arbitrarily large memory.    Thus what is true for a Turing machine is true for all equivalent    machines no matter how implemented.   </definition>   <name><last_name>Turing</last_name></name> was also an accomplished      <profession>mathematician</profession> and   <profession>cryptographer</profession>. His assistance    was crucial in helping the Allies decode the German Enigma   machine. He committed suicide on <date><month>June</month>    <day>7</day>, <year>1954</year></date> after being    convicted of homosexuality and forced to take female    hormone injections. </biography>

The root element of this document is biography. The biography contains name, definition, profession, and emphasize child elements. It also contains a lot of raw character data. Some of these elements such as last_name and profession only contain character data. Others such as name contain only child elements. Still others such as definition contain both character data and child elements. These elements are said to contain mixed content. Mixed content is common in XML documents containing articles, essays, stories, books, novels, reports, web pages, and anything else that's organized as a written narrative. Mixed content is less common and harder to work with in computer-generated and processed XML documents used for purposes such as database exchange, object serialization, persistent file formats, and so on. One of the strengths of XML is the ease with which it can be adapted to the very different requirements of human-authored and computer-generated documents.

2.3 Attributes

XML elements can have attributes. An attribute is a name-value pair attached to the element's start-tag. Names are separated from values by an equals sign and optional whitespace. Values are enclosed in single or double quotation marks. For example, this person element has a born attribute with the value 1912-06-23 and a died attribute with the value 1954-06-07:

<person born="1912-06-23" died="1954-06-07">   Alan Turing </person>

This next element is exactly the same as far an XML parser is concerned. It simply uses single quotes instead of double quotes, puts some extra whitespace around the equals signs, and reorders the attributes.

<person died = '1954-06-07'  born = '1912-06-23' >   Alan Turing </person>

The whitespace around the equals signs is purely a matter of personal aesthetics. The single quotes may be useful in cases where the attribute value itself contains a double quote. Attribute order is not significant.

Example 2-4 shows how attributes might be used to encode much of the same information given in the data-oriented document of Example 2-2.

Example 2-4. An XML document that describes a person using attributes
<person>   <name first="Alan" last="Turing"/>   <profession value="computer scientist"/>   <profession value="mathematician"/>   <profession value="cryptographer"/> </person>

This raises the question of when and whether one should use child elements or attributes to hold information. This is a subject of heated debate. Some informaticians maintain that attributes are for metadata about the element while elements are for the information itself. Others point out that it's not always so obvious what's data and what's metadata. Indeed, the answer may depend on where the information is put to use.

What's undisputed is that each element may have no more than one attribute with a given name. That's unlikely to be a problem for a birth date or a death date; it would be an issue for a profession, name, address, or anything else of which an element might plausibly have more than one. Furthermore, attributes are quite limited in structure. The value of the attribute is simply a text string. The division of a date into a year, month, and day with hyphens in the previous example is at the limits of the substructure that can reasonably be encoded in an attribute. Consequently, an element-based structure is a lot more flexible and extensible. Nonetheless, attributes are certainly more convenient in some applications. Ultimately, if you're designing your own XML vocabulary, it's up to you to decide when to use which.

Attributes are also useful in narrative documents, as Example 2-5 demonstrates. Here it's perhaps a little more obvious what belongs to elements and what to attributes. The raw text of the narrative is presented as character data inside elements. Additional information annotating that data is presented as attributes. This includes source references, image URLs, hyperlinks, and birth and death dates. Even here, however, there's more than one way to do it. For instance, the footnote numbers could be attributes of the footnote element rather than character data.

Example 2-5. A narrative XML document that uses attributes
<biography xmlns:xlink="http://www.w3.org/1999/xlink/namespace/">   <image source="http://www.turing.org.uk/turing/pi1/bus.jpg"   width="152" height="345"/>   <person born='1912-06-23'   died='1954-06-07'><first_name>Alan</first_name>   <last_name>Turing</last_name> </person> was one of the first people   to truly deserve the name <emphasize>computer scientist</emphasize>.   Although his contributions to the field were too numerous to list,   his best-known are the eponymous <emphasize xlink:type="simple"   xlink:href="http://cogsci.ucsd.edu/~asaygin/tt/ttest.html">Turing   Test</emphasize> and <emphasize  xlink:type="simple"   xlink:href="http://mathworld.wolfram.com/TuringMachine.html">Turing   Machine</emphasize>.   <last_name>Turing</last_name> was also an accomplished   <profession>mathematician</profession> and   <profession>cryptographer</profession>. His assistance   was crucial in helping the Allies decode the German Enigma   machine.<footnote source="The Ultra Secret, F.W. Winterbotham,   1974">1</footnote>   He committed suicide on <date><month>June</month> <day>7</day>,   <year>1954</year></date> after being convicted of homosexuality   and forced to take female hormone injections.<footnote   source="Alan Turing: the Enigma, Andrew Hodges, 1983">2</footnote> </biography>

2.4 XML Names

The XML specification can be quite legalistic and picky at times. Nonetheless, it tries to be efficient where possible. One way it does that is by reusing the same rules for different items where possible. For example, the rules for XML element names are also the rules for XML attribute names, as well as for the names of several less common constructs. Generally, these are referred to simply as XML names.

Element and other XML names may contain essentially any alphanumeric character. This includes the standard English letters A through Z and a through z as well as the digits 0 through 9. XML names may also include non-English letters, numbers, and ideograms such as , , figs/u03a9.gif, and figs/u4e32.gif. They may also include these three punctuation characters:

_

the underscore

-

the hyphen

.

the period

XML names may not contain other punctuation characters such as quotation marks, apostrophes, dollar signs, carets, percent symbols, and semicolons. The colon is allowed, but its use is reserved for namespaces as discussed in Chapter 4. XML names may not contain whitespace of any kind, whether a space, a carriage return, a line feed, a nonbreaking space, and so forth. Finally, all names beginning with the string XML (in any combination of case) are reserved for standardization in W3C XML-related specifications.

XML names may only start with letters, ideograms, and the underscore character. They may not start with a number, hyphen, or period. There is no limit to the length of an element or other XML name. Thus these are all well-formed elements:

  • <Drivers_License_Number>98 NY 32</Drivers_License_Number>

  • <month-day-year>7/23/2001</month-day-year>

  • <first_name>Alan</first_name>

  • <_4-lane>I-610</_4-lane>

  • <t l phone>011 33 91 55 27 55 27</t l phone>

  • figs/p19.gif

These are not acceptable elements:

  • <Driver's_License_Number>98 NY 32</Driver's_License_Number>

  • <month/day/year>7/23/2001</month/day/year>

  • <first name>Alan</first name>

  • <4-lane>I-610</4-lane>

2.5 Entity References

The character data inside an element may not contain a raw unescaped opening angle bracket (<). This character is always interpreted as beginning a tag. If you need to use this character in your text, you can escape it using the &lt; entity reference. When a parser reads the document, it will replace the &lt; entity reference with the actual < character. However, it will not confuse &lt; with the start of a tag. For example:

<SCRIPT LANGUAGE="JavaScript">   if (location.host.toLowerCase( ).indexOf("cafeconleche") &lt; 0) {     location.href="http://www.cafeconleche.org/";   } </SCRIPT>

The character data inside an element may not contain a raw unescaped ampersand (&) either. This is always interpreted as beginning an entity or character reference. However, the ampersand may be escaped using the &amp; entity reference like this:

<publisher>O'Reilly &amp; Associates</publisher>

Entity references such as &amp; and &lt; are considered to be markup. When an application parses an XML document, it replaces this particular markup with the actual characters to which the entity reference refers.

XML predefines exactly five entity references. These are:

&lt;

The less-than sign; a.k.a. the opening angle bracket (<)

&amp;

The ampersand (&)

&gt;

The greater-than sign; a.k.a. the closing angle bracket (>)

&quot;

The straight, double quotation marks (")

&apos;

The apostrophe; a.k.a. the straight single quote (')

Only &lt; and &amp; must be used instead of the literal characters in element content. The others are optional. &quot; and &apos; are useful inside attribute values where a raw " or ' might be misconstrued as ending the attribute value. For example, this image tag uses the &apos; entity reference to fill in the apostrophe in O'Reilly:

<image source='oreilly_koala3.gif' width='122' height='66'        alt='Powered by O&apos;Reilly Books' />

Although there's no possibility of an unescaped greater-than sign (>) being misinterpreted as closing a tag it wasn't meant to close, &gt; is allowed mostly for symmetry with &lt;.

In addition to the five predefined entity references, you can define others in the document type definition. We'll discuss how to do this in Chapter 3.

2.6 CDATA Sections

When an XML document includes samples of XML or HTML source code, the < and & characters in those samples must be encoded as &lt; and &amp;. The more sections of literal code a document includes and the longer they are, the more tedious this encoding becomes. Instead you can enclose each sample of literal code in a CDATA section. A CDATA section is set off by a <![CDATA[ and ]]>. Everything between the <![CDATA[ and the ]]> is treated as raw character data. Less-than signs don't begin. Ampersands don't start entity references. Everything is simply character data, not markup.

For example, in a Scalable Vector Graphics (SVG) tutorial written in XHTML, you might see something like this:

<p>You can use a default <code>xmlns</code> attribute to avoid having to add the svg prefix to all your elements:</p>      <![CDATA[        <svg xmlns="http://www.w3.org/2000/svg"             width="12cm" height="10cm">          <ellipse rx="110" ry="130" />          <rect x="4cm" y="1cm" width="3cm" height="6cm" />        </svg>      ]]>

The SVG source code has been included directly in the XHTML file without carefully replacing each < with &lt;. The result will be a sample SVG document, not an embedded SVG picture, as might happen if this example were not placed inside a CDATA section.

The only thing that can not appear in a CDATA section is the CDATA section end delimiter ]]>.

CDATA sections exist for the convenience of human authors, not for programs. Parsers are not required to tell you whether a particular block of text came from a CDATA section, from normal character data, or from character data that contained entity references such as &lt; and &amp;. By the time you get access to the data, these differences will have been washed away.

2.7 Comments

XML documents can be commented so that coauthors can leave notes for each other and themselves, documenting why they've done what they've done or items that remain to be done. XML comments are syntactically similar to HTML comments. Just as in HTML, they begin with <!-- and end with the first occurrence of -->. For example:

<!-- I need to verify and update these links when I get a chance. -->

The double hyphen -- should not appear anywhere inside the comment until the closing -->. In particular, a three hyphen close like ---> is specifically forbidden.

Comments may appear anywhere in the character data of a document. They may also appear before or after the root element. (Comments are not elements, so this does not violate the tree structure or the one-root element rules for XML.) However, comments may not appear inside a tag or inside another comment.

Applications that read and process XML documents may or may not pass along information included in comments. They are certainly free to drop them out if they choose. Do not write documents or applications that depend on the contents of comments being available. Comments are strictly for making the raw source code of an XML document more legible to human readers. They are not intended for computer programs. For this purpose you should use a processing instruction instead.

2.8 Processing Instructions

In HTML, comments are sometimes abused to support nonstandard extensions. For instance, the contents of the script element are sometimes enclosed in a comment to protect it from display by a nonscript-aware browser. The Apache web server parses comments in .shtml files to recognize server side includes. Unfortunately, these documents may not survive being passed through various HTML editors and processors with their comments and associated semantics intact. Worse yet, it's possible for an innocent comment to be misconstrued as input to the application.

XML provides the processing instruction as an alternative means of passing information to particular applications that may read the document. A processing instruction begins with <? and ends with ?>. Immediately following the <? is an XML name called the target, possibly the name of the application for which this processing instruction is intended or possibly just an identifier for this particular processing instruction. The rest of the processing instruction contains text in a format appropriate for the applications for which the instruction is intended.

For example, in HTML a robots META tag is used to tell search-engine and other robots whether and how they should index a page. The following processing instruction has been proposed as an equivalent for XML documents:

<?robots index="yes" follow="no"?>

The target of this processing instruction is robots. The syntax of this particular processing instruction is two pseudoattributes, one named index and one named follow, whose values are either yes or no. The semantics of this particular processing instruction are that if the index attribute has the value yes, then search-engine robots should index this page. If index has the value no, then it won't be. Similarly, if follow has the value yes, then links from this document will be followed.

Other processing instructions may have totally different syntaxes and semantics. For instance, processing instructions can contain an effectively unlimited amount of text. PHP includes large programs in processing instructions. For example:

<?php   mysql_connect("database.unc.edu", "clerk", "password");   $result = mysql("HR", "SELECT LastName, FirstName FROM Employees     ORDER BY LastName, FirstName");   $i = 0;   while ($i < mysql_numrows ($result)) {      $fields = mysql_fetch_row($result);      echo "<person>$fields[1] $fields[0] </person>\r\n";      $i++;   }   mysql_close( ); ?>

Processing instructions are markup, but they're not elements. Consequently, like comments, processing instructions may appear anywhere in an XML document outside of a tag, including before or after the root element. The most common processing instruction, xml-stylesheet, is used to attach stylesheets to documents. It always appears before the root element, as Example 2-6 demonstrates. In this example, the xml-stylesheet processing instruction tells browsers to apply the CSS stylesheet person.css to this document before showing it to the reader.

Example 2-6. A very simple yet complete XML document
<?xml-stylesheet href="person.css" type="text/css"?> <person>   Alan Turing </person>

The processing instruction names xml, XML, XmL, etc., in any combination of case, are forbidden to avoid confusion with the XML declaration. Otherwise, you're free to pick any legal XML name for your processing instructions.

2.9 The XML Declaration

XML documents should (but do not have to) begin with an XML declaration. The XML declaration looks like a processing instruction with the name xml and version, standalone, and encoding attributes. Technically, it's not a processing instruction though, just the XML declaration; nothing more, nothing less. Example 2-7 demonstrates.

Example 2-7. A very simple XML document with an XML declaration
<?xml version="1.0" encoding="ASCII" standalone="yes"?> <person>   Alan Turing </person>

XML documents do not have to have an XML declaration. However, if an XML document does have an XML declaration, then that declaration must be the first thing in the document. It must not be preceded by any comments, whitespace, processing instructions, and so forth. The reason is that an XML parser uses the first five characters (<?xml) to make some reasonable guesses about the encoding, such as whether the document uses a single byte or multibyte character set. The only thing that may precede the XML declaration is an invisible Unicode byte-order mark. We'll discuss this further in Chapter 5.

2.9.1 encoding

So far we've been a little cavalier about encodings. We've said that XML documents are composed of pure text, but we haven't said what encoding that text uses. Is it ASCII? Latin-1? Unicode? Something else?

The short answer to this question is "Yes." The long answer is that by default XML documents are assumed to be encoded in the UTF-8 variable-length encoding of the Unicode character set. This is a strict superset of ASCII, so pure ASCII text files are also UTF-8 documents. However, most XML processors, especially those written in Java, can handle a much broader range of character sets. All you have to do is tell the parser which character encoding the document uses. Preferably this is done through metainformation, stored in the filesystem or provided by the server. However, not all systems provide character-set metadata so XML also allows documents to specify their own character set with an encoding declaration inside the XML declaration. Example 2-8 shows how you'd indicate that a document was written in the ISO-8859-1 (Latin-1) character set that includes letters like and needed for many non-English Western European languages.

Example 2-8. An XML document encoded in Latin-1
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> <person>   Erwin Schr dinger </person>

The encoding attribute is optional in an XML declaration. If it is omitted and no metadata is available, then the Unicode character set is assumed. The parser may use the first several bytes of the file to try to guess which encoding of Unicode is in use. If metadata is available and it conflicts with the encoding declaration, then the encoding specified by the metadata wins. For example, if an HTTP header says a document is encoded in ASCII but the encoding declaration says it's encoded in UTF-8, then the parser will pick ASCII.

The different encodings and the proper handling of non-English XML documents will be discussed in greater detail in Chapter 5.

2.9.2 standalone

If the standalone attribute has the value no, then an application may be required to read an external DTD (that is a DTD in a file other than the one it's reading now) to determine the proper values for parts of the document. For instance, a DTD may provide default values for attributes that a parser is required to report even though they aren't actually present in the document.

Documents that do not have DTDs, like all the documents in this chapter, can have the value yes for the standalone attribute. Documents that do have DTDs can also have the value yes for the standalone attribute if the DTD doesn't in any way change the content of the document or if the DTD is purely internal. Details for documents with DTDs are covered in Chapter 3.

The standalone attribute is optional in an XML declaration. If it is omitted, then the value no is assumed.

2.10 Checking Documents for Well-Formedness

Every XML document, without exception, must be well-formed. This means it must adhere to a number of rules, including the following:

  1. Every start-tag must have a matching end-tag.

  2. Elements may nest, but may not overlap.

  3. There must be exactly one root element.

  4. Attribute values must be quoted.

  5. An element may not have two attributes with the same name.

  6. Comments and processing instructions may not appear inside tags.

  7. No unescaped < or & signs may occur in the character data of an element or attribute.

This is not an exhaustive list. There are many, many ways a document can be malformed. You'll find a complete list in Chapter 20. Some of these involve constructs that we have not yet discussed such as DTDs. Others are extremely unlikely to occur if you follow the examples in this chapter (for example, including whitespace between the opening < and the element name in a tag).

Whether the error is small or large, likely or unlikely, an XML parser reading a document is required to report it. It may or may not report multiple well-formedness errors it detects in the document. However, the parser is not allowed to try to fix the document and make a best-faith effort of providing what it thinks the author really meant. It can't fill in missing quotes around attribute values, insert an omitted end-tag, or ignore the comment that's inside a start-tag. The parser is required to return an error. The objective here is to avoid the bug-for-bug compatibility wars that plagued early web browsers and continue to this day. Consequently, before you publish an XML document, whether that document is a web page, input to a database, or something else, you'll want to check it for well-formedness.

The simplest way to do this is by loading the document into a web browser that understands XML documents such as Mozilla. If the document is well-formed, the browser will display it. If it isn't, then it will show an error message.

Instead of loading the document into a web browser, you can use an XML parser directly. Most XML parsers are not intended for end users. They are class libraries designed to be embedded into an easier-to-use program such as Mozilla. They provide a minimal command-line interface, if that; that interface is often not particularly well documented. Nonetheless, it can sometimes be quicker to run a batch of files through a command-line interface than loading each of them into a web browser. Furthermore, once you learn about DTDs and schemas, you can use the same tools to validate documents, which most web browsers won't do.

There are many XML parsers available in a variety of languages. Here, we'll demonstrate checking for well-formedness with the Apache XML Project's Xerces-J 1.4, which you can download from http://xml.apache.org/xerces-j. This open source package is written in pure Java so it should run across all major platforms. The procedure should be similar for other parsers, though details will vary.

To use this parser, you'll first need a Java 1.1 or later compatible virtual machine. Virtual machines for Windows, Solaris, and Linux are available from http://java.sun.com/. To install Xerces-J 1.4.4, just add xerces.jar and xercesSamples.jar files to your Java class path. In Java 2 you can simply put those .jar files into your jre/lib/ext directory.

The class that actually checks files for well-formedness is called sax.SAXCount. It's run from a Unix shell or DOS prompt like any other standalone Java program. The command-line arguments are the URLs to or filenames of the documents you want to check. Here's the result of running SAXCount against an early version of Example 2-5. The very first line of output tells you where the first problem in the file is. The rest of the output is a more or less irrelevant stack trace.

D:\xian\examples\02>java sax.SAXCount 2-5.xml [Fatal Error] 2-5.xml:3:30: The value of attribute "height" must not contain the '<' character. Stopping after fatal error: The value of attribute "height" must not contain the '<' character. at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java: 1282) at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError( XMLDocumentScanner.java:644) at org.apache.xerces.framework.XMLDocumentScanner.scanAttValue( XMLDocumentScanner.java:519) at org.apache.xerces.framework.XMLParser.scanAttValue( XMLParser.java:1932) at org.apache.xerces.framework.XMLDocumentScanner.scanElement( XMLDocumentScanner.java:1800) at org.apache.xerces.framework.XMLDocumentScanner$ContentDispatcher. dispatch(XMLDocumentScanner.java:1223) at org.apache.xerces.framework.XMLDocumentScanner.parseSome( XMLDocumentScanner.java:381) at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1138) at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1177) at sax.SAXCount.print(SAXCount.java:135) at sax.SAXCount.main(SAXCount.java:331)

As you can see, it found an error. In this case the error message wasn't particularly helpful. The actual problem wasn't that an attribute value contained a < character. It was that the closing quote was missing from the height attribute value. Still, that was enough for us to locate and fix the problem. Despite the long list of output, SAXCount only reports the first error in the document, so you may have to run it multiple times until all the mistakes are found and fixed. Once we fixed Example 2-5 to make it well-formed, SAXCount simply reported how long it took to parse the document and what it saw when it did:

D:\xian\examples\02>java sax.SAXCount 2-5.xml 2-5.xml: 140 ms (17 elems, 12 attrs, 0 spaces, 564 chars)

Now that the document has been corrected to be well-formed, it can be passed to a web browser, a database, or whatever other program is waiting to receive it. Almost any nontrivial document crafted by hand will contain well-formedness mistakes. That makes it important to check your work before publishing it.

This example works with Xerces-J 1.0 through 1.4.4. The recently released Xerces-J 2.0 provides a similar program named sax.Counter.

CONTENTS


XML in a Nutshell
XML in a Nutshell, 2nd Edition
ISBN: 0596002920
EAN: 2147483647
Year: 2001
Pages: 28

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net