Section 7.3. XML Document Syntax

7.3. XML Document Syntax

Now let's look at some of the particulars of XML syntax using this simple XML document:

 <?xml version="1.0" encoding="US-ASCII" standalone="no"?> <!DOCTYPE accounts SYSTEM "simple.dtd"> <accounts> <customer>     <name>         <firstname>Bobby</firstname>         <lastname>Five</lastname>     </name>     <accountNumber>4456</accountNumber>     <balance>111.32</balance> </customer> <!-- more customers will be added soon --> <?php  print date ('Fj,Y') ?> </accounts>

Well-Formed Versus Valid

In short, well-formed documents comply with the rules for marking up documents according to XML , independent of a specific language. For instance, all elements must be correctly nested and may not overlap.

Valid documents are well-formed and abide by the rules of a DTD for a particular XML language. For instance, in XHTML, it is invalid to put a body element in an a element.

An XML document must be well-formed, and should be valid , but validity is not required.

Because XHTML is an XML application, all of the following syntax conventions apply to web documents written in XHTML.

7.3.1. XML Declaration

The first line of the example is the XML declaration.

 <?xml version="1.0" encoding="US-ASCII" standalone="no"?>

The XML declaration contains special information for the XML parser. First, the version attribute tells the parser that it is an XML document that conforms to Version 1.0 of the XML standard (which, incidentally, is the only available option).

In addition, the encoding attribute specifies which character encoding the document uses. By default, XML use the UTF-8 encoding of the Unicode character set (the most complete character set including glyphs from most of the world's languages). Alternate encodings may also be specified, such as ISO-8859-1 (Latin-1), which is a set containing characters from most Western European languages. Character encodings are discussed in more detail in Chapter 6.

Finally, the optional standalone="no" attribute informs the program that an outside DTD is needed to correctly interpret the document. If the value of standalone is yes, it means there is no DTD or the DTD is included in the document.

XML documents should begin with an XML declaration, but it is not required.

In XHTML documents, the presence of an XML declaration will cause Internet Explorer 6 for Windows to render in Quirks mode, even when a proper DOCTYPE declaration is provided (see Chapter 9 for information on Quirks versus Standards mode and DOCTYPE switching). For this reason, it is commonly omitted. This problem has been fixed in IE 7. Some other browsers may render the XML declaration or have other problems. Avoid using the XML declaration in your XHTML documents if possible.

7.3.2. Document Type Declaration

The example also includes a document type (DOCTYPE) declaration.

  <!DOCTYPE accounts SYSTEM "simple.dtd">

The purpose of the DOCTYPE declaration is to refer to the DTD against which the document should be compared for validity. The declaration identifies the root element of the document (accounts, in the example). It also provides a pointer to the DTD itself. DOCTYPE declarations are discussed in the "DTD Syntax" section later in this chapter and again in Chapter 9 as they apply to XHTML.

Together, the XML declaration and DOCTYPE are often referred to as the document prolog . For XML languages that don't use DTDs, the entire prolog is optional. For languages with DTDs, the DOCTYPE declaration is required for the document to validate.

7.3.3. Comments

You can leave notes within an XML document in the form of a comment. Comments begin with . If you've used comments in HTML, this syntax should be familiar. The example document contains the comment:

 <!-- more customers will be added soon -->

Comments are not elements and, therefore, do not affect the structure of the document. They may be placed anywhere in a document except before an XML declaration or within a tag or another comment.

7.3.4. Processing Instructions

A processing instruction is a method for passing information to applications that may read the document. It may also include the program or script itself. Unlike comments, which are intended for humans, processing instructions are for computer programs or scripts. Processing instructions are indicated by <? at the beginning and ?> at the end of the instruction.

The example document includes a processing instruction for a simple PHP command that displays the current date.

 <?php print date('Fj, Y'); ?>

7.3.5. Entity References

Isolated markup characters (such as <, &, and >) are not permitted in the flow of text in an XML document and must be escaped using either a Numeric Character Reference or a predefined character entity. This is to avoid having the XML parser interpret any < symbol as the beginning of a new tag. In addition to using entity references in the content of the document, you must use them in attribute values.

XML defines five character entities for use in all XML languages, listed in Table 7-1. Other entities may be defined in a DTD.

Table 7-1. Predefined character entities in XML
Entity	Char	Notes
&	`&`	Must not be used inside processing instructions
<	`<`	Use inside attribute values quoted with `"`
>	`>`	Use after `]]` in normal text and inside processing instructions
"	`"`	Use inside attribute values quoted with `'`
'	`'`	Use inside attribute values quoted with `"`

If you have a document that uses a lot of special characters, such as an example of source code, you can tell the XML parser that the text is simple character data (CDATA) and should not be parsed. To protect content from parsing, enclose it in a CDATA section , indicated by <![CDATA[ ... ]]>. This XHTML example uses a CDATA section to display sample markup on a web page without requiring every < and > character to be escaped:

 <p>This is sample SMIL markup:</p> <![CDATA[ <audio src="/books/4/439/1/html/2/audio_file.mp3"  begin="0s" />     <seq>          <img src="/books/4/439/1/html/2/image_1.jpg"  begin="0s" />          <img src="/books/4/439/1/html/2/image_2.jpg" begin="5s" />     </seq> ]]>

The five reserved characters (listed in Table 7-1) are also put to frequent use when writing scripts (such as JavaScript), making it necessary to designate those blocks of content as CDATA so they will be ignored by XML parsers.