Components of an XML Document

So far, I have provided some background material and discussed some of the characteristics of XML, but you still don't really know what makes up an XML document. In this section, I'll discuss the components required to build an XML document and then present an XML document that illustrates these various components.

Types of Markup

An XML document is a text file that consists of character data and markup. In the case of XML, there are several different constructs that are considered to be markup. A summary of these different constructs is shown in Table A.1.

Table A.1. Summary of XML markup.

Markup	Description
Start tag	Used to mark the beginning of a non-empty XML element (for example, `<address>` ). Note that in an XML document, all start tags must have a corresponding end tag.
End tag	Used to mark the end of a non-empty XML element (for example, `</address>` ). Note the forward slash before the element name , which distinguishes an end tag from a start tag. Note that in XML, all end tags must have a corresponding start tag.
Empty element tag	Identifies an empty element (for example, `<address/>` ). Note that the forward slash appears after the element name.
Character reference	Characters of languages that cannot be displayed with the standard ASCII character set can be represented using Unicode. Unicode is a list of unique numbers that map to characters in just about any language. This enables you to insert native language characters into an XML document by using Unicode. For additional Unicode information, see http://www.unicode.org.
Entity reference	Certain characters cannot appear between the start and end tags of an element. For example, the less than character "<" is used to delimit the start and end tags, so it would cause confusion to XML parsers if it appeared within the character data. So, inside an XML document, the offending characters are mapped to these. When the XML document is parsed, the entity references are replaced with their real values. For example, an element containing the following string: <inventory>Mark's computer</inventory> would need to actually contain the following markup inside an XML document: <inventory>Mark"s computer</inventory> Currently, the five valid entity references are `&` The ampersand character "&". `'` The apostrophe character "'". `>` Closing tag bracket and greater than sign ">". `<` Opening tag bracket and less than sign "<". `"` Double quote character """.
Comment	Comments can be inserted into XML documents by the authors as notes to themselves or notes to other users. A comment starts with `<! ”` and ends with the first occurrence of `”>` . For example, the following would appear as a comment in an XML document: <!This is a comment. >
`CDATA` Section	A `CDATA` section of an XML document enables a user to include markup data without having to use entity references. This is helpful when the XML document contains data from other formats.You can enclose an HTML or Scalable Vector Graphics (SVG) document within an XML document and not have to worry about entity references. For example, you would use the following notation to enclose an HTML document: <![CDATA[ <html> <head> <meta http-equiv="content-type" content="text/ html; charset=ISO-8859-1"> </head> <body> <div align="Center"> <div align="Left">This is an HTML document.<br> </div> </div> </body> </html> ]]>
Document Type Declaration	A Document Type Definition (DTD) defines the format and content of an XML document and can be either internal or external to the XML document. I'll discuss DTDs in more detail later in this Appendix.
Processing Instruction	Processing instructions (PIs) are used to pass instructions to applications that are working with XML documents. All PIs start with the characters " `<?` " and end with the characters " `?>` ".
XML Declaration	The XML declaration must always appear at the top of a well- formed XML document, starting at the first column. Any whitespace before the XML declaration in an XML document would render it invalid. There is one required parameter and two optional parameters in the XML declaration. An example of the required XML declaration is `<?xml?>` . Remember, this must be the first line of an XML document, and it must start in the first column. The other parameters that can appear with the XML declaration are XML version, encoding parameter, and standalone.
XML `version`	The required `version` parameter identifies the version of XML used in this XML document. Currently, there is only one version of XML and the version is 1.0. An example of the XML `version` parameter (appearing with the XML declaration) is `<?xml version="1.0"?>` .
`Encoding` Parameter	The optional `encoding` parameter identifies the character encoding method for the characters in this XML document. All XML processors are required to support Universal Character Set (UCS) Transformation Formats (UTF)-8 or -16. The encoding scheme UTF-8 is used to represent 7-bit ASCII characters while UTF-16 provides access to 63,000 characters as a single Unicode 16-bit unit. If the `encoding` parameter isn't provided, the Unicode character set is the default encoding scheme. An example of the encoding parameter `encoding` is `<?xml version="1.0" encoding="UTF-8"?>` .
`standalone` Parameter	The optional `standalone` parameter indicates whether this XML document has an internal or external DTD. If it has an internal document, then this value would be set to "yes," otherwise if it requires an external DTD, the value is set to "no." External in this context means that the DTD resides in a different file. If the `standalone` parameter isn't provided, the default value is "no." An example of the `standalone` encoding parameter is `<?xml version="1.0" encoding="UTF-8" standalone="no"?>` .
Text Declaration	A text declaration looks similar to the XML declaration; however, it has a different purpose. The text declaration is used to tell the parser if an external entity uses a different encoding scheme than the one used in the current XML document.

Elements

The most fundamental structure of an XML document is the element. A well-formed XML document must contain at least one element, although an XML document usually contains many elements. An element typically surrounds character data with start and end tags. A sample of an element is

 <address>1106 River Avenue</address>

This is a single element named <address> and it contains the character data " 1106 River Avenue ." Note that in an XML document, an element always uses the following syntax:

Start tags begin with a less than sign " < "
End tags begin with a less than sign followed by a forward-slash " </ ".

Instead of containing character data, elements can also contain other elements called child elements. When we begin to discuss elements containing other elements (that is, nested elements), you can start to visualize the XML document almost as a tree. As you can see in Figure A.1, we have a tree that has a <record> element at the root and two child elements named <name> and <address> .

Figure A.1. Tree representation of a simple XML document.

graphics/apafig01.gif

The XML document that corresponds to the XML tree shown in Figure A.1 is

 <?xml version="1.0" encoding="UTF-8"?>  <record>     <name>Matthew Kolb</name>     <address>1700 Grand Avenue</address>  </record>

In this example, the <record> element has two child elements, <name> and <address> . The <name> element contains the character data " Matthew Kolb " and the <address> element contains the character data " 1700 Grand Avenue ." Note that the child elements follow the same rules for start and end tags (that is, each element must have matching start and end tags).

Elements can also be empty, and of course, they're called empty elements. We can use the standard notation without any character data, such as <book> </book> , or we can use a shorthand notation that consolidates the start and end tags into one tag, such as <book/> . Note that the forward slash appears after the element name in an empty element; it usually appears before the element name in an end tag.

XML is case sensitive, so opening and closing tags for elements must use the same case. Either lower case or all capital letters can be used (and even mixed case), however you must be consistent. For example, <account>data</ACCOUNT> isn't valid.

Attributes

Each element can also have one or more attributes associated with it. Attributes are usually used to store data that is relative to a particular instance of an element. An attribute has a name and value associated with it, and it appears as part of the element's start tag. For example, the following is a valid attribute:

 <book isbn="0735712891">XML and Perl</book>

As you can see, the <book> element has one attribute named isbn , and the value of the isbn attribute is " 0735712891 ." As I mentioned earlier, the attribute is applicable to this particular element ”another book element would have a different isbn attribute value. Elements can have several attributes if required. Attributes within the start tag must be separated by at least one space:

 <book isbn="0735712891" price".99">XML and Perl</book>

Attribute values have to be quoted (either single or double quotes are allowed). The quotes must match (that is, be the same) for each attribute. For example, isbn="0735712891" price='$39.99' is ok, but isbn='0735712891" price="$39.99' isn't.

Use of Attributes Versus Elements

You could have easily created <isbn> and <price> elements in the previous section and stored the values as character data rather than attributes. Why would you use an attribute instead of an element (or vice versa)? Well, there isn't an easy answer to the question, and it is a popular topic on newsgroups or forums that always provokes a lot of strong opinions about when to use an element versus when to use an attribute. I might not be able to provide a definitive answer, but I can certainly provide a few suggestions. As you become more familiar with XML, you will get more comfortable with designing XML documents and the best approach is usually obvious. Let's take a look at a few examples that will help you determine when to store data as an element or an attribute.

Storing Data as an Element

If there is more than one occurrence of a data item, then you will need to store the data in an element rather than in an attribute. For example, let's say that you need to store a list of employee names . One option would be to use a root element named < employees > that has multiple <employee> elements. Each employee element has two child elements, <name> and <phone> , and each of these elements contains character data. Here is an example of that hierarchy:

 <?xml version="1.0" encoding="UTF-8"?>  <employees>     <employee>        <name>Joseph</name>        <phone>112</phone>     </employee>     <employee>        <name>Kayla</name>        <phone>114</phone>     </employee>     <employee>        <name>Sean</name>        <phone>116</phone>     </employee>     <employee>        <name>Matthew</name>        <phone>118</phone>     </employee>  </employees>

This example can easily be extended to include additional information for each employee, such as an employee number, department, or home address.

Storing Data as an Attribute

A good example of when to use an attribute to store data is when you need to assign a unique identifier to each element, or the data describes the element itself. Let's take a look at the list of employees again, and let's say that you want to associate an employee number to each name. An example is shown in the following:

 <?xml version="1.0" encoding="UTF-8"?>  <employees>     <employee id="100">        <name>Joseph</name>        <extension>112</extension>     </employee>     <employee id="101">        <name>Kayla</name>        <extension>114</extension>     </employee>     <employee id="102">        <name>Sean</name>        <extension>116</extension>     </employee>     <employee id="103">        <name>Matthew</name>        <extension>118</extension>     </employee>  </employees>

As you can see, without reorganizing your XML document, you uniquely associated each employee element with an employee identification number. Depending on what you are going to do with the XML document will help drive the design of the XML document. For example, you would need a document structure similar to this if you plan to search the XML document and find employee <name> elements based on the employee identification numbers.

Another example of when it is beneficial to use attributes involves data that requires units (for example, kilograms, degrees Celsius, kilometers, and so forth). For example, the following XML element would require an additional step (a Perl split function call) to separate the data from the units of the data:

 <weight>75 kg</weight>

An alternative to mixing the data and units in the character data would be to store the data unit in an attribute. For example, the following element and attribute would be easier to parse:

 <weight unit="kg">75</weight>

Attributes can also be used when you want to limit the possible range of values to an enumerated list or range of valid values. This will be demonstrated a little later in this Appendix when we discuss DTDs and XML schemas.