Before we start, its important to understand what we mean by the term XML document. The term refers to a collection of content that meets XML construction rules. When we work with XML, the term document has a more general meaning than with software packages. In Flash, for example, a document is a physical file.
While an XML document can be one or more physical files, it can also refer to a stream of information that doesnt exist in a physical sense. You can create these streams using server-side files; youll see how this is done later in this book. As long as the information is structured according to XML rules, it qualifies as an XML document.
XML documents contain information and markup. You can divide markup into
Each XML document contains one or more elements. Elements identify and mark up content, and they make up the bulk of an XML document. Some people call elements nodes .
Here is an element:
This element contains two tags and some text. Elements can also include other elements. They can even be empty, i.e., they contain no text.
As in HTML, XML tags start and end with less-than and greater-than signs. The name of the tag is stored in between these signs < tagName > .
The terms element and tag have a slightly different meaning.
A tag looks like this:
whereas an element looks like this:
If an element contains information or other elements, it will include both an opening and closing tag <tag></tag> . Empty elements can also be written in a single tag <tag/> so that
is equivalent to
There is no preferred way to write empty tags. Either option is acceptable.
You can split elements across more than one line as shown here:
<contact> Some text </contact>
Each element has a name that must follow a standard naming convention. The names start with either a letter or the underscore character. They cant start with a number. Element names can contain any letter or number, but they cant include spaces. Although its technically possible to include a colon ( : ) character in an element name, its not a good idea as these are used when referring to namespaces. Youll understand what that means a little later in the chapter.
You usually give elements meaningful names that describe the content inside the tags. The element name
is more useful than
You cant include a space between the opening bracket < and the element name. You are allowed to include space anywhere else, and its common to include a space before the /> for empty elements. In the early days of XHTML, older browsers required the extra space for tags such as <br /> and <hr /> .
When an element contains another element, the container element is called the parent and the element inside is the child .
<tagname> <childTag>Text being marked up</childTag> </tagname>
The family analogy continues with grandparent and grandchild elements as well as siblings .
You can also mix the content of elements, i.e., they contain text as well as child elements:
<tagname> Text being <childTag>marked up</childTag> </tagname>
The first element in an XML document is called the root element , document root , or root node . It contains all the other elements in the document. Each XML document can have only one root element. The last tag in an XML document will nearly always be the closing tag for the root element.
XML is case sensitive. For example, <phoneBook> and </phonebook> are not equivalent tags and cant be used in the same element. This is a big difference from HTML.
Elements serve many functions in an XML document:
Elements mark up content. The opening and closing tags surround text.
Tag names provide a description of the content they mark up. This gives you a clue about the purpose of the element.
Elements provide information about the order of data in an XML document.
The position of child elements can show their importance.
Elements show the relationships between blocks of information. Like databases, they show how one piece of data relates to others.
Attributes supply additional information about an element. They provide information that clarifies or modifies an element.
Attributes are stored in the start tag of an element after the element name. They are pairs of names and related values, and each attribute must include both the name and the value:
<tagname attributeName="attributeValue"> Text being marked up </tagname>
Attribute values appear within quotation marks and are separated from the attribute name with an equals sign. You can use either single or double quotes around the attribute value. Interestingly enough, you can also mix and match your quotes in the same element:
<tagname attribute1="value1" attribute2='value2'>
You might choose to use double quotes where a value contains an apostrophe:
You would use single quotes where double quotes make up part of the value:
<photo caption='It was an "interesting" day'>
Keep in mind that tags cant be included within an attribute.
An XHTML image tag provides an example of an element that contains attributes:
<img src="logo.gif" width="20" height="15" alt="Company logo"/>
There is no limit to number of attributes within an element, but attributes inside the same element must have unique names. When you are working with multiple attributes in an element, the order isnt important.
Attribute names must follow the same naming conventions as elements. You cant start the name with a number, and you cant include spaces in the name. Some attribute names are reserved, and you shouldnt use them in your XML documents. These include
You can rewrite attributes as nested elements. The following
<contact id="1"> <name>Sas Jacobs</name> </contact>
could also be written as
<contact> <id>1</id> <name>Sas Jacobs</name> </contact>
There is no one right way to structure elements and attributes. The method you choose depends on your data. The way youre going to process the XML document might also impact on your choices. Some software packages find it harder to work with attributes compared with elements.
Text refers to any information contained between opening and closing element tags. In the line that follows , the text Sas Jacobs is stored between the <fullName> and </fullName> tags:
Unless you specify otherwise , the text between the opening and closing tags in an element will always be processed as if it was XML. This means that special characters such as < and > have to be replaced with the entities < ; and > . The alternative is to use CDATA to present the information, and Ill go into that a little later.
Ive listed the common entities that youll need to use in Table 2-1.
Character entities are symbols that represent a single character. In HTML, character entities are used for special symbols such as an ampersand ( & ) and a nonbreaking space ( ).
Character entities replace reserved characters in XML documents. All tags start with a less-than sign so it would be confusing to include another one in your code.
<expression>3 < 5</expression>
This code would cause an error during processing. If you want to include a less-than sign in text, you can use the entity < :
<expression>3 < 5</expression>
Some entities use Unicode numbers. You can use numbers to insert characters that you cant type on a keyboard. For example, the entity é creates the character an e with an acute accent . The number 233 is the Unicode number for the character .
You can also use a hexadecimal number to refer to a character. In that case, you need to include an x in the number so the reference would start with &#x . The hexadecimal entity reference for is é . The Character Map in Windows tells you what codes to use. Open it by choosing Start All Programs Accessories System Tools Character Map . Figure 2-1 shows the Character Map dialog box.
The bottom left of the window shows the hexadecimal value. Dont forget to remove the trailing zeroes and add &#x to the beginning of the value. The right side shows the Unicode number. Again, youll need to remove the first 0 from the code.
Comments in XML work the same as in HTML. They begin with the characters <!-- and end with --> :
<!-- here is a commented line -->
Comments are a useful way to leave messages for other users of an XML document without affecting the way the XML document is processed. In fact, processing software always ignores comments in XML documents. You can also use comments to hide a single line or a block of code.
The only requirements for comments in XML documents are that
A comment cant appear before the first line XML declaration.
Comments cant be nested or included within tag names.
You cant include --> inside a comment.
Comments shouldnt split tags, i.e., you shouldnt comment out just a start or ending tag.
CDATA stands for character data. CDATA blocks mark text so that it isnt processed as XML. For example, you could use CDATA for information containing characters such as < and > . Any < or > character contained within CDATA wont be processed as part of a tag name.
CDATA sections start with <![CDATA and finish with ]> . The character data is contained within square brackets  inside the section:
<![CDATA[ 3 < 5 or 2 > 0 ]]>
Entities will display literally in a CDATA section so you shouldnt include them. For example, if you add < to your CDATA block it will display the same way when the XML document is processed.
The end of a CDATA section is marked with the ]]> characters so you cant include these inside CDATA.
The listing that follows shows a simple XML document. Ill explain this in detail a little later in the chapter. You can see elements, attributes, and text:
<?xml version="1.0"?> <phoneBook> <contact id="1"> <name>Sas Jacobs</name> <address>123 Some Street, Some City, Some Country</address> <phone>123 456</phone> </contact> </phoneBook>