XML Syntax | Beginning XML Databases (Wrox Beginning Guides)

The basic syntax rules of XML are simple but also very strict. This section goes through those basic syntax rules one by one:

The XML tag: The first line in an XML document declares the XML version in use:
```
   <?xml version="1.0"?>   
```
Including style sheets: The optional second line contains a style sheet reference, if a style sheet is in use:
```
   <?  xml:stylesheet  type="text/xsl" href="cities.xsl"?>   
```
The root node: The next line will contain the root node of the XML document tree structure. The root node contains all other nodes in the XML document, either directly or indirectly (through child nodes):
```
   <root>   
```
A single root node: An XML document must have a single root tag, such that all other tags are contained within that root tag. All subsequent elements must be contained within the root tag, each nested within its parent tag.

An XML tag is usually called an element.
The ending root tag: The last line will contain the ending element for the root element. All ending elements have exactly the same name as their corresponding starting elements, except that the name of the node is preceded by a forward slash (/):
```
   </root>   
```
Opening and closing elements: All XML elements must have a closing element. Omitting a closing element will cause an error. Exceptions to this rule is the XML definitional element at the beginning of the document, declaring the version of XML in exceptions, and an optional style sheet:
```
   <root>    <branch_1>       <leaf_1>       </leaf_1>    </branch_1>    <branch_2>    </branch_2> </root>   
```
HTML tags do not always require a closing tag. Examine the first HTML code example in this chapter in the section Comparing HTML and XML . The first paragraph does not have a </P> paragraph end tag. The second paragraph does have a </P> paragraph eng tag. Some closing tags in HTML are optional, meaning that a closing tag can be included or not.
Case sensitive: XML elements are case sensitive. HTML tags are not case sensitive. The XML element <root> in the previous example is completely different than the XML element <Root> in the next example. The following example is completely different than the previous XML document shown in the previous point. Even though all the elements are the same, their case is different for the <Root> and <BRANCH_1> elements:
```
    <Root>   <BRANCH_1>  <leaf_1>       </leaf_1>  </BRANCH_1>  <branch_2>    </branch_2>  </Root>    
```
HTML does not require proper nesting of elements, such as in this example:
```
   <FONT COLOR="red"><B><I>This is bold italic text in red</FONT></B></I>   
```
XML on the other hand, produces an error using the preceding code. For example, in XML the following code is invalid because </tag2> should appear before </tag1> :
```
   <tag1><tag2>some tags</tag1></tag2>   
```
Element attributes: Like HTML tags, XML elements can have attributes. An element attribute refines the aspects of an element. Attributes and their values are called name-value pairs. An XML element can have one or more name-value pairs, and the value must always be quoted. HTML attribute values do not always have to be quoted, although it is advisable. In the following XML document sample (the complete document is not shown here), populations for continents (including the name of the continent ) are contained as attributes of the <continent> element. In other words, the continent of Africa had a population of 748,927,000 people in 1998 (748 million people where the population in thousands is the total divided by 1,000, or 748,927).

It follows that projected populations for the African continent are 1.3 billion (1,298,311) for the year 2025, and 1.8 billion (1,766,082) for the year 2050. Also in this example, the name of the country is stored in the XML document as an attribute of the <country> element:
```
   <?xml version="1.0"?> <?xml:stylesheet type="text/xsl" href="791202 fig0105.xsl" ?> <populationInThousands>    <world>       <continents>  <continent name="Africa"  year1998="748,927"  year2025="1,298,311"   year2050="1,766,082">  <countries>  <country name="Burundi">  <year1998>6457</year1998>                   <year2025>11569</year2025>                   <year2050>15571</year2050>                </country>  <country name="Comoros">  <year1998>658</year1998>                   <year2025>1176</year2025>                   <year2050>1577</year2050>                </country>                ...             </country>             ...          </continent>          ...       </continents>       ...    </world>    ... </populationInThousands>   
```
XML element and attribute names can have space characters included in those names , as in the <continent> element shown in the preceding sample XML document.
As shown in Figure 1-7, the previous sample XML document does include a style sheet, making the XML document display with only the names of continents and countries.

And here is an HTML equivalent of the XML document for the previous example as shown in Figure 1-7. Notice how much more raw code there is for each population region and country:

   <HTML><BODY> <TABLE CELLPADDING="2" CELLSPACING="0" BORDER="1">    <TR>       <TH BGCOLOR="silver">Continent       <TH BGCOLOR="silver">Country       <TH BGCOLOR="silver">1998       <TH BGCOLOR="silver">2025       <TH BGCOLOR="silver">2050    </TR>    <TR ALIGN="right">       <TD BGCOLOR="#D0FFFF" ALIGN=left>Africa</TD>       <TD BGCOLOR=#FFFFD0>&nbsp;</TD>       <TD>748,927</TD>       <TD>1,298,311</TD>       <TD>1,766,082</TD>    </TR>       <TR ALIGN="right">       <TD BGCOLOR="#D0FFFF">&nbsp;</TD>       <TD BGCOLOR=#FFFFD0 ALIGN=left>Burundi</TD>       <TD>6,457</TD>       <TD>11,569</TD>       <TD>11,571</TD>    </TR>       <TR ALIGN="right">       <TD BGCOLOR="#D0FFFF">&nbsp;</TD>       <TD BGCOLOR=#FFFFD0 ALIGN=left>Comoros</TD>       <TD>658</TD>       <TD>1,176</TD>       <TD>1,577</TD>    </TR>    ... </TABLE> </BODY></HTML>

image from book
Figure 1-7: Using XML element attributes to change the display of an XML document

Figure 1-8 shows the HTML display of the preceding HTML coded page, and the XML displayed document in Figure 1-7 (the previous example).

image from book
Figure 1-8: HTML embeds the code and is less flexible than XML.

Comments: Both XML and HTML use the same character strings to indicate commented out code:

   <!-- This is a comment and will not be processed by the HTML or XML parser -->

Elements

As you have already seen in the previous section, an XML element is the equivalent of an HTML tag. A few rules apply explicitly to elements:

Element naming rules: The names of elements (XML tags) can contain all alphanumeric characters as long as the name of the element does not begin with a number or a punctuation character. Also, names cannot contain any spaces. XML delimits between element names and attributes using a space character. Do not begin an element name with any combination of the letters XML, in any combination of uppercase or lowercase characters. In other words, XML_1 , xml_1 , xML_1 , and so on, are all not allowed. It will not produce an error to use multiple operative characters, such as + (addition) and (subtraction), but their use is inadvisable. Elements least likely to cause any problems are those containing only letters and numbers . Stay away from odd characters.
Relationships between elements: The root node has only children. All other nodes have one parent node, as well as zero or more child nodes. Nodes can have elements that are related on the same hierarchical level. In the code example that follows, the following apply:
- The root node element is called <root> .
- The root node has two child node elements: <branch_1> and <branch_2> .
- The node <branch_1> has one child element called <leaf_1_1> .
- The node <branch_2> has three child elements called <leaf_2_1> , <leaf_2_2> , and <leaf_2_3> .
- The nodes <leaf_2_1> , <leaf_2_2> , and <leaf_2_3> are all siblings, having the same parent node element in common (node <branch_2> ):
```
   <root>    <branch_1>       <leaf_1_1>       </leaf_1_1>    </branch_1>    <branch_2 name="branch two">       <leaf_2_1>       </leaf_2_1>       <leaf_2_2>       </leaf_2_2>       <leaf_2_3>This is a leaf</leaf_2_3>    </branch_2> </root>   
```
The content of elements: XML elements can have simple content (text only), attributes for the element concerned , and can contain other child elements. Node <branch_2> in the preceding example has an attribute called name (with a value of branch two ). The node <leaf_1> contains nothing. The node <leaf_2_3> contains the text string This is a leaf .

Extensible elements: XML documents can be altered without necessarily altering what is delivered by an application. Examine Figure 1-7. The following is the XSL code used to apply the reduced template for get the result shown in Figure 1-7:

   <xsl:template>       <xsl:apply-templates select="@*"/>          <xsl:if test="@name[.='Africa']"><HR/></xsl:if>          <xsl:if test="@name[.='Asia']"><HR/></xsl:if>          <xsl:if test="@name[.='Europe']"><HR/></xsl:if>          <xsl:if test="@name[.='Latin America and the Caribbean']"><HR/></xsl:if>          <xsl:if test="@name[.='North America']"><HR/></xsl:if>          <xsl:if test="@name[.='Oceania']"><HR/></xsl:if>  <xsl:value-of select="@name"/>  <xsl:apply-templates/>    </xsl:template> </xsl:stylesheet>

Looking at the preceding XSL script, yes, we have not as yet covered anything about eXtensible Style Sheets (XSL). The point to note is that the boldface text in the preceding code finds only the name attribute values, from all elements, ignoring everything else. Therefore all population numbers are discarded and only the names of continents and countries are returned. It is almost as if the XML document might as well look like that shown next, with all population numbers removed. The result in Figure 1-7 will still be exactly the same:

   <?xml version="1.0"?> <?xml:stylesheet type="text/xsl" href="791202 fig0107.xsl" ?> <populationInThousands>    <world>       <continents>          <continent name="Africa">             <countries>                <country name="Burundi"></country>                <country name="Comoros"></country>             </countries>             ...          </continent>       </continents>    </world> </populationInThousands>

Attributes

Elements can have attributes. An element is allowed to have zero or more attributes that describe it. Attributes are often used when the attribute is not part of the textual data set of an XML document, or when not using attributes is simply awkward . Store data as individual elements and metadata as attributes.

Metadata is the data about the data. In a database environment the data is the names of your customers and the invoices you send them. The metadata is the tables you define which are used to store records in cus tomer and invoice tables. In the case of XML and HTML metadata is the tags or elements ( < > < > ) contained within a web page. The values between the tags is the actual data.

Once again, the now familiar population example:

   <populationInThousands>    <world>       <continents>  <continent name="Africa"  year1998="748,927"  year2025="1,298,311"   year2050="1,766,082">  <countries>  <country name="Burundi">  <year1998>6457</year1998>                   <year2025>11569</year2025>                   <year2050>15571</year2050>                </country>  <country name="Comoros">  <year1998>658</year1998>                   <year2025>1176</year2025>                   <year2050>1577</year2050>                </country>                ...             </countries>             ...          </continent>          ...       </continents>       ...    </world>    ... </populationInThousands>

Attributes can also be contained within an element as child elements. The example you just saw can be altered as in the next script, removing all element attributes. The following script just looks busier and perhaps a little more complex for the naked eye to decipher. The more important point to note is that the physical size of the XML document is larger because additional termination elements are introduced. In very large XML documents this can be a significant performance factor:

   <populationInThousands>    <world>       <continents>  <continent>   <name>Africa</name>   <year1998>748,927</year1998>   <year2025>1,298,311</year2025>   <year2050>1,766,082</year2050>  <countries>  <country>   <name>Burundi</name>  <year1998>6457</year1998>                   <year2025>11569</year2025>                   <year2050>15571</year2050>                </country>  <country>  <name>Comoros</name>                   <year1998>658</year1998>                   <year2025>1176</year2025>                   <year2050>1577</year2050>                </country>                ...             </countries>             ...          </continent>          ...       </continents>       ...    </world>    ... </populationInThousands>

From a purely programming perspective, it could be stated that attributes should not be used because of the following reasons:

Elements help to define structure and attributes do not.
Attributes are not allowed to have multiple values whereas elements can.
Programming is more complex using attributes.
Attributes are more difficult to alter in XML documents at a later stage.

As already stated, the preceding reasons are all sensible from a purely programming perspective. From a database perspective, and XML in databases, the preceding points need some refinement and perhaps even some contradiction:

Elements define structure and attributes do not. I prefer not to put too much structure into data, particularly in a database environment because the overall architecture of data can become too complex to manage and maintain, both for administrators and the database software engine. Performance can become completely disastrous if a database gets large because there is simply too much structure to deal with.
Attributes are not allowed multiple values. If attributes need to have multiple values then those attributes should probably become child elements anyway. This book is after all about XML databases (and XML in databases). Therefore it makes sense to say that an attribute with multiple values is effectively a one-to-many relationship.

You send many invoices to your customers. There is a one-to-many relationship between each customer and all of their respective invoices. A one-to-many relationship is also known as a master-detail rela tionship. In this case the customer is the master, and the invoices are the detail structural element. The many sides of this relationship are also known as a collection, or even an array, in object methodology parlance.
Attributes make programming more complex. Programming is more complex when accessing attributes because code has to select specific values. Converting attributes to multiple contained elements allows programming to scan through array or collection structures. Once again, performance should always be considered as a factor. Scrolling through a multitude of elements contained within an array or collection is much less efficient than searching for exact attributes, which are within exact elements. It is much faster to find a single piece of data, rather than searching through lots of elements, when you do not even know if the element exists or not. An XML document can contain an element, which can be empty, or the element can simply not exist at all. From a database performance perspective, avoiding use of attributes in favor of contained, unreferenced collections (which are what a multitude of same named elements is)/is suicidal for your applications if your database gets even to a reasonable size. It will just be too slow.
Attributes are not expansion friendly. It is more difficult to change metadata than it is to change data. It should be. If you have to change metadata then there might be data structural design issues anyway. In a purely database environment (not using XML), changing the database model is the equivalent of changing metadata. In commercial environments metadata is usually not altered because it is too difficult and too expensive. All application code depends on database structure not being changed. Changing database metadata requires application changes as well. Thats why it can get expensive. From a perspective of XML and XML in databases, you do not want to change attributes because attributes represent metadata, and that is a database modeling design issue not a programming issue. Changing the data is much, much easier.

Try It OutUsing XML Syntax

The following data represents three regions , containing six countries, as in the previous Try It Out sections in this chapter. In this example, currencies are now added:

   Africa           Zambia           Kwacha Africa           Zimbabwe         Zimbabwe Dollars Asia             Burma Australasia      Australia        Dollars Caribbean        Bahamas          Dollars Caribbean        Barbados         Dollars

In this example, you use what you have learned about the difference between XML document elements and attributes.

The following script is the XML document created in the first Try It Out section in this chapter:

   <?xml version="1.0"?> <regions>    <region>Africa</region>       <country>Zambia</country>       <country>Zimbabwe</country>    <region>Asia</region>       <country>Burma</country>    <region>Australasia</region>       <country>Australia</country>    <region>Caribbean</region>       <country>Bahamas</country>       <country>Barbados</country> </regions>

You will use the preceding XML document and add the currencies for each country. Do not create any new elements in this XML document.

Change the XML document as follows:

Open the XML document. You can copy the existing XML text into a new text file if you want.
All you do is add an attribute name-value pair to each opening <country> tag:
```
   <country currency="Kwacha">Zambia</country>   
```

The final XML document looks something like this:

   <?xml version="1.0"?> <regions>    <region>Africa</region>       <country currency="Kwacha">Zambia</country>       <country currency="Zimbabwe Dollars">Zimbabwe</country>    <region>Asia</region>       <country>Burma</country>    <region>Australasia</region>       <country currency="Dollars">Australia</country>    <region>Caribbean</region>       <country currency="Dollars">Bahamas</country>       <country currency="Dollars">Barbados</country> </regions>

Figure 1-9 shows the result when executed in a browser.

Figure 1-9: Adding attributes to elements in an XML document

How It Works

All you did was to edit an XML document containing the XML tag, a single root node, and various regions of the world that contained some of their respective countries. You then proceeded to add currency attributes into some of the countries.

Reserved Characters in XML

Escape characters are characters preventing execution in a programming language or parser. Thus the < and > characters must be escaped (using an escape sequence) if they are used in an XML document anywhere other than delimiting tags (elements). In XML, an escape sequence is a sequence of characters known to the XML parser to represent special characters. This escape sequence is exactly the same as that used by HTML. The following XML code is invalid:

   <country name="Germany">West < East</country>

The preceding code can be resolved into XML by replacing the < character with the escape sequence string < as follows:

   <country name="Germany">West &lt; East</country>

The <, >, and & characters are illegal in XML and will be interpreted. Quotation characters of all forms are best avoided and best replaced with an escape sequence.

Ignoring the XML Parser with CDATA

There is a special section in an XML document called the CDATA section. The XML parser ignores anything within the CDATA section. So no errors or syntax checking will be performed in the CDATA section. The CDATA section can be used to include scripts written in other languages such as JavaScript. The CDATA section is the equivalent of a <SCRIPT> </SCRIPT> tag enclosed section in an HTML page. The CDATA section begins and ends with the strings, as highlighted in the following script example:

   <SCRIPT> <![CDATA[ function F_To_C {    return ((F   32) * (5 / 9)) } ]]> </SCRIPT>

What Are XML Namespaces?

Two different XML documents containing elements with the same name, where those names have different meanings, could cause conflict. This XML document contains weather forecasts for three different cities. The <name> element represents the name of each city:

   <?xml version="1.0"?> <WeatherForecast date="2/1/2004">    <city>       <name>Frankfurt</name>       <temperature><min>43</min><max>52</max></temperature>    </city>    <city>       <name>London</name>       <temperature><min>31</min><max>45</max></temperature>    </city>    <city>       <name>Paris</name>       <temperature><min>20</min><max>74</max></temperature>    </city> </WeatherForecast>

This next XML document also contains <name> elements but those names are of countries and not of cities. Adding these two XML documents together could cause a semantic (meaning) conflict between the <name> elements in the two separate XML documents:

   <?xml version="1.0"?> <WeatherForecast date="2/1/2004">    <country>       <name>Germany</name>       <temperature><min>22</min><max>45</max></temperature>    </country>    <country>       <name>England</name>       <temperature><min>24</min><max>39</max></temperature>    </country>    <country>       <name>France</name>       <temperature><min>22</min><max>85</max></temperature>    </country> </WeatherForecast>

Namespaces can be used to resolve this type of conflict by assigning a separate prefix to each XML document, adding the prefix to tags in each XML document as follows for the XML document containing cities:

   <?xml version="1.0"?> <i:WeatherForecastxmlns:i="http://www.mywebsite.com/nsforcities" date="2/1/2004">    <i:city>       <i:name>Frankfurt</i:name>       <i:temperature><i:min>43</i:min><i:max>52</i:max></i:temperature>    </i:city>    <i:city>       <i:name>London</i:name>       <i:temperature><i:min>31</i:min><i:max>45</i:max></i:temperature>    </i:city>    <i:city>       <i:name>Paris</i:name>       <i:temperature><i:min>20</i:min><i:max>74</i:max></i:temperature>    </i:city> </i:WeatherForecast>

And for the XML document containing countries, you use a different prefix:

   <?xml version="1.0"?> <o:WeatherForecastxmlns:o="http://www.mywebsite.com/nsforcities" date="2/1/2004">    <o:city>       <o:name>Frankfurt</o:name>       <o:temperature><o:min>43</o:min><o:max>52</o:max></o:temperature>    </o:city>    <o:city>       <o:name>London</o:name>       <o:temperature><o:min>31</o:min><o:max>45</o:max></o:temperature>    </o:city>    <o:city>       <o:name>Paris</o:name>       <o:temperature><o:min>20</o:min><o:max>74</o:max></o:temperature>    </o:city> </o:WeatherForecast>

Creating the preceding XML documents using prefixes has actually created separate elements in separate documents. This is done by using an attribute and a URL. Also when using a namespace, you dont have to assign the prefix to every child element, only the parent node concerned. So with the first XML document previously listed you can do this:

   <?xml version="1.0"?> <WeatherForecast xmlns:i="http://www.mywebsite.com/nsforcities" date="2/1/2004">  <city>     <name>Frankfurt</name>     <temperature><min>43</min><max>52</max></temperature> </city>  <city>     <name>London</name>     <temperature><min>31</min><max>45</max></temperature>  </city>  <city>    <name>Paris</name>    <temperature><min>20</min><max>74</max></temperature>  </city> </WeatherForecast>

You could also use a namespace for the weather forecast for the countries.

XML in Many Languages

Storing XML documents in a language other than English requires some characters not used in the English language. These characters are encoded if not stored in Unicode. Notepad allows you to store text files, in this case XML documents, in Unicode. In Notepad on Win2K, select the Encoding option under the Save As menu option.

When reloading the XML document in a browser you simply have to alter the XML tag at the beginning of the script, to indicate that an encoding other than the default is used. Win2K (SP3) Notepad will allow storage as ANSI (the default), Unicode, Unicode big endian, and UTF-8. To allow the XML parser in a browser to interpret the contents of an XML document stored as UTF-8 change the XML tag as follows:

   <?xml version="1.0" encoding="UTF-8"?>