2.3 The very basics of XML

XML always deals with data within the context of documents. Data (i.e., content) are included in an XML document as strings of text. The data are bracketed by XML text markup, which sets out to describe these data (i.e., give them context). The basic building blocks of an XML document are called elements ”where an element is a specific unit of data along with the XML markup describing these data. An XML element is made up of a name and some content.

The XML markup, in much the same way as HTML, is in the form of tags. A tag thus appears within angled-brackets (e.g., < tag > , < name > , < price > , < wife > , and so forth). The big difference between XML and HTML when it comes to tags is that unlike HTML, XML does not come with its own set of tags. Instead, in XML you make up custom tags unique to the data that is to be described.

Though there are no predefined tags per se, XML does, however, have the concept of optional processing instructions, which can be used, in the vein of comment statements, to convey information about an XML document. Key among these is the XML declaration that, though optional, is nonetheless often inserted at the top of an XML document. This XML declaration, which is in the form < ?xml version= 1.0 ? > if one is dealing with the initial and most widely used version of XML, is used to specify the version of XML that is being used to describe the contents of that XML document. Process instructions appear within tags that have ? at the start and end (i.e., < ? ? > ). These processing instructions are the closest thing that XML has in terms of predefined constructs.

An XML element is delimited by two tags: a start tag and an end tag ” where an end tag corresponds to a start tag but has a forward slash (i.e., / ) before the name (e.g., < /tag > , < /name > , < /price > , < /wife > , and so forth). The element will typically consist of these two tags with text (i.e.,data) in the middle (e.g., < name > Nelson Mandela < /name > , < price > $120.00 < /price > , < wife > yes < /wife > ).

2.3.1 XML elements and element names

The start tag is what gives the element its name. In the example, < e-mail > anu@wownh.com < /e-mail > , the element name is e-mail and the content of this element is anu@wownh.com . XML, in its quest to be extensible, does not specify the tags and hence element names you can use within an XML document. XML gives users carte blanche when it comes to defining elements. There are, however, strict rules as to the composition of the XML names that can be given to an XML element.

First, XML names must begin with either an alphabetic letter (i.e., A to Z) or the underscore character (i.e., _). It is important to remember that XML names cannot start with numeric digits. XML names can be of any arbitrary length. Following the first character, the remainder of the name can be made up of the following:

Alphabetic letters
Numeric digits
Underscore character
Dot character (i.e., .)
Hyphen character (i.e., -)
Colon character (i.e., :) ”however, this is a special, reserved character associated with XML namespaces, which are described later

Note that spaces are not valid within XML names. The only other restriction is that names cannot start with the words xml ”which is reserved for use by the XML specification per se. The string xml may, however, occur within a name.

In marked contrast to HTML, XML tags (and hence XML names) are case sensitive. Hence, < name > , < NAME > , and < Name > represent three very different tags. Thus, < Name > Deanna Gurug < /name > would not be a valid element in XML. < Zipcode > and < ZipCode > would also be treated as different names. Given this case sensitivity, there are two popular conventions, as opposed to rules, when it comes to XML names. The first is that where possible people stick to using just lowercase when it comes to XML names. If the name consists of multiple words and it helps to separate them, then hyphens are used between the words (e.g., < product-item-number > ). Others prefer what is referred to as the Camel Case convention. They capitalize the first letter of each word (e.g., < ProductItemNumber > ). But these, one should not forget, are conventions and not rules.

In some cases there can be more XML markups between the original start and end tags, for example:

 <name>  <given-name>Nelson</given-name>  <surname>Mandela</surname>  </name>

However, nested elements cannot overlap each other. In other words, the tags making up a nested element cannot be overlapped . Thus, in the previous example, the given-name element must be terminated prior to the start of the surname element ”that is, the < /given-name > tag must appear before the < surname > tag. In the same way, the surname element must be completed before the name element is terminated.

It is possible to have attributes associated with XML elements, as was shown in Figure 2.2, where the N = value in each <LINE N = n > tag is an attribute. Attributes provide a mechanism whereby small amounts of data can be quickly and easily associated with an element. Thus, in the case of the XML used to describe the Wordsworth poem in Figure 2.2, the attribute N = value is used to quickly associate a line number with each line in the poem. If the N = value attribute was not used in this instance, then one would have to use a separate line number element to realize the same effect, which, as can be seen here, would be slightly more cumbersome:

 <STANZA>      <LINE-NUMBER>1</LINE-NUMBER>      <LINE>I head a thousand blended notes,</LINE>      <LINE-NUMBER>2</LINE-NUMBER>      <LINE>while in grove I sate reclined,</LINE>

The ability to have attributes results in one of the very few syntactical oddities in XML ”these being the so-called empty elements made up using empty tags. An empty element does not contain any content. You could create one using an opening and closing tag with no content between them. Usually this would be somewhat redundant if not for the fact that you can still include information within this element using attributes. The concept of empty tags makes this useful XML construct that much simpler to use. With an empty tag you do not need a corresponding closing tag. Instead, just one tag serves as both the start and end tag. An empty tag is enclosed in angle brackets ( < > ) per the norm but contains a forward slash right in front of the closing angle bracket ( / > ), for example:

 <wife name=Deanna Guruge birthday=June 27 phone=555- 2293/>

With the exception of empty tags, all other XML tags have to appear in matching start and end pairs. This is another big difference between XML and HTML, given that HTML does permit certain tags to be used without corresponding end tags, with < br > , < p > , < hr > , < col > , and < img > HTML tags being classic examples.

2.3.2 Special characters in XML and XML entities

By carefully restricting the special characters that may appear within names, XML deftly gets around the issue of restricted characters (e.g., the angle brackets) that may appear within XML names. However, XML cannot, and does not want to, control the characters that may appear within the content (or data portion) of an XML element. Given XML s tag-oriented syntax, the appearance of restricted characters, such as the angled brackets, within the content of an element would wreak havoc with the XML- related structure of that document. Consequently, there are five special symbols in XML that have to be entered differently. In essence, there are escape sequences assigned to these five special symbols so that their presence does not disrupt the syntax of XML. XML handles this escape sequence to special symbol mapping via a generalized reference insertion mechanism built into XML known as entities.

In XML, an entity is a symbol that represents ”or identifies ”a predefined resource, where this resource may be a file or a text character. Entities are included within a document via entity references. An entity reference is defined using an ampersand ( & ) at the beginning and a semicolon ( ; ) at the end (e.g., & copyright; & UK; & NH; or & wstp; ). An XML parser will automatically replace the entity reference via the value assigned to that entity. Values are assigned to entity references via an entity declaration, which has the form:

 <!ENTITY  entityname entitydefinition  >

Thus, some of the entities mentioned previously could be defined as:

 <!ENTITY UK United Kingdom>  <!ENTITY NH New Hampshire>  <!ENTITY WSTP IBM WebSphere Transcoding Publisher>  <!ENTITY copyright &#169;>

Unlike the others, the copyright entity is nonintuitive in that it in turn illustrates another XML feature. This feature is the ability to directly enter character references in the form of Unicode character references ”in this example the 169 represents the character code for the copyright symbol . If you are using Windows, use the Character Map utility, found under Programs and then Accessories off the Windows Start button, to find these character codes. Character code references are prefaced by & # . Hence the & #169 . The character code for an e with an accent acute (i.e.,) is 233. So you could define an XML entity reference for it that reads:

 <!ENTITY eacute &#233;>

These entity references could be used in an XML document as follows :

 <article-body>  What you always wanted to know about XML  &copyright; Anura Gurug&eacute;  </article-body>

Once the workings of entity references are understood , the way that XML handles the five special characters becomes obvious. XML predefines five entities to represent these special characters, as follows:

Left angle bracket or less than symbol (<) as & lt;
Right angle bracket or greater than symbol (>) as & gt;
Ampersand symbol (&) as & amp;
Double quote symbol () as & quot;
Apostrophe symbol () as & apos;

Thus, in XML, the company name Johnson & Johnson will have to appear as:

 <CompanyName>Johnson &amp; Johnson</CompanyName>

There are two other related concepts that should be dealt with within this concept of names and entities, and they have to do with language specification and how to denote space characters that are meaningful. There is a special attribute, xml:lang , that is provided within XML that enables one to specify the language in which the content that follows it is written. In essence, you can have an xml:lang attribute per element that specifies the language in which the content of that element is written. Thus, you could distinguish between U.K. English and U.S. English as follows:

 <articlebody xml:lang=en-GB>The foreground color is  green.</articlebody>  <articlebody xml:lang=en-US>The foreground color is  green.</articlebody>

The codes that can be used to specify the root language, in this case English (i.e., en ) but it could equally well have been French (i.e., fr ) or Hindi (i.e., hi ), are defined by ISO 639. Visit http://lcweb.loc.gov/standards/iso639-2/englangn.html for more information. In the case of languages with dialects, a subcode, in our example GB and US , can be used to get very specific. In the case of French, Canadian French can be specified as xml:lang="fr-CA" . The subcodes for this purpose are defined by ISO 3166. Visit http://www-old.ics.uci.edu/pub/ietf/http/related/iso3166.txt for more details.

The other special attribute that can be used within an element is xml:space . It can have one of two values: preserve or default (with default, obviously, being the default if this attribute is not included within an element). Stating xml:space="preserve" instructs XML applications that the spaces appearing within the content of this element are meaningful and should be preserved.

2.3.3 The need for mutual understanding

XML thus takes a flat stream of text, which represents data of some sort , and transforms it into a set of self-describing objects. This is what XML is all about. It provides a flexible, open -ended mechanism for describing any type of data. The goal of XML is to ensure that a recipient of an XML document is able to easily, unambiguously, and consistently determine the nature and structure of the data contained within that document. This then enables the recipient to correctly manipulate and process the data without mistaking what the data are supposed to represent (albeit subject to a caveat discussed in a second). Thus, if XML were used to describe Web pages, search engines would be able to better identify the context and meaning of keywords, since they would now contain XML-based descriptive tags. Since XML documents are always in text form, they can invariably be read and deciphered by people, but people have innate intelligence. Computer applications do not, and that is a significant issue when it comes to XML.

In order for an application to be able to successfully process an XML document, it needs to know what the various elements represent. In other words, the recipient application needs to know what the tags mean. Since the meaning of a tag can differ significantly between different organizations, countries , and industry sectors, an application really needs to know what each tag means within a specific application domain. This is the problem when it comes to XML. Just because you have a well- formed XML document does not guarantee that it can and will be correctly interpreted by any and all applications.

An analogy widely used in the mid-1980s to explain the need for networking protocols and architectures can now be conveniently reused to highlight XML s reliance on shared understanding at both ends of a transaction. This analogy relates to making a direct-dial phone call between London and Moscow. Though you will, with luck, get a connection, there is no guarantee that you will be able to hold a meaningful conversation unless both of you happen to know a common language, whether it be Russian, Esperanto, English, or French. The same is true with XML.

To successfully process XML, you need a mutual understanding at both ends of the transaction. XML provides mechanisms to facilitate this mutual understanding. The two main schemes for this are called Document Type Definitions (DTDs) and XML schema. In some special cases it is possible to have DTD-less XML documents, provided the elements are structured in some type of self-explanatory manner. DTDs and XML schema are described in Section 2.5.

2.3.4 XML namespaces

Extensibility is the beauty and the bane of XML. Enabling users to define tags and, as such, element names at will, though desirable, can lead to ambiguity and misunderstandings ” especially if the same names appear to mean different things in different XML documents. DTDs and XML schema are not the answer here, since they tend to be document specific. One option would be to implement a global naming registry, as IBM tried to do to ensure unique network identifiers (i.e., NETIDs) for SNA and APPN networks in the mid-1980s, or the Domain Name System (DNS) now used for Web addresses (i.e., URLs). This would be unwieldy and impractical and compromise the underlying principles and flexibility of XML.

Let us assume that you create an XML document listing your favorite PC games and post this on a Web site as a public domain document. One of your friends could take this list and decide to rate the various games. To do this your friend might add a new rating element. Another one of your friends might decide to rate the games by their suitability for various age groups using the General (G), Parental Guidance (PG), PG-13, and so forth rating system used in North America for movies, TV shows, and even some PC and video games . Given that this is also a rating system, they too may decide to add this new classification using a rating element. However, these two rating elements, though they apply to the same base document, mean very different things.

One way to overcome such a conflict problem is to qualify (or prefix) the various element names (e.g., fun rating and age rating). This works but can be limiting, since qualifying the element names is contingent on one anticipating potential conflicts. XML s strategic solution for preventing this type of conflict is the use of namespaces. Namespaces are an elegant and nonintrusive mechanism to enable unique identification of XML elements ”without in any way restricting the flexibility or extensibility of XML. Namespaces are implemented by attaching a prefix, identified by a colon ( : ), with each element and possibly even with each attribute. Thus, with namespaces, the rating elements would be defined as fun:rating and age:rating, where the prefixes now refer to namespaces. These prefixes are mapped to what is called a Unified Resource Identifier (URI). These URI mappings typically appear near the start of the XML document and have the following form:

 <pc-games xmlns:fun=http://www.wownh.com/funrating            xmlns:age= http://www.wownh.com/agerating

where xmlns identifies this as an XML namespace ( ns ) declaration. Note that this URI scheme is similar to how you define an external DTD.

The exact use of URIs in XML is somewhat complex and confusing, especially since there is no requirement that a URI is valid! In other words, a URI does not have to point to anything. In the context of namespaces, URIs are only used as identifiers. Since URIs do not need to be valid, they are essentially treated as case-sensitive text strings. After stating these caveats, it is fair to say that for practical purposes most XML developers use valid URLs as their URIs. These URLs will then point to a file that contains the exact definition of the element being qualified. The Internet Engineering Task Force (IETF) is working on another alternative to URIs known as Uniform Resource Names (URNs). Whereas URLs usually start with a protocol designation such as http or ftp, URNs start with a urn: prefix. URNs are supposed to define a unique, location-independent name for a resource that then typically maps to one or more URLs. The XML namespace recommendation can be found at http://www.w3.org/TR/REC-xml-names/.