XML

In prehistoric times (well, at least 10 to 15 years ago), if two groups of people wanted to share some proprietary data over a network, they usually had nothing from which to start. So, these groups of people sat down, spent hours in conference rooms, and eventually came up with some format via which data could be shared. Soon, a large specification document came out of this effort, and all those who wanted to work with this format were provided a copy.

If others wanted to also consume or serve the data, they had to talk to the people who came up with the format, get a copy of the specification, and either acquire a code library to work with the new data or write their own new code to manipulate the data. This tended to be error prone, expensive, and frustrating.

It was to solve these problems and many others that the eXtensible Markup Language (XML) was invented.

What Is XML?

XML is an easily manipulated and customizable markup language for describing data. Whereas HTML is a markup language to describe how content should appear, XML is strictly concerned with describing the structure of data and its relationships to other data. It is a plain-text language designed to help share structured data between computers.

Unlike HTML, tags in XML are not predefinedyou are responsible for all of them. In fact, apart from rules about how XML documents are structured, their exact contents are left entirely up to the document author. However, the basic rules concerning document format end up being sufficient to make the language extremely easy to read, use, and manipulate. There are standard implementations of tools to read XML (sometimes called parsers), modify them, and regenerate them. There are also systems for describing how you want your data to be structured, which lets the standard XML implementations verify on your behalf that data is properly formatted.

The advantages of XML are many and worth noting:

Plain-text files are used for storage, meaning they are both human and machine readable, and no proprietary data formats are required.
Unicode support is excellent, meaning that any character data (in addition to binary data) from around the world can be represented easily.
There are no platform dependencies in XML. It is truly a cross-platform technology.
It is also an open standard, meaning nobody owns the XML specification and no royalties or licenses are required.
Its strict document format makes parsing and manipulation fast and efficient.
The existence of standard implementations on nearly all major platforms means you need to spend little if any time adding support for XML into your applications.
An increasing number of tools are generating their output in XML for consumption in your applications. For example, many database servers now return the results of queries as XML data.

With all of these benefits and hype surrounding it, some people can be forgiven for having the impression that XML is expected to solve all of the world's problems. It is, however, worth cutting through the marketing to realize that XML is nothing more than a data-description language. These documents do not actually do anything. They exist strictly to help those who know what to do with them do it more easily.

One common myth concerning XML (and other open data formats) that we will dispel is that, because it is human readable and just plain text, it is less secure and more easily stolen than binary data. Like the "hidden" nature of HTTP POST variables shown back in Chapter 7, "Interacting with the Server: Forms," it is a false and ultimately incorrect sense of security that comes from assuming that binary data is more secure than text data.

Binary protocolsespecially logically structured onescan be reverse engineered in a matter of hours. If you want to protect your data, you should be using encryption technologies, in which case it does not matter whether what you are encrypting is text or binary.

XML is also not intended to be a replacement for HTML. It should be thought of more as a complement. If we could make HTML (which in its historical format is very complicated and difficult to parse fully) take on many of the advantages of XML, we would have the best of both worlds. (See the section "XHTML.")

Finally, it would be beneficial to spare a moment of humility to look at some of the weaknesses of XML:

Being based on text files and having potentially large numbers of tags with the same names means that XML files can be significantly larger than well-designed binary data. The ability to compress data and the increasing average user bandwidth mitigate some of these problems.
XML is really only designed for describing hierarchical data, not data of a more random or overlapping nature.
Having to support all of the common features means that some XML implementations are not as fast or efficient as you might like.

As we will see throughout this chapter, however, these limitations are not serious, and we will be able to make great use of this technology in our web applications.

Why Use XML?

XML's flexibility and extensibility make it ideal for use in a wide variety of places:

Simple structured data, such as configuration files, address books, or other small data stores Increasing numbers of programs use XML to store their configuration and user option information so that they do not have to write large amounts of code to manipulate these.
Data exchange, particularly business-to-business (B2B) applications Companies that want to share data, such as warehousing companies and those that distribute their products, can just use XML to transfer information back and forth. These documents can be self-validating (see the section "Validating XML"), and no new code libraries have to be written for them.
Application data sharing If a broad range of word processing programs stored their data in well-documented XML data files, you could manipulate your documents on a wide variety of platforms and programs. This is already starting to happen with productivity suites such as OpenOffice.org, which stores documents in XML (that is then compressed before being written to disk).
Creation of new markup languages, sometimes called metalanguages Because XML is so flexible and configurable, it can be used to define new markup languages for a variety of purposes. The section "Related Technologies" discusses some of these.

For many of the examples in this chapter, we look at the Business to Business (B2B) use of XML, specifically that of the interaction between software in a doctor's office and that of a health insurance company's billing system. This system works around the concept of claims that the office files with the insurance provider, with information in these describing the patient and what exactly was done by the medical professionals.

Basic Terminology

Before describing the exact format of XML documents, we define a few of the common terms that people use when discussing them so as to avoid any confusion.

Tags: A tag is a markup identifier surrounded by angle brackets (< and >). Examples include <Doctor>, <Books>, and </WankelRotaryEngine> (the latter of which is an example of a closing tag).

Elements: An element is a unit of information in the XML document. Elements, sometimes referred to as nodes, are identified by having both an opening tag and a closing tag (for example, <ElementName>...</ElementName>). They can contain one of four things:

Other elements; for example, an element representing a collection of books might contain individual book elements, such as the following: <Books><Book>...</Book></Books>
Simple text content, such as in the case of an element called Title: <Title>A Tale of Two Cities</Title>
Mixed content (both content and other elements), such as <Person>Mary<Gender>Female</Gender></Person>
Nothing

Elements that have no content between the opening and closing tags are called empty elements. They can be abbreviated as <EmptyElementName />.

Document elements: Also known as document nodes, or root nodes, a document element is the topmost element in an XML document.

Attributes: Elements can have attributes, which are extra pieces of information, associated with their opening tag. These attributes are always specified in the format name='value', and the value must be in single or double quotes. Examples include <User name='chippy_the_chipmunk'>, <DontShowPublic style='flat' />, and <Font name='helvetica' size='15pt'>. Attributes cannot appear in closing tags.

Parent, child, sibling: Elements in XML documents form a hierarchy, and those nodes that contain other nodes are referred to as those nodes' parent node. Those nodes are regarded as the parent node's children, and they are each others' siblings (sometimes referred to as sister nodes).

Before we bog ourselves down completely in jargon, let us look at the actual document structure in XML.

The Structure of XML Documents

We start by describing the basic layout of an XML document.

Basic Structure

Before we look at a sample XML document or indeed think about writing our own document, we should sit down and think about the data we want to represent. Is the data hierarchical in nature? Does it have clearly identified properties that we can use to create subelements or attributes?

In the health insurance example, we can imagine that a doctor's office is going to submit to the insurance company a bunch of claims. Therefore, we are probably already thinking that we are going to have a Claims element containing individual Claim elements. These claims will probably have data associated with them, such as the patient ID, the doctor who performed the service, the procedure that was performed, and the cost.

With this in mind, we present a sample XML document:

 <?xml version="1.0" encoding="utf-8"?> <Claims>   <Claim>     <Patient name='Thomas Fignon'>       <HealthCareID>5546-345-29384A</HealthCareID>       <PrimaryPhysician>Dr. Nutson</PrimaryPhysician>     </Patient>     <Code>45A66</Code>     <Amount>$468.50</Amount>     <ActingPhysicianID>44-539-299</ActingPhysicianID>     <Treatment>         Routine Physical Examination, blood work and         analysis.     </Treatment>   </Claim>   <Claim>     <Patient name='Samuela Nortone'>       <HealthCareID>5546-923-29391D</HealthCareID>       <PrimaryPhysician>Dr. Huang, M.D.</PrimaryPhysician>     </Patient>     <Code>45G87</Code>     <Amount>$180</Amount>     <ActingPhysicianID>45-667-324</ActingPhysicianID>     <Treatment>         Followup examination for treatment of fractured         clavicle.     </Treatment>   </Claim> </Claims>

The structure of an XML document consists of the following elements, in the order listed:

An XML declaration
Optional directives to the XML parser
The element hierarchy, starting with a document element (only one permitted)

The XML declaration must be the first line of every XML document. It has the following basic structure:

   <?xml version="1.0" encoding="utf-8"?>

The exact character set in the encoding parameter is up to you, but we will mostly be using UTF-8 for maximal interoperability.

The second part of the XML file is made up of optional directives to describe things such as Document Type Definitions (DTDs; see the section "Validating XML").

Finally, the element hierarchy starts with the document element (the root of all of the document's content). As mentioned before, there can be only one document node, and the items that come before it (such as the XML declaration or DTD instructions) are not considered part of the document.

Throughout the XML document may come comments, which are just ignored text that you can use to help annotate your data. These are written as follows:

   <!-- This is an XML comment -->

Comments may not contain the sequence of characters --, and they may not be nested as follows:

   <!-- This is <!-- Not permitted!!! --> nor is this --: -->

Rules for Forming Documents

You need to follow a number of rules to generate what are called well-formed XML documents. Only well-formed documents can be read or manipulated by the various XML implementations.

XML element names are case sensitive, and the opening and closing tag must match exactly. Therefore, the following is not well formed: <Name> ... </NAME>.
The document can contain only one document element.
As mentioned previously, elements may be empty or contain other elements, simple content, or a combination of elements and content.
All elements must have a closing tag or be empty elements with the appropriate syntax (for example, <moo/>).
XML elements must be properly nested, and any overlapping or crossing of elements is strictly prohibited. For example, the following is incorrect:
```
 <bold>This is some <italic>text</bold></italic> 
```
To write it correctly, we would write it as follows:
```
 <bold>This is some <italic>text</italic></bold> 
```
Elements can contain attributes. Attribute values must be enclosed in single or double quotes, as in <Aircraft type='jet' engines='4'>. There can be only one attribute with a given name in an element node, and the following would be invalid: <Aircraft type='jet' engines='4' engines='Trent Turbofans'/>.
Whitespace in XML documents is preserved. It is part of a node's content, and it is up to you to remove it later on if you so want. For example:
```
 <Root>   <Element>             This is some content     that spans many lines   </Element> </Root> 
```
If you were to inspect the value of the Element node in PHP, you would see its text content having the value "\n This is some content\n\n that spans many lines\n ".

Entities

One of the limitations in XML is that a number of characters are reserved for use by the XML descriptionspecifically the < and > characters. Trying to include these in the text for an element results in an ill-formed document, as follows:

 <ArrowNode> Arrows are cool!!! ----> <---- </ArrowNode>

Trying to load an XML document with the preceding in it results in an error from the XML parser.

Fortunately, much as in HTML, there exists a solution to this problem in the form of what are called entities. Entities are named elements in your document of the format &entity-name;. All entities begin with an ampersand character (&) and end with a semicolon (;), and in between these, we specify which entity to use.

XML 1.0 has five core entities, as shown in Table 23-1.

Table 23-1. The Core XML Entities
Entity	Example	Output
Left-angle bracket	`<`	`<`
Right-angle bracket	`>`	`>`
Ampersand	`&`	`&`
Single quote (apostrophe)	`'`	`'`
Double quote	`"`	`"`

Here is an example of how these might be used. If we want to have an element in our XML document that contains a small logic statement such as (A > B) && (A < C), we might write it as follows:

 <Logic>  (A &gt; B) &amp;&amp; (A &lt; C) </Logic>

It is quite common to forget to use entities for the ampersand character when inserting text into a document, as in the following:

 <Desc>   Yesterday Delph & I went to the store and bought some wine. </Desc>

This lack of the ampersand then creates errors on load that are often vague and difficult to identify:

 Warning: xmldocfile(): xmlParseEntityRef: no name in    /home/http/www/index.php on line 36

The correction to the previous <Desc> element is, of course, to replace the & with &.

Attributes Versus Elements

One of the key decisions you must make when writing XML documents, and for which there is unfortunately no good answer, is whether to make a particular piece of information a child element of a node or an attribute on it. For example, if we have information for a patient, we might choose to do one of the following:

 <Patient name='Navin Parmar' id='3942-4329-14'          gender='male' height='190cm' weight='85kg'/> <Patient>   <Name>Navin Parmar</Name>   <ID>3942-4329-14</ID>   <Gender>Male</Gender>   <Height>190cm</Height>   <Weight>85kg</Weight> </Patient>

We must ultimately decide which of the preceding two to use, or formulate some combination of them to best meet our needs.

In general, we will try to use attributes sparingly, for the following reasons:

There may be only one attribute of a given name per element. (So, if you wanted to list children of a patient, you would have problems.)
You cannot define a structure for attributes. If you wanted to list a doctor and all of his information for a patient, you would not be able to present it in a structured way inside an attribute.
Attributes can be more difficult to manipulate from within code, and require some extra work.
Validation of attributes via DTDs or XML Schemas can be more difficult.

Against this comes the reality that as we work our way through a patient list, opening all the child nodes of each element might itself be less efficient than desired. If we look at how we search through a patient list, we might realize that we mostly look at their health provider ID. We could thus restructure our patient data as follows:

 <Patient id='3942-4329-14'>   <Name>Navin Parmar</Name>   <Gender>Male</Gender>   <Height>190cm</Height>   <Weight>85kg</Weight> </Patient>

Needless to say, XML is a system for describing hierarchically structured data. Therefore, it sure seems a waste not to use any of this structure and just put all the information in attributes on an element.

Namespaces

If we are writing an application for a doctor's office, in many countries (particularly in the United States) we must interact with more than one insurance company. If Insurance Company A has defined its claim format to be

 <Claims>   <Claim>     <Patient id='...'>       <Name>...</Name>       <PrimaryPhysician>...</PrimaryPhysician>     </Patient>     <Code>...</Code>     <Amount>...</Amount>     <ActingPhysician>...</ActingPhysician>     <Treatment>...</Treatment>   </Claim>   ... </Claims>

and Insurance Company B has defined its claim format as

 <Claims>   <Claim>     <Patient name='...'>       <InsuranceID>...</InsuranceID>       <PhysicianID>...</PhysicianID>     </Patient>     <Code>...</Code>     <SubCode>...</SubCode>     <Amount>...</Amount>     <Description>...</Description>     <TendingPhysicianID>...</TendingPhysicianID>   </Claim> </Claims>

you can clearly see that we will have a problem. After our code is given a document element named Claims, it has a difficult time determining from whom it came. We would ideally like to avoid having to dig through child nodes looking for hints as to the source.

Fortunately, XML has a solution to this problem in a feature called namespaces. In effect, namespaces are a way of associating a domain, or some prefix, with a set of elements, so as to differentiate them from possibly similar nodes coming from other sources. We can associate a namespace with our Claims document hierarchy as follows:

 <hc:Claims xmlns:hc="healthclaim">   <hc:Claim>     <hc:Patient id='...'>       <hc:Name>...</hc:Name>       <hc:PrimaryPhysician>...</hc:PrimaryPhysician>     </hc:Patient>     <hc:Code>...</hc:Code>     <hc:Amount>...</hc:Amount>     <hc:ActingPhysician>...</hc:ActingPhysician>     <hc:Treatment>...</hc:Treatment>   </hc:Claim>   ... </hc:Claims>

By adding the xmlns:hc="healthclaim" to our document element, we are announcing the creation of a new namespace with a shortened prefix name of hc, and that its full identifying name is "healthclaim".

Even though it looks like we have made our XML significantly more complicated with the addition of the preceding namespace, most XML implementations in web application platforms (such as PHP) enable you to get the name of elements without these prefixes. We can also just declare a default namespace for all elements by omitting the short prefix name:

 <Claims xmlns="healthclaim">   <Claim>     <Patient id='...'>       <Name>...</Name>       <PrimaryPhysician>...</PrimaryPhysician>     </Patient>     <Code>...</Code>     <Amount>...</Amount>     <ActingPhysician>...</ActingPhysician>     <Treatment>...</Treatment>   </Claim>   ... </Claims>

All of these elements are still in the same namespace, fully identified as "healthclaim".

One problem we can imagine for this is that if we have a large number of health-care insurance providers, at least two of them might choose to use the namespace name of "healthclaim". To get around this, we need to use another way to specify a namespace domain. The most common mechanism is to use a domain URI, such as the URL to a web site. The full reference within that domain often points to some page with information about the structured data being represented in the namespace:

 <Claims xmlns="http://www.hcproviderA.com/schema/healthclaim">   <Claim>     <Patient id='...'>       <Name>...</Name>       <PrimaryPhysician>...</PrimaryPhysician>     </Patient>     <Code>...</Code>     <Amount>...</Amount>     <ActingPhysician>...</ActingPhysician>     <Treatment>...</Treatment>   </Claim>   ... </Claims>

Thus, even if our second insurance company were to want to use the namespace name "healthclaim", its unique namespace name would be something like "http://www.healthcareproviderB.com/schemas/healthclaim" and would not interfere with the first one.

Validating XML

We have mentioned the concept of a well-formed XML document, which basically implies that it is a properly constructed XML documentall tags are closed, all attributes are enclosed in quotes, there is a single document node, and so on. However, one of the strengths of the XML standard is that there are ways to actually describe what the structure of the data should be, and have the XML implementation validate the data against that description for you. This enables us to define, in addition to well-formed documents, valid documents.

There are two ways in which this can be done. The first (and older) mechanism is via a DTD, which is a series of information you can include with your XML document to describe its layout. This method lets you describe a hierarchy of elements for your document, and whether a node is to contain content, other nodes, or both.

The second (and newer) method of doing this is known as XML Schemas. It is a more powerful and flexible system to describe the structure of your documents, at the cost of being much more difficult to learn and write. It does, however, support more features than DTDs, including the ability to specify data types for element contents, sequences of elements, and flexible limits on the number of elements that can appear. Furthermore, XML Schemas are well-formed XML documents.

Although we cannot provide much in the way of description for either of these technologies (this would be a large book indeed), we show you one example of each for our health-care claim documents written earlier so that you know how to recognize these documents when you see them.

A DTD for our claims document could look like this:

 <!DOCTYPE Claims [   <!ELEMENT Claims (Claim*)>   <!ELEMENT Claim (Patient,Code,Amount,ActingPhysicianID,Treatment)>   <!ELEMENT Patient (HealthCareID,PrimaryPhysician)>   <!ELEMENT HealthCareID (#PCDATA)>   <!ELEMENT PrimaryPhysician (#PCDATA)>   <!ELEMENT Code (#PCDATA)>   <!ELEMENT ActingPhysicianID (#PCDATA)>   <!ELEMENT Treatment (#CDATA)>   <!ATTRLIST Patient name CDATA #REQUIRED> ]>

These are often inserted directly in the XML document directly between the starting <?xml ... ?> declaration and the document element, making them easy to transport.

An XML Schema Definition (XSD) for the same structure could look like this:

 <?xml version="1.0"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">   <!-- Declare basic element types first -->   <xsd:element name='HealthCareID' type='xsd:string' />   <xsd:element name='PrimaryPhysician' type='xsd:string' />   <xsd:element name='Code' type='xsd:string' />   <xsd:element name='Amount' type='xsd:string' />   <xsd:element name='ActingPhysicianID' type='xsd:string' />   <xsd:element name='Treatment' type='xsd:string' />   <!-- Declare any complex types next -->   <xsd:element name='Patient'>     <xsd:complexType>       <xsd:sequence>         <xsd:element ref="HealthCareID" />         <xsd:element ref="PrimaryPhysician" />       </xsd:sequence>       <xsd:attribute name='name' type='xsd:string' use='required'/>     </xsd:complexType>   </xsd:element>   <!-- The Claims node (shown below) is a list of Claim        nodes, described here:     -->   <xsd:element name='Claim'>     <xsd:complexType>       <xsd:sequence>         <xsd:element ref='Patient'/>         <xsd:element ref='Code'/>         <xsd:element ref='Amount'/>         <xsd:element ref='ActingPhysicianID'/>         <xsd:element ref='Treatment'/>       </xsd:sequence>     </xsd:complexType>   </xsd:element>   <!-- Finally, declare the Claims node, which is the root of        our document -->   <xsd:element name='Claims'>     <xsd:complexType>       <xsd:sequence>         <xsd:element ref='Claim' maxOccurs='unbounded'/>       </xsd:sequence>     </xsd:complexType>   </xsd:element> </xsd:schema>

Fortunately, both of these technologies are extremely well covered in documentation in books and in tutorials found on the Internet, and both can be learned in a reasonably short period of time.

Related Technologies

One of the most powerful features of XML is that it is simply a document description language, and thus extensible for many other purposes. This has led to a good number of extensions to the XML specification and new content description languages based on the basic principles of XML. By allowing the implementers of these languages to focus on the details specific to their domain, they are freed from worrying about parsing and "well formedness."

Although we could go on for some time describing some of these extensions and meta-languages, including ones for describing sheet music, Chinese characters, and genealogy structures, we focus instead on some of the more common ones you will encounter when writing web applications:

XPath This is a small extension to the XML specification that allows for the identification of specific content within an XML document. You can use it to query for the existence of certain elements within your document or as the basis of other XML extensions, most notably XSLT.
XSL/XSLT The eXtensible Stylesheet Language (XSL) is a family of languages that allow for formatting and transformation of data in XML documents. The most notable of these is XSLT (XSL Transformations), which takes documents and transforms their content into something else. This is used to great effect in web sites to take XML data and generate XHTML. (See the section "XHTML.")
XQuery Many people have noted that the data in an XML document is not completely unlike that in a database, apart from the hierarchical nature of it versus the database's flat relational model. This has led to the development of languages to query the data inside the XML documents, most notably the XQuery programming language.
XML-RPC This is a protocol for calling methods or functions on remote machines (RPC stands for Remote Procedure Call) using XML as the means via which the function data is transmitted and returned.

We do not have much opportunity to use these technologies in this book, but many web application authors incorporate some or all of them into their larger enterprise systems.