In prehistoric times (well, at least 10 to 15 years ago), if two groups of people wanted to share some proprietary data over a network, they usually had nothing from which to start. So, these groups of people sat down, spent hours in conference rooms, and eventually came up with some format via which data could be shared. Soon, a large specification document came out of this effort, and all those who wanted to work with this format were provided a copy.
If others wanted to also consume or serve the data, they had to talk to the people who came up with the format, get a copy of the specification, and either acquire a code library to work with the new data or write their own new code to manipulate the data. This tended to be error prone, expensive, and frustrating.
It was to solve these problems and many others that the eXtensible Markup Language (XML) was invented.
What Is XML?
XML is an easily manipulated and customizable markup language for describing data. Whereas HTML is a markup language to describe how content should appear, XML is strictly concerned with describing the structure of data and its relationships to other data. It is a plain-text language designed to help share structured data between computers.
Unlike HTML, tags in XML are not predefinedyou are responsible for all of them. In fact, apart from rules about how XML documents are structured, their exact contents are left entirely up to the document author. However, the basic rules concerning document format end up being sufficient to make the language extremely easy to read, use, and manipulate. There are standard implementations of tools to read XML (sometimes called parsers), modify them, and regenerate them. There are also systems for describing how you want your data to be structured, which lets the standard XML implementations verify on your behalf that data is properly formatted.
The advantages of XML are many and worth noting:
With all of these benefits and hype surrounding it, some people can be forgiven for having the impression that XML is expected to solve all of the world's problems. It is, however, worth cutting through the marketing to realize that XML is nothing more than a data-description language. These documents do not actually do anything. They exist strictly to help those who know what to do with them do it more easily.
One common myth concerning XML (and other open data formats) that we will dispel is that, because it is human readable and just plain text, it is less secure and more easily stolen than binary data. Like the "hidden" nature of HTTP POST variables shown back in Chapter 7, "Interacting with the Server: Forms," it is a false and ultimately incorrect sense of security that comes from assuming that binary data is more secure than text data.
Binary protocolsespecially logically structured onescan be reverse engineered in a matter of hours. If you want to protect your data, you should be using encryption technologies, in which case it does not matter whether what you are encrypting is text or binary.
XML is also not intended to be a replacement for HTML. It should be thought of more as a complement. If we could make HTML (which in its historical format is very complicated and difficult to parse fully) take on many of the advantages of XML, we would have the best of both worlds. (See the section "XHTML.")
Finally, it would be beneficial to spare a moment of humility to look at some of the weaknesses of XML:
As we will see throughout this chapter, however, these limitations are not serious, and we will be able to make great use of this technology in our web applications.
Why Use XML?
XML's flexibility and extensibility make it ideal for use in a wide variety of places:
For many of the examples in this chapter, we look at the Business to Business (B2B) use of XML, specifically that of the interaction between software in a doctor's office and that of a health insurance company's billing system. This system works around the concept of claims that the office files with the insurance provider, with information in these describing the patient and what exactly was done by the medical professionals.
Before describing the exact format of XML documents, we define a few of the common terms that people use when discussing them so as to avoid any confusion.
Tags: A tag is a markup identifier surrounded by angle brackets (< and >). Examples include <Doctor>, <Books>, and </WankelRotaryEngine> (the latter of which is an example of a closing tag).
Elements: An element is a unit of information in the XML document. Elements, sometimes referred to as nodes, are identified by having both an opening tag and a closing tag (for example, <ElementName>...</ElementName>). They can contain one of four things:
Elements that have no content between the opening and closing tags are called empty elements. They can be abbreviated as <EmptyElementName />.
Document elements: Also known as document nodes, or root nodes, a document element is the topmost element in an XML document.
Attributes: Elements can have attributes, which are extra pieces of information, associated with their opening tag. These attributes are always specified in the format name='value', and the value must be in single or double quotes. Examples include <User name='chippy_the_chipmunk'>, <DontShowPublic style='flat' />, and <Font name='helvetica' size='15pt'>. Attributes cannot appear in closing tags.
Parent, child, sibling: Elements in XML documents form a hierarchy, and those nodes that contain other nodes are referred to as those nodes' parent node. Those nodes are regarded as the parent node's children, and they are each others' siblings (sometimes referred to as sister nodes).
Before we bog ourselves down completely in jargon, let us look at the actual document structure in XML.
The Structure of XML Documents
We start by describing the basic layout of an XML document.
Before we look at a sample XML document or indeed think about writing our own document, we should sit down and think about the data we want to represent. Is the data hierarchical in nature? Does it have clearly identified properties that we can use to create subelements or attributes?
In the health insurance example, we can imagine that a doctor's office is going to submit to the insurance company a bunch of claims. Therefore, we are probably already thinking that we are going to have a Claims element containing individual Claim elements. These claims will probably have data associated with them, such as the patient ID, the doctor who performed the service, the procedure that was performed, and the cost.
With this in mind, we present a sample XML document:
<?xml version="1.0" encoding="utf-8"?> <Claims> <Claim> <Patient name='Thomas Fignon'> <HealthCareID>5546-345-29384A</HealthCareID> <PrimaryPhysician>Dr. Nutson</PrimaryPhysician> </Patient> <Code>45A66</Code> <Amount>$468.50</Amount> <ActingPhysicianID>44-539-299</ActingPhysicianID> <Treatment> Routine Physical Examination, blood work and analysis. </Treatment> </Claim> <Claim> <Patient name='Samuela Nortone'> <HealthCareID>5546-923-29391D</HealthCareID> <PrimaryPhysician>Dr. Huang, M.D.</PrimaryPhysician> </Patient> <Code>45G87</Code> <Amount>$180</Amount> <ActingPhysicianID>45-667-324</ActingPhysicianID> <Treatment> Followup examination for treatment of fractured clavicle. </Treatment> </Claim> </Claims>
The structure of an XML document consists of the following elements, in the order listed:
The XML declaration must be the first line of every XML document. It has the following basic structure:
<?xml version="1.0" encoding="utf-8"?>
The exact character set in the encoding parameter is up to you, but we will mostly be using UTF-8 for maximal interoperability.
The second part of the XML file is made up of optional directives to describe things such as Document Type Definitions (DTDs; see the section "Validating XML").
Finally, the element hierarchy starts with the document element (the root of all of the document's content). As mentioned before, there can be only one document node, and the items that come before it (such as the XML declaration or DTD instructions) are not considered part of the document.
Throughout the XML document may come comments, which are just ignored text that you can use to help annotate your data. These are written as follows:
<!-- This is an XML comment -->
Comments may not contain the sequence of characters --, and they may not be nested as follows:
<!-- This is <!-- Not permitted!!! --> nor is this --: -->
Rules for Forming Documents
You need to follow a number of rules to generate what are called well-formed XML documents. Only well-formed documents can be read or manipulated by the various XML implementations.
One of the limitations in XML is that a number of characters are reserved for use by the XML descriptionspecifically the < and > characters. Trying to include these in the text for an element results in an ill-formed document, as follows:
<ArrowNode> Arrows are cool!!! ----> <---- </ArrowNode>
Trying to load an XML document with the preceding in it results in an error from the XML parser.
Fortunately, much as in HTML, there exists a solution to this problem in the form of what are called entities. Entities are named elements in your document of the format &entity-name;. All entities begin with an ampersand character (&) and end with a semicolon (;), and in between these, we specify which entity to use.
XML 1.0 has five core entities, as shown in Table 23-1.
Here is an example of how these might be used. If we want to have an element in our XML document that contains a small logic statement such as (A > B) && (A < C), we might write it as follows:
<Logic> (A > B) && (A < C) </Logic>
It is quite common to forget to use entities for the ampersand character when inserting text into a document, as in the following:
<Desc> Yesterday Delph & I went to the store and bought some wine. </Desc>
This lack of the ampersand then creates errors on load that are often vague and difficult to identify:
Warning: xmldocfile(): xmlParseEntityRef: no name in /home/http/www/index.php on line 36
The correction to the previous <Desc> element is, of course, to replace the & with &.
Attributes Versus Elements
One of the key decisions you must make when writing XML documents, and for which there is unfortunately no good answer, is whether to make a particular piece of information a child element of a node or an attribute on it. For example, if we have information for a patient, we might choose to do one of the following:
<Patient name='Navin Parmar' id='3942-4329-14' gender='male' height='190cm' weight='85kg'/> <Patient> <Name>Navin Parmar</Name> <ID>3942-4329-14</ID> <Gender>Male</Gender> <Height>190cm</Height> <Weight>85kg</Weight> </Patient>
We must ultimately decide which of the preceding two to use, or formulate some combination of them to best meet our needs.
In general, we will try to use attributes sparingly, for the following reasons:
Against this comes the reality that as we work our way through a patient list, opening all the child nodes of each element might itself be less efficient than desired. If we look at how we search through a patient list, we might realize that we mostly look at their health provider ID. We could thus restructure our patient data as follows:
<Patient id='3942-4329-14'> <Name>Navin Parmar</Name> <Gender>Male</Gender> <Height>190cm</Height> <Weight>85kg</Weight> </Patient>
Needless to say, XML is a system for describing hierarchically structured data. Therefore, it sure seems a waste not to use any of this structure and just put all the information in attributes on an element.
If we are writing an application for a doctor's office, in many countries (particularly in the United States) we must interact with more than one insurance company. If Insurance Company A has defined its claim format to be
<Claims> <Claim> <Patient id='...'> <Name>...</Name> <PrimaryPhysician>...</PrimaryPhysician> </Patient> <Code>...</Code> <Amount>...</Amount> <ActingPhysician>...</ActingPhysician> <Treatment>...</Treatment> </Claim> ... </Claims>
and Insurance Company B has defined its claim format as
<Claims> <Claim> <Patient name='...'> <InsuranceID>...</InsuranceID> <PhysicianID>...</PhysicianID> </Patient> <Code>...</Code> <SubCode>...</SubCode> <Amount>...</Amount> <Description>...</Description> <TendingPhysicianID>...</TendingPhysicianID> </Claim> </Claims>
you can clearly see that we will have a problem. After our code is given a document element named Claims, it has a difficult time determining from whom it came. We would ideally like to avoid having to dig through child nodes looking for hints as to the source.
Fortunately, XML has a solution to this problem in a feature called namespaces. In effect, namespaces are a way of associating a domain, or some prefix, with a set of elements, so as to differentiate them from possibly similar nodes coming from other sources. We can associate a namespace with our Claims document hierarchy as follows:
<hc:Claims xmlns:hc="healthclaim"> <hc:Claim> <hc:Patient id='...'> <hc:Name>...</hc:Name> <hc:PrimaryPhysician>...</hc:PrimaryPhysician> </hc:Patient> <hc:Code>...</hc:Code> <hc:Amount>...</hc:Amount> <hc:ActingPhysician>...</hc:ActingPhysician> <hc:Treatment>...</hc:Treatment> </hc:Claim> ... </hc:Claims>
By adding the xmlns:hc="healthclaim" to our document element, we are announcing the creation of a new namespace with a shortened prefix name of hc, and that its full identifying name is "healthclaim".
Even though it looks like we have made our XML significantly more complicated with the addition of the preceding namespace, most XML implementations in web application platforms (such as PHP) enable you to get the name of elements without these prefixes. We can also just declare a default namespace for all elements by omitting the short prefix name:
<Claims xmlns="healthclaim"> <Claim> <Patient id='...'> <Name>...</Name> <PrimaryPhysician>...</PrimaryPhysician> </Patient> <Code>...</Code> <Amount>...</Amount> <ActingPhysician>...</ActingPhysician> <Treatment>...</Treatment> </Claim> ... </Claims>
All of these elements are still in the same namespace, fully identified as "healthclaim".
One problem we can imagine for this is that if we have a large number of health-care insurance providers, at least two of them might choose to use the namespace name of "healthclaim". To get around this, we need to use another way to specify a namespace domain. The most common mechanism is to use a domain URI, such as the URL to a web site. The full reference within that domain often points to some page with information about the structured data being represented in the namespace:
<Claims xmlns="http://www.hcproviderA.com/schema/healthclaim"> <Claim> <Patient id='...'> <Name>...</Name> <PrimaryPhysician>...</PrimaryPhysician> </Patient> <Code>...</Code> <Amount>...</Amount> <ActingPhysician>...</ActingPhysician> <Treatment>...</Treatment> </Claim> ... </Claims>
Thus, even if our second insurance company were to want to use the namespace name "healthclaim", its unique namespace name would be something like "http://www.healthcareproviderB.com/schemas/healthclaim" and would not interfere with the first one.
We have mentioned the concept of a well-formed XML document, which basically implies that it is a properly constructed XML documentall tags are closed, all attributes are enclosed in quotes, there is a single document node, and so on. However, one of the strengths of the XML standard is that there are ways to actually describe what the structure of the data should be, and have the XML implementation validate the data against that description for you. This enables us to define, in addition to well-formed documents, valid documents.
There are two ways in which this can be done. The first (and older) mechanism is via a DTD, which is a series of information you can include with your XML document to describe its layout. This method lets you describe a hierarchy of elements for your document, and whether a node is to contain content, other nodes, or both.
The second (and newer) method of doing this is known as XML Schemas. It is a more powerful and flexible system to describe the structure of your documents, at the cost of being much more difficult to learn and write. It does, however, support more features than DTDs, including the ability to specify data types for element contents, sequences of elements, and flexible limits on the number of elements that can appear. Furthermore, XML Schemas are well-formed XML documents.
Although we cannot provide much in the way of description for either of these technologies (this would be a large book indeed), we show you one example of each for our health-care claim documents written earlier so that you know how to recognize these documents when you see them.
A DTD for our claims document could look like this:
<!DOCTYPE Claims [ <!ELEMENT Claims (Claim*)> <!ELEMENT Claim (Patient,Code,Amount,ActingPhysicianID,Treatment)> <!ELEMENT Patient (HealthCareID,PrimaryPhysician)> <!ELEMENT HealthCareID (#PCDATA)> <!ELEMENT PrimaryPhysician (#PCDATA)> <!ELEMENT Code (#PCDATA)> <!ELEMENT ActingPhysicianID (#PCDATA)> <!ELEMENT Treatment (#CDATA)> <!ATTRLIST Patient name CDATA #REQUIRED> ]>
These are often inserted directly in the XML document directly between the starting <?xml ... ?> declaration and the document element, making them easy to transport.
An XML Schema Definition (XSD) for the same structure could look like this:
<?xml version="1.0"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <!-- Declare basic element types first --> <xsd:element name='HealthCareID' type='xsd:string' /> <xsd:element name='PrimaryPhysician' type='xsd:string' /> <xsd:element name='Code' type='xsd:string' /> <xsd:element name='Amount' type='xsd:string' /> <xsd:element name='ActingPhysicianID' type='xsd:string' /> <xsd:element name='Treatment' type='xsd:string' /> <!-- Declare any complex types next --> <xsd:element name='Patient'> <xsd:complexType> <xsd:sequence> <xsd:element ref="HealthCareID" /> <xsd:element ref="PrimaryPhysician" /> </xsd:sequence> <xsd:attribute name='name' type='xsd:string' use='required'/> </xsd:complexType> </xsd:element> <!-- The Claims node (shown below) is a list of Claim nodes, described here: --> <xsd:element name='Claim'> <xsd:complexType> <xsd:sequence> <xsd:element ref='Patient'/> <xsd:element ref='Code'/> <xsd:element ref='Amount'/> <xsd:element ref='ActingPhysicianID'/> <xsd:element ref='Treatment'/> </xsd:sequence> </xsd:complexType> </xsd:element> <!-- Finally, declare the Claims node, which is the root of our document --> <xsd:element name='Claims'> <xsd:complexType> <xsd:sequence> <xsd:element ref='Claim' maxOccurs='unbounded'/> </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:schema>
Fortunately, both of these technologies are extremely well covered in documentation in books and in tutorials found on the Internet, and both can be learned in a reasonably short period of time.
One of the most powerful features of XML is that it is simply a document description language, and thus extensible for many other purposes. This has led to a good number of extensions to the XML specification and new content description languages based on the basic principles of XML. By allowing the implementers of these languages to focus on the details specific to their domain, they are freed from worrying about parsing and "well formedness."
Although we could go on for some time describing some of these extensions and meta-languages, including ones for describing sheet music, Chinese characters, and genealogy structures, we focus instead on some of the more common ones you will encounter when writing web applications:
We do not have much opportunity to use these technologies in this book, but many web application authors incorporate some or all of them into their larger enterprise systems.