Chapter 1. Introducing XML | XML in a Nutshell, 2nd Edition

CONTENTS

1.1 The Benefits of XML
1.2 Portable Data
1.3 How XML Works
1.4 The Evolution of XML

XML, the Extensible Markup Language, is a W3C-endorsed standard for document markup. It defines a generic syntax used to mark up data with simple, human-readable tags. It provides a standard format for computer documents. This format is flexible enough to be customized for domains as diverse as web sites, electronic data interchange, vector graphics, genealogy, real-estate listings, object serialization, remote procedure calls, voice-mail systems, and more.

You can write your own programs that interact with, massage, and manipulate the data in XML documents. If you do, you'll have access to a wide range of free libraries in a variety of languages that can read and write XML so that you can focus on the unique needs of your program. Or you can use off-the-shelf software, such as web browsers and text editors, to work with XML documents. Some tools are able to work with any XML document. Others are customized to support a particular XML application in a particular domain, such as vector graphics, and may not be of much use outside that domain. But in all cases, the same underlying syntax is used, even if it's deliberately hidden by the more user-friendly tools or restricted to a single application.

1.1 The Benefits of XML

XML is a metamarkup language for text documents. Data is included in XML documents as strings of text. The data is surrounded by text markup that describes the data. XML's basic unit of data and markup is called an element. The XML specification defines the exact syntax this markup must follow: how elements are delimited by tags, what a tag looks like, what names are acceptable for elements, where attributes are placed, and so forth. Superficially, the markup in an XML document looks a lot like the markup in an HTML document, but there are some crucial differences.

Most importantly, XML is a metamarkup language. That means it doesn't have a fixed set of tags and elements that are supposed to work for everybody in all areas of interest for all time. Any attempt to create a finite set of such tags is doomed to failure. Instead, XML allows developers and writers to define the elements they need as they need them. Chemists can use elements that describe molecules, atoms, bonds, reactions, and other items encountered in chemistry. Real-estate agents can use elements that describe apartments, rents, commissions, locations, and other items needed for real estate. Musicians can use elements that describe quarter notes, half notes, G-clefs, lyrics, and other objects common in music. The X in XML stands for Extensible. Extensible means that the language can be extended and adapted to meet many different needs.

Although XML is quite flexible in the elements it allows to be defined, it is quite strict in many other respects. It provides a grammar for XML documents that says where tags may be placed, what they must look like, which element names are legal, how attributes are attached to elements, and so forth. This grammar is specific enough to allow the development of XML parsers that can read any XML document. Documents that satisfy this grammar are said to be well-formed. Documents that are not well-formed are not allowed, any more than a C program that contains a syntax error is allowed. XML processors will reject documents that contain well-formedness errors.

For reasons of interoperability, individuals or organizations may agree to use only certain tags. These tag sets are called XML applications. An XML application is not a software application that uses XML, such as Mozilla or Microsoft Word. Rather, it's an application of XML in a particular domain like vector graphics or cooking.

The markup in an XML document describes the structure of the document. It lets you see which elements are associated with which other elements. In a well-designed XML document, the markup also describes the document's semantics. For instance, the markup can indicate that an element is a date or a person or a bar code. In well-designed XML applications, the markup says nothing about how the document should be displayed. That is, it does not say that an element is bold or italicized or a list item. XML is a structural and semantic markup language, not a presentation language.^[1]

The markup permitted in a particular XML application can be documented in a schema. Particular document instances can be compared to the schema. Documents that match the schema are said to be valid. Documents that do not match are invalid. Validity depends on the schema. That is, whether a document is valid or invalid depends on which schema you compare it to. Not all documents need to be valid. For many purposes it is enough that the document merely be well-formed.

There are many different XML schema languages, with different levels of expressivity. The most broadly supported schema language and the only one defined by the XML 1.0 specification itself is the document type definition (DTD). A DTD lists all the legal markup and specifies where and how it may be included in a document. DTDs are optional in XML. On the other hand, DTDs may not always be enough. The DTD syntax is quite limited and does not allow you to make many useful statements such as "This element contains a number" or "This string of text is a date between 1974 and 2032." The W3C XML Schema Language (which sometimes goes by the misleadingly generic label schemas) does allow you to express constraints of this nature. Besides these two, there are many other schema languages from which to choose, including RELAX NG, Schematron, Hook, and Examplotron, and this is hardly an exhaustive list.

All current schema languages are purely declarative. However, there are always some constraints that cannot be expressed in anything less than a Turing complete programming language. For example, given an XML document that represents an order, a Turing complete language is required to multiply the price of each order_item by its quantity, sum them all up, and verify that the sum equals the value of the subtotal element. Today's schema languages are also incapable of verifying extra-document constraints such as "Every SKU element matches the SKU field of a record in the products table of the inventory database." If you're writing programs to read XML documents, you can add code to verify statements like these, just as you would if you were writing code to read a tab-delimited text file. The difference is that XML parsers present you with the data in a much more convenient format and do more of the work for you before you have to resort to your own custom code.

1.1.1 What XML Is Not

XML is a markup language, and it is only a markup language. It's important to remember that. The XML hype has gotten so extreme that some people expect XML to do everything up to and including washing the family dog.

First of all, XML is not a programming language. There's no such thing as an XML compiler that reads XML files and produces executable code. You might perhaps define a scripting language that used a native XML format and was interpreted by a binary program, but even this application would be unusual.^[2] XML can be used as a format for instructions to programs that do make things happen, just like a traditional program may read a text config file and take different action depending on what it sees there. Indeed, there's no reason a config file can't be XML instead of unstructured text. Some more recent programs are beginning to use XML config files; but in all cases it's the program taking action, not the XML document itself. An XML document by itself simply is. It does not do anything.

Secondly, XML is not a network transport protocol. XML won't send data across the network, any more than HTML will. Data sent across the network using HTTP, FTP, NFS, or some other protocol might happen to be encoded in an XML format, but again there has to be some software outside the XML document that actually does the sending.

Finally, to mention the example where the hype most often obscures the reality, XML is not a database. You're not going to replace an Oracle or MySQL server with XML. A database can contain XML data, either as a VARCHAR or a BLOB or as some custom XML data type, but the database itself is not an XML document. You can store XML data into a database on a server or retrieve data from a database in an XML format, but to do this, you need to be running software written in a real programming language such as C or Java. To store XML in a database, software on the client side will send the XML data to the server using an established network protocol such as TCP/IP. Software on the server side will receive the XML data, parse it, and store it in the database. To retrieve an XML document from a database, you'll generally pass through some middleware product like Enhydra that makes SQL queries against the database and formats the result set as XML before returning it to the client. Indeed, some databases may integrate this software code into their core server or provide plug-ins to do it such as the Oracle XSQL servlet. XML serves very well as a ubiquitous, platform-independent transport format in these scenarios. However, it is not the database, and it shouldn't be used as one.

1.2 Portable Data

XML offers the tantalizing possibility of truly cross-platform, long-term data formats. It's long been the case that a document written on one platform is not necessarily readable on a different platform, or by a different program on the same platform, or even by a future or past version of the same program on the same platform. When the document can be read, there's no guarantee that all the information will come across. Much of the data from the original moon landings in the late 1960s and early 1970s is now effectively lost. Even if you can find a tape drive that can read the now obsolete tapes, nobody knows in what format the data is stored on the tapes!

XML is an incredibly simple, well-documented, straightforward data format. XML documents are text and can be read with any tool that can read a text file. Not just the data, but also the markup is text, and it's present right there in the XML file as tags. You don't have to wonder whether every eighth byte is random padding, guess whether a four-byte quantity is a two's complement integer or an IEEE 754 floating point number, or try to decipher which integer codes map to which formatting properties. You can read the tag names directly to find out exactly what's in the document. Similarly, since element boundaries are defined by tags, you aren't likely to be tripped up by unexpected line-ending conventions or the number of spaces that are mapped to a tab. All the important details about the structure of the document are explicit. You don't have to reverse-engineer the format or rely on incomplete and often unavailable documentation.

A few software vendors may want to lock in their users with undocumented, proprietary, binary file formats. However, in the long term we're all better off if we can use the cleanly documented, well-understood, easy to parse, text-based formats that XML provides. XML lets documents and data be moved from one system to another with a reasonable hope that the receiving system will be able to make sense out of it. Furthermore, validation lets the receiving side check that what it gets is what it expects. Java promised portable code; XML delivers portable data. In many ways, XML is the most portable and flexible document format designed since the ASCII text file.

1.3 How XML Works

Example 1-1 shows a simple XML document. This particular XML document might be seen in an inventory-control system or a stock database. It marks up the data with tags and attributes describing the color, size, bar-code number, manufacturer, name of the product, and so on.

Example 1-1. An XML document

<?xml version="1.0"?> <product barcode="2394287410">   <manufacturer>Verbatim</manufacturer>   <name>DataLife MF 2HD</name>   <quantity>10</quantity>   <size>3.5"</size>   <color>black</color>   <description>floppy disks</description> </product>

This document is text and might well be stored in a text file. You can edit this file with any standard text editor such as BBEdit, jEdit, UltraEdit, Emacs, or vi. You do not need a special XML editor. Indeed, we find most general-purpose XML editors to be far more trouble than they're worth and much harder to use than simply editing documents in a text editor.

Programs that actually try to understand the contents of the XML document that is, do more than merely treat it as any other text file will use an XML parser to read the document. The parser is responsible for dividing the document into individual elements, attributes, and other pieces. It passes the contents of the XML document to an application piece by piece. If at any point the parser detects a violation of the well-formedness rules of XML, then it reports the error to the application and stops parsing. In some cases the parser may read further in the document, past the original error, so that it can detect and report other errors that occur later in the document. However, once it has detected the first well-formedness error, it will no longer pass along the contents of the elements and attributes it encounters.

Individual XML applications normally dictate more precise rules about exactly which elements and attributes are allowed where. For instance, you wouldn't expect to find a G_Clef element when reading a biology document. Some of these rules can be precisely specified with a schema written in any of several languages including the W3C XML Schema Language, RELAX NG, and DTDs. A document may contain a URI indicating where the schema can be found. Some XML parsers will notice this and compare the document to its schema as they read it to see if the document satisfies the constraints specified there. Such a parser is called a validating parser . A violation of those constraints is called a validity error , and the whole process of checking a document against a schema is called validation. If a validating parser finds a validity error, it will report it to the application on whose behalf it's parsing the document. This application can then decide whether it wishes to continue parsing the document. However, validity errors are not necessarily fatal (unlike well-formedness errors), and an application may choose to ignore them. Not all parsers are validating parsers. Some merely check for well-formedness.

The application that receives data from the parser may be:

A web browser such as Netscape Navigator or Internet Explorer that displays the document to a reader
A word processor such as StarOffice Writer that loads the XML document for editing
A database such as Microsoft SQL Server that stores the XML data in a new record
A drawing program such as Adobe Illustrator that interprets the XML as two-dimensional coordinates for the contents of a picture
A spreadsheet such as Gnumeric that parses the XML to find numbers and functions used in a calculation
A personal finance program such as Microsoft Money that sees the XML as a bank statement
A syndication program that reads the XML document and extracts the headlines for today's news
A program that you yourself wrote in Java, C, Python or some other language that does exactly what you want it to do
Almost anything else

XML is an extremely flexible format for data. It is used for all of this and a lot more. These are real examples. In theory, any data that can be stored in a computer can be stored in XML format. In practice, XML is suitable for storing and exchanging any data that can plausibly be encoded as text. It's only really unsuitable for multimedia data such as photographs, recorded sound, video, and other very large bit sequences.

1.4 The Evolution of XML

XML is a descendant of SGML, the Standard Generalized Markup Language. The language that would eventually become SGML was invented by Charles F. Goldfarb, Ed Mosher, and Ray Lorie at IBM in the 1970s and developed by several hundred people around the world until its eventual adoption as ISO standard 8879 in 1986. SGML was intended to solve many of the same problems XML solves in much the same way XML solves them. It is a semantic and structural markup language for text documents. SGML is extremely powerful and achieved some success in the U.S. military and government, in the aerospace sector, and in other domains that needed ways of efficiently managing technical documents that were tens of thousands of pages long.

SGML's biggest success was HTML, which is an SGML application. However, HTML is just one SGML application. It does not have or offer anywhere near the full power of SGML itself. Since it restricts authors to a finite set of tags designed to describe web pages and describes them in a fairly presentationally oriented way at that it's really little more than a traditional markup language that has been adopted by web browsers. It doesn't lend itself to use beyond the single application of web-page design. You would not use HTML to exchange data between incompatible databases or to send updated product catalogs to retailer sites, for example. HTML does web pages, and it does them very well, but it only does web pages.

SGML was the obvious choice for other applications that took advantage of the Internet but were not simple web pages for humans to read. The problem was that SGML is complicated very, very complicated. The official SGML specification is over 150 very technical pages. It covers many special cases and unlikely scenarios. It is so complex that almost no software has ever implemented it fully. Programs that implemented or relied on different subsets of SGML were often incompatible with each other. The special feature one program considered essential would be considered extraneous fluff and omitted by the next program.

In 1996, Jon Bosak, Tim Bray, C. M. Sperberg-McQueen, James Clark, and several others began work on a "lite" version of SGML that retained most of SGML's power while trimming a lot of the features that had proven redundant, too complicated to implement, confusing to end users, or simply not useful over the previous 20 years of experience with SGML. The result, in February of 1998, was XML 1.0, and it was an immediate success. Many developers who knew they needed a structural markup language but hadn't been able to bring themselves to accept SGML's complexity adopted XML whole-heartedly. It was used in domains ranging from legal court filings to hog farming.

However, XML 1.0 was just the beginning. The next standard out of the gate was Namespaces in XML, an effort to allow markup from different XML applications to be used in the same document without conflicting. Thus a web page about books could have a title element that referred to the title of the page and title elements that referred to the title of a book, and the two would not conflict.

Next up was the Extensible Stylesheet Language (XSL), an XML application for transforming XML documents into a form that could be viewed in web browsers. This soon split into XSL Transformations (XSLT) and XSL Formatting Objects (XSL-FO). XSLT has become a general-purpose language for transforming one XML document into another, whether for web-page display or some other purpose. XSL-FO is an XML application for describing the layout of both printed pages and web pages that rivals PostScript for its power and expressiveness.

However, XSL is not the only option for styling XML documents. The Cascading Style Sheets (CSS) language was already in use for HTML documents when XML was invented, and it proved to be a reasonable fit to XML as well. With the advent of CSS Level 2, the W3C made styling XML documents an explicit goal for CSS and gave it equal importance to HTML. The pre-existing Document Style Sheet and Semantics Language (DSSSL) was also adopted from its roots in the SGML world to style XML documents for print and the Web.

The Extensible Linking Language, XLink, began by defining more powerful linking constructs that could connect XML documents in a hypertext network that made HTML's A tag look like it is an abbreviation for "anemic." It also split into two separate standards: XLink for describing the connections between documents and XPointer for addressing the individual parts of an XML document. At this point, it was noticed that both XPointer and XSLT were developing fairly sophisticated yet incompatible syntaxes to do exactly the same thing: identify particular elements of an XML document. Consequently, the addressing parts of both specifications were split off and combined into a third specification, XPath.

Another piece of the puzzle was a uniform interface for accessing the contents of the XML document from inside a Java, JavaScript, or C++ program. The simplest API was merely to treat the document as an object that contained other objects. Indeed, work was already underway inside and outside the W3C to define such a Document Object Model (DOM) for HTML. Expanding this effort to cover XML was not hard.

Outside the W3C, David Megginson, Peter Murray-Rust, and other members of the xml-dev mailing list recognized that third party XML parsers, while all compatible in the documents they could parse, were incompatible in their APIs. This led to the development of the Simple API for XML, SAX. In 2000, SAX2 was released to add greater configurability in parsing, namespace support, and a cleaner API.

One of the surprises during the evolution of XML was that developers used it more for data-oriented structures such as serialized objects and database records than for the narrative structures for which SGML had traditionally been used. DTDs worked very well for narrative structures, but had some limits when faced with the data-oriented structures developers were actually creating. In particular, the lack of data typing and the fact that DTDs were not themselves XML documents were perceived as major problems. A number of companies and individuals began working on schema languages that addressed these deficiencies. Many of these proposals were submitted to the W3C, which formed a working group to try to merge the best parts of all of these and come up with something greater than the sum of its parts. In 2001, this group released Version 1.0 of the W3C XML Schema Language. Unfortunately, they produced something overly complex and burdensome. Consequently, several developers went back to the drawing board to invent cleaner, simpler, more elegant schema languages, including RELAX NG and Schematron.

Eventually, it became apparent that XML 1.0, XPath, the W3C XML Schema Language, SAX, and DOM all had similar but subtly different conceptual models of the structure of an XML document. For instance, XPath and SAX don't consider CDATA sections to be anything more than syntax sugar, but DOM does treat them differently than plain-text nodes. Thus the W3C XML Core Working Group began work on an XML Information Set that all these standards could rely on and refer to.

Development of extensions to the core XML specification continues. Future directions include:

XML Query Language: A fourth-generation language for extracting information that meets specified criteria from one or more XML documents
Canonical XML: A standard algorithm for determining whether two XML documents are the same after insignificant details, such as whether single or double quotes delimit attribute values, are accounted for
XInclude: A means of building a single XML document out of multiple well-formed, potentially valid XML documents and pieces thereof
XML Signatures: A standard for digitally signing XML documents, embedding those signatures in XML documents, and authenticating the resulting documents
XML Encryption: A standard XML syntax for encrypted digital content, including portions of XML documents
SAX 2.1: A set of small extensions to SAX2 that provides extra information about an XML document recommended by the Infoset, including the XML declaration
DOM Level 3: Many additional classes, interfaces, and methods that build on top of DOM2 to provide schema support, standard means of loading and saving XML documents, and many more additional capabilities
XFragment: An effort to make sense out of pieces of XML documents that may not be well-formed documents when considered in isolation

Doubtless, many new extensions of XML remain to be invented. XML has proven itself a solid foundation for many diverse technologies.

[1] A few XML applications, such as XSL Formatting Objects, are designed to describe the presentation of text. However, these are exceptions that prove the rule. Although XSL-FO does describe presentation, you'd never write an XSL-FO document directly. Instead, you'd write a more semantically structured XML document, then use an XSL Transformations stylesheet to change the structure-oriented XML into presentation-oriented XML.

[2] At least one XML application, XSL Transformations, has been proven to be Turing complete by construction. See http://www.unidex.com/turing/utm.htm for one universal Turing machine written in XSLT.

CONTENTS