Introducing XML | XML Programming Bible

The Web gives you unlimited amounts of information—or does it? You can buy a ticket for a flight from New York to Miami over the Internet, but can you search for the cheapest fare? Can you price a different route when disconnected from the Web? Can you load the fare into your expense report? The answers is always no. The Web and the Hypertext Markup Language (HTML) give you access to information, but they do not enable you to leverage it.

For example, say you want to purchase a ticket. You go to http://www.myfavoriteairlines.com and download the Extensible Hypertext Markup Language (XHTML) page, mfa-index.html, shown in Listing 1-1.

Listing 1-1 mfa-index.html: An airline ticket purchase page.

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> <title>Welcome to My Favorite Airlines</title> </head> <body> <h1>Connecting flights to Miami</h1> <p> Buy Direct from us and Save $$$! New York - Orlando - Miami..$95 </p> <h1> Direct flights to Miami</h1> <p> Buy Direct from us and Save $$$! New York - Miami..$195 </p> <h1> Your Selection</h1> <p> Your flight with tax and headphones: $210 </p>  </body> </html>

This page, shown in Figure 1-1, displays information about flights to Miami, but you can't use it for much more. If you wanted to search for the fare itself, a simple search engine might look for a dollar sign, but in this example you would erroneously get a hit on "Save $$$." If you wanted to list only direct flights, a filter could try to interpret the text by looking for the word "Direct." Screen-scraping algorithms, however, are limited by how cleverly they can interpret the author's use of language. So, XML to the rescue!

Figure 1-1 Viewing our simple XHTML document in a browser.

XML Compared to XTML

XML is a Recommendation from the World Wide Web Consortium, commonly referred to as the W3C (http://www.w3.org), the multicompany group that defined XHTML and its predecessor, HTML. XML is a vehicle for information that brings usable data to the desktop and is a universal data format that does for data what HTML does for Web content—it provides the necessary markup. Because the source code of languages defined in XML looks like HTML, it's useful to compare the two.

XML consists of hierarchically nested fields like HTML, it is just as easy to read, and it is portable. However, where HTML contains titles, headings, and italics, XML can contain customers, order numbers, prices, or any data element you need. XML is fully extensible so you can add new tags and new elements to support your application.

The Core of XML

Structured information contains both content (words, pictures, and so forth) and a suggestion of what function that content plays. For example, content in a section heading has a different meaning from content in a footnote, which is different from content in a Figure caption or a database table. The XML specification defines a standard way to structure the markup of documents.

Why XML?

Programmers created XML so that richly structured documents could be used over the Web. The only viable alternatives, HTML and the Standard Generalized Markup Language (SGML), are not practical for this purpose. HTML comes bound with a set of semantics and does not provide arbitrary structure. SGML provides arbitrary structure, but implementing SGML is too difficult for a Web browser to do on its own. XML specifies neither semantics nor a tag set. It is a metalanguage for describing markup languages and provides a facility for defining tags and the structural relationships between them. Because there's no predefined tag set, there aren't any preconceived semantics. The semantics of an XML document will either be defined by the applications that process them or by style sheets.

XML Documents

If you use HTML or SGML, XML documents will look familiar. Let's revisit the airline example. If the Web page included XML data as in Listing 1-2, the information in mfa-sample.xml could be sent with the page.

Listing 1-2 mfa-sample.xml: An XML version of our airline ticket information.

 <?xml version="1.0"?> <flightdata>   <ny_mia_flights> <direct> <cost>195</cost> </direct> <connecting> <layover duration="90" durationtype="minutes">Orlando</layover> <cost>95</cost> </connecting> </ny_mia_flights> </flightdata>

The document, which is displayed in Figure 1-2, begins with an XML declaration: <?xml ...?>. While not required, its presence unequivocally identifies the document as an XML document and indicates the version of XML to which it was authored.

Figure 1-2 Loading our sample XML document in Microsoft Internet Explorer.

Elements are the most common form of markup. Delimited by angle brackets, most elements identify the nature of the content they surround. Some elements might be empty, in which case they have no content. If an element is not empty it begins with a start-tag, <element> and ends with an end-tag, </element>.

Attributes are name-value pairs that occur inside start-tags after the element name. For example, <layover duration="90" durationtype="minutes"> is a <layover> element where the attributes duration and duration type have the values "90" and "minutes". All attribute values must be quoted. Either single or double quotes can be used in pairs.

If the HTML page described earlier included this data, you could easily identify the price of a flight because it is delimited with <cost> tags. Identifying which flights are direct is also simple because the <direct> element is nested within <ny_mia_flights>.

Usable data is shipped with the Web page, so you can calculate how much more a direct flight would have cost while you're stranded on the runway at Orlando on your connecting flight in the middle of a hurricane. XML is great for customers and it makes Web sites easier to build and maintain. If My Favorite Airlines used XML for its data-driven Web site, the company could use the same applet to calculate the total fare on every page. And should the tax rate change, when using appropriately structured XML you need only update the database, not every HTML page. XML is also extensible, so My Favorite Airlines can add a new element for meal preference (for example, <meal>Vegetarian</meal>) without disrupting the rest of the site.

Document Type Declarations

For any given application, however, elements occurring in a completely arbitrary order are meaningless. Consider the flight data example in Listing 1-3. Would the contents of the following mfa-bad.xml be meaningful?

Listing 1-3 mfa-bad.xml: A bad example of using XML.

 <flightdata> <meal> <layover> orlando  vegetarian </layover> </meal> <cost> <direct> 195 </direct> </cost> </flightdata>

This example document is so far outside the bounds of what we expect that it's absurd. It doesn't mean anything, as you can see in Figure 1-3. From a strictly syntactic point of view, however, there's nothing wrong with this document. Therefore, if the document is to have meaning, and certainly if you need an application to process it, there must be some constraint on the sequence and nesting of tags. These constraints can be expressed in a Document Type Definition (DTD) or the newer XML Schemas (XSD).

Figure 1-3 Meaninglessly defining our content in a semi-XML format.

XML 1.0 and XML Schema

The W3C released the XML 1.0 Recommendation in February 1998. The full text can be accessed at http://www.w3.org/TR/1998/REC-xml-19980210. The WC3 issued an update, with minor corrections, in October 2000, which is located at http://www.w3.org/TR/2000/REC-xml-20001006.

The XML Schema specification reached full W3C Recommendation status in May 2001. It has two normative parts. Part 1 (Structures) is at http://www.w3.org/TR/xmlschema-1/. Part 2 (Datatypes) is at http://www.w3.org/TR/xmlschema-2/. In addition to the two normative documents, the W3C has provided a useful non-normative Primer to XML Schema, Part 0, at http://www.w3.org/TR/xmlschema-0/.

Generally, DTDs and XML Schemas allow a document to communicate metadata to the parser about its content. Meta-information includes the allowed sequence and nesting of tags, attribute values and their types and defaults, the names of external files that might be referenced and whether or not they contain XML, the formats of some external (non-XML) data that might be referenced, and the entities that might be encountered.

Well-Formed and Valid Documents

The are two categories of XML documents. A document is either "well-formed" or "valid". A document can be well-formed only if it obeys the syntax of XML. A document that includes sequences of markup characters that cannot be parsed or are invalid cannot be well-formed. In addition, the document must meet all of the following conditions:

There can be one, and only one, root element.
All tags that are opened must be closed.
Tag names are case-sensitive.

Additional, less critical rules include

No attribute can appear more than once on the same start-tag.
Non-empty tags must be properly nested.
Parameter entities must be declared before they are used. If a document is not well-formed, it is not XML. This means that all XML documents are well-formed, and XML processors are not required to do anything with documents that are not.

A well-formed document is valid only if it contains or refers to a proper DTD or XML Schema and if the XML document obeys the constraints of that DTD or XML Schema (that element sequence and nesting is valid, required attributes are provided, attribute values are of the correct type, and so forth). We will talk more about validity in Chapter 2.