Section 24.1. A Bit of Background

24.1. A Bit of Background

XML and HTML are called markup languages because of the way they add structure to plain-text documentsby surrounding parts of the text with tags that indicate structure or meaning, much as someone with a pen might highlight a sentence and add a note. While HTML predefines a set of tags and their structure, XML is a blank slate in which the author gets to define the tags, the rules, and their meanings.

Both XML and HTML owe their lineage to Standard Generalized Markup Language (SGML)the mother of all markup languages. SGML has been used in the publishing industry for many years (including at O'Reilly). But it wasn't until the Web captured the world that it came into the mainstream through HTML. HTML started as a very small application of SGML, and if HTML has done anything at all, it has proven that simplicity reigns.

HTML flourished but eventually showed its limitations. Documents using HTML have an unhealthy mix of both structural information (such as <head> and <body>) and presentation information (for an egregious example, <blink>). Mixing the model and the user interface in this way limits the usefulness of HTML as a format for data exchange; it's hard for a machine to understand. XML documents consist purely of structure, and it is up to the reader of the document to apply meaning. As we'll see in this chapter, several related languages exist to help interpret and transform XML for presentation or further processing.

24.1.1. Text Versus Binary

When Tim Berners-Lee began postulating the Web back at CERN in the late 1980s, he wanted to organize project information using hypertext.^[1] When the Web needed a protocol, HTTPa simple, text-based client-server protocolwas invented. So, what exactly is so enchanting about the idea of plain text? Why, for example, didn't Tim turn to the Microsoft Word format as the basis for Web documents? Surely a binary, non-human-readable format and protocol would be more efficient? Since the Web's inception, there have now been literally trillions of HTTP transactions. Was it really a good idea for them to use (English) words like "GET" and "POST"?

^[1] To read Berners-Lee's original proposal to CERN, go to http://www.w3.org/History/1989/proposal.html.

The answer, as we've all seen, is yes! What humans can read, human developers can work with more easily. There is a time and place for a high level of optimization (and obscurity), but when the goal is universal acceptance and cross-platform portability, simplicity and transparency are paramount. This is the first, fundamental proposition of XML.

24.1.2. A Universal Parser

Using text to exchange data is not exactly a new idea, either, but historically, for every new document format that came along, a new parser would have to be written. A parser is an application that reads a document and understands its formatting conventions, usually enforcing some rules about the content. For example, the Java Properties class has a parser for the standard properties file format (Chapter 11). In our simple spreadsheet in Chapter 18, we wrote a parser capable of understanding basic mathematical expressions. As we've seen, depending on complexity, parsing can be quite tricky.

With XML, we can represent data without having to write this kind of custom parser. This isn't to say that it's reasonable to use XML for everything (e.g., typing math expressions into our spreadsheet), but for the common types of information that we exchange on the Net, we should no longer have to write parsers that deal with basic syntax and string manipulation. In conjunction with document-verifying components (Document Type Definitions [DTDs] or XML Schema), much of the complex error checking is also done automatically. This is the second fundamental proposition of XML.

24.1.3. The State of XML

The APIs we'll discuss in this chapter are powerful and popular. They are being used around the world to build enterprise-scale systems today. Unfortunately, the current slate of XML tools bundled with Java only partially remove the burden of parsing from the developer. Although we have taken a step up from low-level string manipulation to a common, structured document format, the standard tools still generally require the developer to write relatively low-level code to traverse the content and interpret the string data manually. The resulting programs remain somewhat fragile, and much of the work can be tedious. The next step, as we'll discuss later in this chapter, is to begin to use generating tools that read a description of an XML document (an XML DTD or Schema in some form) and generate Java classes or bind existing classes to XML data automatically. As these APIs grow, so does their complexity, and XML begins to lose some of its charm. Nonetheless, the promise of a seamless blending of XML documents and Java objects is one worth pursuing.

24.1.4. The XML APIs

In Java 1.4, all the basic APIs for working with XML were bundled with the standard release of Java. This included the javax.xml standard extension packages for working with Simple API for XML (SAX), Document Object Model (DOM), and Extensible Stylesheet Language (XSL) transforms. In Java 5.0, several enhancements were made and some important new APIs were introduced, including validation, XPath, and XInclude. These new APIs expose more of the power of XML through Java in a completely portable way. If you are using an older version of Java, you can still use many of these tools, but you will have to download packages separately from http://java.sun.com/xml/ or find alternative implementations.

24.1.5. XML and Web Browsers

Microsoft (IE) was the first web browser to support XML explicitly. If you load an XML document in IE 5.0 or greater, it is displayed as a tree using a special stylesheet. The stylesheet uses dynamic HTML to allow you to collapse and expand nodes (like an outline) while viewing the document. Displaying the XML is mainly for debugging, but IE also supports client-side XSL transformation directly in the browser. XSL is a language for transforming XML into other documents; we'll talk about it later in this chapter. Recent versions of Netscape and Firefox support XML viewing and transformation as well.

A few exceptions remain. At the time of this writing, on the otherwise stellar Mac OS X operating system, the native Safari browser does not display XML in this way, nor does it apply stylesheets for transformation. Browsers that do not explicitly format XML for viewing simply display the text of the document with all the tags (structural information) stripped off. This is the prescribed behavior for working with unknown XML markup in a viewing environment. Remember that you can always use the "view source" option to display the text of a file in your browser, if you want to see the original source.