9.1 The Origins of XML | Internet-Enabled Business Intelligence


Team-Fly

	Internet-Enabled Business Intelligence By William A. Giovinazzo
	Table of Contents

	Chapter 9. eXtensible Markup Language

XML is more than YAML (Yet Another Markup Language). Figure 9.1 presents the XML family tree. The XML standard shares the same heritage as HyperText Markup Language (HTML); both are descendents of the Standard Generalized Markup Language (SGML). XML, while combining many features of both its parent and sibling, overcomes their drawbacks. Where SGML is complex, XML is simple. Where HTML is limited, XML is powerful. To understand this difference, let's begin our examination of XML by looking at the entire family.

Figure 9.1. Markup languages family tree.

graphics/09fig01.gif

SGML became an international standard in 1986. It is described in ISO (International Standard Organization) 8879. You may recall from Chapter 6 that ISO is one of our friendly standards bodies. Like other markup languages, document designers use SGML to define their own format for documents independent of the destination device or system. The strength of SGML is that it can be used to format large, complex documents or vast repositories of information. It does more than simply describe a visual image. It actually creates a structure for the document. This makes the standard ideal for large mission-critical systems. Companies that produced large documents are able to take advantage of SGML's power and flexibility. The disadvantage is that the ability to handle these large complex documents and repositories has made the SGML standard itself large and complex.

As we have seen, cost is typically a by-product of complexity. The complexity of SGML made the standard the language of large corporations. The required initial investment for SGML document processing was too large for smaller organizations. The issue, therefore, remained: How do companies, especially smaller organizations, deliver information over the Web, an environment with a plethora of destination system types? The easiest solution to this problem was to create a subset of SGML specifically designed for the World Wide Web. This subset should be simple yet flexible enough to display the information in an independent manner. This was the birth of HTML.

Tim Berners-Lee at CERN first developed the HTML in 1989. The greatest difference between HTML and SGML was its objective. HTML simply created a way to express how documents' text and images were to be displayed. It did not provide a structure to the document, as did SGML. Note that HTTP is concerned with the transmission of not only documents, but also of hypertext documents. These are documents with links to other parts of the document as well as to other documents.

This simplicity made HTML inexpensive enough for any organization to create and transmit hypertext documents. A hypertext document was not all that different from a word processing document with embedded tags. This simplification also had its liabilities. HTML, while simple, lacked structure. Without the ability to distinguish between the different parts of a document, systems were unable to manipulate the subsections. Simple tasks such as numbering section headings or locating section titles in a document became much more difficult. As we shall see later, this lack of structure created other issues that could not be resolved by HTML. While one might be tempted to call this a deficiency of the language, it must be remembered that structure was not necessary to HTML's objectives.

As we can see in Figure 9.1, the XML standard is not a replacement for SGML or HTML but a complement to it. The objectives of the XML metalanguage recognize SGML's complexity and structure as well as HTML's simplicity and lack of structure. XML defined three simple objectives: extensibility, structure, and validation. One might call the XML standard the Goldilocks of markup languages. SGML is too complex. HTML is too simple, but XML is just right.

As the name implies, XML is an extensible markup language. HTML has a fixed set of tags and attributes, which are defined by the World Wide Web Consortium (W3C). XML, on the other hand, allows the definition of tags. Document authors can create new tags and attribute names as they see fit. This empowers XML designers to create sets of tags that address specific needs. Examples of such extensions include the Chemical Markup Language (CML) for the chemical industry and the Mathematical Markup Language (MML) that provides tags and attributes for the representation of math formulas. There are also extensions that range from the Resource Description Framework (RDF) that provides the integration of metadata to the eXtensible User interface Language (XUL) that provides the customization of user interfaces.

The second objective of XML is that of structure. Eskimos have hundreds of different words for snow, but none for palm trees. The simplicity of HTML makes it wonderful for the construction of Web pages, which is the intent of the language. Web pages, however, don't necessarily require any real structure. Structure to HTML is like a palm tree to an Eskimo. XML makes it possible to support structures such as hierarchies and data associations. We shall see in Chapter 10 how XML is used by the Common Warehouse Metadata Interchange (CWMI) to provide such a structure for the exchange of metadata.

Structure allows a document to be divided into its component parts, making it much easier to process. Consider a book, for example. It is composed of sections, chapters, paragraphs, and diagrams. Breaking these apart for storage into a database becomes a much simpler task if the structure of the document is understood . It is also possible to take parts of a document and process them differently. Perhaps a report designer would like to send the executive summary to the information portals of the organization's C-level executives, with a link to the actual report stored somewhere else on the Web. It may be desirable to store the contents of the report in a database, where specific parts, such as the executive summary or conclusions, can be viewed periodically and all the detail archived elsewhere. If we assume that the report is generated frequently, the task becomes burdensome in an unstructured document. In a structured environment, the component parts of the report can be distributed as the author sees fit. This structure also allows the author to store these component parts into a database.

The third objective in the design of XML is that of validation. HTML does not have a way for the application working with the HTML document to validate the syntax of the document it receives. When we think about it, this makes sense. HTML has no need for validation. HTML documents are either valid HTML documents or they are not. The originating author should be able to easily validate the document. XML, however, is extensible. The author of an XML document can extend XML to fit his or her particular needs. As such, the application receiving the XML document must be able to validate the document.

There are two levels of validation. The first is what is known as simply a well- formed document. A valid document defines itself. This is the case where the author has defined his or her own set of tags, which are used within the document. The second level of validation is that of being valid. A valid document strictly complies with the markup and syntax of a particular environment. Although it is possible for the definition of this environment to be defined within the document, as we shall see in the next section, it is not necessary to do so. While this may seem a bit nebulous, we will elaborate on document validation later in this chapter.


Team-Fly

Top