XML is a mechanism for describing data. It can be used with any type of data, irrespective of the nature of the data. XML describes data using markup language conventions. The term markup , in this context, has its roots in preelectronic document publishing (i.e., prior to the mid-1980s). Previously, authors, editors, and proofreaders used a standardized set of marks (typically written in red) to make all necessary corrections to a manuscript, as well as to convey formatting instructions to the typesetter. Figure 2.4 shows some of the editing and proofreading marks that used to be widely used, and Figure 2.5 shows an actual facsimile of a book manuscript containing edit markups .
Markup was a way to add descriptive and instructional annotation around the content of a document without overtly interfering with the original content. This is the motivation and philosophy around today s electronic markup languages ”that is, the ability to annotate a document (e.g., a Web page) for a particular purpose (e.g., formatting or collaborative review) without impacting the content of that document. Most of today s electronic markup languages have evolved from the Standard Generalized Markup Language (SGML) that was developed by IBM in the early 1980s.
HyperText Markup Language (HTML) is by far the most widely known and the most widely used of contemporary, electronic markup languages. HTML is a classic markup language in the true sense of the term. It comes with its particular set of markups (e.g., <B> for bold text, <H1> for level 1 heading, <P> for paragraph, and so forth). Though it could have been used to do more, HTML from the start gravitated toward being a formatting-specific markup language. It is an electronic version of the markup notation previously used to instruct typesetters on how to lay out and present the contents of a manuscript. HTML, as such, has remained true to the roots of markup languages and has propagated their legacy.
XML, on the other hand, is not a true markup language in the conventional sense. First, it does not come with its own set of markups, as does HTML. Instead, XML is a meta-markup language. It allows one to create markup languages to address any particular need. There are no restrictions as to what can be marked up (i.e., annotated) with XML. In essence, it allows you to create application-specific markups. This is what the extensible part of its name alludes to. It affirms that with XML there are no preset bounds as to what you can deal with. It is flexible and pliable. Thus, you could use XML, la HTML, to describe the layout and presentation of data. There is even an emerging XML-based W3C sanctioned standard known as Extensible Stylesheet Language (XSL), discussed later in this chapter, that does exactly this ”that is, provide an XML vocabulary for specifying data (or document) formatting semantics.
XML is thus a generalized, no-holds-barred meta-markup language for describing data. Though it can be used with any type of data, it is most effective in describing data that have some type of structure associated with them. Much of the data used by people and computers have some level of innate structure ”particularly if they are thought of in terms of a document. A poem, as shown in Figure 2.2, has a structure made of stanzas and lines. A book, at a minimum, is made of chapters, chapter headings, and paragraphs. A spreadsheet, as illustrated in Figure 2.3, is made up of cells .
Consequently, one can think of any document as consisting of data that are structured in some manner. The information contained in a document is invariably made up of content (e.g., text, graphics) and context (e.g., headings, tables, captions). XML excels in describing this type of structured data. XML describes the context of the data relative to a document ”where document in this context is just an arbitrary and generic file (or even just a placeholder) containing the data in question. The bottom line here is that XML deals with the context of data. HTML, by marked contrast, rather than dealing with the context of data, instead deals purely with how data should be formatted and presented for visual consumption. This fundamental difference between XML and HTML will be demonstrated further in Sections 2.3 and 2.4. For the time it suffices to note that XML was never intended to be just an enhanced version of HTML.
XML, though a meta-markup language as opposed to a specific markup language in its own right, is nonetheless derived from SGML. SGML became an ISO standard (i.e., ISO 8879) around 1985. SGML has become the de facto standard for defining the structure of different types of electronic documents. It has been widely used by the U.S. military, the U.S. government, and the aerospace industry over the last decade . HTML is also a derivative of SGML. SGML, by design, is very detailed, powerful, and complex. It was too unwieldy to be easily adopted for the Web and e-business.
XML is in essence SGML lite. XML retains enough of the SGML functionality to make it useful and powerful but removes much of the optional and redundant features that can make SGML somewhat convoluted and unwieldy.
Before moving on, it is salutatory to list some of the things that XML is not and that XML cannot do ”just to dispel any possible lingering misconceptions. XML is not a programming language, a database scheme, or a networking protocol. XML documents can and are transported across networks using standard, widely used protocols, such as FTP, HTTP(S), SOAP, and SMTP. Since they are standard text documents, there are really no limitations or caveats as to how they can be transported, installed, or viewed . Databases (e.g., DB2 V7) will support XML data and even permit its data to be described in XML form, but the database will not be in the form of a large XML document.
The bottom line is that XML really does address what has been the Holy Grail of networking, right from the very early days ”that of universal data interchange that transcends platforms, networking protocols, and programming languages. Now, with the Web providing global connectivity, XML does what previous networking schemes (e.g., IBM s SNA and OSI) always wanted to provide ”a consistent, universal data interchange capability, without barriers.