Introducing Markup Languages

You can use markup languages to manage structured documents in a standard format. To "mark up" a document means to imbed text within the content of the document in order to perform procedures on the data and convey information about it. Today's markup languages form their bases on procedural markup languages of the 1960s, when typesetters used proprietary coding for fine control over the font, size, and spacing of printed copy, with the intention of formatting documents destined for paper. For example, some book publishers required authors to submit their manuscript using proprietary procedural markup tags embedded in the text. Procedural coding leaves the task of formatting to the publisher who uses proprietary software to process the tags embedded in the document. This enables the author to concentrate on the content of the document.

Figure 7-1 gives a sample excerpt from the beginning of this book.

Figure 7-1. A Sample Book Excerpt

Note

Common word processing packages, such as Microsoft Word or Corel WordPerfect, use procedural markup internally to format documents. Each package has its own set of proprietary procedural markup tags for formatting, which is why you normally require a conversion tool to read a document created by one package in the other.

To produce the formatting in Figure 7-1 using procedural markup, you can insert the imaginary procedural marks !SkipLine=n!, !Center!, !Bold!, and !Indent=n! in to the text to provide the correct formatting instructions to a word processor, browser, or printer. As you can see from Figure 7-2, the procedural markup tags specify a particular procedure that is to be applied to the text that it references.

Figure 7-2. A Basic Procedural Markup Example

Descriptive markup languages, or generic coding, differs from procedural markup languages by describing the structure of the document, leaving the parsers (programs used to display, print, or store the information) to perform the desired procedure. For example, you can use exactly the same text document for printing, monitor display, or even Braille. In contrast, with procedural markup, you would require a separate document for each.

Procedural markup languages have the following disadvantages over descriptive languages for publishing documents on the web:

Inflexible The number of commands indicating how the text should be formatted (that is, skip line, indent, and so on) is often cumbersome for effective usage. For example, internally, most word processors contain thousands of tags, which are transparent to you. This number of tags is unmanageable for use in web publishing.
Requires Multiple Document Formats Procedural markups are inflexible because, if you require different styles of documents, the text needs to be marked up again for each style. For example, if a document destined for a printer needs to be displayed to a monitor, a new marked-up document is necessary, because even the slightest offset of margins requires reformatting.

Stanley Rice and William Tunnicliffe first conceived of the separation of content from its formatting in 1967. The first formal descriptive markup language was GenCode and was the first general markup specification for the typesetting industry. Gencode recognized the need of different codes for different types of documents. Together, Rice and Tunnicliffe formed the Graphic Communications Association (GCA) Gencode Committee to further develop their ideas into a nonproprietary generic coding markup standard.

IBM then took the ideas behind GenCode to produce Generalized Markup Language (GMLalso the initials of its creators, Charles Goldfarb, Edward Mosher, and Raymond Lorie) in 1969, to organize IBM's legal documents into a searchable form. GML automated functions performed specifically on IBM's legal documents. The ANSI Computer Languages group expanded GML in 1980 into the Standard Generalized Markup Language (SGML) for the Processing of Text Committee. In 1984, the International Standards Organization (ISO) joined the ANSI committee to publish the SGML standard based on its sixth working draft in 1986 as an international norm (ISO 8879). The first important users of the standard were the US Internal Revenue Service (IRS) and Department of Defense (DoD).

Figure 7-3 illustrates how to mark up the example described previously using SGML.

Figure 7-3. A Basic Structural Markup Example Using SGML

The tags shown in Figure 7-3 group the document into elements. For example, you can use the tags <Book>, <Chapter>, and <Goal> to group the document into elements. Each piece of content has the general form of a start-tag (for example, "<Book>") followed by the content, and ending with the end-tag (for example, </Book>). The title of the <Book> element is an attribute of the element as opposed to being an element itself.

Because the elements, attributes, and overall structure of Figure 7-3 are specific to the application, you should formally define them for the application. This need for formally declaring elements, attributes, and structure gave birth to the Document Type Definition (DTD) file, against which you can validate markup language files. DTD files define the syntax of the markup language. That is, it declares all tags within the markup file and specifies the order with which they should appear, which ones are optional or repeatable, and that they are properly nested. You can use DTD files to establish portability and interoperability and exchange data between organizations with different file formats.

The DTD file for the SGML sample in Figure 7-3 is in Table 7-1. The <!ELEMENT> tag defines each element, preceded by the <ATTLIST> tag, if the element contains any attributes (for example, title and author).

Table 7-1. A Sample Book.dtd DTD File
DTD Information	Description
<!ELEMENT book (chapter+)> <!ATTLIST book title CDATA #REQUIRED author CDATA #IMPLIED>	The Book element contains one or more Chapter elements. The + indicates that one or more elements may exist. The Book element contains attributes for the required book's title and optional (IMPLIED) author name.
<!ELEMENT chapter (chaptertitle,chaptergoals?,section+)>	Each chapter contains one chapter title, an optional "Chapter Goals" area (as indicated by a question mark), and one or more Sections elements.
<!ELEMENT chaptertitle (#PCDATA)>	The Chapter Title contains parsable character data (meaning it can contain tags within the data), which require special parsing to be rendered properly.
<!ELEMENT chaptergoals (goal+)> <!ATTLIST chaptergoals title CDATA #REQUIRED>	The Chapter Goals area contains one or more Goals Elements.
<!ELEMENT goal (#PCDATA) >	Each goal is parsable data.
<!ELEMENT section (subsection*)> <!ATTLIST section title CDATA #REQUIRED>	A section contains a title attribute, optional parsable character data, and zero or more sub-sections.
<!ELEMENT subsection (subsubsection*)> <!ATTLIST subsection title CDATA #REQUIRED>	A subsection contains a title attribute and zero or more subsubsections.
<!ELEMENT subsubsection (#PCDATA)> <!ATTLIST subsubsection title CDATA #REQUIRED>	A subsubsection contains a title attribute and parsable text.

Hypertext Markup Language

HTML was developed as a simple means to publish hyperlinked documents in a standard fashion. HTML enables you to avoid proprietary formats, and thus promote interoperability between the various devices expected to connect to the web. At the time, SGML was considered too bulky and complicated for such a "simple" environment. In general, HTML is an application of SGML, and includes a minor subset of simple tags for organizing content on the web.

Note

You can use DTD files to define not only applications of certain markup languages but entire markup languages themselves. For example, just as in the book example above, HTML has its own SGML-compliant structure and subset of tags, which are used to mark up hypertext on the WWW.

In 1990, Tim Berners-Lee, then working at the Organsation Européenne pour la recherche nucléaire (CERN), published the first version of the HTML DTD (that is, HTML 1.0). Tim developed the first prototype browser, supporting HTML transported in HTTP over a TCP/IP network, resulting in the birth of the World Wide Web. The computer community received HTML 1.0 with welcoming arms, and many text-based browsers, such as Viola, Cello, and Lynx, became available shortly after its release.

The IETF published HTML version 2 as RFC 1866 in 1994. Version 2.0 included many new features and fixes to version 1.0, such as support for images and forms. The National Center for Supercomputing Applications (NCSA) developed the first graphical browser for HTML 2.0, then called Mosaic, in late 1993. The developers of Mosaic soon decided that leaving NCSA to form Netscape would be a profitable endeavor.

Note

Numerous parties with vested interests in web protocols formed the W3C consortium in 1993 to take web standardization into a nonprofit and unincorporated setting. Soon after work began on HTML version 3.0 at W3C.

In 1993, Netscape developed HTML+ for its Mosaic browser, based on HTML version 2.0, but it included many additional practical features over HTML version 2.0. Numerous competitive companies followed suite with various browsers with support for HTML version 2.0, the largest being Microsoft with Internet Explorer. Although the two largest browser companies developed browsers with close interpretation of the HTML spec, they each developed new and incompatible tags. The browser manufacturers quickly diverged from one another, creating a highly competitive browser market. As a result, the differences between HTTP 2.0 and 3.0 and Netscape's HTML+ were so vast that W3C decided to avoid standardizing 3.0 and instead to include the version 3.0 and HTML+ updates in version HTML 3.2, among various fixes and other new features. Thus, HTTP 3.2 was released in 1997 and included the generally accepted practices at the time, or as general as possible given the major explosion of web applications developed with HTML. HTML 4.0 and 4.0.1 are the most recent versions of documents. The HTML 4.0.1 specification comprises three separate DTDs maintained and published by W3C.

HTML is an excellent markup language for displaying content for humans to read on a screen and for navigation between documents. However, even though the use of HTML is widespread, it soon proved to be insufficient in abstraction and structure for today's increasingly complex content-based applications. Like its procedural markup predecessor, presentation and formatting were given higher priority during the drafting of the HTML DTD than structure and organization, especially since the popularization of style sheets within HTML.

Note

You can use style sheets to further separate content from the presentation of content of web documents written in HTML or Extensible HTML (XHTML). You will learn about Cascading Style Sheets (CSS) and XHTML later in this Chapter.

Although HTML is a structural markup language based on SGML, the HTML tags do not sufficiently describe the content, and the specification is very loose in terms of syntax and structure as compared to SGML. Due to the explosion of the web, content providers quickly thirsted for control over formatting that was similar to that used with printed copy. Browser developers in conjunction with W3C responded with numerous HTML presentational controls. As HTML matured, its procedural markup features were replaced by mechanisms, such as converting text to images, using proprietary HTML extensions, and style sheets, as simple ways to separate presentation from content without severely changing the markup language specification.

Example 7-1 shows how an HTML file is structured.

Example 7-1. A Sample HTML Document to Print "Hello World" to a Web Browser

 <HTML> <HEAD> <TITLE>Hello World Page/TITLE> </HEAD> <BODY> Hello World! </BODY> </HTML>

To fully overcome the limitations of HTML, people recently favor more robust descriptive markup languages coupled with separate presentational markup languages as successors to HTML. In order to bridge the gap between the structured, self-descriptive nature of SGML and the usability of HTML for visual, interactive web applications, the W3C created the XML family of markup languages to simplify HTML.

Extensible Markup Language

Like HTML, Extensible Markup Language (XML) is an application of the SGML protocol, but includes more of the semantic aspects of SGML. XML is a true structural markup language in that it does not do anything to the data but just describe it. In contrast to HTML, which achieved a certain level of structure with abstractions such as headings, paragraphs, emphasis, and numbered lists, in XML you can create custom XML tags to describe content (for example, Book, Section, and Goals are custom XML tags). HTML has only a specific set of tags available to describe content (e.g., Header, Body, and so on), and vast numbers of tags to perform actions on the data.

Note

In contrast to the way you use HTML, you use XML to carry data, not both data and presentation information. In order to present the data, you must use a style sheet or transform the XML document into HTML or XHTML. You will learn how to present XML later in this Chapter.

XML was published as a W3C Recommendation in early 1998. Much of the out-of-date features are excluded from XML. For example, SGML typewriter directives are no longer pertinent today. XML also extends SGML with its internationalization features and typing of elements using XML schemas.

With HTML, user agents can accept any syntax and try to make sense out of it, without giving errors. User agents are therefore difficult to write because an enormous number of erroneous pages exist on the web. The validation of XML is much more deliberate, easing the pressure on user agent developers to perform the complex error correction required on poorly written HTML. The downfall is that users must conform to the strict rules imposed by XML to avoid errors in their documents.

Note

The term user agent refers to any program that fetches, parses, and optionally displays web pages. Search engine robots are user agents, which is why you will not often see the term web browser used in most web texts, journals, and standardization documents to refer to all such agents.

Drawing on the ability to create custom elements in XML, numerous associated XML-based languages are available to you for extending the basic functionality of XML. Each requires special applications to recognize and perform actions on their respective custom-defined elements.

Extensible StyleSheet Language (XSL) and Extensible Stylesheet Transformation (XSLT) You can use the XSL and XSLT languages to display and transform XML documents, for Cisco IP phones, WAP cell phones, and PDAs.
Extensible StyleSheet Language-Format Object (XSL-FO) You can use XSL-FO to format documents for print, such as Adobe PDF files and barcodes.
XPath Use XPath to specify locations within XML documents, similar to the way files are organized on a standard computer file system. XPath is not an application of XML, but it is a major component in XSLT. You will see how XPath works with XSLT later in this Chapter.
XLink Use XLink for hyperlinking between XML documents. XLink is similar to HTML links but includes many extensions, such as bidirectional, typed, one-to-many and many-to-many links. You can also use XLink to download links automatically or on user request.
XQuery You can perform queries on XML files using XQuery, similar to the way in which you use Structure Query Language (SQL) queries in database systems.
Synchronized Multimedia Integration Language SMIL Use SMIL for the multimedia structured markup. You will learn about SMIL in Chapter 9, "Introducing Streaming Media."
Scalable Vector Graphics (SVG) Use SVG for structuring graphics.
Resource Description Framework (RDF) Use RDF for structured metadata markup.
MathML You can use MathML for mathematical equation structured markup.

Figure 7-4 shows the sample document described previously in Figure 7-1, structured in XML. Notice that the simple SGML example discussed previously in Figure 7-3 is identical when written XML, except for the required "?xml version" header.

Figure 7-4. Sample XML File

Note

XML with correct syntax is well-formed XML. XML validated against a DTD is valid XML. You can optionally specify the DTD to validate an XML file against in the header of the XML document, as show in Figure 7-4. You can also use XML Schemas as an XML-based alternative to the standard DTDs. Relax NG is the schema language by OSI.

Extensible Hypertext Markup Language

Extensible HTML (XHTML) is the next step in the evolution of web documents, and its creation was motivated by the need to deliver content to many different types of devices, such as mobile phones, PDAs, and web kiosks. As the name suggests, it is a combination of XML and HTML. More specifically, the XHTML DTDs are a reformulation of the three HTML 4.0 DTDs, as an application of XML (recall that HTML 4.0 is conversely an application of SGML). In other words, the HTML DTDs where rewritten within the XML DTD, creating the new XHTML DTD. With the new definitions, the old HTML syntax must follow the same strict rules as XML. This leads the way to a more standardized language, as user agents gradually transition to XHTML.

Note

Because documents in XHTML conform to both XML and HTML 4, you can view them in user agents supporting either type.

Although HTML may never totally retire as a web markup language, it will become much more extensible and standardized under the guise of XML. It will be extensible in that you can create your own tags and standardized in that user agents concern themselves with the standards of the XML specification, not the complex HTML error-correction methods stemming from a lack of standard syntax. Important differences between HTML and XHTML are

You must nest XHTML elements properly.
XHTML documents must be well-formed.
Tag names must be in lowercase.
You must close all XHTML elements.

Wireless Application Protocol Markup Languages

New business potential in mobile browsing has fostered the development in Wireless Application Protocol (WAP). You can use WAP to supply web content to mobile devices, such as cell phones, pagers, and PDAs. Just as the W3C is responsible for web protocols, the WAP Forum is responsible for its wireless protocol counterparts. The WAP 1.0 protocol is composed of the following specifications:

Wireless Markup Language (WML) 1.0 language Use WML structural markup language for WAP content rendering. WML is an application of XML and as such strictly adheres to the XML specification.
WMLScript language A scaled-down scripting language for wireless devices, similar to JavaScript or VBscript for HTML client or server scripting or both.
Wireless Telephony Application Interface (WTAI) API for making phone calls from data connections.

WAP is an application of XML. Using the analogy of playing cards, WML pages are called decks, and contain one or more cards. WAP devices download all the cards at once but are displayed one at a time to the user. Figure 7-5 illustrates how to publish an online book in WML.

Figure 7-5. A Sample WML File

Figure 7-6 shows how you can navigate between individual cards, or chapters, using the WAP device controls.

Figure 7-6. Navigating a WML Document on a WAP Device

The W3C specifies a subset of XHTML 1.1 for small devices, called XHTML Basic. However, the WAP Forum created WAP 2.0 to include the XHTML Basic features plus some of the features from the full XHTML 1.1 specification, called the XHTML Mobile Profile (XHTMLMP), or Wireless Markup Language 2.0 (WML 2.0). WAP 2.0 was motivated by advancements in wireless transmission technologies, such as GSM, GPRS, G2.5, and G3.

WAP 2.0 also introduced support for special WAP versions of TCP/IP protocols in order to leverage the same languages and tools for mobile and standard web content (alternatively, WAP 1.0 uses the WAP protocol stack and does not support connectivity to TCP/IP networks). The wTCP/IP protocol supports TCP/IP, HTTP for content transport, and PKI for content security. Additionally, the power of CSS is available to you in WAP 2.0-enabled devices for the possibility to control a document's layout, including the text fonts, text attributes, borders, margins, padding, text alignment, text colors, and background colors to name a few. WAP 2.0 also supports XSLT transformation to transform between WML 1.0 and WML 2.0 documents.

Figure 7-1. A Sample Book Excerpt

Figure 7-2. A Basic Procedural Markup Example

Figure 7-3. A Basic Structural Markup Example Using SGML

Table 7-1. A Sample Book.dtd DTD File

Hypertext Markup Language

Example 7-1. A Sample HTML Document to Print "Hello World" to a Web Browser

Extensible Markup Language

Figure 7-4. Sample XML File

Extensible Hypertext Markup Language

Wireless Application Protocol Markup Languages

Figure 7-5. A Sample WML File

Figure 7-6. Navigating a WML Document on a WAP Device