3.2 HTML, SGML, and XML | Java Network Programming, Third Edition

HTML is the primary format used for Web documents. As I said earlier, HTML is a simple standard for describing the semantic content of textual data. The idea of describing a text's semantics rather than its appearance comes from an older standard called the Standard Generalized Markup Language (SGML). Standard HTML is an instance of SGML. SGML was invented in the mid-1970s by Charles Goldfarb, Edward Mosher, and Raymond Lorie at IBM. SGML is now an International Standards Organization (ISO) standard, specifically ISO 8879:1986.

SGML and, by inheritance, HTML are based on the notion of design by meaning rather than design by appearance. You don't say that you want some text printed in 18-point type; you say that it is a top-level heading ( <H1> in HTML). Likewise, you don't say that a word should be placed in italics. Rather, you say it should be emphasized (  in HTML). It is left to the browser to determine how to best display headings or emphasized text.

The tags used to mark up the text are case-insensitive. Thus,  is the same as  is the same as  is the same as  . Some tags have a matching end-tag to define a region of text. An end-tag is the same as the start-tag, except that the opening angle bracket is followed by a / . For example: this text is strong ; this text is emphasized . The entire text from the beginning of the start-tag to the end of the end-tag is called an element . Thus, this text is strong is a STRONG element.

HTML elements may nest but they should not overlap. The first line in the following example is standard-conforming. The second line is not, though many browsers accept it nonetheless:

 <STRONG><EM>Jack and Jill went up the hill</EM></STRONG> <STRONG><EM>to fetch a pail of water</STRONG></EM>

Some elements have additional attributes that are encoded as name -value pairs on the start-tag. The <H1> tag and most other paragraph-level tags may have an ALIGN attribute that says whether the header should be centered, left-aligned, or right-aligned. For example:

 <H1 ALIGN=CENTER> This is a centered H1 heading </H1>

The value of an attribute may be enclosed in double or single quotes, like this:

 <H1 ALIGN="CENTER"> This is a centered H1 heading </H1> <H2 ALIGN='LEFT'> This is a left-aligned H2 heading </H2>

Quotes are required only if the value contains embedded spaces. When processing HTML, you need to be prepared for attribute values that do and don't have quotes.

There have been several versions of HTML over the years . The current standard is HTML 4.0, most of which is supported by current web browsers, with occasional exceptions. Furthermore, several companies, notably Netscape, Microsoft, and Sun, have added nonstandard extensions to HTML. These include blinking text, inline movies, frames , and, most importantly for this book, applets. Some of these extensionsfor example, the <APPLET> tagare allowed but deprecated in HTML 4.0. Others, such as Netscape's notorious <BLINK> , come out of left field and have no place in a semantically-oriented language like HTML.

HTML 4.0 may be the end of the line, aside from minor fixes. The W3C has decreed that HTML is getting too bulky to layer more features on top of. Instead, new development will focus on XML, a semantic language that allows page authors to create the elements they need rather than relying on a few fixed elements such as P and LI . For example, if you're writing a web page with a price list, you would likely have an SKU element, a PRICE element, a MANUFACTURER element, a PRODUCT element, and so forth. That might look something like this:

 <PRODUCT MANUFACTURER="IBM">   <NAME>Lotus Smart Suite</NAME>   <VERSION>9.8</VERSION>   <PLATFORM>Windows</PLATFORM>   <PRICE CURRENCY="US">299.95</PRICE>   <SKU>D05WGML</SKU> </PRODUCT>

This looks a lot like HTML, in much the same way that Java looks like C. There are elements and attributes. Tags are set off by < and > . Attributes are enclosed in quotation marks, and so forth. However, instead of being limited to a finite set of tags, you can create all the new and unique tags you need. Since no browser can know in advance all the different elements that may appear, a stylesheet is used to describe how each of the items should be displayed.

XML has another advantage over HTML that may not be obvious from this simple example. HTML can be quite sloppy . Elements are opened but not closed. Attribute values may or may not be enclosed in quotes. The quotes may or may not be present. XML tightens all this up. It lays out very strict requirements for the syntax of a well- formed XML document, and it requires that browsers reject all malformed documents. Browsers may not attempt to fix the problem and make a best-faith effort to display what they think the author meant . They must simply report the error. Furthermore, an XML document may have a Document Type Definition (DTD), which can impose additional constraints on valid documents. For example, a DTD may require that every PRODUCT element contain exactly one NAME element. This has a number of advantages, but the key one here is that XML documents are far easier to parse than HTML documents. As a programmer, you will find it much easier to work with XML than HTML.

XML can be used both for pure XML pages and for embedding new kinds of content in HTML and XHTML. For example, the Mathematical Markup Language, MathML, is an XML application for including mathematical equations in web pages. SMIL, the Synchronized Multimedia Integration Language, is an XML application for including timed multimedia such as slide shows and subtitled videos on web pages. More recently, the W3C has released several versions of XHTML. This language uses the familiar HTML vocabulary ( p for paragraphs, tr for table rows, img for pictures, and so forth) but requires the document to follow XML's stricter rules: all attribute values must be quoted; every start-tag must have a matching end-tag; elements can nest but cannot overlap; etc. For a lot more information about XML, see XML in a Nutshell by Elliotte Rusty Harold and W. Scott Means (O'Reilly).