HTML, the standard on which the World Wide Web (WWW) is founded and a descendant of Standard Generalized Markup Language (SGML), provides the basic mechanism by which documents on the Web are presented and linked to each other. Tim Berners-Lee is credited with the first proposals for creating HTML in 1989. His goal was to simplify the maintenance and the publishing of research documents on the computers of the Centre Europ en pour la Recherche Nucl aire (CERN) in Geneva, Switzerland where he worked, while also making these documents more accessible to the rest of the research community. Berners-Lee and colleague Robert Caillau wrote the first program to render HTML documents in 1991.
Berners-Lee derived the basic rules for HTML by using the International Organization for Standardization (ISO) 8879 standard of 1986 for SGML. For instance, the tags used in HTML to denote the elements of a document come directly from SGML. A tag is a keyword enclosed within a pair of angle brackets (<, >), whereas elements are enclosed within a start tag and an end tag. The end tag carries the same keyword as the start tag, but is preceded by a slash (/). Here is an example:
<title>This is the title of the document</title>
SGML was designed with human editors in mind who would manually compose the text and include tags to identify the semantic functions of each text element (such as <title> or <body>). That is why HTML documents are readable with any plaintext editor like Notepad. However, the layout of most of today's HTML documents has become so complex that the amount of markup within an HTML document easily surpasses the amount of actual text.
Originally, HTML was meant to represent text within a document structure, rather than accounting for any formatting or attributes of the presentation. Nevertheless, as the implementers of Web browsers and the authors of HTML documents were looking for methods to control the visual representation of HTML documents, more and more tags were added allowing exactly that. As a result, HTML has become a very rich markup language that allows the representation of the most complex layouts.
Because most of the documents that CERN maintained and published were written in multiple Western European languages, a basic international feature was present in the original versions of HTML-namely, the capability to render documents written in any Western European language that uses the Latin 1 character set. Latin 1 is, in fact, the default character set for HTML documents.
Today's popular Web browsers implement HTML in the form that is standardized by the World Wide Web Consortium (W3C). The first version of HTML standardized by the W3C was HTML 4, which included a host of international features. These features far surpassed those present in earlier versions of HTML and, for the first time, bidirectional support was added, as required for documents written in Hebrew, Arabic, or Farsi, for example. (For more information on international features, see "International Features" later in this chapter.)
Microsoft Internet Explorer makes use of the dynamic-link libraries (DLLs) Shdocvw.dll and Mshtml.dll (associated with MSHTML, Microsoft's own superset of HTML tags) to handle all its reading and rendering of HTML documents. Any Microsoft Windows application can host either one of these libraries and use its interfaces for managing HTML or fragments of an HTML document. Both DLLs allow access down to the individual elements of the document. Applications can use the simple browser control of Shdocvw.dll to implement what is essentially another version of a Web browser. Alternatively, applications can directly host Mshtml.dll to gain access to and manipulate the document's elements. (For more information on hosting MSHTML and its associated DLL, go to http://msdn.microsoft.com/workshop/browser/hosting/Hosting.asp.)
Before taking a closer look at HTML's international features, there are some basic terms with which you should be familiar. These terms-used frequently (and sometimes interchangeably) throughout various Request for Comments (RFCs), ISO standards, and other documents-include "encoding," "character encoding," "character set" and "charset." (For more information on these terms, see Chapter 3, "Unicode." )