6.1. Character Sets and EncodingThe first challenge in internationalization is dealing with the staggering number of unique character shapes (called glyphs ) that occur in the writing systems of the world. This includes not only alphabets, but also all ideographs (characters that indicate a whole word or concept) for such languages as Chinese, Japanese, and Korean. There are also invisible characters that indicate particular functionality within a word or a line of text, such as characters that indicate that adjacent characters should be joined. To understand character encoding as it relates to HTML, XHTML, and XML, you must be familiar with some basic terms and concepts.
Many character sets and their encodings have been standardized for worldwide interoperability. The most relevant character set to the Web is the comprehensive Unicode (ISO/IEC 106460-1), which includes more than 50,000 characters from all active modern languages. Unicode is discussed in appropriate detail in the next section. Web documents may also be encoded with more specialized encodings appropriate to their authoring languages. Some common encodings are listed in Table 6-1. Note that all of these encodings are 8-bit (256 character) subsets of Unicode.
HTML 2.0 and 3.0 were based on the 8-bit Latin-1 (ISO 8859-1) character set. Even as HTML 2.0 was being penned, the W3C was aware that 256 characters were not adequate to exchange information on a global scale, and it had its sights set on a super-character set called Unicode. Unfortunately, Unicode wasn't ready for inclusion in an HTML Recommendation until Version 4.0 (1996). Without further ado, it's time to talk Unicode. 6.1.1. Unicode (ISO/IEC 10646-1)SGML-based markup languages are required to define a document character set that serves as the basis for interpreting characters. The document character set for HTML (4 and 4.01), XHTML, and XML is the Universal Character Set (UCS) , which is a superset of all widely used standard character sets in the world. The USC is defined by both the Unicode and ISO/IEC 10646 standards. The code points in Unicode and ISO/IEC 10646 are identical and the standards are developed in parallel. The difference is that Unicode adds some rules about how characters should be used. It is also used as a reference for such issues as the bidirectional text algorithm for handling reading direction within text. The Unicode Standard is defined by the Unicode Consortium (www.unicode.org).
Because Unicode is the document character set for all (X)HTML documents, numeric character references in web documents will always be interpreted according to Unicode code points, regardless of the document's declared encoding. 6.1.1.1. Unicode code pointsUnicode was originally intended to be a 16-bit encoded character set, but it was soon recognized that 65,536 code positions would not be enough, so it was extended to include more than a million available code points (not all of them are assigned, of course) on supplementary planes. The first 16 bits, or 65,536 positions, in Unicode are referred to as the Basic Multilingual Plane (BMP) . The BMP includes most of the more common characters in use, such as character sets for Latin, Greek, Cyrillic, Devangari, hirgana, katakana, Cherokee, and others, as well as mathematical and other miscellaneous characters. Most ideographs are there, too, but due to their large numbers, many have been moved to a Supplementary Ideographic Plane. Unicode was created with backward compatibility in mind. The first 256 code points in the BMP are identical to the Latin-1 character set, with the first 128 matching the established ASCII standard. 6.1.1.2. Unicode encodingsMany character sets have only one encoding method, such as the ISO 8859 series. Unicode, however, may be encoded a number of ways. So although the code points never change, they may be represented by 1, 2, or 4 bytes. The encoding forms for Unicode are:
So while the code point for the percent sign is U+0025, it would be represented by the byte value 25 in UTF-8, 00 25 in UTF-16, and 00 00 00 25 by UTF-32. There are other things at work in the encoding as well, but this gives you a feel for the difference in encoding forms. 6.1.1.3. Choosing an encodingThe W3C recommends the UTF-8 encoding for all (X)HTML and XML documents because it can accommodate the greatest number of characters and is well supported by servers. It allows wide-ranging languages to be mixed within a single document. Not all web documents need to be encoded using UTF-8 however. If you are authoring a document in a language that uses a lot of non-ASCII characters, you may want to choose an encoding that minimizes the need to numerically represent ("escape") these special characters. Bear in mind, however, that regardless of the encoding, all characters in the document will be interpreted relative to Unicode code points.
6.1.2. Specifying Character EncodingThe W3C encourages authors to specify the character encoding for all web documents, even those that use the default UTF-8 Unicode encoding, but it is particularly critical if an alternate encoding is used. There are several ways to declare the character encoding for documents: in the HTTP header delivered by the server, in the XML declaration (for XHTML and XML documents only), or in a meta element in the head of the document. This section looks at each method and provides guidelines for their use. 6.1.2.1. HTTP headersWhen a server sends a document to a user agent (such as a browser), it also sends information about the document in a portion of the document called the HTTP header. A typical HTTP header looks like this: HTTP/1.x 200 OK Date: Mon, 14 Nov 2005 19:45:33 GMT Server: Apache/2.0.46 (Red Hat) Accept-Ranges: bytes Connection: close Transfer-Encoding: chunked Content-Type: text/html; charset=UTF-8 Notice that one of the bits of information that the server sends along is the Content-Type of the document using a MIME type label. For example, HTML documents are always delivered as type text/html. (The MIME types for XHTML documents aren't as straightforward, as discussed in the sidebar, "Serving XHTML.") The Content-Type entry may also contain the character encoding of the document using the charset parameter, as shown in the example. The method for setting up a server with your preferred character encoding varies with different server software, so it is best to consult the server administrator for assistance. For Apache servers , the default character encoding may be set for all documents with the .html extension by adding this line to the .htaccess file. AddType 'text/html; charset=UTF-8' html The advantages to setting character encodings in HTTP headers are that the information is easily accessible to user agents and the header information has the highest priority in case of conflict. On the downside, it is not always easy for authors to access the server settings, and it is possible for the default server settings to be changed without the author's knowledge. It is also possible for the character encoding information to get separated from the document, which is why it is recommended that the character encoding be provided within the document as well, as described by the next two methods.
6.1.2.2. XML declarationXHTML (and other XML) documents often begin with an XML declaration before the DOCTYPE declaration. The XML declaration is not required. The declaration may include the encoding of the document, as shown in this example. <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> The XML declaration may be provided even for XHTML documents served as text/html. Because the default encoding for all XML documents is UTF-8 or UTF-16, encoding information in the XML declaration is not required for these encodings, and thus can be omitted as a space-saving optimization. In addition, although it is technically correct to include the XML declaration in such documents, Appendix C of the XHTML 1.0 specification, "HTML Compatibility Guidelines," recommends avoiding it, and many authors choose to omit it because of browser-support issues. For example, when Internet Explorer 6 for Windows detects a line of text before the DOCTYPE declaration, it converts to Quirks Mode (see Chapter 9 for details), which can have a damaging effect on how the documents styles are rendered. (This is reportedly fixed in IE 7.) It is required only if your document uses an encoding other than UTF-8 or UTF-16 and if the encoding has not been set on the server. 6.1.2.3. The meta elementFor HTML documents as well as XHTML documents served as text/html, the encoding should always be specified using a meta element in the head of the document. The http-equiv attribute passes information along to the user agent as though it appeared in the HTTP header. Again, the encoding is provided with the charset value as shown here: <head> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> <title>Document Title</title> </head> Although the meta element declaring the content type is not a required element in the HTML and XHTML DTDs, it is strongly recommended for the purpose of clearly identifying the character encoding and keeping that information with the document. This is particularly helpful for common text editors (such as BBEdit), which use the meta element to identify the character encoding of the document when opening the document for editing. With this method, all character encodings must be explicitly specified, including UTF-8 and UTF-16. 6.1.2.4. Choosing the declaration methodThe declaration method you use depends on the type of document you are authoring and its encoding method.
|