Section 6.1. Character Sets and Encoding

6.1. Character Sets and Encoding

The first challenge in internationalization is dealing with the staggering number of unique character shapes (called glyphs ) that occur in the writing systems of the world. This includes not only alphabets, but also all ideographs (characters that indicate a whole word or concept) for such languages as Chinese, Japanese, and Korean. There are also invisible characters that indicate particular functionality within a word or a line of text, such as characters that indicate that adjacent characters should be joined.

To understand character encoding as it relates to HTML, XHTML, and XML, you must be familiar with some basic terms and concepts.

Character set

A character set is any collection or repertoire of characters that are used together for a particular function. Many character sets have been standardized, such as the familiar ASCII character set that includes 128 characters mostly from the Roman alphabet used in modern English.

Coded character set

When a specific number is assigned to each character in a set, it becomes a coded character set. Each position (or numbered unit) in a coded character set is called a code point (or code position ). In Unicode, (discussed in more detail later) the code point of the greater-than symbol (>) is 3E in hexadecimal or 62 in decimal. Unicode code points are typically denoted as U+hhhh, where hhhh is a sequence of at least four and sometimes six hexadecimal digits.

Character encoding

Character encoding refers to the way characters and their code points are converted to bytes for use by computers. The character encoding transforms the character stream in a document to a byte stream that is interpreted by user agents and reassembled again as a character stream for the user.

The number of characters available in a character set is limited by the bit depth of its encoding. For example, 8 bits are capable of describing 256 (28) unique characters, 16 bits can describe 65,536 (2¹⁶) different characters, and so on.

Many character sets and their encodings have been standardized for worldwide interoperability. The most relevant character set to the Web is the comprehensive Unicode (ISO/IEC 106460-1), which includes more than 50,000 characters from all active modern languages. Unicode is discussed in appropriate detail in the next section.

Web documents may also be encoded with more specialized encodings appropriate to their authoring languages. Some common encodings are listed in Table 6-1. Note that all of these encodings are 8-bit (256 character) subsets of Unicode.

Table 6-1. Common 8-bit character encodings
Encoding	Description
ISO 8859-1 (a.k.a. Latin-1)	Latin characters used in most Western languages (includes ASCII)
ISO 8859-5	Cyrillic
ISO 8859-6	Arabic
ISO 8859-7	Greek
ISO 8859-8	Hebrew
ISO-2022-JP	Japanese
SHIFT_JIS	Japanese
EUC-JP	Japanese

HTML 2.0 and 3.0 were based on the 8-bit Latin-1 (ISO 8859-1) character set. Even as HTML 2.0 was being penned, the W3C was aware that 256 characters were not adequate to exchange information on a global scale, and it had its sights set on a super-character set called Unicode. Unfortunately, Unicode wasn't ready for inclusion in an HTML Recommendation until Version 4.0 (1996). Without further ado, it's time to talk Unicode.

6.1.1. Unicode (ISO/IEC 10646-1)

SGML-based markup languages are required to define a document character set that serves as the basis for interpreting characters. The document character set for HTML (4 and 4.01), XHTML, and XML is the Universal Character Set (UCS) , which is a superset of all widely used standard character sets in the world.

The USC is defined by both the Unicode and ISO/IEC 10646 standards. The code points in Unicode and ISO/IEC 10646 are identical and the standards are developed in parallel. The difference is that Unicode adds some rules about how characters should be used. It is also used as a reference for such issues as the bidirectional text algorithm for handling reading direction within text. The Unicode Standard is defined by the Unicode Consortium (www.unicode.org).

In common practice, and throughout this book, the Universal Character Set is referred to simply as "Unicode."

Because Unicode is the document character set for all (X)HTML documents, numeric character references in web documents will always be interpreted according to Unicode code points, regardless of the document's declared encoding.

6.1.1.1. Unicode code points

Unicode was originally intended to be a 16-bit encoded character set, but it was soon recognized that 65,536 code positions would not be enough, so it was extended to include more than a million available code points (not all of them are assigned, of course) on supplementary planes.

The first 16 bits, or 65,536 positions, in Unicode are referred to as the Basic Multilingual Plane (BMP) . The BMP includes most of the more common characters in use, such as character sets for Latin, Greek, Cyrillic, Devangari, hirgana, katakana, Cherokee, and others, as well as mathematical and other miscellaneous characters. Most ideographs are there, too, but due to their large numbers, many have been moved to a Supplementary Ideographic Plane.

Unicode was created with backward compatibility in mind. The first 256 code points in the BMP are identical to the Latin-1 character set, with the first 128 matching the established ASCII standard.

6.1.1.2. Unicode encodings

Many character sets have only one encoding method, such as the ISO 8859 series. Unicode, however, may be encoded a number of ways. So although the code points never change, they may be represented by 1, 2, or 4 bytes. The encoding forms for Unicode are:

UTF-8: This is an expanding format that uses 1 byte for characters in the ASCII set, 2 bytes for additional character ranges, and 3 bytes for the rest of the BMP. Supplementary planes use 4 bytes. UTF-8 is the recommended Unicode encoding for web documents and other Internet technologies.
UTF-16: Uses 2 bytes for BMP characters and 4 bytes for supplementary characters. UTF-16 is another option for web documents.
UTF-32: Uses 4 bytes for all characters.

So while the code point for the percent sign is U+0025, it would be represented by the byte value 25 in UTF-8, 00 25 in UTF-16, and 00 00 00 25 by UTF-32. There are other things at work in the encoding as well, but this gives you a feel for the difference in encoding forms.

6.1.1.3. Choosing an encoding

The W3C recommends the UTF-8 encoding for all (X)HTML and XML documents because it can accommodate the greatest number of characters and is well supported by servers. It allows wide-ranging languages to be mixed within a single document.

Not all web documents need to be encoded using UTF-8 however. If you are authoring a document in a language that uses a lot of non-ASCII characters, you may want to choose an encoding that minimizes the need to numerically represent ("escape") these special characters.

Bear in mind, however, that regardless of the encoding, all characters in the document will be interpreted relative to Unicode code points.

For more information on how character sets and character encodings should be handled for web documents, see the W3C's Character Model for the World Wide Web 1.0 Recommendation at www.w3.org/TR/charmod/.

6.1.2. Specifying Character Encoding

The W3C encourages authors to specify the character encoding for all web documents, even those that use the default UTF-8 Unicode encoding, but it is particularly critical if an alternate encoding is used. There are several ways to declare the character encoding for documents: in the HTTP header delivered by the server, in the XML declaration (for XHTML and XML documents only), or in a meta element in the head of the document. This section looks at each method and provides guidelines for their use.

6.1.2.1. HTTP headers

When a server sends a document to a user agent (such as a browser), it also sends information about the document in a portion of the document called the HTTP header. A typical HTTP header looks like this:

     HTTP/1.x 200 OK     Date: Mon, 14 Nov 2005 19:45:33 GMT     Server: Apache/2.0.46 (Red Hat)     Accept-Ranges: bytes     Connection: close     Transfer-Encoding: chunked     Content-Type: text/html; charset=UTF-8

Notice that one of the bits of information that the server sends along is the Content-Type of the document using a MIME type label. For example, HTML documents are always delivered as type text/html. (The MIME types for XHTML documents aren't as straightforward, as discussed in the sidebar, "Serving XHTML.") The Content-Type entry may also contain the character encoding of the document using the charset parameter, as shown in the example.

The method for setting up a server with your preferred character encoding varies with different server software, so it is best to consult the server administrator for assistance. For Apache servers , the default character encoding may be set for all documents with the .html extension by adding this line to the .htaccess file.

     AddType 'text/html; charset=UTF-8' html

The advantages to setting character encodings in HTTP headers are that the information is easily accessible to user agents and the header information has the highest priority in case of conflict. On the downside, it is not always easy for authors to access the server settings, and it is possible for the default server settings to be changed without the author's knowledge.

It is also possible for the character encoding information to get separated from the document, which is why it is recommended that the character encoding be provided within the document as well, as described by the next two methods.

Serving XHTML

XHTML 1.0 documents may be served as either XML or HTML documents. Although XML is the proper method, many authors choose to deliver XHTML 1.0 files with the text/html MIME type used for HTML documents for reasons of backward compatibility, lack of browser support for XML files, and other problems with XHTML interpretation. When XHTML documents are served in this manner, they may not be parsed as XML documents.

XHTML 1.0 files may also be served as XML, and XHTML 1.1 files must always be served as XML. XHTML documents served as XML may use the MIME types application/xhtml+xml, application/xml, or text/xml. The W3C recommends that you use application/xhtml+xml only.

Whether you serve an XHTML document as an HTML or XML file type changes the way you specify the character encoding , as covered in the upcoming "Choosing the declaration method" section.

6.1.2.2. XML declaration

XHTML (and other XML) documents often begin with an XML declaration before the DOCTYPE declaration. The XML declaration is not required. The declaration may include the encoding of the document, as shown in this example.

 <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"       "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

The XML declaration may be provided even for XHTML documents served as text/html.

Because the default encoding for all XML documents is UTF-8 or UTF-16, encoding information in the XML declaration is not required for these encodings, and thus can be omitted as a space-saving optimization.

In addition, although it is technically correct to include the XML declaration in such documents, Appendix C of the XHTML 1.0 specification, "HTML Compatibility Guidelines," recommends avoiding it, and many authors choose to omit it because of browser-support issues. For example, when Internet Explorer 6 for Windows detects a line of text before the DOCTYPE declaration, it converts to Quirks Mode (see Chapter 9 for details), which can have a damaging effect on how the documents styles are rendered. (This is reportedly fixed in IE 7.) It is required only if your document uses an encoding other than UTF-8 or UTF-16 and if the encoding has not been set on the server.

6.1.2.3. The meta element

For HTML documents as well as XHTML documents served as text/html, the encoding should always be specified using a meta element in the head of the document. The http-equiv attribute passes information along to the user agent as though it appeared in the HTTP header. Again, the encoding is provided with the charset value as shown here:

 <head>     <meta http-equiv="content-type" content="text/html; charset=utf-8" />     <title>Document Title</title> </head>

Although the meta element declaring the content type is not a required element in the HTML and XHTML DTDs, it is strongly recommended for the purpose of clearly identifying the character encoding and keeping that information with the document. This is particularly helpful for common text editors (such as BBEdit), which use the meta element to identify the character encoding of the document when opening the document for editing. With this method, all character encodings must be explicitly specified, including UTF-8 and UTF-16.

6.1.2.4. Choosing the declaration method

The declaration method you use depends on the type of document you are authoring and its encoding method.

HTML documents: The encoding should be specified on the server and again in the document with a meta element. This makes sure the encoding is easily accessible and stays with the document should it be saved for later use.
XHTML 1.0 documents served as HTML: The encoding should be specified on the server and again in the document with a meta element. If the encoding is something other than UTF-8 or UTF-16, and the document is likely to be parsed as XML (not just HTML), then also include the encoding in an XML header. Be aware that the inclusion of the XML declaration may cause rendering problems for some browsers.
XHTML (1.0 and 1.1) documents served as XML: The encoding should be specified on the server and by using the encoding attribute in the XML declaration. Although not strictly required for UTF-8 and UTF-16 encodings, it doesn't hurt to include it anyway.

This strategy for declaring character encodings is outlined in a tutorial on the W3C's Internationalization site (www.w3.org/International/tutorials/tutorial-char-enc/). For another approach, see the article "WaSP Asks the W3C: Specifying Character Encoding" on the Web Standards Project site (webstandards.org/learn/askw3c/dec2002.html).