Section 6.9. Choosing an Encoding

6.9. Choosing an Encoding

The Unicode standard explicitly says that the Unicode Consortium "fully endorses the use of any of the three Unicode encoding forms [UTF-8, UTF-16, and UTF-32] as a conformant way of implementing the Unicode Standard." As far as the Unicode standard is concerned, it expresses no preference and leaves the choice is up to you.

The forms are not equally suitable in practice, though. For use on the Internet, the Internet Engineering Task Force (IETF) has expressed a strong preference for UTF-8. In programming, you may find UTF-16 (or sometimes UTF-32) most suitable due to its simplicity. There are also efficiency differences.

UTF-8, UTF-16, and UTF-32 all support exactly the same repertoire of characters, the full Unicode repertoire. Thus, they can all be used for all languages. However, the language of the text matters when you consider which encoding is most efficient.

6.9.1. Storage Requirements

The storage requirements for the encodings in octets are summarized in Table 6-5. If almost all characters in the text are Basic Latin characters, as in English, UTF-8 is clearly the most compact. The second class of characters, range U+0080 to U+07FF, currently contains the following blocks: Latin-1 Supplement, Latin Extended-A, Latin Extended-B, IPA Extensions, Spacing Modifier Letters, Combining Diacritical Marks, Greek and Coptic, Cyrillic, Cyrillic Supplement, Armenian, Hebrew, Arabic, Syriac, Arabic Supplement, and Thaana. Thus, for this collection of alphabetic scripts, UTF-8 and UTF-16 use the same number (2) of octets per character.

Table 6-5. Size of characters in UTF encodings, in octets
Class of characters	Range of characters	UTF-8	UTF-16	UTF-32
Basic Latin (ASCII)	U+0000 to U+007F	1	2	4
Latin 1 Suppl., ..., Thaana	U+0080 to U+07FF	2	2	4
Rest of BMP	U+0800 to U+FFFF	3	2	4
Outside BMP	U+10000 to U+10FFFF	4	4	4

Storage requirements naturally affect data transfer time as well. For example, for a document distributed on the Internet, the use of disk space (on a server, and on users' workstations) is relatively unimportant, unless the document is very large. However, the time required for transmission over the network is almost proportional to file size, at least unless some compression is applied. The transfer time is important especially on slow connections and for files that are requested very often. On the other hand, the size of text files is often a relatively small factor in material that contains images, videos, and other nontext files.

6.9.2. Efficiency of Processing

What you lose in use of storage might be gained in processing simplicity and speed. In UTF-32, you have one character per code unit, and a 32-bit code unit typically corresponds to the integer type in modern computer architectures. If you process BMP characters only, as you probably do, UTF-16 sounds tempting, especially since UTF-16 is the representation form of characters in many programming languages, such as Java. However, when dealing with arbitrary data, you cannot really be sure of never getting any characters beyond BMP.

UTF-16 is internally used in all modern versions of Windows. This makes it efficient for system-oriented programming, or generally for programming that uses the built-in functions of Windows. Subroutine libraries have often been written to assume UTF-16 (or perhaps just UCS-2) representation of character data.

In processing, UTF-32 has the benefit of using exactly one code unit per Unicode character. UTF-16 shares this property for BMP characters, which constitute the vast majority of all characters that you process. However, the simple correspondence between code units and characters is somewhat illusionary. Even using UTF-32 and UTF-16, something that constitutes a character in the user's thinking need not correspond to a single code unit. For example, the character é might be represented in decomposed form, as two code points (for letter "e" and a combining mark), hence as two code units. Thus, even a simple operation like "move one character forward" might need to be more complicated than just proceeding to the next code unit.

UTF-8 is also suitable for work with old programming languages like C, where the character data type is identified with an octet (byte) concept. When you use a string in such a language, you can store UTF-8 encoded data as such, but you need to handle the interpretation (decoding) of octet sequences as characters yourself.

6.9.3. Specific Limitations

In any of UTF-8, UTF-16, and UTF-32, octets with the most significant bit set may appear. Thus, they cannot be safely transmitted over connections or software that are not "eight-bit-clean" but may mask out the most significant bit, interpret it as a sign bit or parity bit, or otherwise process it incorrectly. In such situations, you could use UTF-7, but it is usually better to use some of the standard UTF encodings and an additional transfer encoding, usually Base64 or Quoted Printable.

The software you use may impose restrictions on the use of encodings. However, if a program can handle any of UTF-8, UTF-16, and UTF-32, it can probably handle the others as well. Some old software, reflecting the original 16-bit design of Unicode, might effectively support UCS-2 only, which means that you can use UTF-16 but you need to limit the character repertoire to the BMP.

6.9.4. Favoring UTF-8 on the Internet

UTF-8 is typically the preferred encoding form for Unicode data on the Internet, including web pages in HTML format. UTF-8 is explicitly recommended by the Internet Engineering Task Force (IETF). The document "IETF Policy on Character Sets and Languages," published in 1998 as RFC 2277 and also labeled as Best Current Practice (BCP) 18, is written basically as a policy on Internet protocols:

Protocols MUST be able to use the UTF-8 charset, which consists of the ISO 10646 coded character set combined with the UTF-8 character encoding scheme...for all text.

Protocols MAY specify, in addition, how to use other charsets or other character encoding schemes for ISO 10646, such as UTF-16, but lack of an ability to use UTF-8 is a violation of this policy....

In practice, web browsers generally accept both UTF-8 and UTF-16, if they handle Unicode at all (as the great majority of browsers do). However, important software like the Google search engine has been reported to fail to recognize UTF-16 properly.

UTF-32 is not suitable for use on the Internet. For example, Internet Explorer 6 does not recognize it at all. Moreover, UTF-32 wastes storage and transfer time.

However, there is nothing wrong with using UTF-16 or even UTF-32 internally in databases, for example. If desired, you can store the data in such a format and operate on it but accept user input and present results to the user in UTF-8, or in any encoding that suits the user.