6.9. Choosing an EncodingThe Unicode standard explicitly says that the Unicode Consortium "fully endorses the use of any of the three Unicode encoding forms [UTF-8, UTF-16, and UTF-32] as a conformant way of implementing the Unicode Standard." As far as the Unicode standard is concerned, it expresses no preference and leaves the choice is up to you. The forms are not equally suitable in practice, though. For use on the Internet, the Internet Engineering Task Force (IETF) has expressed a strong preference for UTF-8. In programming, you may find UTF-16 (or sometimes UTF-32) most suitable due to its simplicity. There are also efficiency differences.
6.9.1. Storage RequirementsThe storage requirements for the encodings in octets are summarized in Table 6-5. If almost all characters in the text are Basic Latin characters, as in English, UTF-8 is clearly the most compact. The second class of characters, range U+0080 to U+07FF, currently contains the following blocks: Latin-1 Supplement, Latin Extended-A, Latin Extended-B, IPA Extensions, Spacing Modifier Letters, Combining Diacritical Marks, Greek and Coptic, Cyrillic, Cyrillic Supplement, Armenian, Hebrew, Arabic, Syriac, Arabic Supplement, and Thaana. Thus, for this collection of alphabetic scripts, UTF-8 and UTF-16 use the same number (2) of octets per character.
Storage requirements naturally affect data transfer time as well. For example, for a document distributed on the Internet, the use of disk space (on a server, and on users' workstations) is relatively unimportant, unless the document is very large. However, the time required for transmission over the network is almost proportional to file size, at least unless some compression is applied. The transfer time is important especially on slow connections and for files that are requested very often. On the other hand, the size of text files is often a relatively small factor in material that contains images, videos, and other nontext files. 6.9.2. Efficiency of ProcessingWhat you lose in use of storage might be gained in processing simplicity and speed. In UTF-32, you have one character per code unit, and a 32-bit code unit typically corresponds to the integer type in modern computer architectures. If you process BMP characters only, as you probably do, UTF-16 sounds tempting, especially since UTF-16 is the representation form of characters in many programming languages, such as Java. However, when dealing with arbitrary data, you cannot really be sure of never getting any characters beyond BMP. UTF-16 is internally used in all modern versions of Windows. This makes it efficient for system-oriented programming, or generally for programming that uses the built-in functions of Windows. Subroutine libraries have often been written to assume UTF-16 (or perhaps just UCS-2) representation of character data. In processing, UTF-32 has the benefit of using exactly one code unit per Unicode character. UTF-16 shares this property for BMP characters, which constitute the vast majority of all characters that you process. However, the simple correspondence between code units and characters is somewhat illusionary. Even using UTF-32 and UTF-16, something that constitutes a character in the user's thinking need not correspond to a single code unit. For example, the character é might be represented in decomposed form, as two code points (for letter "e" and a combining mark), hence as two code units. Thus, even a simple operation like "move one character forward" might need to be more complicated than just proceeding to the next code unit. UTF-8 is also suitable for work with old programming languages like C, where the character data type is identified with an octet (byte) concept. When you use a string in such a language, you can store UTF-8 encoded data as such, but you need to handle the interpretation (decoding) of octet sequences as characters yourself. 6.9.3. Specific LimitationsIn any of UTF-8, UTF-16, and UTF-32, octets with the most significant bit set may appear. Thus, they cannot be safely transmitted over connections or software that are not "eight-bit-clean" but may mask out the most significant bit, interpret it as a sign bit or parity bit, or otherwise process it incorrectly. In such situations, you could use UTF-7, but it is usually better to use some of the standard UTF encodings and an additional transfer encoding, usually Base64 or Quoted Printable. The software you use may impose restrictions on the use of encodings. However, if a program can handle any of UTF-8, UTF-16, and UTF-32, it can probably handle the others as well. Some old software, reflecting the original 16-bit design of Unicode, might effectively support UCS-2 only, which means that you can use UTF-16 but you need to limit the character repertoire to the BMP. 6.9.4. Favoring UTF-8 on the InternetUTF-8 is typically the preferred encoding form for Unicode data on the Internet, including web pages in HTML format. UTF-8 is explicitly recommended by the Internet Engineering Task Force (IETF). The document "IETF Policy on Character Sets and Languages," published in 1998 as RFC 2277 and also labeled as Best Current Practice (BCP) 18, is written basically as a policy on Internet protocols:
In practice, web browsers generally accept both UTF-8 and UTF-16, if they handle Unicode at all (as the great majority of browsers do). However, important software like the Google search engine has been reported to fail to recognize UTF-16 properly. UTF-32 is not suitable for use on the Internet. For example, Internet Explorer 6 does not recognize it at all. Moreover, UTF-32 wastes storage and transfer time. However, there is nothing wrong with using UTF-16 or even UTF-32 internally in databases, for example. If desired, you can store the data in such a format and operate on it but accept user input and present results to the user in UTF-8, or in any encoding that suits the user. |