6.3. UTF-16 and UCS-2The UCS-2 and UTF-16 encodings use 16-bit code units. In these encodings, all characters in the Basic Multilingual Plane (BMP), and hence most characters that people use these days, are represented directly: a character is represented as one code unit. It represents the code number of the character as one unsigned 16-bit integer. Thus, the encodings are structurally simpler than UTF-8. 6.3.1. UCS-2 Is BMP OnlyUCS-2 is by definition limited to BMP. It is therefore not a full Unicode encoding: you cannot represent all Unicode data in UCS-2. On the other hand, UTF-16 is basically UCS-2 enhanced with a mechanism (surrogate pairs ) for representing Unicode characters outside BMP. If you don't use such characters, UTF-16 effectively behaves as UCS-2. Thus, UCS-2 can be regarded as mainly historical. It is however still part of the ISO 10646 standardbut not part of the Unicode standard. The registered MIME name of UCS-2 is ISO-10646-UCS-2. 6.3.2. Surrogate Pairs in UTF-16UTF-16 uses surrogate pairs to overcome the 16 bit limitation. This means that some 16-bit values have been reserved for use as a high (leading) or low (trailing) value in a pair of code units. Together these values denote a Unicode character outside BMP. The word "surrogate" is not very descriptive, and it has caused much confusion; in reality, the "surrogates" are simply an extension mechanism. More exactly, a high surrogate is a code unit in the range D800 to DBFF, and a low surrogate is in the range DC00 to DFFF. We use hexadecimal numbers here without the "U+" prefix to emphasize that the surrogates are code units, not code points. Two consecutive surrogate code units together denote one code point, which is outside BMPi.e., in the range U+10000 to U+10FFFF. Surrogate code units have a defined meaning only when they appear in a pair of a high surrogate and a low surrogate. Otherwise, they have no defined meaning, and they are data errors. A surrogate code unit pair is constructed by the following algorithm:
For example, consider the code point U+1D405. (It denotes mathematical bold capital "F," but its meaning is irrelevant here.) Writing it as a 21-bit binary integer, we get 000011101010000000101. When split, this gives u1 = 00001, u2 = 110101, and u3 = 0000000101. After subtraction, u1 = 0000. Now we can construct the surrogate code units: 1101100000110101 and 1101110000000101. In hexadecimal, they are D835 and DC05. The example calculation was performed only to illustrate the algorithm. In practice, we don't do such calculations by hand or even write program code for them, except perhaps as an assignment when learning programming. We use existing software such as conversion programs and routines. The algorithm implies the following arithmetic relationship between a code number U and the corresponding surrogate pair consisting of H (high surrogate) and L (low surrogate):
Here all numbers are expressed in hexadecimal. Although the formula contains multiplications, they contain multipliers that are constant and powers of two. Such multiplications can be implemented efficiently as shifts that move bits to the left, which is essentially faster than normal multiplications on a computer. 6.3.3. Some Properties of UTF-16Using UTF-16, you cannot access the nth character of a string directly. You need to scan the string and count the characters, since some characters (those in BMP) occupy one code unit, others take two code units. UTF-16 is robust, though, in the same sense as UTF-32: if a code unit is corrupted, then only one character is corrupted. If a normal code point is corrupted so that it becomes a high surrogate, for example, the next code unit will still be interpreted correctly. Since it is not a low surrogate, we can know that the previous code point is erroneous data. If you access a code unit in a UTF-16 string, you can immediately recognize it as a BMP character or as a component of a surrogate pair, simply by checking whether it falls within the ranges for surrogates. If it is a high surrogate, you need to read the next code unit to determine a character. If it is a low surrogate, you need to read the preceding code unit. Since conformance to the Unicode standard does not require support for all Unicode characters, it is quite permissible for an implementation to be ignorant of all characters outside the BMP. It could be incapable of rendering any of them or processing them in any useful way. However, for conformance, an implementation must be able to recognize that there is a surrogate code unit pair UTF-16 encoded data. It must not treat the code units in it as two characters but as a representation of one character, although perhaps a completely unknown character. |