Section 6.3. UTF-16 and UCS-2


6.3. UTF-16 and UCS-2

The UCS-2 and UTF-16 encodings use 16-bit code units. In these encodings, all characters in the Basic Multilingual Plane (BMP), and hence most characters that people use these days, are represented directly: a character is represented as one code unit. It represents the code number of the character as one unsigned 16-bit integer. Thus, the encodings are structurally simpler than UTF-8.

6.3.1. UCS-2 Is BMP Only

UCS-2 is by definition limited to BMP. It is therefore not a full Unicode encoding: you cannot represent all Unicode data in UCS-2. On the other hand, UTF-16 is basically UCS-2 enhanced with a mechanism (surrogate pairs ) for representing Unicode characters outside BMP. If you don't use such characters, UTF-16 effectively behaves as UCS-2.

Thus, UCS-2 can be regarded as mainly historical. It is however still part of the ISO 10646 standardbut not part of the Unicode standard. The registered MIME name of UCS-2 is ISO-10646-UCS-2.

6.3.2. Surrogate Pairs in UTF-16

UTF-16 uses surrogate pairs to overcome the 16 bit limitation. This means that some 16-bit values have been reserved for use as a high (leading) or low (trailing) value in a pair of code units. Together these values denote a Unicode character outside BMP. The word "surrogate" is not very descriptive, and it has caused much confusion; in reality, the "surrogates" are simply an extension mechanism.

More exactly, a high surrogate is a code unit in the range D800 to DBFF, and a low surrogate is in the range DC00 to DFFF. We use hexadecimal numbers here without the "U+" prefix to emphasize that the surrogates are code units, not code points. Two consecutive surrogate code units together denote one code point, which is outside BMPi.e., in the range U+10000 to U+10FFFF.

Surrogate code units have a defined meaning only when they appear in a pair of a high surrogate and a low surrogate. Otherwise, they have no defined meaning, and they are data errors.

A surrogate code unit pair is constructed by the following algorithm:

  1. Given a Unicode code point outside BMPi.e., with value > FFFFrepresent it as a 21-bit integer, with leading zeros as necessary.

  2. Divide this sequence of 21 bits to parts with 5, 6, and 10 bits; denote the parts with u1, u2, and u3, respectively.

  3. Subtract 1 from u1, and consider the result as a 4-bit sequence. Note that this loses no information, since the original u1 is at most 10000 (because the Unicode range ends at 10FFFF hexadecimal, 100001111111111111111 binary).

  4. Construct the high surrogate code unit as 110110u1u2 (by simple catenation of bit sequences).

  5. Construct the low surrogate code unit as 110111u3.

For example, consider the code point U+1D405. (It denotes mathematical bold capital "F," but its meaning is irrelevant here.) Writing it as a 21-bit binary integer, we get 000011101010000000101. When split, this gives u1 = 00001, u2 = 110101, and u3 = 0000000101. After subtraction, u1 = 0000. Now we can construct the surrogate code units: 1101100000110101 and 1101110000000101. In hexadecimal, they are D835 and DC05.

The example calculation was performed only to illustrate the algorithm. In practice, we don't do such calculations by hand or even write program code for them, except perhaps as an assignment when learning programming. We use existing software such as conversion programs and routines.

The algorithm implies the following arithmetic relationship between a code number U and the corresponding surrogate pair consisting of H (high surrogate) and L (low surrogate):

U = (H - D800) x 400 + (L - DC00) x 10000

Here all numbers are expressed in hexadecimal. Although the formula contains multiplications, they contain multipliers that are constant and powers of two. Such multiplications can be implemented efficiently as shifts that move bits to the left, which is essentially faster than normal multiplications on a computer.

6.3.3. Some Properties of UTF-16

Using UTF-16, you cannot access the nth character of a string directly. You need to scan the string and count the characters, since some characters (those in BMP) occupy one code unit, others take two code units. UTF-16 is robust, though, in the same sense as UTF-32: if a code unit is corrupted, then only one character is corrupted. If a normal code point is corrupted so that it becomes a high surrogate, for example, the next code unit will still be interpreted correctly. Since it is not a low surrogate, we can know that the previous code point is erroneous data.

If you access a code unit in a UTF-16 string, you can immediately recognize it as a BMP character or as a component of a surrogate pair, simply by checking whether it falls within the ranges for surrogates. If it is a high surrogate, you need to read the next code unit to determine a character. If it is a low surrogate, you need to read the preceding code unit.

Since conformance to the Unicode standard does not require support for all Unicode characters, it is quite permissible for an implementation to be ignorant of all characters outside the BMP. It could be incapable of rendering any of them or processing them in any useful way. However, for conformance, an implementation must be able to recognize that there is a surrogate code unit pair UTF-16 encoded data. It must not treat the code units in it as two characters but as a representation of one character, although perhaps a completely unknown character.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net