Section 6.6. Conversions Between Unicode Encodings

6.6. Conversions Between Unicode Encodings

When you need to convert data between UTF-8, UTF-16, and UTF-32 encodings, you normally use tools like programs or routines that can read and write text data in the different encodings, as described in Chapter 3. For an overview of these encodings and their use, we will however discuss the nature of the conversions here. A conversion from UTF-32 to UTF-16 means the following:

Characters in the BMP are represented by omitting the two most significant octets (which are zero in UTF-32 for BMP characters).
Other characters are replaced by surrogate code unit pairs. This means replacing one 32-bit code unit by two 16-bit code units.

A conversion in the opposite direction naturally means extension with two zero octets for BMP characters and decoding a surrogate code unit pair into a code number, to be represented as a 32-bit quantity.

A conversion from UTF-32 to UTF-8 simply means that the UTF-8 encoding algorithm, as presented in Table 6-1, is applied. The reverse conversion is straightforward, too, since it can operate octet by octet, using the first few bits of an octet to determine its role.

Conversions between UTF-8 and UTF-16 are best performed via an intermediate representation that corresponds to UTF-32. This does not require the creation of an actual UTF-32 coded representation of the file or data stream. Instead, you can operate on just the code points: read code units from UTF-8 (or UTF-16) encoded data as much as needed to determine the Unicode code number that they represent, and then encode this number in UTF-16 (or, respectively, in UTF-8).