Section 6.2. UTF-32 and UCS-4 | Unicode Explained

6.2. UTF-32 and UCS-4

UTF-32 uses a 32-bit code unit to represent a code number (and hence a character). That is, a code unit is simply a sequence of 32 bits (four octets) that represents the code number as an integer in binary notation. Since Unicode code numbers are guaranteed to fit into 21 bits, this wastes space; the most significant 11 bits in a code unit are always zero.

On the other hand, addressing of 32-bit units is efficient in modern computers. UTF-32 is otherwise suited for data processing, too, since it allows fast data access. To address the nth character of a string, a program would just add 4 x (n - 1) to the base (start) address of the string.

UTF-32 is robust in the sense that if a code unit is corrupted, all the rest of the data remains intact. Each code unit represents a code number, independently of other code units.

Since the Unicode coding space is limited to 21 bits, and since UTF-32 does not use surrogate code units (only UTF-16 does), UTF-32 encoded data contains code units from the following ranges only (expressed in hexadecimal): 0000 to D7FF and E000 to 10FFFF. This can be used as a basis for a rough check: if you take a reasonably large file that contains other than UTF-32 data and interpret it as 32-bit units, the odds are that there are many values outside those mentioned earlier in this chapter.

UCS-4 is effectively the ISO 10646 equivalent of UTF-32. The registered MIME name of UCS-4 is ISO-10646-UCS-4. Previously, UCS-4 and UTF-32 were different in principle, since UCS-4 operated on a 31-bit coding space, UTF-32 on a 21-bit coding space. The decision to stick to 21-bit coding space removed the distinction. The difference is now nominal, and it is more natural to use the name UTF-32.