Section 6.1. Unicode Encodings in General


6.1. Unicode Encodings in General

As described in Chapter 3, an encoding is a mapping from code numbers (which represent characters) to sequences of code units. A code unit is in practice an octet (8-bit byte), a double octet (16-bit quantity), or a quadruple octet (32-bit quantity). The reason for using such units is that modern computers have been designed to work on such data objects efficiently.

Thus, the simplest encoding for Unicode is to map each code number to a quadruple octet representing the number as a single integer in binary notation. Such an encoding, UTF-32, is however too inefficient for most practical purposes.

Within a code unit of 16 or 32 bits, the order in which the octets are interpreted depends on "endian-ness," which belongs to the level of encoding scheme in the Unicode terminology. Often the encoding scheme is coupled with the encoding even in the name, so that we use, for example, the name "UTF-16LE" to refer to the UTF-16 encoding represented with a little-endian (LE) encoding scheme.

The names of the encodings contain abbreviations "UCS" for "Universal Character Set" and "UTF" for "Unicode Transformation Format." These expansions should not be taken too seriously; treat the names as historical oddities.

For illustration, Figure 6-1 shows the string "pâté" in some encodings. The string is represented used precomposed characters, so that there are just four characters in it. As a

Figure 6-1. Some encodings of the string "pâté"


Figure 6-2. Encodings of a character (U+1D405) as displayed by FileFormat.info


sequence of code points, the string is U+0070 U+00E2 U+0074 U+00E9. Each box indicates a code unit, with its content expressed in hexadecimal digits, paired so that each pair corresponds to one octet.

To check the representation of a character in one or more Unicode encodings, you can use the service http://www.fileformat.info/info/unicode/char/search.htm. It lets you type in the code number and get information that contains the encodings, among other data, as illustrated in Figure 6-2.

Technically, Unicode encodings are defined as representations ofUnicode scalar values as sequences of code units. This somewhat odd (and practically rare) term refers to all Unicode code numbers except those corresponding to surrogates. This means in practice the ranges U+0000 to U+D7FF and U+E000 to U+10FFFF. The in-between range from U+D8000 to U+DFFF is the surrogates area, and those code points need not be represented, since they are not meant to appear in Unicode data.

On the other hand, the Unicode encodings are defined for noncharacters and for unassigned code points, too. If some data contains, for example, the code point U+FFFF, which is defined to be a noncharacter, the data is incorrect as Unicode character data. However, it is processed in a well-defined way when encoding the data in UTF-8, UTF-16, or UTF-32 . This guarantees that conversions between Unicode encodings do not remove such errors but allow them to be detected.

The encodings UTF-8, UTF-16, and UTF-32 are all self-synchronizing . This feature, also known as auto-synchronization, means that if malformed data (i.e., data that is not possible according to the definition of the encoding) is encountered, only one code point needs to be rejected. The start of the representation of the next code point can be recognized easily. This helps guard against errors caused by data corruption in transfer or storage: the effects of errors are local. If you have data like "Foobar" and the character "b" is corrupted in storage or transfer, the data appears as "Foo?ar" (where ? indicates corrupted data). In some other encodings, all data following a corrupted character might appear as corrupted.

Sample program code, in the C language, for conversions between the Unicode encoding forms is available at http://www.unicode.org/Public/PROGRAMS/CVTUTF/.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net