Section 6.4. UTF-8


6.4. UTF-8

UTF-8 uses 8-bit code units, and it represents characters in the Basic Latin (ASCII) range U+0000 to U+007F efficiently, one code unit per character. On the other hand, this implies that all other characters use at least two code units, which all have the most significant bit seti.e., they are in the range 80 to FF (hexadecimal). More exactly, they are in the range 80 to 9F. This means that when there is a code unit in the range 00 to 7F in UTF-8 data, we can know that it represents a Basic Latin character and cannot be part of the representation of some other character.

These structural decisions imply that UTF-8 is relatively inefficient, since it leaves many simple combinations unused. There is yet another principle that has a similar effect. In a representation of any character other than Basic Latin characters, the first (leading) code unit is from a specific range, and all the subsequent (trailing) code units are from a different range.

6.4.1. UTF-8 Encoding Algorithm

For a character outside the Basic Latin block, UTF-8 uses two, three, or four octets. You might encounter specifications that describe UTF-8 as using up to six octets per character, but they reflect definitions that did not restrict the Unicode coding space the way it has now been restricted.

The UTF-8 algorithm is described in Table 6-1. The first column specifies a bit pattern, in 16 or 21 bits, grouped for readability. The other columns indicate how the pattern is mapped to code units (octets), represented here as bit patterns.

Table 6-1. UTF-8 encoding algorithm

Code number in binary

Octet 1

Octet 2

Octet 3

Octet 4

00000000 0xxxxxxx

0xxxxxxx

   

00000yyy yyxxxxxx

110yyyyy

10xxxxxx

  

zzzzyyyy yyxxxxxx

1110zzzz

10yyyyyy

10xxxxxx

 

uuuww zzzzyyyy yyxxxxxx

11110uuu

10wwzzzz

10yyyyyy

10xxxxxx


Thus, the UTF-8 encoding uses bit combinations of very specific types in the octets. If you pick up an octet from UTF-8 encoded data, you can immediately see its role. If the first bit is 0, the octet is a single-octet representation of a (Basic Latin) character. Otherwise, you look at the second bit as well. If it is 0, you know that you have a second, third, or fourth octet of a multioctet representation of a character. Otherwise, you have the first octet of such a representation, and the initial bits 110, 1110, or 1111 reveal whether the representation is two, three, or four octets long.

Thus, interpreting (decoding) UTF-8 is straightforward, too. You take an octet, match it with the patterns in column "Octet 1" in Table 6-1, and read zero to three additional octets accordingly. Then you construct the binary representation of the code number from the bit sequences you extract from the octets. Naturally, nobody wants to do this by hand, but the point is that this can be implemented efficiently, as operations on bit fields. A correct implementation of Unicode has to signal an error, if there is data that does match any of the defined patterns.

A quick way to find out the UTF-8 encoding of a string is to visit http://www.google.com on any modern browser, type the string into the keyword box, and hit Search. Then just look at the address field of the browser. For example, if you type pâté, the address field will contain http://www.google.com/search?hl=en&lr=&q=p%C3%A2t%C3%A9, so you can see that â is encoded as the octets C3 A2 and é as octets C3 A9. (In some situations, this does not work since Google does not use UTF-8. In that case, use the URL http://www.google.com/webhp?ie=UTF-8 to force the input encoding to UTF-8.)

6.4.2. UTF-8 Versus ISO-8859-1

UTF-8 is not compatible with ISO-8859-1, and still less with windows-1252 (which is often, but incorrectly, called "ANSI"). The Basic Latin (ASCII) range is treated the same way, but the Latin 1 Supplement (the upper half of ISO-8859-1) is represented as one octet per character in ISO-8859-1, and two octets per character in UTF-8. The octets that denote Latin 1 Supplement characters in ISO-8859-1 have their first bit set to 1, and such octets are used as components of multioctet representations of characters in UTF-8.

If UTF-8 encoded data is by mistake interpreted as ISO-8859-1 encoded, a Latin 1 Supplement character will appear as  or à followed by another character. The reason is that the first octet of the encoded form is 11000010 or 11000011 in binary, C2 or C3 in hexadecimal, which means  or à in ISO-8859-1. The second octet has "10" as the first 2 bits, so it would be interpreted as some Latin 1 Supplement character or as a C1 Control. For example, if you type the text "Here is my résumé." and send it with a program that UTF-8 encodes it but does not adequately specify the encoding, the recipient may well imply ISO-8859-1 or windows-1252 encoding and display your text as "Here is my résumé." The text looks strange, but with some guesswork and experience, it is legible.

6.4.3. Some Properties of UTF-8

Due to the algorithm, the octets appearing in UTF-8 are limited to certain ranges, as shown in Table 6-2. In particular, octets C0 and C1 and F5 through FF do not appear in UTF-8. Other octets may appear in specific contexts only. This means that if you have a large file that is not, in fact, character data in UTF-8 and you try to read it as UTF-8, it is most probable that errors will be signaled.

Table 6-2. Octet ranges in UTF-8

Code range

Octet 1

Octet 2

Octet 3

Octet 4

U+0000..U+007F

00..7F

   

U+0080..U+07FF

C2..DF

80..BF

  

U+0800..U+0FFF

E0

A0..BF

80..BF

 

U+1000..U+CFFF

E1..EC

80..BF

80..BF

 

U+D000..U+D7FF

ED

80..9F

80..BF

 

U+E000..U+FFFF

EE..EF

80..BF

80..BF

 

U+10000..U+3FFFF

F0

90..BF

80..BF

80..BF

U+40000..U+FFFFF

F1..F3

80..BF

80..BF

80..BF

U+100000..U+10FFFF

F4

80..8F

80..BF

80..BF


Similarly to UTF-16, UTF-8 makes it impossible to access the nth character of a string directly. UTF-8 is robust, though: if a code unit is corrupted, other characters will be processed correctly. The reason is that UTF-8 has been designed so that a code unit starting the representation of a character can be recognized as such, even if the preceding code unit is in error.

Although the authoritative definition of UTF-8 is in the Unicode standard, with content as described here, there is also a description of UTF-8 as an Internet standard, STD 63. It is currently RFC 3629, "UTF-8, a transformation format of ISO 10646," and available at http://www.ietf.org/rfc/rfc3629.txt. It contains additional recommendations (by the IETF) regarding the use of UTF-8 on the Internet, especially with regards to protocol design.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net