Section 6.7. Other Encodings

6.7. Other Encodings

In addition to the encodings defined in the Unicode and ISO 10646 standards, there are several encodings that have been used or at least proposed for Unicode data. We will discuss some of them, summarized in Table 6-3 in alphabetic order by name. For completeness, the table contains also the previously discussed UTF and UCS encodings.

Table 6-3. Encodings used for Unicode data
Name of encoding	Nature and usage of the encoding
Base64	General purpose encoding, used as "transfer encoding"
BOCU-1	A compression scheme for Unicode; not used much
CESU-8	A mixture of UTF-8 and UTF-16 for special usage
GB18030	"Chinese Unicode," technically a separate character code
Modified UTF-8	Used in Java programming; CESU-8 with an additional change
Punycode	An encoding for Internationalized Domain Names (IDN)
Quoted Printable	Transfer encoding especially for email
SCSU	A standardized compression scheme for Unicode; little used
UCS-2	A two-octet encoding, restricted to Basic Multilingual Plane
UCS-4	ISO 10646 equivalent of UTF-32
URL Encoding	Special encoding for URLs and form data on the Web
UTF-1	Obsolete, historic only
UTF-7	Obsolete encoding; little used; not part of the Unicode standard
UTF-8	A standard Unicode encoding, very widely used
UTF-16	A standard Unicode encoding, widely used
UTF-16BE	As UTF-16, but with Big Endian byte order fixed
UTF-16LE	As UTF-8, but with Little Endian byte order fixed
UTF-32	A standard Unicode encoding; wastes space, easy to process
UTF-32BE	As UTF-32, but with Big Endian byte order fixed
UTF-32LE	As UTF-32, but with Little Endian byte order fixed
UTF-EBCDIC	Designed to be compatible with IBM computers using EBCDIC
Uuencode	General purpose encoding of data; sometimes used for text

6.7.1. SCSU Compression

SCSU is defined in Unicode Technical Standard (UTS #6), "A Standard Compression Scheme for Unicode," http://www.unicode.org/reports/tr6/. SCSU was designed to achieve compactness comparable to language-specific 8-bit encodings. It has not been widely adopted, but some organizations use it internally.

SCSU works best when the text contains mostly alphabetic characters from one or a few scripts. It can be described as switching between blocks of characters and using efficient one-octet references to characters within a block. SCSU internally switches to UTF-16 to handle non-alphabetic languages.

Although SCSU is registered as a character encoding in the MIME sense, it is not suitable for subtypes of the MIME type text. For example, SCSU cannot be used directly in email and similar protocols. Moreover, for good performance, SCSU requires an implementation with a lookahead in the character stream.

This encoding, like the next one, has been designed as a compression method rather than encoding. However, their usefulness is limited by the fact that widely used general purpose compression mechanisms, such as zip and bzip2, can produce better results, rather independently of encoding issues. SCSU is useful for short strings of text, where general compression mechanisms would require many octets of overhead.

6.7.2. BOCU-1 Compression

BOCU-1 is also a compression scheme for Unicode, and it has been registered as an encoding in the MIME sense. It is defined and described in the Unicode Technical Note (UTN) #6, "BOCU-1: MIME-Compatible Unicode Compression," available at http://www.unicode.org/notes/tn6/. Thus, its official status is lower than that of SCSU.

The name "BOCU" comes from "Binary Ordered Compression for Unicode." The encoding preserves code point order.

6.7.3. CESU-8

CESU-8 mixes UTF-8 and UTF-16 so that it uses UTF-8 for all characters in the Basic Multilingual Plane (BMP) but switches to UTF-16 for other characters. CESU-8 is oriented toward systems that internally process characters as 16-bit entities. It is defined in Unicode Technical Report #26, "Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8)," http://www.unicode.org/reports/tr26/. The report says about CESU-8:

It is not intended nor recommended as an encoding used for open information exchange. The Unicode Consortium does not encourage the use of CESU-8, but does recognize the existence of data in this encoding and supplies this technical report to clearly define the format and to distinguish it from UTF-8. This encoding does not replace or amend the definition of UTF-8.

Instead of encoding a character outside the BMP as a sequence of four octets according to the UTF-8 algorithm, CESU-8 first represents it as a pair of surrogate code points as in UTF-16), and then encodes these individually, each with three octets. This implies that CESU-8 uses six octets for any non-BMP character. More exactly, CESU-8 encoding consists of the following:

Replace any character outside the BMP with the surrogate pair that represents it according to UTF-16.
Encode the data according to the UTF-8 algorithm as presented in Table 6-1. Note that only mappings that result in one, two, or three octets will be used, since there are only 16-bit values to be encoded.

For example, consider the three-character string U+004D U+0061 U+10000. In UTF-8, its encoding is 4D 61 F0 90 80 80, since the two characters in the Basic Latin block are represented each as one octet, and the non-BMP character U+10000 is mapped to a sequence of four octets by the algorithm. In CESU-8, the first two characters are treated the same way, but U+10000 is first replaced by the surrogate pair U+D800 U+DC00. (Here we speak of surrogates as if they were code points and denote them that way, and this reflects the thinking behind CESU-8, but in principle, they are just code units in an intermediate representation.) The components of the pair are then each encoded by the UTF-8 algorithm: U+D800 gives ED A0 80 and U+DC00 gives ED B0 80. Thus, the final CESU-8 encoded string is 4D 61 ED A0 80 ED B0 80.

CESU-8 has the same binary collation as UTF-16. That is, if you compare strings by comparing their encoded representations as raw data, as bit sequences, you get the same order in CESU-8 as in UTF-16. CESU-8 is designed and recommended only for systems where such collation equivalence is important.

6.7.4. Modified UTF-8

Although UTF-8 could be modified in different ways, the phrase "Modified UTF-8" is a term that denotes a specific modification. It differs from UTF-8 in two ways: it mixes UTF-16 into UTF-8 the same way as CESU-8, and it has special treatment for U+0000.

Modified UTF-8 is used in the Java programming language . Java uses UTF-16 internally, but it supports a nonstandard modification of UTF-8 for writing and reading text data as "serialized" to an octet stream.

Modified UTF-8 represents the null character (NUL) U+0000 in a special way, as two octets C0 80i.e., 11000000 10000000 in binary. This combination does not appear in UTF-8, but as you can see from Table 6-1, it is what you would get if you encoded U+0000 according to the branch of the UTF-8 algorithm that applies to the range U+0080..U+07FF. In UTF-8, the null character is encoded as one octet with value 0.

Such a representation of the null character means that there are no octets with value 0 ("null bytes") in the encoded data. This guarantees that the encoded string can be processed by routines that treat an octet with value 0 as a string terminator, according to the old convention in the C language and its many derivatives.

The second difference is that Modified UTF-8 represents characters outside the BMP the same way as CESU-8. The reason behind this is the difference between modern Unicode and the Java character model. In Java, a character is 16 bits long, reflecting the design of Unicode before the merge with ISO 10646 and expansion of the coding space. Thus, in Java, you process "Java characters," which are identical with Unicode characters for the BMP but cannot directly correspond to anything outside the BMP. In effect, Java treats surrogate code points as "Java characters." When a Java program reads a string in Modified UTF-8, the decoding process produces a string of "Java characters." Additional program logic is then needed to deal with them by Unicode rules, since a program needs to recognize any surrogate pair and treat it as indicating one Unicode character.

The Java routines that write or read in Modified UTF-8 format also produce or recognize a byte count before the start of the data itself (see Chapter 11).

6.7.5. Base64 Encoding of Data

Base64 is not really a character encoding. It is a general encoding mechanism, which can be used to represent any data (any sequence of octets) as a string of characters from a subset of ASCII. Since those characters in turn are represented as octets, by the ASCII encoding, Base64 logically defines a mapping from sequences of octets to sequences of octets. As you may guess, the length of the sequence increases, by the ratio 4:3.

The role of Base64 in the representation of characters is that it can be used as an encoding applied to data that is already in an encoding, such as UTF-8, UTF-16, ISO-8859-1, or ASCII. Base64 lets you represent data in a format that can safely be transmitted and processed in situations where, for example, some octets used in UTF-8 might cause trouble. Base64 is used especially in email. Technically, it is not regarded (or registered) as a character encoding but as a "content transfer encoding."

The name "Base64" reflects the idea of using a positional number system with base 64. To convert data to Base64, you take three octetsi.e., 24 bits of dataand represent the 24-bit integer in a base 64 number system. As digits, you use basic Latin letters (uppercase and lowercase), digits, and two other characters.

To express the idea in another way, without reference to number systems, and somewhat more exactly, we can say that data is encoded into Base64 as follows:

Pick up the next 24 bits (three octets) from the input. If there is not enough data left to encode, fill the missing bit positions with zeros.
Divide the bits to four groups of 6 bits.
Interpret each of the groups, in succession, as a 6-bit unsigned integer (in the range 0 to 63) and map it to a character by using it as an index to the (64-character) string "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/".
If there were only one or two octets (instead of three) available at the last step of processing input data, replace, respectively, the last two or one characters generated in step 3 with the = character.
Represent the characters according to the ASCII code.

For example, if you take the string "Here's my résumé." and encode it in UTF-8, then apply Base64 encoding and interpret the result as ASCII, you get the following:

SGVyZeKAmXMgbXkgcsOpc3Vtw6kuDQoNCg==

When interpreted as ASCII data, a Base64-encoded string looks like a random alphanumeric string, perhaps interspersed with the occasional + or / and possibly terminated one or two = characters. Therefore, Base64 encoding is sometimes used as a poor man's encryption method. It is of course trivial to experts to break the "encryption." Moreover, email programs are typically capable of decoding Base64 automatically.

The choice of the number 64 is based on the fact that 64 is a power of two, and this makes the algorithm fast, since it essentially works with shift operations. The next higher power of two is 128, which is too large, since there are not that many printable ASCII characters. The characters used in Base64 are very "safe": they belong to the invariant subset of ASCII. Naturally, the method relies on the distinction between uppercase and lowercase letters.

Many programs can do Base64 encoding and decoding, but there are also online tools for the purpose. You can find them by entering the search string "base64 converter".

There are several variations of the Base64 encoding, including the following:

In MIME email, a line break is inserted after every 76 characters of Base64 encoded data, to keep the line length acceptable to all email software.
The padding = characters at the end may be omitted, when the length of the data is known to the recipient from other information.
The characters + and /, which might be unsafe in some contexts where Base64 is used (e.g., in filenames), are replaced by other characters in some variations.
In particular, "URL and filename safe" Base64 alphabet uses the hyphen-minus "-" instead of + and the underline _ instead of /.
When Base64 is used to produce encoded strings that will be used as XML name tokens, the underline _ and the colon : might be used instead of + and / in order to meet the requirements of XML name syntax. However, the colon has a special meaning in XML names.

The Base64 encoding and some similar encodings are described in the informational RFC 3548, "The Base16, Base32, and Base64 Data Encoding."

6.7.6. Quoted Printable Encoding

Quoted Printable (QP), too, is a content transfer encoding, not a primary encoding of characters. It is widely used especially for delivery of non-ASCII data by email. QP is defined in the MIME specifications, namely in RFC 2045, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies."

Like Base64, QP encodes any data, any octet stream. When used for character data, this means that the data is already in some encoding, and QP applies another encoding on top of it. In particular, you can have UTF-8 encoded data but encode it with QP to make it safer for sending it through software that might munge octets with the first bit set.

Logically, QP maps an octet string to an octet string, but we usually describe the result string in terms of ASCII characters. If the original data is ASCII encoded, QP leaves most printable characters intact. Similarly, if the data is UTF-8 encoded, most printable characters in the ASCII range remain unchanged.

QP uses an escape notation of the form = xx, where xx are two hexadecimal digits, for representing non-ASCII characters as well as some ASCII characters. The digits xx indicate the numeric value of the octet. The escape notation must be applied even to many ASCII characters (all code values are expressed here in hexadecimal):

Most control characters must be escaped. For example, the ASCII form feed, code value C, must appear as =0C.
If the data contains a line break, it shall be represented as CR LF (carriage return, line feed), as such (octets in ASCII encoding, not encoded).
The horizontal tab character (code 9), need not be escaped (as =09), unless it appears at the end of a line.
The space character (code 20) may be represented as such, except at the end of a line, where it must be escaped (as =20).
The equals sign = (code 3D) must be escaped (as =3D), to avoid confusing it as data character with its use in escape notations.

The maximum line length in QP coded data is 76 characters (counted by characters, or octets, in the encoded form). Therefore, QP has a special "soft line break" convention: a line can be ended with an equals sign = alone, and neither that character nor the line break after it will be treated as part of the data itself.

For example, suppose you configure your email program to send messages as UTF-8 encoded, using QP as the transfer encoding. You could write a message body that contains just "Here's my résumé." (with a typographically correct apostrophe ' U+2019 instead of the ASCII apostrophe ' U+0027). A recipient who looks at the raw data of your message interpreted as all ASCII characters would see the body as follows:

Here=E2=80=99s my r=C3=A9sum=C3=A9.

Looking at the message headers, the recipient would see, among other things:

Content-Type: text/plain;charset="utf-8" Content-Transfer-Encoding: quoted-printable

This contains information for adequate interpretation of the message. Of course, most people would never directly apply such information. We normally use email programs that do such things for us, recognizing the headers, decoding the data, and displaying just the characters for us. Mostly we would know nothing about the encoding issues, unless something goes wrong. (However, too often something really goes wrong.)

In the example, the letter é (U+00E9) appears as =C3=A9, which is the QP encoded form of the two octets C3 and A9 that constitute the UTF-8 encoded form of U+00E9. As you probably remember, UTF-8 uses at least two octets for any character outside the ASCII range, even for Latin 1 Supplement characters. (If you had sent email as ISO-8859-1 encoded, with QP encoding, the letter é would appear as =E9.) The character ' (U+2019) appears as =E2=80=99, which is the QP encoded form of the three octets that constitute the UTF-8 encoded form of U+2019.

QP has often been criticized for being "quoted unreadable" and unnecessarily messing things up. There is a good point here. Quite often, QP is used in wrong contexts, like Usenet messages, where 8-bit characters work better. However, much of the criticism is unjust. When viewed on a program that does not support QP, you may still get a fairly good picture of the content. The data looks messy, because there is so much readable in the text. Base64, for example, is completely unreadable, if not interpreted properly.

6.7.7. Uuencode

QP and Base64 are just examples of content transfer encodings, but they were selected due to their relatively common use for character data, especially in MIME email. Other transfer encodings, such as Uuencode, Binhex, and yEnc, are typically best known for their use in embedding binary data such as images or executable programs into text. However, they can also be used for text data. You could, for example, first encode text as UTF-8, and then apply Uuencode to the octets, to get a representation that can safely be transmitted over connections, gateways, and software that might mess up UTF-8 as such.

Here we will only consider Uuencode, which has lost importance but can still be found as one option for data transmissione.g., in email programs. The us in the name "Uuencode" do not refer to Unicode but to Unix: it's originally "Unix to Unix encode." Uuencode was designed to make it possible to send any data from one Unix computer to another with tools like old email systems, which process only ASCII data (octets in the range 0 to 7F hexadecimal) reliably. On virtually any Unix system, you can find a command uuencode for performing the encoding and uudecode for decoding it.

Uuencoded data appears as a block of the following form:

begin mode filename data lines end

Here mode is the "file mode" in the Unix sense, specifying the file's read, write, and execute permissions as three octal digits, and filename is the name to be used when saving the decoded data into a file. Although there is no indication of the media type or primary encoding of the data, some guesses can be based on the filename extension that was chosen when generating the encoded data.

The encoded data itself is first constructed as follows (cf. to Base64 encoding):

Pick up the next 24 bits (three octets) from the input. If there is not enough data left to encode, fill the missing bit positions with zeros.
Divide the bits to four groups of 6 bits and interpret the groups as integers in the range 0 to 63 (decimal).
Add 32 (decimal) to each of the integers. After this, the range is thus 32 to 95 in decimal, 20 to 5F in hexadecimal.
Represent the characters according to the ASCII code.

ASCII characters greater than 95 may also be used; however, only the six right-most bits are relevant. This means that number 64 decimal, 40 hexadecimal may be added to the ASCII code. For example, instead of a space (20 hexadecimal), a grave accent (60 hexadecimal) may be used.

When all the data has been processed that way, the algorithm continues as follows:

Write each group of 60 output characters (corresponding to 45 input octets) as a separate line preceded by an encoded character that gives the number of octets in the original data that are represented on that line. For all lines except the last, this will be the letter "M" (ASCII code 77 = 32+45).
Finally, a line containing just a single space (or grave accent ') is output, to be followed by one line containing the string end that terminates the encoded data.

Sometimes each data line has extra dummy characters (often the grave accent) added to avoid problems with software that strips trailing spaces. These characters are ignored when decoding the data.

For example, if you have a file that contains the string "Hello world!" and you Uuencode it, specifying hello.txt as the filename to be used, you get the following:

begin 644 hello.txt ,2&5L;&\@=V]R;&0A end

Thus, Uuencode produces an encoded form that is completely unintelligible without decoding. On the other hand, the initial and final line indicate the presence of encoded data in a recognizable way, and some email programs can recognize Uuencoded data embedded into the body of a message.

6.7.8. UTF-7

UTF-7 is an obsolete encoding, which is not part of the Unicode standard. However, it is a registered encoding, and you might still encounter it somewhere.

Analogously with UTF-8, UTF-16, and UTF-32, we can regard UTF-7 as an encoding that uses 7-bit code units. In practice, the code units are stored and transmitted as 8-bit bytes (octets), usually with the first bit set to zero. In principle, the first bit could be used for other purposese.g., as a parity bit for checking. In any case, it is considered external to the encoding.

The idea was to define an encoding that can be safely transmitted over 7-bit connections, notably data transfer systems that cannot be trusted to pass 8-bit bytes correctly. Such connections existed, in particular, for transmitting ASCII data. You could even send UTF-7 data over an old email connection that had been designed to work with ASCII only. Of course, UTF-7 is not ASCII, but since UTF-7 uses octets in the ASCII range only, the transfer works fine. It is then up to the recipient to know how to interpret it.

UTF-7 uses up to eight octets per character. Characters in the ASCII range remain unchanged, except for the plus sign +, which is escaped as +- due to its special role in the encoding. Other characters are represented using modified Base64 encoding and surrounded by octets corresponding to characters + and -.

For example, the string "£500" is "+AKM-500" in UTF-7 (when we represent the octets of UTF-7 representation as ASCII characters). The characters "500" are unchanged, but the pound sign £ (U+00A3) becomes "+AKM-" as follows: The code point 00A3 is first represented by octets 00 A3, which means 00000000 10100011 in binary. The bits are grouped and the 6-bit groups are mapped to ASCII characters according to the Base64 algorithm, giving 000000 (decimal 0) A, 001010 (decimal 10) K, and 001100 (decimal 12) M. The last zeros in 001100 are fill bits.

The UTF-7 encoding is defined in the informational RFC 2152, "UTF-7: A Mail-Safe Transformation Format of Unicode."

6.7.9. UTF-1

UTF-1 was the first transfer encoding for the Universal Character Set (hence the number "1"). It was defined in the ISO 10646 standard, and it was formally registered as an encoding in the MIME sense, under the name ISO-10646-UTF-1. It never gained much use; it was removed from ISO 10646, and it has been obsolete for years.

UTF-1 used one to five octets per character. One of the reasons for its failure was inefficiency: the algorithm required integer divisions, which are much slower than operations on bit fields. It also lacked the "self-synchronizing" feature.

6.7.10. UTF-EBCDIC

The EBCDIC code, briefly described in Chapter 3, has been widely used on large IBM computers. To facilitate the use of Unicode on such computers, using EBCDIC as their "native" character code, UTF-EBCDIC, was designed. It is defined in the Unicode Technical Report #16, "UTF-EBCDIC," http://www.unicode.org/reports/tr16/.

UTF-EBCDIC is "EBCDIC-friendly Unicode." It is similar to UTF-8 but uses EBCDIC codes for some characters and handles code points U+0080 to U+009F in a special way, in order to make the control characters used in EBCDIC have the same representations as in EBCDIC. More exactly, the algorithm is:

Starting from a sequence of Unicode code points, construct first an intermediate format, called "UTF-8-Mod" or "I8," using a special mapping that resembles the UTF-8 algorithm. The mapping represents U+0000 to U+009F each as one octet and other code points as two to five octets.
Map the octets 00 to 9F to the octets that represent the characters U+0000 to U+009F in the EBCDIC code (with some modifications on line break conventions), and map other octets to remaining octets according to a specifically designed table. As a whole, this step is a simple table-driven operation.

This allows some old EBCDIC applications to handle Unicode data to some extent. To them, UTF-EBCDIC looks like EBCDIC, and although the meanings of some octets are different, the printable characters in the ASCII repertoire as well as the EBCDIC control characters are the same. Problems may still arise due to differences between variants of EBCDIC.

UTF-EBCDIC is intended for use in homogeneous systems and networks that use EBCDIC. It is not meant for use in public networks. In reality, UTF-EBCDIC is not used much. EBCDIC-based IBM mainframes generally use UTF-16 for Unicode support.

6.7.11. GB 18030, "Chinese Unicode"

GB 18030 has been characterized as the Chinese equivalent of UTF-8, with a capability of representing all Unicode code points and maintaining compatibility with GB 2312/GBK, and older character code for Chinese. However, GB18030 also defines a character code (code points) in a manner that differs from Unicode. In practice, due to the well-defined mappings, we can informally describe GB 18030 as "Chinese Unicode."

GB 18030 is formally called "Chinese National Standard GB 18030-2000: Information Technology -- Chinese ideograms coded character set for information interchange -- Extension for the basic set." The letters GB are short for "Guojia Biaozhun," which is a transcription of the Chinese words for "National Standard." Support for GB 18030 is mandatory for all computer operating systems sold in the People's Republic of China.

The MIME name of the encoding has no space: "GB18030."

There is a more detailed description of GB 18030 and its background available at http://examples.oreilly.com/cjkvinfo/pdf/GB18030_Summary.pdf.

6.7.12. Punycode, Encoding for Domain Names

Punycode is an encoding, or an escape scheme (depending on how you look at it), for a specific purpose: implementing Internationalized Domain Names (IDN) . The idea is that people can use Unicode characters in Internet domain names through special conventions that map strings to ASCII strings. Software that supports IDN is expected to recognize certain types of constructs in domain names as indicating that they should not be interpreted as such but by the special conventions.

Suppose, for example, that we would like to register the Internet domain name "härmä.fi," reflecting the Finnish name "Härmä." Previously, such issues were resolved simply by dropping the diacritic marks (e.g., "harma.fi") or by using some replacement notation (e.g., writing "muenchen" instead of "München"). This is rather unsatisfactory, if the diacritics really make a difference in a language. For languages that use a non-Latin script, the situation was even more problematic.

Since it would not have been realistic to change the entire domain name system to use Unicode as such, a tricky method was developed. Special notations that start with "xn--" (letters "x" and "n" and two hyphen-minus characters) are used to signal that the method, Punycode is used. You would register, for example, the domain name "x⁠n⁠-⁠-⁠h⁠r⁠m⁠-⁠q⁠l⁠a⁠c⁠.⁠f⁠i," which contains ASCII characters only and therefore does not create technical problems. Web browsers are expected to behave so that if the user types "härmä.fi," the browser internally applies Punycode to it, producing "xn--hrm-qlac.fi." Then the browser uses this name to ask a domain name server to tell the numeric IP address to be used. The browser is expected to show "härmä.fi" in the address field, so that from the user point of view, the non-ASCII characters seem to work smoothly in the domain name.

There is no reason to use two consecutive hyphen-minus characters in a normal domain name. Therefore, the Punycode convention will hardly clash with meaningful non-Punycode domain names.

Technically, Punycode converts a sequence of Unicode characters to a form that contains only characters that are allowed in components of domain names: ASCII letters, digits, and hyphen-minus. For example, in the Punycode form "xn--hrm-qlac.fi," the string "xn--" and the hyphen "-" are delimiters, and between them, you have the ASCII characters of the field "härmä." The "qlac" part is the Punycode way of representing the two occurrences of the non-ASCII character ä and their positions within the string. As you may guess, this involves some relatively sophisticated computation.

Punycode is defined in RFC 3492, which carries a long name: "Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)."

Old browsers may need an update in order to support Punycode. Partly for such reasons, organizations that acquire an internationalized domain name also keep or acquire a simplified, pure ASCII domain name (such as "harma.fi").

There is an online service for Punycode conversions at http://mct.verisign-grs.com/.

Punycode has raised some serious security issues, as any method of using Unicode in domain names would. There have long been attempts to mislead users by reserving Internet domain names that resemble others. For example, someone might try to register the domain name "orei11y.com" and send bulk email containing a link to a web site in that domain. Users might think they are visiting oreilly.com, especially if they see the domain name in a font that does not make a clear distinction between "1" (digit one) and "l" (lowercase letter "l"). When the character repertoire is extended, there are much more possibilities for such tricks. For example, if you wrote "oreilly.com" so that the first "o" is the Cyrillic small letter "o," it would look exactly the same as "oreilly.com" in all Latin letters, since no usual font distinguishes between Latin "o" and Cyrillic "o." Yet, the characters are distinct, and so are the domain names. In Chapter 10, we will discuss attempts at preventing abuse of IDN without restricting ease of use too much.

6.7.13. URL Encoding

URL Encoding relates to Uniform Resource Locators (URL), often loosely called "web addresses," but it is not limited to them. It has an important role in encoding form data, when the user has filled out a form on a web page and submits it to processing. The encoded form data may in fact constitute a URL, but it need not.

6.7.13.1. Introduction: URL Encoding for form data

Suppose that you use Google search and enter the word Dürst into the text box. (You can do this even if your keyboard has no ü key; see Chapter 2 for some methods.) Looking at the result page that Google produces, you might see its address (URL) as:

 http://www.google.com/search?hl=en&q=D%C3%BCrst&btnG=Google+Search

You might be somewhat disappointed at the results, since by default Google treats "Dürst" and "Durst" as basically the same (when the user language is set to English; the matching principles of Google vary by language). To make Google look for "Dürst" only and not for "Durst," you would prefix the string by a plus sign, which means "exact match" to Google: +Dürst. But this was a digression, although perhaps a useful one.

The point in mentioning the URL is that the letter ü appears as %C3%BC in it. To be honest, this depends on your browser and its settings, but what we discuss here is the most common case in modern browsers. The browser has actually encoded your string according to UTF-8 (namely as octets C3 and BC), and then applied another encoding to the result.

6.7.13.2. The original URL Encoding

Originally, URL Encoding was defined for data that is restricted to ASCII, and the reason for the encoding was that not even all ASCII characters are "safe" in all contexts. In addition to national use variation (described in Chapter 3), some characters were deemed "unsafe" because some software was known to use them for special purposes. The encoding mechanism is simple: for an "unsafe" ASCII character, use the notation %xx, where xx is the ASCII code number of the character in two hexadecimal digits. Naturally, this implies that the percent sign % itself needs to be escaped (as %25). In a %xx notation, uppercase and lowercase letters are equivalent; e.g., %5B is equivalent to %5b.

URL Encoding is meant to be applied to all use of URLs, both in plain text and elsewheree.g., in HTML and in HTTP. For example, if a URL contains a space, the space must always appear as URL Encoded, as %20. When a browser follows a link containing such a URL, the browser should not decode %20 in any way but keep it in the request it sends to the server. Only the server is allowed to interpret %20 as a spacee.g., when mapping a URL to a filename.

URL Encoding was also used as a basis for defining the format in which form data is sent by default. A browser is supposed to collect all the relevant fields of a form and their values and construct a data set from them, and then URL Encode the data set. However, there is one modification: before applying URL Encoding, the browser is required to replace any occurrence of a space by a plus sign, +.

6.7.13.3. To encode or not to encode?

During the history of URL specifications, which have been issued as RFCs, the definitions have become more permissive. Fewer characters are declared as "unsafe" than in the original specification. Moreover, what is "safe" depends on the contexti.e., the part of a URL where a character appears. The situation has stabilized, since now the general syntax of URLs, including the URL Encoding mechanism, is defined in an Internet Standard, STD 66, "Uniform Resource Identifiers (URI): Generic Syntax." Currently STD 66 is RFC 3986. "URI" is a theoretical concept that is a generalization of URL.

According to STD 66, the characters that are always "safe" in URLs are letters "A" to "Z" and "a" to "z," digits 0 to 9, hyphen-minus -, period ., underline _, and tilde ~. These characters need not, and should not, be encoded using a %xx notation. For historical and practical reasons, the tilde is still often encoded (as %7E). Characters outside the "safe" set may need to be encoded, depending on context.

URL Encoding is special in the sense that the need for encoding characters depends on the context, and the same character might even appear as such or as encoded, with a difference in meaning. When a character is defined as constituting part of URL syntax, as a punctuation character in it, it need not and it must not be encoded. For example, a URL may contain a query part that begins with ? and consists of parts of the form name=value, separated from each other by ampersand & (as in our previous Google example). In such constructs, the characters ?, =, and & must not be encoded, since they appear in special meanings. If, however, a value in such a construct needs to contain one of those characters (e.g., because the user input in a Google search contained such a character), it needs to be encodedotherwise, it could be mistakenly regarded as part of the syntax and not part of the value.

6.7.13.4. Generalized URL Encoding

There is an obvious way to generalize URL Encoding to strings in an 8-bit encoding such as ISO-8859-1 or windows-1252. You would just use %xx for values of xx up to FF, instead of the upper limit of 7E (as defined by the range of printable ASCII characters). This means that you would encode, for example, ü (U+00FC) as %FC, using its code number in ISO-8859-1. Although such a technique works in many situations, the problem is that the character encoding of a URL is unspecified, and we don't want to give ISO-8859-1 a special status. Besides, ISO-8859-1 is insufficient for true internationalization.

6.7.13.5. Modern, UTF-8-based URL Encoding

The modern approach to allowing a wide repertoire of characters in URLs uses UTF-8 together with URL Encoding of octets. The proposed convention, generally supported by modern browsers, is the following:

Encode the characters in a URL using UTF-8. This of course leaves ASCII characters intact, but for example, ü becomes the octet pair C3 BC.
Encode octets from 80 to FF (as well as "unsafe" ASCII characters) using the %xx mechanism. For example, octets C3 BC become encoded as %C3%BC.

You may wonder how it is possible that both this modern way and the old way, implying ISO-8859-1 or some other encoding, can work in browsers. How can the browser know how to interpret the data? The HTML specification recommends that upon processing a link with a URL with a %xx notation outside the ASCII range, browsers should first try to interpret it the modern, UTF-8-based way. If the result does not resolve to a working address, the browser could try to interpret the notation according to the character encoding of the document in which the link appears.