Character Encodings | Real World Web Services

Let us start by considering basic facts about binary data. Generally speaking, binary data consists of streams of bytes. These bytes have numerical values from 0 to 255, as composed by 8 bits (zeros or ones). The meaning of the bytes depends on the application.

Now, compare binary data with character data. At first, most English speakers think of ASCII as character dataan international standard for converting byte data (numbers from 0 to 255) to characters. However , ASCII only defines the meaning of the lower 128 valuesthe remaining 128 possible numbers (also known as high-bit characters) don't correspond to a specific character. On some systems, these high-bit characters are used to represent locale-specific information. For example, a French system might use high-bit characters to represent French-specific information, whereas an English system might use the high-bit characters to represent special graphics (such as the trademark symbol).

As if high-bit characters weren't complicated enough, many languages have far more than 255 characters. To support these languages, two or even three bytes may be required to store individual characters. Therefore, to properly read and understand the character data represented by a stream of bytes, you must know the proper encoding for the data. In other words, the character encoding is a bit of information that lets your application know how to interpret the binary representation of the data as character data.

Java uses Unicode, a unified international standard for representing character data, to store character data internally. So, a String in Java represents a series of Unicode bytes.

Listing 12-2 shows a bit of code that illustrates the various interpretations of a bit of character data. The string highBitCharacters is used as our test casethis string contains a normal ASCII character and also several Unicode character points, expressed with the \u0000 format, that represent characters not common to all languages. The \u represents a Unicode escape sequence, and the 0000 represents the specific character.

Listing 12-2. Encoding Example Source

 public static void encodingDemo() {    // These strings are used to refer to specific    // encoding types.    String UTF8 = "UTF-8";    String WindowsLatin1 = "Cp1252";    // These characters are, in order, the ordinary ASCII letter X,    // a pound sign (the UK currency), a latin capital letter C    // with cedilla, a latin small letter e with diaeresis, and    // registered sign (typically used to indicate a registered    // trademark).  These characters are represented internally    // in the JVM as Unicode.  Their output as bytes is    // determined by the default encoding for the system, or    // potentially by an overidden setting on a per-method basis.    String highBitCharacters = "X\u00a3\u00c7\u00eb\u00ae";    printHeader("Encoding Demo");    System.out.print("Default JVM encoding: ");    System.out.println(System.getProperty("file.encoding"));    System.out.print("Test high bit character output: ");    System.out.println(highBitCharacters);    System.out.print("It takes ");    System.out.print(highBitCharacters.getBytes().length);    System.out.println(" bytes to store this string.");    try    {        System.out.print("Windows Cp1252 Characters: ");        String winstring =            new String(                highBitCharacters.getBytes(),                WindowsLatin1);        System.out.println(winstring);        System.out.println();        System.out.println("Everything looks fine, but some");        System.out.println("systems automatically convert the");        System.out.println("data.  Here, note the lossy ");        System.out.println("conversion to ? characters:");        System.out.println();        System.out.print("ASCII Characters: ");        String asciiString =            new String(highBitCharacters.getBytes(), "ASCII");        System.out.println(asciiString);        System.out.println();        System.out.println(            "Commons Encoding supports converting");        System.out.println("bytes to two-digit 0-f hex codes.");        System.out.println();        System.out.print("Hex Characters: ");        byte[] hexresults;        hexresults =            new Hex().encode(highBitCharacters.getBytes());        System.out.println(new String(hexresults));    } catch (UnsupportedEncodingException e)    {        e.printStackTrace();    } }

As we can see in the output shown in Listing 12-3, converting the string from Unicode code points to Cp1252 (the default encoding on a US English Windows system) works fine, but the encoding conversion to pure ASCII results in the loss of information.

Listing 12-3. Encoding Example Results

 ================================ Encoding Demo ================================ Default JVM encoding: Cp1252 Test high bit character output: X£Çë® It takes 5 bytes to store this string. Windows Cp1252 Characters: X£Çë® Everything looks fine, but some systems automatically convert the data.  Here, note the lossy conversion to ? characters: ASCII Characters: X???? Commons Encoding supports converting bytes to two-digit 0-f hex codes. Hex Characters: 58a3c7ebae

This example shows us the first use of the Apache Commons projectthe ability to translate a series of characters from raw byte data to a hex representation. In a hex representation, each byte is transformed to a two-letter ASCII representation, 0-f. The hex format, while potentially useful for reading and writing binary data to an ASCII representation, is not as well known or popular as the Base64 format.