Section 4.1. Text Encoding

4.1. Text Encoding

Java is a language for the Internet. Since the people of the Net speak and write in many different human languages, Java must be able to handle a large number of languages as well. One of the ways in which Java supports internationalization is through the Unicode character set. Unicode is a worldwide standard that supports the scripts of most languages.^[*] Java bases its character and string data on the Unicode 4.0 standard, which uses 16 bits to represent each symbol.

^[*] For more information about Unicode, see http://www.unicode.org. Ironically, one of the scripts listed as "obsolete and archaic" and not currently supported by the Unicode standard is Javanesea historical language of the people of the Island of Java.

Java source code can be written using Unicode and stored in any number of character encodings, ranging from its full 16-bit form to ASCII-encoded Unicode character values. This makes Java a friendly language for non-English-speaking programmers who can use their native language for class, method, and variable names just as they can for the text displayed by the application.

The Java char type and String objects natively support Unicode values. But if you're concerned about having to labor with two-byte characters, you can relax. The String API makes the character encoding transparent to you. Unicode is also very ASCII-friendly (ASCII is the most common character encoding for English). The first 256 characters are defined to be identical to the first 256 characters in the ISO 8859-1 (Latin-1) character set, so Unicode is effectively backward-compatible with the most common English character sets. Furthermore, the most common encoding for Unicode, called UTF-8, preserves ASCII values in their single byte form. This encoding is used in compiled Java class files, so for English text, storage remains compact.

Most platforms can't display all currently defined Unicode characters. As a result, Java programs can be written with special Unicode escape sequences. A Unicode character can be represented with this escape sequence:

     \uxxxx

xxxx is a sequence of one to four hexadecimal digits. The escape sequence indicates an ASCII-encoded Unicode character. This is also the form Java uses to output (print) Unicode characters in an environment that doesn't otherwise support them. Java also comes with classes to read and write Unicode character streams in specific encodings, including UTF-8.