Text Files and Character Sets | Building an On Demand Computing Environment with IBM: How to Optimize Your Current Infrastructure for Today and Tomorrow (MaxFacts Guidebook series)

As you know, the Java programming language itself is fully Unicode based. However, operating systems typically have their own character encoding, such as ISO-8859 -1 (an 8 -bit code sometimes called the "ANSI" code) in the United States, or Big5 in Taiwan.

When you save data to a text file, you should respect the local character encoding so that the users of your program can open the text file with their other applications. Specify the character encoding in the FileWriter constructor:

 out = new FileWriter(filename, "ISO-8859-1");

You can find a complete list of the supported encodings in Volume 1, Chapter 12.

Unfortunately, there is currently no connection between locales and character encodings. For example, if your user has selected the Taiwanese locale zh_TW, no method in the Java programming language tells you that the Big5 character encoding would be the most appropriate.

Character Encoding of Source Files

It is worth keeping in mind that you, the programmer, will need to communicate with the Java compiler. And you do that with tools on your local system. For example, you can use the Chinese version of Notepad to write your Java source code files. The resulting source code files are not portable because they use the local character encoding (GB or Big5, depending on which Chinese operating system you use). Only the compiled class files are portablethey will automatically use the "modified UTF-8" encoding for identifiers and strings. That means that even when a program is compiling and running, three character encodings are involved:

Source files: local encoding
Class files: modified UTF-8
Virtual machine: UTF-16

(See Volume 1, Chapter 12 for a definition of the modified UTF-8 and UTF-16 formats.)

TIP

You can specify the character encoding of your source files with the -encoding flag, for example,

 java -encoding Big5 Myfile.java

To make your source files portable, restrict yourself to using the plain ASCII encoding. That is, you should change all non-ASCII characters to their equivalent Unicode encodings. For example, rather than using the string "Häuser", use "H\u0084user". The JDK contains a utility, native2ascii, that you can use to convert the native character encoding to plain ASCII. This utility simply replaces every non-ASCII character in the input with a \u followed by the four hex digits of the Unicode value. To use the native2ascii program, provide the input and output file names.

 native2ascii Myfile.java Myfile.temp

You can convert the other way with the -reverse option:

 native2ascii -reverse Myfile.temp Myfile.java

You can specify another encoding with the -encoding option. The encoding name must be one of those listed in the encodings table in Volume 1, Chapter 12.

 native2ascii -encoding Big5 Myfile.java Myfile.temp

TIP

It is a good idea to restrict yourself to plain ASCII class names. Because the name of the class also turns into the name of the class file, you are at the mercy of the local file system to handle any non-ASCII coded names. Here is a depressing example. Windows 95 uses the so-called Code Page 437 or original PC encoding, for its file names. If you make a class Bär and try to run it in Windows 95, you get an error message "cannot find class Br".