Recipe 10.11 ReadingWriting a Different Character Set

Recipe 10.11 Reading/Writing a Different Character Set

Problem

You need to read or write a text file using a particular encoding.

Solution

Convert the text to or from internal Unicode by specifying a converter when you construct an InputStreamReader or PrintWriter.

Discussion

Classes InputStreamReader and OutputStreamWriter are the bridge from byte-oriented Streams to character-based Readers. These classes read or write bytes and translate them to or from characters according to a specified character encoding. The Unicode character set used inside Java (char and String types) is a 16-bit character set. But most character sets such as ASCII, Swedish, Spanish, Greek, Turkish, and many others use only a small subset of that. In fact, many European language character sets fit nicely into 8-bit characters. Even the larger character sets (script-based and pictographic languages) don't all use the same bit values for each particular character. The encoding , then, is a mapping between Unicode characters and an external storage format for characters drawn from a particular national or linguistic character set.

To simplify matters, the InputStreamReader and OutputStreamWriter constructors are the only places where you can specify the name of an encoding to be used in this translation. If you do not, the platform's (or user's) default encoding is used. PrintWriters, BufferedReaders, and the like all use whatever encoding the InputStreamReader or OutputStreamWriter class uses. Since these bridge classes only accept Stream arguments in their constructors, the implication is that if you want to specify a nondefault converter to read or write a file on disk, you must start by constructing not a FileReader or FileWriter, but a FileInputStream or FileOutputStream!

// UseConverters.java BufferedReader fromKanji = new BufferedReader(     new InputStreamReader(new FileInputStream("kanji.txt"), "EUC_JP")); PrintWriter toSwedish = new PrinterWriter(     new OutputStreamWriter(new FileOutputStream("sverige.txt"), "Cp278"));

Not that it would necessarily make sense to read a single file from Kanji and output it in a Swedish encoding; for one thing, most fonts would not have all the characters of both character sets, and, at any rate, the Swedish encoding certainly has far fewer characters in it than the Kanji encoding. Besides, if that were all you wanted, you could use a JDK tool with the ill-fitting name native2ascii (see its documentation for details). A list of the supported encodings is also in the JDK documentation, in the file docs/guide/internat/encoding.doc.html. A more detailed description is found in Appendix B of O'Reilly's Java I/O.