5.8 Converting Between Character Sets | XML in a Nutshell, Third Edition

The ultimate solution to this character set morass is to use Unicode in either UTF-16 or UTF-8 format for all your XML documents. An increasing number of tools support one of these two formats natively; even the unassuming Notepad offers an option to save files in Unicode in Windows NT 4.0, 2000, and XP. Microsoft Word 97 and later saves the text of its documents in Unicode, although unlike XML documents, Word files are hardly pure text. Much of the binary data in a Word file is not Unicode or any other kind of text. However, Word 2000 and later can actually save plain text files into Unicode. To save as plain Unicode text in Word 2000, select the format Encoded Text from the Save As Type: Choice menu in Word's Save As dialog box. Then select one of the four Unicode formats in the resulting File Conversion dialog box. In Word 2003, select the plain text format. When you save, Word will pop up a dialog box that prompts you for the encoding. Choose Other Encoding and then select one of the four Unicode formats in the list box on the right.

Most current tools are still adapted primarily for vendor-specific character sets that can't handle more than a few languages at one time. Thus, learning how to convert your documents from proprietary to more standard character sets is crucial.

Some of the better XML and HTML editors let you choose the character set you wish to save in and perform automatic conversions from the native character set you use for editing. On Unix, the native character set is likely one of the standard ISO character sets, and you can save into that format directly. On the Mac, you can avoid problems if you stick to pure ASCII documents. On Windows, you can go a little further and use Latin-1, if you're careful to stay away from the extra characters that aren't part of the official ISO-8859-1 specification. Otherwise, you'll have to convert your document from its native, platform-dependent encoding to one of the standard platform-independent character sets.

Fran §ois Pinard has written an open source character-set conversion tool called recode for Linux and Unix, which you can download from http://recode.progiciels-bpi.ca/, as well as GNU mirror sites. Wojciech Galazka has ported recode to DOS. You can also use the Java Development Kit's native2ascii tool at http://java.sun.com/j2se/1.4.2/docs/tooldocs/win32/native2ascii.html. First, convert the file from its native encoding to Java's special ASCII-encoded Unicode format, then use the same tool in reverse to convert from the Java format to the encoding you actually want. For example, to convert the file myfile.xml from the Windows Cp1252 encoding to UTF-8, execute these two commands in sequence:

  % native2ascii -encoding Cp1252 myfile.xml myfile.jtx   % native2ascii -reverse -encoding UTF-8 myfile.jtx myfile.xml