Section 3.7. Converters and Transcoding

3.7. Converters and Transcoding

As the preceding discussion of some encodings has shown, there are many character codes and encodings in use now and in the future. Unicode is a tool that helps to deal with this complexity, rather than a once and for all solution that replaces all other codes. Even if you use Unicode for everything you can, you still have reasons to know about conversions between encodings:

Old data often exists in different other encodings ("legacy data")
Old software, which you may need to use or to interface with, often requires input or writes output in some non-Unicode encoding ("legacy software")
Other people still use and prefer other encodings, and you may need to cope with that in email, exchange of text files, web page design, etc.

The process of converting character data from one encoding into another was previously often calledrecoding, but nowadays the term transcoding is more common. A program or part of program that has been specifically designed to perform transcodings can be called a converter. Transcoding is often performed by programs that do something quite different as their main job.

3.7.1. Transcoding Tools

For example, a text editor can often read and write data in several encodings, including the possibility of reading data in one encoding and saving it in another, as discussed in Chapter 1. This means that the program has to transcodei.e., to contain a built-in converter. For an occasional conversion task, the simplest way is usually to open a text file in a suitable editor or word processor and to use the Save As function to save the content under a different filename and with a different encoding. For repeated and often bulky conversions, something more efficient is needed.

When appropriate, a converter can be very simple. Transcoding between 8-bit codes is a matter of mapping each code number to another code number according to a table, and this can be implemented rather efficiently. If you need to write such a converter, the main challenge is to find the relevant mapping table, or tables, from a reliable source. Beware that many codes exist in slightly different versions.

There are cross-mapping tables available at http://www.unicode.org/Public/MAPPINGS for transcoding between various encodings and Unicode. They are plain text files but in a format that can easily be read and parsed to construct suitable data structures. For example, the document http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT, which is about the Russian KOI8-R encoding, contains lines like the following:

0xBF    0x00A9    #    COPYRIGHT SIGN 0xC0    0x044E    #    CYRILLIC SMALL LETTER YU 0xC1    0x0430    #    CYRILLIC SMALL LETTER A

This says, for example, that code number BF (in hexadecimal) in KOI8-R denotes the same character as code number 00A9 in Unicode. Anything after the # character in this format is to be treated as comment, and the names of the characters are just for human readers. They are not needed in any way in the transcoding process.

General purpose subroutine libraries often contain transcoding routines. Typically, if you pass a string and the names of two encodings as parameters, you will get the transcoded string as output.

3.7.2. Free Recode

Probably the best known general purpose converter is Free Recode, available from http://recode.progiciels-bpi.ca/. It has been designed for use as a so-called filter (in the Unix sense)i.e., as a program that takes input from the standard input stream (called stdin in Unix) and writes the output to the standard output stream (stdout). This means that it is typically used as a component of a chain of programs (a pipe), where data is processed in phases. In such usage, each component is more or less assumed to work correctly. Therefore, Free Recode plays fast and loose. If the input is correct, so that all data actually represents characters in the source encoding and has a representation in the target encoding, the output is fine. If there are errors in the datae.g., a character that is unrepresentable in the target encodingno error message is given, and the output is more or less unpredictable. Moreover, when you pass a filename as an argument to Free Recode, the program performs an "in situ" conversioni.e., it replaces the old content of the file with the new, transcoded version. It is your responsibility as the user to create a backup copy of the original content if you need to, and usually you do.

Free Recode is available as an executable (.exe) file for Windows. When installing it, it is best to add the name of the folder where you put it into the default path. (You do not need to do this if you put the file into the same folder as your data files, but gets rather awkward if you perform many transcodings.) Then you can use Free Recode via the command-line interface (DOS prompt) using a command like:

recode cp-437..windows-1252 test.txt

This command takes the content of the file test.txt, interprets it as CP-437 encoded, and transcodes it into windows-1252. The result overwrites the original content of test.txt.

There are several converters available commercially, too. You may find them more suitable, maybe due to a graphic user interface or wider support of different encodings. Searching Google for "character * converter" can be useful in finding them.

3.7.3. The iconv Converter

Unix systems normally contain a converter called iconv, which has a simple interface, where you specify the source ("from") encoding after the switch -f, the destination ("to") encoding after the switch -t, and then the source file. The result is written to standard output, which you can direct to a file as usual on Unix. For example:

iconv -f iso-8859-1 -t utf-8 demo.txt >demo.utf

Check man iconv for more instructions. Beware that your system might have an old version of iconv, with rather limited support for different encodings. With some expertise, you could download and install GNU iconv to improve the situation. GNU iconv contains the libiconv library, which you can use when writing programs. For more information, consult http://www.gnu.org/software/libiconv/.