Character Sets and Encodings

	Practical Programming in Tcl & Tk, Third Edition By Brent B. Welch
	Table of Contents

	Chapter 15. Internationalization

If you are from the United States, you've probably never thought twice about character sets. Most computers use the ASCII encoding, which has 127 characters. That is enough for the 26 letters in the English alphabet, upper case and lower case, plus numbers, various punctuation characters, and control characters like tab and newline. ASCII fits easily in 8-bit characters, which can represent 256 different values.

European alphabets include accented characters like è, ñ, and ä. The ISO Latin-1 encoding is a superset of ASCII that encodes 256 characters. It shares the ASCII encoding in values 0 through 127 and uses the "high half" of the encoding space to represent accented characters as well as special characters like ©. There are several ISO Latin encodings to handle different alphabets, and these share the trick of encoding ASCII in the lower half and other characters in the high half. You might see these encodings referred to as iso8859-1, iso8859-2, and so on.

Asian character sets are simply too large to fit into 8-bit encodings. There are a number of 16-bit encodings for these languages. If you work with these, you are probably familiar with the "Big 5" or ShiftJIS encodings.

Unicode is an international standard character set encoding. There are both 16-bit Unicode and 32-bit Unicode standards, but Tcl and just about everyone else just use the 16-bit standard. Unicode has the important property that it can encode all the important character sets without conflicts and overlap. By converting all characters to the Unicode encoding, Tcl can work with different character sets simultaneously.

The System Encoding

Computer systems are set up with a standard system encoding for their files. If you always work with this encoding, then you can ignore character set issues. Tcl will read files and automatically convert them from the system encoding to Unicode. When Tcl writes files, it automatically converts from Unicode to the system encoding. If you are curious, you can find out the system encoding with:

 encoding system => cp1252

The "cp" is short for "code page," the term that Windows uses to refer to different encodings. On my Unix system, the system encoding is iso8859-1.

Do not change the system encoding.

You could also change the system encoding with:

 encoding system encoding

But this is not a good idea. It immediately changes how Tcl passes strings to your operating system, and it is likely to leave Tcl in an unusable state. Tcl automatically determines the system encoding for you. Don't bother trying to set it yourself.

The encoding names command lists all the encodings that Tcl knows about. The encodings are kept in files stored in the encoding directory under the Tcl script library. They are loaded automatically the first time you use an encoding.

 lsort [encoding names] => ascii big5 cp1250 cp1251 cp1252 cp1253 cp1254 cp1255 cp1256 cp1257 cp1258 cp437 cp737 cp775 cp850 cp852 cp855 cp857 cp860 cp861 cp862 cp863 cp864 cp865 cp866 cp869 cp874 cp932 cp936 cp949 cp950 dingbats euc-cn euc-jp euc-kr gb12345 gb1988 gb2312 identity iso2022 iso2022-jp iso2022-kr iso8859-1 iso8859-2 iso8859-3 iso8859-4 iso8859-5 iso8859-6 iso8859-7 iso8859-8 iso8859-9 jis0201 jis0208 jis0212 ksc5601 macCentEuro macCroatian macCyrillic macDingbats macGreek macIceland macJapan macRoman macRomania macThai macTurkish macUkraine shiftjis symbol unicode utf-8

The encoding names reflect their origin. The "cp" refers to the "code pages" that Windows uses to manage encodings. The "mac" encodings come from the Macintosh. The "iso," "euc," "gb," and "jis" encodings come from various standards bodies.

File Encodings and `fconfigure`

The conversion to Unicode happens automatically in the Tcl C library. When Tcl reads and writes files, it translates from the current system encoding into Unicode. If you have files in different encodings, you can use the fconfigure command to set the encoding. For example, to read a file in the standard Russian encoding (iso8859-7):

 set in [open README.russian] fconfigure $in -encoding iso8859-7

Example 15-1 shows a simple utility I use in exmh,^[*] a MIME-aware mail reader. MIME has its own convention for specifying the character set encoding of a mail message that differs slightly from Tcl's naming convention. The procedure launders the name and then sets the encoding. Exmh was already aware of MIME character sets, so it could choose fonts for message display. Adding this procedure and adding two calls to it was all I had to do adapt exmh to Unicode.

^[*] The exmh home page is http://www.beedub.com/exmh/. It is a wonderful tool that helps me manage tons of e-mail. It is written in Tcl/Tk, of course, and relies on the MH mail system, which limits it to UNIX.

Example 15-1 MIME character sets.and file encodings.

 proc Mime_SetEncoding {file charset} {    regsub -all {(iso|jis|us)-} $charset {\1}charset    set charset [string tolower charset]    regsub usascii $charset ascii charset    fconfigure $file -encoding $charset }

Scripts in Different Encodings

If you have scripts that are not in the system encoding, then you cannot use source to load them. However, it is easy to read the files yourself under the proper encoding and use eval to process them. Example 15-2 adds a -encoding flag to the source command. This is likely to become a built-in feature in future versions of Tcl so that commands like info script will work properly:

Example 15-2 Using scripts in nonstandard encodings.

 proc Source {args} {    set file [llength $args end]    if {[llength $args] == 3 &&          [string equal -encoding [lindex $args 0]]} {       set encoding [lindex $args 1]       set in [open $file]       fconfigure $in -encoding $encoding       set script [read $in]       close $in       return [uplevel 1 $script]    } elseif {[llength $args] == 1} {       return [uplevel 1 [list source $file]]    } else {       return -code error \          "Usage: Source ?-encoding encoding? file?"    } }

Unicode and UTF-8

UTF-8 is an encoding for Unicode. While Unicode represents all characters with 16 bits, the UTF-8 encoding uses either 8, 16, or 24 bits to represent one Unicode character. This variable-width encoding is useful because it uses 8 bits to represent ASCII characters. This means that a pure ASCII string, one with character codes all fewer than 128, is also a UTF-8 string. Tcl uses UTF-8 internally to make the transition to Unicode easier. It allows interoperability with Tcl extensions that have not been made Unicode-aware. They can continue to pass ASCII strings to Tcl, and Tcl will interpret them correctly.

As a Tcl script writer, you can mostly ignore UTF-8 and just think of Tcl as being built on Unicode (i.e., full 16-bit character set support). If you write Tcl extensions in C or C++, however, the impact of UTF-8 and Unicode is quite visible. This is explained in more detail in Chapter 44.

Tcl lets you read and write files in UTF-8 encoding or directly in Unicode. This is useful if you need to use the same file on systems that have different system encodings. These files might be scripts, message catalogs, or documentation. Instead of using a particular native format, you can use Unicode or UTF-8 and read the files the same way on any of your systems. Of course, you will have to set the encoding properly by using fconfigure as shown earlier.

The Binary Encoding

If you want to read a data file and suppress all character set transformations, use the binary encoding:

 fconfigure $in -encoding binary

Under the binary encoding, Tcl reads in each 8-bit byte and stores it into the lower half of a 16-bit Unicode character with the high half set to zero. During binary output, Tcl writes out the lower byte of each Unicode character. You can see that reading in binary and then writing it out doesn't change any bits. Watch out if you read something in one encoding and then write it out in binary. Any information in the high byte of the Unicode character gets lost!

Tcl actually handles the binary encoding more efficiently than just described, but logically the previous description is still accurate. As described in Chapter 44, Tcl can manage data in several forms, not just strings. When you read a file in binary format, Tcl stores the data as a ByteArray that is simply 8 bits of data in each byte. However, if you ask for this data as a string (e.g., with the puts command), Tcl automatically converts from 8-bit bytes to 16-bit Unicode characters by setting the high byte to all zeros.

The binary command also manipulates data in ByteArray format. If you read a file with the binary encoding and then use the binary command to process the data, Tcl will keep the data in an efficient form.

The string command also understands the ByteArray format, so you can do operations like string length, string range, and string index on binary data without suffering the conversion cost from a ByteArray to a UTF-8 string.

Conversions Between Encodings

The encoding command lets you convert strings between encodings. The encoding convertfrom command converts data in some other encoding into a Unicode string. The encoding convertto command converts a Unicode string into some other encoding. For example, the following two sequences of commands are equivalent. They both read data from a file that is in Big5 encoding and convert it to Unicode:

 fconfigure $input -encoding gb12345 set unicode [read $input]

 fconfigure $input -encoding binary set unicode [encoding convertfrom gb12345 [read $input]]

In general, you can lose information when you go from Unicode to any other encoding, so you ought to be aware of the limitations of the encodings you are using. In particular, the binary encoding may not preserve your data if it starts out from an arbitrary Unicode string. Similarly, an encoding like iso-8859-2 may simply not have a representation of a given Unicode character.

The `encoding` Command

Table 15-1 summarizes the encoding command:

Table 15-1. The `encoding` command.
`encoding convertfrom ?encoding?` `data`	Converts binary `data` from the specified `encoding`, which defaults to the system encoding, into Unicode.
`encoding convertto ?encoding?` `string`	Converts `string` from Unicode into data in the `encoding` format, which defaults to the system encoding.
`encoding names`	Returns the names of known encodings.
`encoding system ?encoding?`	Queries or change the system encoding.

Top

The System Encoding

File Encodings and fconfigure