Encodings | Professional .NET Framework 2.0 (Programmer to Programmer)

When storing or reading text, encodings are used to specify precisely what bits represent what characters. Ultimately, when data gets written to a disk or transmitted over a network, everything is just a sequence of bytes. In order for these bytes to get transformed in a way that make any sense to the end user, there needs to be a standard way to interpret them. This is the goal of encodings.

The ASCII (American Standard Code for Information Interchange) encoding has been used for years as the basis for plain-text documents, as well as several extended character sets. ASCII itself uses only 7 bits per character (with 1 bit reserved for extensibility) to represent 128 possible unique characters. Unfortunately, 128 characters aren't quite enough to represent the world's character sets, such as Arabic, Kanji, and so forth. Most extensions to ASCII, such as the ANSI and OEM code pages in Windows, add support for Latin or other culture-specific characters by utilizing the extra bit for the upper range of characters (i.e., from 128 to 255). Still, this isn't sufficient for many languages, especially if you intend to draw characters from diverse languages on screen simultaneously.

As international software has become more pervasive, the world has shifted toward 16-bit character encodings, also called double-byte encodings. Unicode is the de facto standard. Employing 2 full bytes per character means roughly 65,000 individual code-points (fancy name for a character) may be used. (This is an oversimplification, as both variable length and pair code-points (a.k.a. surrogates) make the real number of code-points higher.) Unicode's wide adoption makes sharing Unicode-encoded data between applications and across platforms straightforward.

UTF-8, described by RFC 2279, is a slight variant on Unicode, favoring efficiency in the size of the encoded stream. These utilize 8 (or 7) bits for the lower code-points (e.g., in the ASCII base character set) and a variable number of bytes for characters higher than 127. A single Unicode character in the range U+0000 through U+FFFF will require between one and 3 bytes to represent in UTF-8. The maximum encoding length in UTF-8 is 6 bytes, although no Unicode code-points that require more than 4 bytes in UTF-8 are currently defined. 16-bit characters are used for anything beyond those code-points. UTF-7 is a slight variant on UTF-8, described by RFC 2152, A Mail-Safe Transformation Format of Unicode. The CLR metadata format uses UTF-8 for most text encoding — for example, string tables use UTF-8 — although full-fledged Unicode (i.e., wchar_t) characters are used to represent text at runtime.

Encodings are quite complicated beasts. Unfortunately, there are many more than just ASCII, Unicode, UTF-8, and UTF-7. These are, thankfully, the most common. Converting to and from the various encodings is easy, thanks to the various Encoding classes in the runtime.

BCL Support

The System.Text.Encoding type is an abstract class from which a set of concrete Encoding types derives. These types include Encoding.ASCII, the traditional 7-bit ASCII encoding (supporting code-points U+0000 through U+007F); Default, which is equivalent to ASCII and retrieved through a Win32 call to GetACP; Unicode (little endian default); BigEndianUnicode (sometimes needed when interoperating with non-Windows platforms, for example UNIX, UTF8, UTF7, and UTF32.

In addition to the static properties, you can obtain an Encoding instance by passing in a Windows code-page name or number. For example, the default OEM code-page for Latin-based language geographies is "Western European," or code-page 1252. You can obtain an Encoding using the following code:

 Encoding westernEuropean = Encoding.GetEncoding(1252);

The Encoding class has a number of interesting members. For example, you can take a CLR string or char[], pass it to GetBytes, and obtain the byte encoding for the target encoding. Various types in the System.IO namespace use this functionality to ensure stream contents are serialized according to the selected encoding. Similarly, GetChars goes in the reverse direction; given a byte[] array, it will parse the contents into a char[]. These types rely on the Encoder and Decoder types, which we will not discuss here for the sake of brevity.

IO Integration

Encodings come into play primarily when performing IO. For example, many of the File, Stream, and reader and writer APIs in the System.IO namespace enable you to pass in an Encoding instance to indicate how data should be read or written. It's important to understand that whatever encoding was used to create textual data must also be used to read it. Otherwise, bytes will be interpreted incorrectly, resulting in garbage. If you ever read a file, for example, and it appears garbled and unintelligible, it's very likely you have run into an encoding mismatch problem.

Byte-Order-Marks

In many cases, the System.IO classes will actually pick up the correct encoding by default, due to something called a byte-order-mark (BOM). A BOM is a small 16-bit signature (U+FEFF) at the beginning of a Unicode-based file to indicate things such as endian-ness (i.e., does the reader see 0xFE 0xFF or 0xFF 0xFE). BOMs are also used to signal encodings such as UTF-8 and UTF-7.

The System.IO.StreamReader constructors have a detectEncodingFromByteOrderMarks parameter to specify whether BOMs should be automatically detected and responded to or not. The default value for this is true.