2.2 Encode a String Using Alternate Character Encoding


Problem

You need to exchange character data with systems that use character encoding schemes other than UTF-16 ”the character-encoding scheme used internally by the CLR.

Solution

Use the System.Text.Encoding class and its subclasses to convert characters between different encoding schemes.

Discussion

Unicode is not the only character-encoding scheme, nor is UTF-16 the only way to represent Unicode characters. When your application needs to exchange character data with external systems (particularly legacy systems), you must convert the data between UTF-16 and the encoding scheme supported by the other system.

The abstract class Encoding , and its concrete subclasses, provide the functionality to convert characters to and from a variety of encoding schemes. Each subclass instance supports the conversion of characters between UTF-16 and one other encoding scheme. You obtain instances of the encoding specific classes using the static factory method Encoding.GetEncoding , which accepts either the name or the code page number of the required encoding scheme.

Table 2.1 lists some commonly used character encoding schemes and the code page number you must pass to the GetEncoding method to create an instance of the appropriate encoding class. The table also shows static properties of the Encoding class that provide shortcuts for obtaining the most commonly used types of encoding object.

Table 2.1: Character Encoding Classes

Encoding Scheme

Class

Create Using

ASCII

ASCIIEncoding

GetEncoding(20127) or the ASCII property

Default (current Microsoft Windows default)

Encoding

GetEncoding(0) or the Default property

UTF-7

UTF7Encoding

GetEncoding(65000) or the UTF7 property

UTF-8

UTF8Encoding

GetEncoding(65001) or the UTF8 property

UTF-16 (Big Endian)

UnicodeEncoding

GetEncoding(1201) or the BigEndianUnicode property

UTF-16 (Little Endian)

UnicodeEncoding

GetEncoding(1200) or the Unicode property

Windows OS

Encoding

GetEncoding(1252)

Once you have an Encoding object of the appropriate type, you convert a UTF-16 encoded Unicode string to a byte array of encoded characters using the GetBytes method and convert a byte array of encoded characters to a string using the GetString method. The following code demonstrates the use of some encoding classes.

 using System; using System.IO; using System.Text; public class CharacterEncodingExample {     public static void Main() {                  // Create a file to hold the output         using (StreamWriter output = new StreamWriter("output.txt")) {             // Create and write a string containing the symbol for Pi             string srcString = "Area = \u03A0r^2";             output.WriteLine("Source Text : " + srcString);             // Write the UTF-16 encoded bytes of the source string             byte[] utf16String = Encoding.Unicode.GetBytes(srcString);             output.WriteLine("UTF-16 Bytes: {0}",                  BitConverter.ToString(utf16String));             // Convert the UTF-16 encoded source string to UTF-8 and ASCII             byte[] utf8String = Encoding.UTF8.GetBytes(srcString);             byte[] asciiString = Encoding.ASCII.GetBytes(srcString);                          // Write the UTF-8 and ASCII encoded byte arrays                     output.WriteLine("UTF-8  Bytes: {0}",                  BitConverter.ToString(utf8String));             output.WriteLine("ASCII  Bytes: {0}",                  BitConverter.ToString(asciiString));             // Convert UTF-8 and ASCII encoded bytes back to UTF-16              // encoded string and write             output.WriteLine("UTF-8  Text : {0}",                  Encoding.UTF8.GetString(utf8String));             output.WriteLine("ASCII  Text : {0}",                  Encoding.ASCII.GetString(asciiString));             // Flush and close the output file             output.Flush();             output.Close();         }     } } 

Running CharacterEncodingExample will generate a file named output.txt. If you open this file in a text editor that supports Unicode, you will see the following content:

 Source Text : Area = r^2 UTF-16 Bytes: 41-00-72-00-65-00-61-00-20-00-3D-00-20-00-  A0-03  -72-00-5E-00-32-00 UTF-8  Bytes: 41-72-65-61-20-3D-20-  CE-A0  -72-5E-32 ASCII  Bytes: 41-72-65-61-20-3D-20-  3F  -72-5E-32 UTF-8  Text : Area = r^2 ASCII  Text : Area =  ?  r^2 

Notice that using UTF-16 encoding, each character occupies 2 bytes, but because most of the characters are standard characters, the high-order byte is 0. (The use of little-endian byte ordering means that the low-order byte appears first.) This means that most of the characters are encoded using the same numeric values across all three encoding schemes. However, the numeric value for the symbol pi ( emphasized in bold in the preceding code) is different in each of the encodings. The value of pi requires more than one byte to represent ”UTF-8 encoding uses 2 bytes, but ASCII has no direct equivalent and so replaces pi with the code 3F. As you can see in the text version of the string, 3F is the symbol for an English question mark (?).

Warning  

If you convert Unicode characters to ASCII or a specific code page encoding scheme, you risk losing data. Any Unicode character with a character code that can't be represented in the scheme will be ignored.

The Encoding class also provides the static method Convert to simplify the conversion of a byte array from one encoding scheme to another without the need to manually perform an interim conversion to UTF-16. For example, the following statement converts the ASCII encoded bytes contained in the asciiString byte array directly from ASCII encoding to UTF-8 encoding:

 byte[] utf8String = Encoding.Convert(Encoding.ASCII, Encoding.UTF8,     asciiString); 



C# Programmer[ap]s Cookbook
C# Programmer[ap]s Cookbook
ISBN: 735619301
EAN: N/A
Year: 2006
Pages: 266

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net