Recipe 5.23. Converting Strings Between Encoding Systems

Problem

You need to convert string data to and from byte arrays using an encoding method matched to your data, environment, or culture.

Solution

Sample code folder: Chapter 05\Encoding

Use System.Text.Encoding shared functions to convert between strings and byte arrays, using either UTF7, UTF8, Unicode, or UTF32 encoding, as appropriate.

Discussion

The following code starts with a sample string and then converts it to four byte arrays, one for each type of encoding. The length of each byte array will vary as a function of the encoding (to be explained in more detail later), so the Length property of each array is formatted into a StringBuilder for display at the end of the code. The four byte arrays are then converted back to Strings, using the same encoding in each case, and a quick check is made to verify that the resulting strings match the original:

 Dim quote As String = "The important thing is not to " & _    "stop questioning. --Albert Einstein" Dim result As New System.Text.StringBuilder ' ----- Convert a string to various formats. Dim bytesUTF7 As Byte( ) = _    System.Text.Encoding.UTF7.GetBytes(quote) Dim bytesUTF8 As Byte( ) = _    System.Text.Encoding.UTF8.GetBytes(quote) Dim bytesUnicode As Byte( ) = _    System.Text.Encoding.Unicode.GetBytes(quote) Dim bytesUTF32 As Byte( ) = _    System.Text.Encoding.UTF32.GetBytes(quote) ' ----- Show the converted results. result.Append("bytesUTF7.Length = ") result.AppendLine(bytesUTF7.Length.ToString( )) result.Append("bytesUTF8.Length = ") result.AppendLine(bytesUTF8.Length.ToString( )) result.Append("bytesUnicode.Length = ") result.AppendLine(bytesUnicode.Length.ToString( )) result.Append("bytesUTF32.Length = ") result.AppendLine(bytesUTF32.Length.ToString( )) ' ----- Convert everything back to standard strings. Dim fromUTF7 As String = _    System.Text.Encoding.UTF7.GetString(bytesUTF7) Dim fromUTF8 As String = _    System.Text.Encoding.UTF8.GetString(bytesUTF8) Dim fromUnicode As String = _    System.Text.Encoding.Unicode.GetString(bytesUnicode) Dim fromUTF32 As String = _    System.Text.Encoding.UTF32.GetString(bytesUTF32) ' ----- Check for conversion issues. If (fromUTF7 <> quote) Then _    Throw New Exception("UTF7 Conversion Error") If (fromUTF8 <> quote) Then _    Throw New Exception("UTF8 Conversion Error") If (fromUnicode <> quote) Then _    Throw New Exception("Unicode Conversion Error") If (fromUTF32 <> quote) Then _    Throw New Exception("UTF32 Conversion Error") MsgBox(result.ToString( ))

All strings in .NET are internally stored as two-byte Unicode characters. However, if each character of the string always falls within a known range of characters, the string can be converted to a one-byte-per-character byte array.

UTF7 encoding converts each character of the string to a single byte with the assumption that only the lower seven bits of each byte are used, leaving the highest-order bit as zero in all cases. This is true of ASCII characters with binary values in the range 0to 127, which covers the normal range of English-language displayable and printable characters.

UTF8 is very similar to UTF7, but it also allows conversion of special characters in the byte value range 128 to 255. This is the extended ASCII character set that is sometimes used for special purposes. UTF8 uses all eight bits of each byte to define each character's value in the range 0 to 255.

Today's computer systems now invariably use the international standard Unicode character set, which requires two bytes per character. Standard ASCII characters still fall within the same 0to 127 range in Unicode, so the second byte of each Unicode character in this range is set to zero. Other languages and cultures have character sets with Unicode integer values greater than 255, and Visual Basic strings handle them just fine.

UTF32 is not widely used, because it requires four bytes per character. However, even the two-byte Unicode characters occasionally require multiple sequential characters to define the specialized characters defined in some languages. UTF32 covers all possible characters in a simple four-bytes-per-character way, allowing internal processing simplifications. Generally, most worldwide string data is stored on external media in the two-byte Unicode format. Only occasionally is it converted to and processed as four-byte UTF32 bytes, and then only while in memory.

For most ASCII conversions, UTF8 is a good choice, requiring the same number of bytes as UTF7 but handling the full range of character values from 0to 255. If squeezing bytes down to a minimum is not a mandate, Unicode is the safest bet.

Problem

Solution

Discussion

See Also