Character Encodings and Unicode | Cross-Platform GUI Programming with wxWidgets

There are more characters around on Earth than can fit into the 256 possible byte values that the classical 8-bit character represents. In order to be able to display more than 256 different glyphs, another layer of indirection has been added: the character encoding or character set. (The "new and improved" solution, Unicode, will be presented later in this section.)

Thus, what is represented by the byte value 161 is determined by the character set. In the ISO 8859-1 (Latin-1) character set, this is ¡an inverted exclamation mark. In ISO 8859-2 (Latin-2), it represents a (Aogonek).

When you are drawing text on a window, the system must know about the encoding used. This is called the "font encoding," although it is just an indication of a character set. Creating a font without indicating the character set means "use the default encoding." This is fine in most situations because the user is normally using the system in his or her language.

But if you know that something is in a different encoding, such as ISO 8859-2, then you need to create the appropriate font. For example:

 wxFont myFont(10, wxFONTFAMILY_DEFAULT, wxNORMAL, wxNORMAL,                false, wxT("Arial"), wxFONTENCODING_ISO8859_2);

Otherwise, it will not be displayed properly on a western system, such as ISO 8859-1.

Note that there may be situations where an optimal encoding is not available. In these cases, you can try to use an alternative encoding, and if one is available, you must convert the text into this encoding. The following snippet shows this sequence: a string text in the encoding enc should be shown in the font facename. The use of wxCSConv will be explained shortly.

 // We have a string in an encoding 'enc' which we want to // display in a font called 'facename'. // // First we must find out whether there is a font available for // rendering this encoding wxString text; // Contains the text in encoding 'enc' if (!wxFontMapper::Get()->IsEncodingAvailable(enc, facename)) {    // We don't have an encoding 'enc' available in this font.    // What alternative encodings are available?    wxFontEncoding alternative;    if (wxFontMapper::Get()->GetAltForEncoding(enc, &alternative,                                               facename, false))    {        // We do have a font in an 'alternative' encoding,        // so we must convert our string into that alternative.        wxCSConv convFrom(wxFontMapper::GetEncodingName(enc));        wxCSConv convTo(wxFontMapper::GetEncodingName(alternative));        text = wxString(text.wc_str(convFrom), convTo) ;        // Create font with the encoding alternative        wxFont myFont(10, wxFONTFAMILY_DEFAULT, wxNORMAL, wxNORMAL,                false, facename , alternative);        dc.SetFont(myFont);    }    else    {       // Unable to convert; attempt a lossy conversion to       // ISO 8859-1 (7-bit ASCII)       wxFont myFont(10, wxFONTFAMILY_DEFAULT, wxNORMAL, wxNORMAL,               false, facename, wxFONTENCODING_ISO8859_1);       dc.SetFont(myFont);     } } else {     // The font with that encoding exists, no problem.      wxFont myFont(10, wxFONTFAMILY_DEFAULT, wxNORMAL, wxNORMAL,                false, facename, enc);      dc.SetFont(myFont); } // Finally, draw the text with the font we've selected. dc.DrawText(text, 100, 100);

Converting Data

The previous code example needs a chain of bytes to be converted from one encoding to another. There are two ways to achieve this. The first, using wxEncodingConverter, is deprecated and should not be used in new code. Unless your compiler cannot handle wchar_t, you should use the character set converters (wxCSConv, base class wxMBConv).

wxEncodingConverter

This class supports only a limited subset of encodings, but if your compiler doesn't recognize wchar_t, it is the only solution you have. For example:

 wxEncodingConverter converter(enc, alternative, wxCONVERT_SUBSTITUTE); text = converter.Convert(text);

wxCONVERT_SUBSTITUTE indicates that it should try some lossy substitutions if it cannot convert a character strictly. This means that, for example, acute capitals might be replaced by ordinary capitals and en dashes and em dashes might be replaced by "-", and so on.

wxCSConv (wxMBConv)

Unicode solves the ambiguity problem mentioned earlier by using 16 or even 32 bits in a wide character (wchar_t) to store all characters in a "global encoding." This means that you don't have to deal with encodings unless you need to read or write data in an 8-bit format, which as we know does not have enough information and needs an indication of its encoding.

Even when you don't compile wxWidgets in Unicode mode (where wchar_t is used internally to store the characters in a string), you can use these wide characters for conversions, if available. You convert from one encoding into wide character strings and then back to a different encoding. This is also used in the wxString class to offer you convenient conversions. Just bear in mind that in non-Unicode builds, wxString itself uses 8-bit characters and does not know how this string is encoded.

To transfer a wxString into a wide character array, you use the wxString::wc_str function, which takes a multi-byte converter class as its parameter. This parameter tells a non-Unicode build which encoding the string is in, but it is ignored by the Unicode build because the string is already using wide characters internally.

In a Unicode build, we can then build a string directly from these characters, but in a non-Unicode build, we must indicate which character set this should be converted to. So in the line below, convTo is ignored in Unicode builds.

 text = wxString(text.wc_str(convFrom), convTo);

The character set encoding offers more possibilities than font encodings, so you'd have to convert from font encoding to character set encoding using

 wxFontMapper::GetEncodingName(fontencoding);

This means that our previous task would be written as follows using character set encoding:

 wxCSConv convFrom(wxFontMapper::GetEncodingName(enc)); wxCSConv convTo(wxFontMapper::GetEncodingName(alternative)); text = wxString(text.wc_str(convFrom) , convTo) ;

There are situations where you output 8-bit data directly instead of a wxString, and this can be done using a wxCharBuffer instance. So the last line would read as follows:

 wxCharBuffer output = convTo.cWC2MB(text.wc_str(convFrom));

And if your input data is not a string but rather 8-bit data as well (a wxCharBuffer named input below), then you can write:

 wxCharBuffer output = convTo.cWC2MB(convFrom.cMB2WC(input));

A few global converter objects are available; for example, wxConvISO8859_1 is an object, and wxConvCurrent is a pointer to a converter that uses the C library locale. There are also subclasses of wxMBConv that are optimized for certain encoding tasks, namely wxMBConvUTF7, wxMBConvUTF8, wxMBConvUTF16LE/BE, and wxMBConvUTF32LE/BE. The latter two are typedefed to wxMBConvUFT16/32 using the byte order native to the machine. For more information, see the topic "wxMBConv Classes Overview" in the wxWidgets reference manual.

Converting Outside of a Temporary Buffer

As just discussed, the conversion classes allow you to easily convert from one string encoding to another. However, most conversions return either a newly created wxString or a temporary buffer. There are instances where we might need to perform a conversion and then hold the result for later processing. This is done by copying the results of a conversion into separate storage.

Consider the case of sending strings between computers, such as over a socket. We should agree on a protocol for what type of string encoding to use; otherwise, platforms with different default encodings would garble received strings. The sender could convert to UTF-8, and the receiver could then convert from UTF-8 into its default encoding.

The following short example demonstrates how to use a combination of techniques to convert a string of any encoding into UTF-8, store the result in a char* for sending over the socket, and then later convert that raw UTF-8 data back into a wxString.

 // Convert the string to UTF-8 const wxCharBuffer ConvertToUTF8(wxString anyString) {     return wxConvUTF8.cWC2MB( anyString.wc_str(*wxConvCurrent) ) ; } // Use the raw UTF-8 data passed to build a wxString wxString ConvertFromUTF8(const char* rawUTF8) {     return wxString(wxConvUTF8.cMB2WC(rawUTF8), *wxConvCurrent); } // Test our wxString<->UTF-8 conversion void StringConversionTest(wxString anyString) {     // Convert to UTF-8, keep the char buffer around     const wxCharBuffer bUTF8 = ConvertToUTF8(anyString);     // wxCharBuffer has an implicit conversion operator for char *     const char *cUTF8 = bUTF8 ;     // Rebuild the string     wxString stringCopy = ConvertFromUTF8(cUTF8);     // The two strings should be equal     wxASSERT(anyString == stringCopy); }

Help Files

You will want to distribute a separate help file for each supported language. Your help controller initialization will select the appropriate help file name according to the current locale, perhaps using wxLocale::GetName to form the file name, or simply using _() to translate to the appropriate file name. For example:

 m_helpController->Initialize(_("help_english"));

If you are using wxHtmlHelpController, you need to make sure that all the HTML files contain the META tag, for example:

 <meta http-equiv="Content-Type" content="text/html; charset=iso8859 //2">

You also need to make sure that the project file (extension HHP) contains one additional line in the OPTIONS section:

 Charset=iso8859-2

This additional entry tells the HTML help controller what encoding is used in contents and index tables.