5.4 Coding for Internationalization | Applied C++: Practical Techniques for Building Better Software

If you want your application to be used in other countries , you have to make provisions that allow you to translate your user interface into other languages. If your application processes textual data, you also have to make provisions to handle data in other languages. These provisions are called internationalization .

When building applications, most developers don't think about internationalization issues unless there are immediate business requirements that force them to do so. When the requirements show up later, they force the developers to remediate the code -- a process that is often far more complex and expensive than planning for internationalization from the start.

Almost all significant business software ends up internationalized sooner or later, and the technical requirements posed by internationalization are becoming more complex. For example, People's Republic of China (PRC) now requires that all software sold in China must support the GB18030 character set ”a very large and complex character set.

Many people only think about the user-interface component of internationalization, which includes translating such items as messages, menus , and labels. The translation process can be fairly straightforward, regardless of the target market. Straightforward, however, is not the same thing as trivial. You cannot simply replace the strings in the original language with translated strings and expect a working result. The problems with such an approach include:

The translated strings no longer make sense grammatically.
Monetary values are printed incorrectly.
Dates are semantically incorrect due to differences in locales.
The translated strings no longer fit into the GUI (or dialog boxes), which were designed for Latin characters running under the English or European versions of the operating system.

In addition to translation issues, there are issues related to handling text inside your application. Almost all applications read, write, parse, or otherwise process some text. Code that works with text can require very significant adaptation to handle multiple languages. This is especially true for those languages with large and complex character sets, such as Chinese, Japanese, and Korean. See [Lunde99]. All kinds of things can and will go wrong when your code encounters text in other character sets, including:

Overflow errors occur in those variables that store string values.
The database underlying the product, which worked fine for Latin characters, starts to produce errors.
Third-party components start to produce errors when they encounter double-byte strings.
The performance of the system noticeably degrades.

To avoid these and other problems, you need a little forward thinking to allow for handling international text, even if it is not required for your initial release. In this section, we touch on a few of the largest issues involved in making your code ready for internationalization. Getting on the right track from the beginning is the key to making this process work. Toward that goal, Figure 5.3 highlights some of the issues you should consider.

Figure 5.3. Internationalization Checklist

graphics/05fig03.gif

5.4.1 Unicode

If your program processes any significant amount of text, the most important issue is to decide how to represent text in your code - what character encoding to use. A character encoding defines the mapping from a set of characters to binary values in your code. As soon as you look outside of Western Europe, you will discover that there are many character encodings other than ASCII and Latin-1, sometimes several for each language. For example, the Japanese language appears in four different encodings in common applications: Shift-JIS, EUC-JP, UTF-8, and ISO-2022-JP.

You can write your code to process text in one or more of the hundreds of defined encodings. However, if you do this, you are likely to encounter defects and complexity that must be debugged one language at a time. The alternative is to follow the example of Oracle, Microsoft Windows, and Java, and use Unicode as your internal representation for text.

Unicode is not a font, nor is it a software program. Unicode is a standard means of representing text in all of the world's languages in computer systems. It defines a very large set of characters, and then offers several encoding methods for representing these characters in memory. These encoding methods , UTF-8, UTF-16, and UTF-32, replace character encodings like ASCII or Shift-JIS.

Microsoft Windows supports UTF-16 as a native data type in C, C++, C#, and Visual Basic. Various UNIX systems and compilers support different Unicode encoding methods. You will almost certainly have to support data sources in the many legacy encodings , but by converting to Unicode for your internal processing, you can avoid the complexity of handling all of these encodings in the bulk of your code.

The Unicode Standard has been adopted by such industry leaders as Microsoft, Apple, Oracle, Sun, HP, IBM, Sybase, and many others. Unicode is required by such standards as Java, JavaScript, XML, LDAP, CORBA 3.0, and WML. In addition, Unicode is the official way to implement ISO/IEC 10646 and GB18030, which have important business ramifications .

You can add Unicode support to your application either directly or by using a third-party library, such as Basis Technology's Rosette Core Library for Unicode (http://www.basistech.com) or International Components for Unicode's (ICU) open source software package (http://oss.software.ibm.com/icu/index.html). For detailed information about Unicode and the Unicode Standard, visit the web site: http://www.unicode.org.

5.4.2 A Simple Resource Manager for Strings

Handling strings is more complicated than making a new code base for each language and translating the existing strings in place. There are a number of issues when dealing with strings in the scope of internationalization: the size of the buffers, their placement on the GUI, and their use within the code, to name a few.

There are differing approaches to what strings should be localized (or translated). For example, some people feel that all strings should be localized, including all error messages. We take a simpler approach and consider the intended audience when determining whether or not to localize a message. For example, the message File not found should be localized because the user can take some corrective action when this message is received. However, a string like Fatal Error #1004. Stack Trace: ... mostly contains information useful only to the developer. In this case, we would split the string into two parts . The Fatal Error #1004 would be localized because this tells the user that something bad happened . However, any other detailed information, such as a stack trace or internal dump of the system, can be in the native language.

Designing the Resource Manager

In this section, we design a simple resource manager to handle all strings used within the application. The goal of our resource manager is to make it easy to replace strings with a different list of strings, depending upon the desired language. We create a repository for all displayed strings, with a mechanism for replacing that repository with an alternate version. The design is flexible because few assumptions are made regarding where this list of strings is stored. Our application refers to strings with a unique ID, which is the key to the design.

Our resource manager, apResourceMgr , has the following features:

Returns a string given the ID of the string.
Exports all managed strings to a file, which is usually done during development after all the strings have been defined. This file is then given to translators to produce a localized version for another language.
Imports strings from a file. This is usually done when the application starts running to load a set of translated strings.
Stores the string files in XML. This permits the file to be edited using an XML-aware editor, or even a generic text editor, as long as it can display wchar_t characters.

It is important to note that our resource manager does not address the myriad of GUI issues that arise (due to localization and the rendering issues specific to native operating systems). We only briefly address the issue of string length. The CD-ROM contains the full source code for the resource manager.

STRING REPOSITORY

apResourceMgr keeps all strings in a std::map object, which also does most of the work.

 struct apStringData {   std::wstring str;     // Current string value   std::wstring notes;   // Optional notes describing value }; typedef std::map<unsigned int, apStringData>    stringmap;

RETRIEVING A STRING

apResourceMgr controls an instance of stringmap called strings_ . Regardless of what interface is used to access a string, a member function fetchString() is called. This function returns either the string associated with the id passed to it, or it returns a null string, as shown.

 std::wstring apResourceMgr::fetchString (unsigned long id) {   stringmap::iterator i = strings_.find (id);   if (i == strings_.end())     return L"";   return (*i).second.str; }

ADDING A STRING

Inserting a string into our map is only a little more complicated, as shown here.

 unsigned int apResourceMgr::addString (const std::wstring& str,                                        const std::wstring& notes,                                        unsigned long id,                                        bool overlay) {   if (id == 0)     id = hash (str);    // Get a hash value   if (id == 0)     return id;  // null string. These can be safely ignored.   // Ignore the string if overlay is false   // and the string already exists   if (!overlay) {     stringmap::iterator i = strings_.find (id);     if (i != strings_.end()) {       return id;     }   }   apStringData data;   data.str   = str;   data.notes = notes;   strings_[id] = data;   return id; }

Only the first argument of addString() is required. The notes field is optional and is needed only when you want to keep track of any requirements or translation notes. If id is not specified, a function hash() is called to compute an id based on the string itself. It is usually considered an error if the id is already found in our map. Some may consider this a bit harsh , so the overlay argument can be set to true to allow duplicate strings to replace any existing definition.

Using the Resource Manager

Our resource manager can be used, depending upon the style of the developer, in the following ways:

Accessing strings using IDs
Accessing strings using names

ACCESSING STRINGS USING IDS

We have found it cumbersome to refer to strings by an ID value, because these IDs must be unique to the application. However, for developers that are comfortable with this method of localizing strings, our implementation supports it, as shown in the following example:

 std::wstring msg = apResourceMgr::gOnly().fetchString (0x12345); std::wcout << msg << std::endl;

In production code, using symbolic constants will help improve readability:

 #define STRING_R 0x12345 std::wstring msg = apResourceMgr::gOnly().fetchString (STRING_R);

A Singleton object, apResourceMgr , is used to access a string value, given its id ( unsigned int ). The returned string is then used for display or other purposes. Notice that we are using a std::wstring object, and not a std::string object. std::wstring is a string of 16-bit characters (or 32-bit on some platforms), while a std::string is a string of 8-bit characters. We designed apResourceMgr to manage strings of 16-bit Unicode characters (UTF-16), as opposed to strings of multibyte characters (MBCS). Working with wchar_t characters instead of char is not difficult, but it does require some changes in coding style, as follows :

 std::wstring message = L"Hello World"; wchar_t* p = message.c_str();

If your compiler does not have reasonable wide-character or Unicode support, you can convert our resource manager to a multibyte (MBCS) manager without too much difficulty. GNU's gcc compiler lacks support for wide characters prior to version 3. Earlier GNU versions, like 2.95, are still very popular, so you should check on your requirements. Compiling and running this simple example is an adequate test of a compiler's support:

 #include <string> #include <iostream> int main() {   std::string astr = "Hello World";   std::wstring wstr = L"Hello World";   std::cout << astr << std::endl;   std::wcout << wstr << std::endl;   return 0; }

ACCESSING STRINGS USING NAMES

The second way you can use apResourceMgr is to encapsulate all your strings inside another object, apStringResource . You can choose either to define these objects in each source file, or you can define them all in a single resource file. One nice feature of apStringResource is that you do not have to worry about string id s. You access the string using the variable name you created. Another advantage of using apResourceMgr is that your code contains a default string to display. If no translation is available for a particular string, the default string is shown.

EXAMPLE

Let's look at a simple example:

 apStringResource r1 (L"Hello World", L"Misc. Notes"); int main() {   std::wcout << r1.string() << std::endl; }

In this example, a global instance of apStringResource , r1 , is defined. It has an initial value of L"Hello World" , and this is what is displayed on the console. The second parameter contains optional notes you can add. These notes are never displayed to the user, but they are seen by whoever translates the strings. For instance, you can specify how long the translated string can be before it adversely affects the user interface.

The definition for apStringResource is shown here.

 class apStringResource { public:   static std::wstring sNullWString;   apStringResource (const std::wstring& str,                     const std::wstring& notes=sNullWString,                     unsigned int id=0);   std::wstring string () const;   operator std::wstring () const { return string ();}   unsigned int id () const { return id_;} private:   unsigned int id_;        // Id of this string   apStringResource (const apStringResource&) {}   apStringResource& operator= (const apStringResource&) {} };

As you can see, apStringResource is a very simple object. When you construct an instance of apStringResource , the string is actually stored in apResourceMgr . When string() or the conversion operator accesses the string, the string is fetched from apResourceMgr . There is a optional third argument to the apStringResource constructor. If no id is specified, a hash function is run on the string to compute a unique ID. In this case, you should not change the text of the default string, which was used to compute the id . This is especially true once the strings have been localized.

5.4.3 Saving and Restoring Strings from Files

The other important part of our implementation is saving and restoring strings from a file as shown here.

 bool exportStrings (const std::string& file);   bool importStrings (const std::string& file);

Instead of choosing a binary format, we chose XML because it means we don't have to write any special tools to allow a translator to edit these files; any text or XML editor will do

To keep this subject brief we will not go into every aspect of the XML tools we use. We wrote a simple XML parser because we support a limited number of tags and do not require a comprehensive package. There are many XML parsers available, including the open source Expat library (see http://www.jclark.com/xml/expat.html). These are somewhat large packages and we simply did not need all this functionality. However, if your application uses XML for other purposes, we encourage you to write your own version of exportStrings() and importStrings() to take advantage of the parser you already use.

EXAMPLE

Our use of XML can best be seen from example:

 apStringResource r1 (L"This is string 1", L"Notes");   apStringResource r2 (L"This is string 2", L"Notes 2");   apResourceMgr::gOnly().exportStrings ("test.xml");

The file, test.xml , contains the following data:

 <?xml version="1.0" encoding="UTF-16" ?> <resources> <phrase> <id>324936760</id> <string>This is string 1</string> <notes>Notes</notes> </phrase> <phrase> <id>324936761</id> <string>This is string 2</string> <notes>Notes 2</notes> </phrase> </resources>

For more information on XML, refer to http://www.xml.com. In our example, this file is written using wchar_t characters and many text editors will be unable to display it because most editors are only able to display 8-bit characters. We recommend that you use Microsoft's Internet Explorer to view the XML file.

The first line of the file, <?xml version="1.0" encoding="UTF-16" ?> , describes the data to be XML using UTF-16 format. What is not shown is that two bytes precede the first printed data. This data is called the BOM, or byte-order mark. Because machines store data in either little-endian or big-endian order, the BOM specifies which order is used in the file. This permits your localized string files to be used on any machine, regardless of whether the endian order of the file matches that of the machine.

Use the tag, <resources>...</resources> , once per file. This must surround all other elements in the file.

The tag, <phrase>...</phrase> , contains information for a single string. It relates an id with a string and a notes field. A phrase contains two or three nested elements (the notes field is optional), as shown here.

 <id>id</id> <string>string</string> <notes>notes</notes>

Our simple parser can also accept comments, but these are removed when the file is read. A comment looks like this:

 <!-- This is a comment -->

That's it! Our XML parsing uses the search capability of std::wstring . The entire XML document is stored here when it is read from the file. Comments and other XML declarations are removed and the BOM is compared with the endian order of the system to see if the bytes must be swapped before parsing. The find() method is used to find the next element, or skip forward to the end of an existing element. Because our XML is so simple, we do not have to worry about any complicated nesting of elements. We have also defined a Singleton object, apWideTools , to encapsulate endian detection and byte swapping. The XML functionality is kept inside an object, apXMLTools . You can use this for other simple parsing needs, or you can substitute it with a comprehensive XML parser.

5.4.4 An Alternate Approach to Handling Strings

There is a package that works in a similar way to our resource manager. The gettext package was originally released on UNIX systems but is now available on most platforms. This package includes:

API to retrieve messages from a file. Macros are used to replace the user string with a version that will look up, and return, a translated string.
Programs to extract existing strings from your code base.
Programs to edit and manipulate lists of translated string.

You can find its full documentation by going to http://www.gnu.org/manual/gettext.

5.4.5 Locales

You can't discuss internationalization in the context of C++ without mentioning locales. Although we don't recommend using locales for managing large-scale commercial internationalization efforts, we provide a very brief overview here. For an extensive discussion, see [Stroustrup00] or [Langer00].

The standard library includes a complicated package called locales. Although it is easy to understand the intention of std::locale , its difficulty lies in its extensibility. std::locale was intended to be used by the stream classes, although it is a general purpose object. You can think of a locale as a collection of preferences, usually with regard to how something is displayed.

A locale is a collection of these preferences and a single preference is called a facet. Examples of locales include not only English, French, and Japanese, but also more specific ones, like US English or French Canadian. Facets are provided for date and time display, monetary values, and numeric quantities .

For example, time and date formatting depends upon standards that vary around the world. The date December 31st, 2002 would be represented as follows:

In the US: 12/31/2002
In Europe: 31/12/2002
In Japan: 2002/12/31
ISO specification: 20021231

Fortunately, when an application runs, these preferences are typically specified by the user.

C++ allows facets to be reused by different locales, and you can take an existing locale and change one or more facets. Delving deeper into locales and facets is beyond the scope of this book. However, we do recommend the following:

Even if you do not use locales directly, the underlying stream package and many run-time library functions do. When you write locale-type data to a buffer (for example, strftime to format a time string), be generous about the size of any temporary buffers you allocate. If you use a fixed size buffer and the function asks for a maximum buffer size, you should use sizeof() instead of a hard-coded value.
If you display any string whose length might be affected by a locale, you should verify that the string will fit before displaying it.