Section 11.8. Using Locales


11.8. Using Locales

Computer technology has mostly been developed in English-speaking environments, and much of the way in which it handles characters and notations reflects the conventions of English. However, the majority of people speak languages other than English as their native language. As computers become a popular commodity, it is increasingly important to let people use them in their own language and according to their cultural conventions. To big software companies, this is essential, since they aim at a worldwide market. It is also important to small companies, due to the competitive advantage.

There are many aspects in making computing technology useable to people with different backgrounds, and part of this is the translation of user interfaces to software. This includes traditional translation work but also new challenges. Increasingly, programs generate texts dynamically, as immediate responses to user queries and responses. Of course, such texts cannot be translated on the fly by human translators.

Suppose that you are designing a program that accepts a search string as input from a user, searches for data in a bibliographic database (i.e., a database containing information about books, serials, etc.), and displays some results to the user. Naturally, the explanations (like "Found 42 hits") should appear in a language of the user's choice, if possible. This is typically straightforward, since it is mostly a matter of translating fixed texts. The book titles may be in any language, and this is a character-level challenge. But the information also contains data such as date of issue and language of the book.

In a well-designed database, data like date and document language is expressed in an unambiguous, easily machine-processable format. For example, the date might be in a format that conforms to the ISO 8601 standard, in year-month-day notation like "1985-11-06," and the language might be expressed using a two- or three-letter code as defined in the ISO 693 family of standardse.g., with "de" indicating German. When the data is to be presented to a user, however, it should be expressed in a format that the user finds understandable and natural. To some people, this might mean "November 6, 1985" and "German." To some other people, it might mean "6. marraskuuta 1985" and "saksa," or perhaps "6 1985 ." and "." The goal is to achieve this without forcing software designers to know about the language-dependent conventions and strings.

11.8.1. The Locale Concept

The data presentation conventions of a language constitute a locale. More technically, a locale is an exact, usually formalized specification of some data presentation conventions. Typically, a locale is about a language, so the name "locale" is somewhat misleading, and so is the rather common way of presenting locale settings to a user under a name that primarily refers to country or regional settings. The word "locale" is of course related to the word "local," though there is a difference in meaning as well as in pronunciation. (In "locale," the stress is on the second syllable.)

There is sometimes some locality in a locale, though, since some conventions depend on the country or other area, too. For example, the British English locale differs from the U.S. English locale somewhate.g., in the conventions for quotation marks. Even then, language is the primary choice, and the country selection is secondary and optional.

Technically, locales are identified by structured strings with components for language, script (writing system), country or other territory, and variant. The underscore "_" is used as a separator between the components. Only the first component is obligatory, and it is a two- or three-letter language identifier (see Chapter 7). Naming conventions take care of unambiguity when components are omitted. For example, in the identifiers "en_GB" (British English) and "fr_CA" (Canadian French), the second component is a country identifier, since it consists of two letters. A four-letter component is a script identifier; for example, "zh_Hans" means Chinese written in the Traditional script.

In practice, locales are mostly identified by a language code only or by a language code and a country code. This means that they are very similar to language codes with an optional country specifier, though with different punctuation. In principle, the locale "en_US" indicates the notational conventions used by English-speaking people in the United States, whereas "en-US" is a language code for English as spoken in the U.S. In practice, the line between locales and languages is fuzzy.

A locale can be very specific, even relating to the conventions applied by some specific ethnic or cultural group. Ultimately, a locale can even be a personal locale: as a user, you could select a locale according to your native language, then perhaps a specific variant of the language, and add some cultural preferences (e.g., the use of "AD" or "CE" in year denotations), and finally some purely personal preferences, if you like. For practical reasons, though, most of the work revolves around language locales for now, though they may allow some variation.

Previously, different companies (and even different groups within one company), associations, groups of volunteers, and even individuals have decided on locale settings independently of each otherand without asking language authorities or representative groups of people using a language. Consequently, if you look at the different language versions of different programs, you can see incompatibilities and even errors. For example, language-dependent names for countries may vary within a language. For usability, it would be better if a U.K. citizen could see his country under the same name (and hence in the same place in alphabetic order) in country selection menus in different programs and services. Whether it is "United Kingdom" or "Great Britain" is less important from the practical point of view.

Some localization decisions in programs have been outright wrong, giving localization a somewhat bad reputation in some circles. Many people who do not speak English as their native language prefer an English version of a program to a poorly localized version. All too often, a "localized" version is actually a mixed-language version, perhaps even so that the program asks a question in the user's language but presents the options for an answer in English, or some commands in a menu in one language, others in another.

11.8.2. CLDR

The Common Locale Data Repository (CLDR) is about making user-oriented presentation of data easier, so that system designers and programmers can implement it easily. Ease of implementation is essential, since software vendors, still less individual people, cannot be expected to find out and implement all the possible conventions used in the hundreds of written languages of the world. Moreover, such conventions are sometimes debatable or subject to interpretation. Suppose you are designing programs that might be used throughout the world, with user interfaces in different languages. Would you like to take position on some heated question about the orthography or date format or names of countries in Swahili or Thai? You would probably prefer applying the rules decided by authorities and experts on the languages.

The general idea is to collect reliable data based on consensus about language-dependent conventions, present it in a rigorously defined (XML-based) format, and make it available worldwide. Ideally, the data is used when building general purpose subroutine libraries. Thus, a programmer need not know anything specific about the conventions, or even see them. She would just call, for example, a library routine to print a date, passed as a parameter in some standard format, according to the conventions of a language. The language would be specified by using a standardized language code, and it could be passed as a parameter to the output routine. Preferably, however, the routine would get the language code from user settings in the computer where the program runs. Of course, more primitive tools could be used, too. The mere availability of reliable data on cultural conventions on data presentation will help a lot, even if the information is implemented in programming "by hand"i.e., by coding it separately for the supported languages.

At the cultural and social level, the CLDR approach makes it possible to support small languages and ethnic groups, even very small ones, at an acceptable cost. Once the data about the conventions of a language has been produced and stored in CLDR, there is no extra cost in supporting that language along with others, as regards the scope of CLDR. Of course, there would still be the cost of translating application-specific texts, such as command menus, instructions, and error message texts.

Dynamic adaptability to the user's locale is particularly important in modern online services, such as those based on the web services concept. When a request may come from any source, it is essential to try to recognize the user's preferred language and present the answer in the conventions of that language. This of course applies to situations where you communicate with a human user, rather than just a program. The localization is often best left to the user interfacee.g., so that in a server/client architecture, the server sends the response in internationalized format and the client presents it to the user according to the user's locale.

The CLDR activity was launched in 2004 by the Unicode Consortium, continuing the work of a joint effort by IBM, Sun, and OpenOffice.org. The activity has produced an extensive and growing database. The CLDR database is independent of the Unicode standard but related to it in many ways. Naturally, it uses Unicode as the character code, but many of the definitions in CLDR relate directly to the use of Unicode characterse.g., the rules of using quotation marks in different languages and the language-specific collation rules that are to be superimposed on general Unicode rules. The main page of the CLDR activity is http://www.unicode.org/cldr/.

For discussion on CLDR, the public Unicode discussion list (email list), described at http://www.unicode.org/consortium/distlist.html, can be used. The list exists for all discussions related to the activities of the Unicode Consortium.

ICU is the best-known implementation of CLDR definitions, but a clear distinction should be made between them. CLDR specifies types of data that can be localized and specific values for such data in different locales. It does not prescribe any particular implementation. ICU, on the other hand, is a collection of software that implements the CLDR definitions, or part of it, among other things. It is quite possible to implement CLDR in other wayse.g., using your own code that directly reads the CLDR data and converts it to suitable tables and algorithms. If you need or decide to implement just a small part of CLDR, you might even do it "by hand." As support to CLDR becomes more mature in software libraries, you will probably want to use their built-in CLDR support even for trivial tasks, just because it's easier.

11.8.2.1. CLDR versus Unix/Linux/POSIX locale concept

There have been some predecessors of CLDR, but their scope of application remained rather limited. In particular, although especially Unix and Linux systems have a "locale" concept, defined in the POSIX specifications and allowing user-selected presentation format for some data, it covers only a few features of presentation. CLDR is much wider, and growing even wider. Moreover, it is supported by major software companies, which have technological and economic motives for promoting and implementing the ideas. As an indication of this, they have permitted the creation of comparison tables, which compare CLDR definitions with the actual settings in software from different vendors.

Although CLDR owes much to the previous work, there will also be conflicts between old and new concepts and techniques. In particular, the POSIX-style locale concept involves character code and encoding in addition to language and country.

The POSIX specification has been merged into the Single Unix Specification, Version 3, by The Open Group, and it is available via http://www.unix.org/version3/. A POSIX locale contains the following categories, each identified with an environment variable:


LC_CTYPE

Character classification and case conversion


LC_COLLATE

Collation order


LC_MONETARY

Monetary formatting


LC_NUMERIC

Numeric formatting (other than monetary)


LC_TIME

Date and time formats


LC_MESSAGES

Formats of informative and diagnostic messages and interactive responses; in practice, strings that are to be interpreted as affirmative (yes) or negative (no) answers

Typically, the overall (POSIX) default values correspond to the locale "C" alias "POSIX," which is a programming-oriented locale, which in practice implies the English language. Setting the environment variable LC_ALL (e.g., with the shell command export LC_ALL=fr or setenv LC_ALL fr) is supposed to set all the above-mentioned variables to suitable values. In practice, the system-wide default for LC_CTYPE often carries the name of a character encoding (e.g., export LC_CTYPE=iso8859 -1, documented as "country setting"), as if encoding implied classification and conversion rules. Similarly, the available full locale names may carry the encoding, for example, en_GB.iso8859-1, en_US.UTF-8, etc.

Consider the following C program, which is very trivial: it simply prints the value "42.01" as formatted text. However, it has been localized in the POSIX sense. It calls the standard library routine setlocale in a manner that makes the program use the locale settings as defined by the environment variables. If the value of LC_ALL does not correspond to any locale known to the system, setlocale returns a null pointer, and our program recognizes this and issues an error message:

#include <stdio.h> #include <locale.h> int main() {   if(!setlocale(LC_ALL, "")) {     fprintf(stderr, "Unknown locale\n"); }   printf("%6.2lf\n", 42.01);   return 0; }

The following demonstration shows how the program (stored in print.c) is compiled with the gcc compiler and executed, then executed again after setting the locale (to French). Recompilation is not needed, since the locale selection takes place at runtime:

% gcc print.c % ./a.out  42.01 % setenv LC_ALL fr % ./a.out  42,01 %

Although this may look nice, localization has been rather problematic. The repertoire of available locales is usually rather limited, there can be errors in their values, and locale settings via environment variables might be used when they shouldn't. In testing the simple program, I made the mistake of having LC_ALL set to the value en (English) when trying to compile the program, and I got the error message "couldn't set locale correctly" from the compiler. Apparently, the compiler checked the locale settings, theoretically to adapt its own behavior to them, but did not recognize the locale name.

You can view the list of available locales with the locale -a command. The list may contain a mixture of primary language codes like "fr," language codes with country specifier like "fr_FR," and locale names that additionally contain the name of an encoding, such as "fr_FR.ISO8859-1." For some languages, there might be no simple, general locale like "fr" or "en."

The repertoire of available locales in a system varies greatly. It may cause surprises. Even "en" for English might be missing.


Moreover, most users are probably unaware of the possibilities of setting the locale. Those who have tried to set the locale have often been disappointed with the effects. For example, you might expect that by doing the above in C, you would also make the standard isalpha function to work according to a localized definition of what is an alphabetic character, but this probably won't happen.

11.8.3. Using CLDR

Using software modules that output data in localized formats according to CLDR, a programmer can create programs that adapt to users' preferences in data presentation. Ideally, the programmer need not know the different conventions, though she needs to be aware of the fact that output formats vary. In particular, assumptions about any fixed or maximum length or character repertoire in, for example, date and time denotations should be avoided.

Do not localize everything. In the past, many mistakes have been made, for example, by writing numeric data to temporary files as formatted text. Suppose a number is written using an English-language locale as the string "1.234" (meaning a number somewhat larger than one). When the data is read by the same program in another environment, or just with a different locale setting, serious problems may arise. If the program uses a locale where the decimal separator is comma ",", it will fail to read "1.234" properly. An error might be reported or, worse still, just occur. The data might be read as "1234" for example, treating the period "." as a thousands separator.

Localize output presented to users, but not in the internal format inside a program or in interchange formats between programs.


Since CLDR is a relatively new invention, it will take time before you can use sufficiently high-level routines. The programming environment that can be expected to keep up with the development well is Java, since much of CLDR work adopts notations and definitions from the Java environment.

In the absence of library routines that print, say, a monetary amount according to each user's locale, you may need to write such routines yourself. You will probably want to deal with a few locales only, according to expected user base. Even in such somewhat boring work, CLDR can help you by specifying the exact format of output for the locales. If someone criticizes you for wrong output format for some locale, you can always say that you have been using the most up-to-date publicly collected information on it.

The CLDR data is primarily meant to be used in automatic data processing when a program generates menus, diagnostic messages, reports, tabulated data, date stamps, etc. It could also be used for data to be inserted into running text (paragraphs of normal text), though this involves many complications that are currently not addressed in CLDR, such as word inflection. As a large collection of information, CLDR can also be useful to translators, editors, and writers in "manual" worke.g., in translating rarely used names of languages and in estimating what characters will probably be needed in texts in some language.

11.8.4. Internationalization and Localization

Before you can localize software reasonably well, it must be internationalized. The software must internally use data formats that can easily be mapped to various presentations. This typically means adherence to some published international standard or specification. Moreover, the software must perform input and output operations by using subroutines that know how to find the current locale settings.

This mostly applies to output, since localization of input has not been addressed much yet. However, localization is important in menu-based input. For example, if the user is prompted to select a currency, typically from a short list, the currencies should usually be specified by names in the user's preferred language.

For example, localizable software would process monetary data in a standardized internal format, normally with the sum and the currency in separate fields, and always carrying the currency information, with no implied currency. Only on output (and input) should the monetary data be converted, via a general purpose routine, into a language-dependent format, such as "$42.50" or "42,50 $" or "42:50 dollar."

This approach avoids many character-level problems, since the internal data formats typically use a limited repertoire of characters only, often just ASCII. For example, monetary data would be represented as a combination of a number (represented as a binary integer or floating-point number, or perhaps as an ASCII string) and currency identifier (represented as a string of ASCII letters or as an integer). Only the output routine would need to deal with special currency symbols, digits of different scripts, etc.

Localizable software uses universal, exact, and easily processable data formats internally. It converts to language-dependent format on output only.


Existing software that stores monetary data as strings like "$42.50" (to take a somewhat artificial example) might need considerable changes to become localizable. However, using a language-dependent format for the internal storage and processing of data does not as such prevent localization. You would just need to make sure that the format is well-defined and consistently used so that it can be converted to some international format that can be passed to a localized output routine. It would however be a real obstacle to localization, if the software has been coded to perform output at different places and directly using the internal format as the output format. In that case, it would need substantial modularization of output.

11.8.5. CLDR Description and Data

Currently CLDR contains definitions for data formats like the following:

  • Names of languages (e.g., for use in language menus or bibliographic information)

  • Names of scripts (such as "Latin," "Cyrillic," etc.)

  • Names of countries and some other territories, such as continents

  • Names of calendar systems (e.g. "Gregorian calendar")

  • Names of time zones (e.g., "East European normal time")

  • Different (short, medium, long) formats for dates (in different calendars)

  • Formats for time of the day (e.g., "2 PM" versus "14.00")

  • Format of decimal numbers (e.g., 2,50 versus 2.50) and percentages (e.g., 7% versus 7 %)

  • Format of monetary data (e.g., €1.23 versus 1,23 €)

  • Names of currencies (e.g., for use in explanations and menus)

Currently there is no data for localized names of characters, although there would surely be need for them, for example, in character maps, in some error messages, and in user interface components for asking "which character is this?" There have been some discussions on such data, but it would be a major effort to compose a consensus-based list of names for characters in some language, even if we limit ourselves to a small subset of Unicode.

The CLDR database uses an XML-based format called Locale Data Markup Language (LDML), which has been defined as Unicode Technical Standard #35 at http://www.unicode.org/reports/tr35/.

For quick access to files containing data for particular locales, use the index page http://unicode.org/cldr/data/common/. It is divided into directories:


collation

Locale-specific exceptions and additions to Unicode collating order (which was described in Chapter 5)


main

This contains most of the locale dataeverything that has no specific directory


posix

Locale settings for POSIX compatibility


supplemental

Information that is needed for some formatting of data but is not itself localizablee.g., information about the use of historical currencies


test

Generated test data for checking implementations against CLDR (described in http://unicode.org/cldr/data/common/test/readme.html)

For example, most of the data for the French language locale (code: fr) is available at http://unicode.org/cldr/data/common/main/fr.xml. An extract of that data is shown in Figure 11-3, containing information about decimal and group (thousands) separator, currency format, and the start of data that contains French names for currencies. There is some additional data for country-specific French locales, for example, for Canadian (country code: CA) French at http://unicode.org/cldr/data/common/main/fr_CA.xml.

The data just mentioned is in LDML format, and if you access it with a web browser, you will see it as text with XML markup. Although it is readable to some extent, at least to people who have a basic knowledge of XML and who can guess the meanings of element

Figure 11-3. Extract of CLDR data for French, in XML format


and attribute names, it's really not for a common user. The data is also available in a more formatted and more readable form, though partly as very large documents, at http://www.unicode.org/cldr/comparison_charts.html, and specifically in the by-type chart index at http://www.unicode.org/cldr/data/diff/by_type/index.html illustrated in Figure 11-4. It shows different patterns for presentation of percentages (producing, for example, "42%", "%42", 42%, 42 %, etc.) and the codes of the languages for which they apply.

11.8.6. Problems with Aspects of Localization

As mentioned earlier in this chapter, locales are mostly about languages, not locality. However, the selection of a locale is very often presented to the user as a matter of choosing a country or area. Yet it is currently impossible to specify locale settings as applying to a country or other geographic area independently of language. Territory codes can only be used as a subcode after a language code.

This will hopefully be fixed somehow, making language and territory orthogonal aspects of "localeness." Few localization-relevant things can be reasonably described as belonging to a form of a language as spoken in a particular country, as opposed to the language in general. Such features include the different rules for quotation marks in U.S. English and British English. On the other hand, there are things that should depend on the geographic position alone. The default time zone might be one of them. For some large countries, the

Figure 11-4. CLDR Sideways Data for percent formats


country code alone would not imply a meaningful default. The point is that the time zone is not derivable from language, even when a specific variant of language is specified.

Language selection menus often contain country-specific variants of languages for no good reason: the choice between them usually has no effect. The language forms could be different, but not in a manner that affects the behavior of programs. Spellchecks are probably the most common (potential) area where the country may matter. For example, Brazilian Portuguese has somewhat different spelling than Portuguese in Portugal.

Ideally, language codes such as en_GB and en_US should be kept separate from the territory setting. After all, an American living in the U.K. might prefer to see quotation marks used in the U.S. English style, yet see times displayed in the time zone used in Britain, even if the display format is in U.S. English style (assuming it differs from British English).

Some people prefer dates and times as 2005-09-15 and 23:54 (i.e., in ISO 8601 format), especially if they read texts in different languages. There is no locale that matches such preferences. It would be possible to define such a locale, of course, and distribute it for use by people who prefer such presentation. They would not need to understand how a locale definition is written in LDML. Naturally, this would work only in programs that allow the use of locale definitions outside a predefined set like CLDR. Even then, users would have the problem of combining their language preferences with the specific preferences they have selected. This will probably imply that good-quality implementations of CLDR-based localization will offer a way to superimpose locales: set a locale, and then set one or more other locales, which override some of the settings.

The conclusion is that whenever possible, language and country selection should be kept logically separate. Both of them should be derived from user-supplied data. They should affect different settings, such as date and number formats, in a manner that is overridable by the user.

Globalization is more than making things global. Adequately globalized software adapts to varying conditions of use, including the user's language, country, cultural habits, and personal preferences.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net