Challenges with Culture-By-Default


The .NET Framework was designed to support automatic culture-specific formatting in many APIs. In most cases, this is actually desirable. For example, the fact that dates will be formatted correctly when deployed in another culture without explicit code on your behalf is usually a welcomed feature. String sorting and comparison that differs automatically based on culture is usually great. Well, it's great until it causes problems for you, that is. Such problems range from benign bugs that only show up on international platforms to exploitable security holes in your Framework. The former is obviously much more common than the latter.

String Manipulation (ToString, Parse, and TryParse)

The canonical examples of culture-specific APIs in the Framework are the methods for transforming a primitive into its string representation and the reverse, parsing a string into a primitive. These are the ToString, Parse, and TryParse functions, and they are available on all of the primitive data types such as Int32 (C# int), Int64 (C# long), Double (C# double), DateTime, and Decimal (C# decimal), for example.

What you might not already know is that the primitive types also offer overloads of ToString, Parse, and TryParse that accept an IFormatProvider. We mentioned above that IFormatProvider is the common unification for passing around formatting specifications. This enables you to pass a precise culture XxxFormatInfo instance to modify the default behavior.

The default implementations of these methods use the CultureInfo.CurrentCulture property as the IFormatProvider. This means that by default you'll get behavior that varies depending on the culture on the thread executing. If you're parsing a list of numbers from a text file, for example, this could fail unpredictably when switching to an alternative culture:

 string numberString = "33,200.50"; double number = double.Parse(numberString); 

This is an innocent piece of code which (in an English culture, at least) works just fine. But try running it after the following line of code (which might happen implicitly based on the current user's Windows profile, or without your knowledge in some code that your coworker haphazardly wrote):

 Thread.CurrentThread.CurrentCulture = CultureInfo.GetCultureInfo("de-DE"); 

Your code will actually fail at runtime with a FormatException (or return false if you used TryParse). Why? Well, de-DE represents the German culture where the comma (,) character is used instead of period (.) to separate the decimal portion of a number. Likewise, period (.) is used instead of comma (,) to separate groups of three numbers. So in Germany, the number 33,200.50 doesn't make any sense. If the original string were, say, "200.50" something even worse would happen: The application would silently parse the number successfully, but it would mean something entirely different (twenty thousand fifty instead of two hundred and one-half). If this were part of a bank transaction routine, somebody is liable to become unhappy when he or she finds out about this bug!

CultureInfo.InvariantCulture is the solution to this problem. As long as the textual representation of your numbers (or whatever data type you are dealing with) is serialized using the InvariantCulture, they will be deserialized correctly. The original two lines of code change to:

 string numberString = "33,200.500"; double number = double.Parse(numberString, CultureInfo.InvariantCulture); 

Regardless of which culture is selected in your user's Control Panel (or whether your coworker accidentally left some weird culture on your thread), your code will work predictably.

ToString has similar problems. Depending on what you expect to do with that string, you may or may not want the culture-friendly version. This comes in useful in the above example. Although you parse a string using the InvariantCulture, you can simply call ToString to get the correct representation on German platforms. You have to pass in InvariantCulture to ToString in order for it to ignore your current culture. Note that if you are working with ADO.NET, XML, or the web services APIs, they always deal natively with invariant representations for you.

To summarize, testing the correctness of operations with culture-specific behavior is tricky business, especially because a lot of it is hidden throughout the Framework. Where possible, try to work natively with InvariantCulture-based data. To make sure that your intent is clear in the code, consider passing the CultureInfo.CurrentCulture explicitly as a parameter to these APIs. Then, when somebody is reviewing your code later on, it will be clear that you meant for culture to be taken into account.

String Comparisons and Sorting

There are a set of String-related functions that vary behavior depending on the current culture. This should be recognized as a recurring theme by now, but this particular case is often problematic. Comparing two strings can actually exhibit variance between cultures due to differing sorting and casing rules. We'll see why this matters in this section.

First, it's worth noting that the String.Equals method (and the op_Equality (==) and op_Inequality (!=) operator overloads) — by far the most widely used String comparison API — is not culture sensitive. Equals simply compares two Strings for pure lexical equivalence — called an ordinal compare — meaning that it looks at the bits and doesn't try to perform culture-specific comparisons. Thus, ordinal compare avoids the issues lurking within altogether. Likewise, either calling String.CompareOrdinal or Equals(string, string, StringComparison) with a value of StringComparison.Ordinal performs ordering based on the ordinal value of the characters within a string — not the alphabetic ordering. It too avoids any culture-specific behavior.

Compare and Sort Ordering

The static Compare and instance CompareTo methods, however, do rely on culture. These use a native alphabetic ordering based on the culture's language. Specifically, these methods operate on two strings a and b (this takes the place of a in the CompareTo case), and return < 0 if a comes before b, 0 if a and b are equal, or > 0 if a comes after b. Similarly, calling Equals with a StringComparison.CurrentCulture value will rely on this same compare behavior.

These are appropriate for situations where the sorting algorithm needs to compare and reorder a list of strings. Using culture to determine this ordering often makes sense. But similar to the parsing issues noted above, this means that sorting behavior might differ when run on varying platforms. The instance CompareTo method doesn't offer an overload that takes an IFormatProvider; you should use the static String.Compare (or better yet, CompareOrdinal) for this purpose.

There are several functions available for dealing with sorting lists of Strings that you might be concerned with. Specifically, System.Collections.SortedList and System.Collections.Generic .SortedList<TKey,TValue> both maintain their internal arrays in sorted order. This relies on the default IComparable.CompareTo (or IComparable<T>.CompareTo) method implementation. For String, this is the culture-friendly CompareTo method mentioned above. Similarly, the System.Array type's BinarySearch and Sort methods use the type's CompareTo methods as the default. In all of these cases, the sort will vary based on culture.

All of these methods provide an overload that takes an IComparer instance for custom comparisons, which can be used to ask for invariant sorting behavior. Simply pass the static Comparer.Default .DefaultInvariant property. It uses the InvariantCulture for all comparisons:

 string[] strings = /*...*/; // I can specify the DefaultInvariant for SortedList: SortedList list = new SortedList(Comparer.DefaultInvariant); foreach (string s in strings) list.Add(s, null); // Alternatively, I can sort the array in place using DefaultInvariant: Array.Sort(strings, Comparer.DefaultInvariant); 

In this brief snippet, using a SortedList which is based on the DefaultInvariant Comparer is shown along with how to use the Array.Sort method with a specific Comparer.

Compare and Sort Casing

Another tricky subject is that of casing. Many programs will normalize strings by forcing them to upper-or lowercase. In most cases, the result is then used for a case-insensitive comparison. But in some alphabets—most notably Turkish — portions of the Latin alphabet are mixed with custom characters, and casing rules conflict. The so-called "Turkish I" problem stems from the fact that Turkish uses the same characters "i" and "I" in their alphabet. However, the two are not related as far as casing goes. 'i' (Unicode code-point U+0069) has a separate uppercase character '_' (U+0130) while 'I' (U+0049) has a separate lowercase character '1' (U+0131).

This bit of code demonstrates these subtle differences:

 CultureInfo[] cultures = new CultureInfo[] {     CultureInfo.GetCultureInfo("en-US"),     CultureInfo.GetCultureInfo("tr-TR") }; char lower = 'i'; char upper = 'I'; foreach (CultureInfo culture in cultures) {     Thread.CurrentThread.CurrentCulture = culture;     Console.WriteLine("{0}", culture.DisplayName);     // do conversion from lower case 'i' to upper     char toUpper = Char.ToUpper(lower);     Console.WriteLine("  Lower->Upper: {0} ({1:X}) -> {2} ({3:X})",         lower, (int)lower, toUpper, (int)toUpper);     // do conversion from upper case 'I' to lower     char toLower = Char.ToLower(upper);     Console.WriteLine("  Upper->Lower: {0} ({1:X}) -> {2} ({3:X})",         upper, (int)upper, toLower, (int)toLower); } 

The output of running this is:

 English (United States)   Lower->Upper: i (69) -> I (49)   Upper->Lower: I (49) -> i (69) Turkish (Turkey)   Lower->Upper: i (69) -> _ (130)   Upper->Lower: I (49) -> 1 (131) 

Notice that round-tripping from a lowercase "i" to an uppercase "_" and back results in an entirely different character, "1." This can cause obvious problems if you are normalizing text in order to perform equality or comparisons on it. And it can actually cause serious issues. Consider what might happen if we did a text-based comparison to enforce a security check; for example, if we passed URLs to some system for retrieving data and needed to guarantee that users could ask for files on disk:

 void DoSomething(string path) {     if (String.Compare(path, 0, "FILE:", 0, 5, true) == 0)         throw new SecurityException("Hey, you can't do that!");     // Proceed with the operation } 

If somebody were using the tr-TR culture, however, they could pass a URL of file://etc/ and fake the DoSomething API into permitting an operation against a file on disk. That's because the string, when normalized, would be turned into F_LE://ETC/ due to the capitalization issue noted above. (Note the difference between I and _.)

The ToUpper and ToLower methods on both String and Char are, by default, culture sensitive. Both supply a ToUpperInvariant and ToLowerInvariant version that implicitly use the InvariantCulture. The noninvariant versions also supply overloads that accept an IFormatProvider, meaning that you could supply a precise CultureInfo to it if you desired a certain casing behavior.

There are other odd casing rules that might affect your programs. Please refer to the "Further Reading" section at the end of this chapter for some resources that will help you to identify individual cases. In general, it's recommended that you not worry about specific cases but rather avoid the issue altogether by relying on InvariantCulture to as grant an extent as possible. Where you don't, add extra testing bandwidth to ferret out such impossible-to-reproduce bugs.




Professional. NET Framework 2.0
Professional .NET Framework 2.0 (Programmer to Programmer)
ISBN: 0764571354
EAN: 2147483647
Year: N/A
Pages: 116
Authors: Joe Duffy

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net