Glossary
Even though you might think you understand all the issues involved with sorted lists, users of world-ready applications might have very different expectations of what constitutes a "sorted" list. Not only does alphabetical order vary among languages, but conventions for sequencing items in dictionaries and phone books can also be quite different.
In Swedish, for example, some vowels with an accent sort after "Z," whereas in other European countries the same accented vowel comes right after the nondiacritic vowel. Languages that include characters outside the Latin script have special sorting rules. The Asian languages have several different sort orders depending on phonetics, radical order, number of pen strokes, and so on.
String sorting and comparison are language-specific. Even within languages based on the Latin script, there are different composition and sorting rules. Thus do not rely on code points to do proper sorting and string comparison.
The lstrcmp and lstrcmpi Win32 functions compare two character strings in a case-sensitive or case-insensitive manner, respectively, following the sorting rules and standards that are defined for the currently selected user locale. The lstrcmp and lstrcmpi functions compare two strings by checking the first characters against each other, the second characters against each other, and so on until they find an inequality or reach the ends of the strings. These functions return the difference of the values of the first unequal characters they encounter. For example, lstrcmp determines that "abcz" is greater than "abcdefg" and returns the difference of the code-point numbers that correspond to "z" and "d." The selected user locale determines which string is greater (or whether the strings are the same).
The big disadvantage of these APIs is that you cannot specify which locale-specific rules should be used to perform the comparison, and you are always limited to the currently selected user-locale standards. That can be a problem in the following cases:
Windows XP introduced a new locale ID named the "invariant locale"-LOCALE_INVARIANT is the predefined flag for it-that allows you to do this type of comparison. A locale-independent string comparison must use this new locale ID.
CompareString(LOCALE_INVARIANT, NORM_IGNORECASE, TEXT("HorzScroll"), -1, TEXT("HorzScroll"), -1);
By using the invariant locale, you are always sure to retrieve the same comparison result no matter what the user locale and its associated sorting rules are. However, the invariant locale is only available on Windows XP. Therefore, for downlevel platforms you'll need to define your own locale-such as English-and then in this instance use basic comparison rules of the English language that follow the ASCII mapping of characters. Here is how the code should look:
LCID Locale; Locale = MAKELCID(MAKELANDIG (LANG_ENGLISH, SUBLANG_ENGLISH_US), SORT_DEFAULT); ComapareString(Locale, ..., ..., ..., ...);
In this approach, all non-English characters, for which the English (United States) has no sorting rule defined, will be sorted by their default sort orders. By doing this, you're guaranteed to retrieve a consistent result.
Important Whenever performing a comparison by using CompareString, the following flags should be used with extra care:
Glossary
You can take advantage of the support for sorting that the operating system provides whenever you represent your data in a list-view control that has the sorting attribute set. Edit controls will sort items by following the linguistic sorting rules specific to the currently selected user locale. But if you want to do your own sorting, the real solution depends on your needs.
If you are sorting only a small number of elements, the easiest and quickest solution is to use the string comparison techniques explained previously, and then to sort elements based on the result of those comparisons. However, this technique does not apply if you are sorting a large number of elements. Because CompareString 'salgorithm is complex, it is faster to sort long lists of words by retrieving sort keys.
Your advanced sorting algorithm should follow this general model:
A sort key is a composite representation of a sort element. (See Figure 4-21.) Each component of the sort key has a different weight, depending on the sort rules associated with the locale. Elements are sorted based on script (that is, all Greek characters sort together, all Cyrillic characters sort together, and so on); alphanumeric and symbolic characters; diacritics; case; and other special rules, such as two characters sorting as a single character (for example, the Spanish "CH"). The predefined constant SORT_DEFAULT refers to whichever sort algorithm associated with a given language ID is commonly preferred.
Figure 4-21 - The Win32 sort key for the German word "öde," which means "wasteland."
The first three values in the sort key represent the Unicode sort weights2 for each character in the string. The next three values represent the diacritic weights, and the final three values represent the case weights. Separator values fall between sections. The sort key ends with a null terminator.
The code example that follows contains a search-engine procedure that maps lexical tokens into sort keys. You'll notice that the search engine customizes each sort key by prefixing it with an application-defined flag to ensure that words will sort first, followed by numbers and punctuation. (The system's default behavior is to sort punctuation first, followed by numbers and letters.)
// sort key generation from Unicode BYTE* pbSource; // multibyte string buffer int cbSource; // size of pbSource buffer int LCSortKey( LCID lcid, // locale ID LPCTSTR pwSource, // address of source string int cwSource, // number of characters in source string LPTSTR pwDest, // address of destination buffer int cwDest) // size of destination buffer { int nRet; // size of sort key in words nRet = LCMapString(lcid, LCMAP_SORTKEY, pwSource, cwSource, pwDest+1, (cwDest-1)<<1) >> 1; if (nRet) { nRet++; if (cwDest && pwDest) // Set a sort key's prefix so tokens // group first by alphabetics, then // by numerics, then by punctuation. { BYTE bCharType = char_types(*pwSource); *pwDest = ~(bCharType & WORD_TYPE); } } return nRet; }
Once the search engine has a sorted list of words, it can quickly look for matching strings using a binary search algorithm. Using sort keys is very convenient because they automatically handle language-specific and locale-specific sorting preferences. One example includes matching ligatures. (In English, " " is equivalent to "AE," but this is not true in Danish.) Another example involves distinguishing between single letters and two-character combinations. (In some Spanish locales, "l" and the combination "ll" are distinct letters.)
Some features of linguistic sorting are:
The best thing to remember is that by using Windows 2000 and Windows XP system support, all of these sorting issues are automatically handled for you.
Glossary
In the following sections, you'll see which methods and properties to use for string comparison. You'll also see what to do when a culture supports more than one sort order, how to use the Array class for string sorting, and how to use sort keys.
The CompareInfo class provides a set of methods you can use to perform culture-sensitive string comparisons. The CultureInfo.CompareInfo property, which is an instance of the CultureInfo class, defines how to compare and sort strings for a specific culture. The String.Compare method uses the information in the CultureInfo.CompareInfo property to compare strings. This method returns a negative integer if string1 is less than string2, zero (0) if string1 and string2 are equal, and a positive integer if string1 is greater than string2.
The following code example illustrates how two strings can be evaluated differently by the String.Compare method depending upon the culture used to perform the comparison. First, the CurrentCulture is set to Danish in Denmark and the strings "Apple" and " ble" are compared. The Danish language treats the character " " as an individual letter, sorting it after "Z" in the alphabet. Therefore, the string " ble" is greater than "Apple" for the Danish culture. Next, the CurrentCulture is set to English in the United States and the strings "Apple" and " ble" are compared again. This time, the string " ble" is determined to be less than "Apple." The English language treats the character " " as a special symbol, sorting it before the letter "A" in the alphabet.
using System; using System.Globalization; using System.Threading; public class CompareStringSample { public static void Main() { string str1 = "Apple"; string str2 = " ble"; // Set the CurrentCulture to Danish in Denmark. Thread.CurrentThread.CurrentCulture = new CultureInfo("da-DK"); // Compare the two strings. int result1 = String.Compare(str1, str2); Console.WriteLine("\nWhen the CurrentCulture is \"da-DK\",\n"); Console.WriteLine("the result of comparing {0} with {1} is: {2}", str1, str2, result1); // Set the CurrentCulture to English in the U.S. Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US"); // Compare the two strings. int result2 = String.Compare(str1, str2); Console.WriteLine( "\nWhen the CurrentCulture is \"en-US\",\n"); Console.WriteLine( "the result of comparing {0} with {1} is: {2}", str1, str2, result1); } }
If you execute this code, the output appears as follows:
When the CurrentCulture is "da-DK", the result of comparing Apple with ble is: -1 When the CurrentCulture is "en-US", the result of comparing Apple with ble is: 1
The InvariantCulture is useful for storing data that will not be displayed directly to end users. Storing data in a culture-independent format guarantees a known format that does not change. When users from different cultures access the data, it can be formatted appropriately based on the user. You can initialize CultureInfo with the invariant culture by using an empty string ("") or by using CultureInfo.InvariantCulture as shown:
// The following lines are equivalent. CultureInfo Invc = New CultureInfo(""); CultureInfo Invc = CultureInfo.InvariantCulture;
Glossary
Some cultures support more than one sort order. For example, the culture "zh-CN" (Chinese in China) supports a sort by pronunciation (default) and a sort by stroke count. When you create a CultureInfo object using a culture name, such as "zh-CN," the default sort order is used. To specify the alternate sort order, create a CultureInfo object using the LCID for the alternate sort order. Then, obtain a CompareInfo object to use in string comparisons from CultureInfo.CompareInfo. As an alternative, you can create a CompareInfo object directlyby using the Com-pareInfo.GetCompareInfo method (Int32), specifying the LCID for the alternate sort order.
Table 4-6 lists the cultures that support alternate sort orders and the LCIDs for the default and alternate sort orders.
Table 4-6 Cultures with alternate sort orders and their corresponding LCIDs.
Culture Name | Language- Country/Region | Default Sort Name and LCID | Alternate Sort Name and LCID |
es-ES | Spanish (Spain) | International: 0x00000C0A | Traditional: 0x0000040A |
zh-TW | Chinese (Taiwan) | Stroke Count: 0x00000404 | Bopomofo: 0x00030404 |
zh-CN | Chinese (China) | Pronunciation: 0x00000804 | Stroke Count: 0x00020804 |
zh-HK | Chinese (Hong Kong SAR) | Stroke Count: 0x00000c04 | Stroke Count: 0x00020c04 |
zh-SG | Chinese (Singapore) | Pronunciation:0x00001004 | Stroke Count:0x00021004 |
zh-MO | Chinese (Macau SAR) | Pronunciation: 0x00001404 | Stroke Count:0x00021404 |
de-DE | German (Germany) | Dictionary:0x00000407 | Phone Book Sort DIN: 0x00010407 |
hu-HU | Hungarian (Hungary) | Default:0x0000040e | Technical Sort: 0x0001040e |
ka-GE | Georgian (Georgia) | Traditional:0x00000437 | Modern Sort:0x00010437 |
The Array class provides an overloaded Array.Sort method that allows you to sort arrays based on the CultureInfo.CurrentCulture property. In the following example, an array of three strings is created. First, the CurrentCulture is set to "en-US" and the Array.Sort method is called. The resulting sort order is based on sorting conventions for the "en-US" culture. Next, the CurrentCulture is set to "da-DK" and the Array.Sort method is called again. Notice how the resulting sort order differs from the "en-US" results because the sorting conventions for the "da-DK" culture are used.
using System; using System.Threading; using System.Globalization; public class ArraySort { public static void Main(String[] args) { String str1 = "Apple"; String str2 = " ble"; String str3 = "Zebra"; // Create and initialize a new Array instance to store // the strings. Array stringArray = Array.CreateInstance( typeof(String), 3 ); stringArray.SetValue(str1, 0 ); stringArray.SetValue(str2, 1 ); stringArray.SetValue(str3, 2 ); // Display the values of the Array. Console.WriteLine( "\nThe Array contains the following strings:"); PrintIndexAndValues(stringArray); // Set the CurrentCulture to "en-US". Thread.CurrentThread.CurrentCulture = new CultureInfo("enUS"); // Sort the values of the Array. Array.Sort(stringArray); // Display the values of the Array. Console.WriteLine( "\nAfter sorting for the culture \"en-US\":" ); PrintIndexAndValues(stringArray); // Set the CurrentCulture to "da-DK". Thread.CurrentThread.CurrentCulture = new CultureInfo("daDK"); // Sort the values of the Array. Array.Sort(stringArray); // Display the values of the Array. Console.WriteLine( "\nAfter sorting for the culture \"da-DK\":" ); PrintIndexAndValues(stringArray); } public static void PrintIndexAndValues(Array myArray) { for ( int i = myArray.GetLowerBound(0); i <= myArray.GetUpperBound(0); i++ ) Console.WriteLine( "\t[{0}]:\t{1}", i, myArray.GetValue( i ) ); } }
This code produces the following output:
The Array initially contains the following strings:
[0]: Apple [1]: ble [2]: Zebra
After sorting for the culture "en-US":
[0]: ble [1]: Apple [2]: Zebra
After sorting for the culture "da-DK":
[0]: Apple [1]: Zebra [2]: ble
Once again, sort keys are used to support sorting methods that vary from one culture to another. According to sort-key methodology, each character in a string is given several categories of sort weights, including alphabetic, case, and diacritic weights. A sort key serves as the repository of these weights for a particular string. For example, a sort key might contain a string of alphabetic weights, followed by a string of case weights, and so on.
In the .NET Framework, the SortKey class maps strings to their sort keys and vice versa. You can use the CompareInfo.GetSortKey method to create a sort key for a string that you specify. The resulting sort key for a specified string is a sequence of bytes that can differ depending upon the CurrentCulture and the CompareOptions that you specify. For example, if you specify IgnoreCase when creating a sort key, a string comparison operation using the sort key will ignore case.
After you create a sort key for a string, you can pass it as a parameter to methods provided by the SortKey class. The SortKey.Compare method allows you to compare sort keys. Because SortKey.Compare performs a simple byte-by-bytecomparison, it is much faster than using String.Compare. In applications that are sorting-intensive, you can improve performance by generating and storing sort keys for all the strings that the application uses. When a sort or comparison operation is required, you can use the sort keys rather than the strings.
The following code example creates sort keys for two strings when the CurrentCulture is set to "da-DK". It compares the two strings using the Sort-Key.Compare method and displays the results. The SortKey.Compare method returns a negative integer if string1 is less than string2, zero (0) if string1 and string2 are equal, and a positive integer if string1 is greater than string2. Next, the CurrentCulture is set to "en-US" culture, and sort keys are created for the same strings. The sort keys for the strings are compared and the results are displayed. Notice that the sort results differ based upon the CurrentCulture. Although the results of the following code example are identical to the results of comparing these strings in the String.Compare example given earlier, the SortKey.Compare method is faster than the String.Compare method.
using System; using System.Threading; using System.Globalization; public class SortKeySample { public static void Main(String[] args) { String str1 = "Apple"; String str2 = " ble"; // Set the CurrentCulture to "da-DK". CultureInfo dk = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = dk; // Create a culturally sensitive sort key for str1. SortKey sc1 = dk.CompareInfo.GetSortKey(str1); // Create a culturally sensitive sort key for str2. SortKey sc2 = dk.CompareInfo.GetSortKey(str2); // Compare the two sort keys and display the results. int result1 = SortKey.Compare(sc1, sc2); Console.WriteLine( "\nWhen the CurrentCulture is \"da-DK\",\n"); Console.WriteLine("\the result of comparing {0} with {1} is: {2}", str1, str2, result1); // Set the CurrentCulture to "en-US". CultureInfo enus = new CultureInfo("en-US"); Thread.CurrentThread.CurrentCulture = enus ; // Create a culturally sensitive sort key for str1. SortKey sc3 = enus.CompareInfo.GetSortKey(str1); // Create a culturally sensitive sort key for str1. SortKey sc4 = enus.CompareInfo.GetSortKey(str2); // Compare the two sort keys and display the results. int result2 = SortKey.Compare(sc3, sc4); Console.WriteLine( "\nWhen the CurrentCulture is \"en-US\",\n"); Console.WriteLine( "\the result of comparing {0} with {1} is: {2}", str1, str2, result2); } }
This code produces the following output:
When the CurrentCulture is "da-DK", the result of comparing Apple with ble is: -1 When the CurrentCulture is "en-US", the result of comparing Apple with ble is: 1
In addition to dealing with varying sort orders, representing numbers according to locale conventions is another challenge that you'll encounter. The sections that follow show how you can best meet this challenge.