Sorting and String Comparison | Developing International Software

Glossary

Radicals: A group of strokes in a Chinese character that are treated as a unit for the purposes of sorting, indexing, and classification. A character can contain more than one element that is recognized as a radical, but each character contains only one element, called the "main radical," that is used as the indexing radical. The main radical often gives a hint as to the general meaning of the character, and other radicals in the character might indicate how the character is pronounced.
Sort keys: Numeric representations of a sort element based on locale-specific sorting rules. A sort key consists of several weighted components that represent a character's script, diacritics, case, and so on.

Even though you might think you understand all the issues involved with sorted lists, users of world-ready applications might have very different expectations of what constitutes a "sorted" list. Not only does alphabetical order vary among languages, but conventions for sequencing items in dictionaries and phone books can also be quite different.

In Swedish, for example, some vowels with an accent sort after "Z," whereas in other European countries the same accented vowel comes right after the nondiacritic vowel. Languages that include characters outside the Latin script have special sorting rules. The Asian languages have several different sort orders depending on phonetics, radical order, number of pen strokes, and so on.

String Comparison and Sorting in Win32

String sorting and comparison are language-specific. Even within languages based on the Latin script, there are different composition and sorting rules. Thus do not rely on code points to do proper sorting and string comparison.

String Comparison

The lstrcmp and lstrcmpi Win32 functions compare two character strings in a case-sensitive or case-insensitive manner, respectively, following the sorting rules and standards that are defined for the currently selected user locale. The lstrcmp and lstrcmpi functions compare two strings by checking the first characters against each other, the second characters against each other, and so on until they find an inequality or reach the ends of the strings. These functions return the difference of the values of the first unequal characters they encounter. For example, lstrcmp determines that "abcz" is greater than "abcdefg" and returns the difference of the code-point numbers that correspond to "z" and "d." The selected user locale determines which string is greater (or whether the strings are the same).

The big disadvantage of these APIs is that you cannot specify which locale-specific rules should be used to perform the comparison, and you are always limited to the currently selected user-locale standards. That can be a problem in the following cases:

You want to display data in a client context and, therefore, you need to perform your comparison within the client-side, user-locale specifications. You want to do a string comparison that's locale-aware, but you'd like to use the rules of a locale different from those of your current user locale. The CompareString API allows you to compare two character strings and to specify a locale to be used to do this comparison.
To prevent locale-specific composition or sorting rules from affecting the result of your comparison, you do a comparison that is not dependent upon locale. Imagine that you retrieve some kind of data fromthe registry (such as "HorzScroll") and want to compare that data against a predefined list of items (such as "HORZSCROLL") by using lctrcmpi. This performance being internal to your application, you always want to perform your comparison within the same condition. But unfortunately, "sc" is a special composition within the Hungarian language (as opposed to "Sc"). Thus your comparison would always fail if the user locale is set to Hungary! In this case, what you should be doing is a locale-insensitive (or locale-independent) comparison.

Windows XP introduced a new locale ID named the "invariant locale"-LOCALE_INVARIANT is the predefined flag for it-that allows you to do this type of comparison. A locale-independent string comparison must use this new locale ID.

 CompareString(LOCALE_INVARIANT, NORM_IGNORECASE, TEXT("HorzScroll"),    -1, TEXT("HorzScroll"), -1);

By using the invariant locale, you are always sure to retrieve the same comparison result no matter what the user locale and its associated sorting rules are. However, the invariant locale is only available on Windows XP. Therefore, for downlevel platforms you'll need to define your own locale-such as English-and then in this instance use basic comparison rules of the English language that follow the ASCII mapping of characters. Here is how the code should look:

 LCID Locale; Locale = MAKELCID(MAKELANDIG (LANG_ENGLISH, SUBLANG_ENGLISH_US),             SORT_DEFAULT); ComapareString(Locale, ..., ..., ..., ...);

In this approach, all non-English characters, for which the English (United States) has no sorting rule defined, will be sorted by their default sort orders. By doing this, you're guaranteed to retrieve a consistent result.

Important Whenever performing a comparison by using CompareString, the following flags should be used with extra care:

NORM_IGNOREKANATYPE. This flag causes corresponding hiragana and katakana characters to compare as equal.
NORM_IGNORENONSPACE. This flag causes combined characters (diacritics) to compare as equal to noncombined characters. For example, in Japanese would be equal to . For Latin scripts, "ö" would be equal to "o."

String Sorting

Glossary

Alphanumeric: Consisting of either letters or numbers, or both.

You can take advantage of the support for sorting that the operating system provides whenever you represent your data in a list-view control that has the sorting attribute set. Edit controls will sort items by following the linguistic sorting rules specific to the currently selected user locale. But if you want to do your own sorting, the real solution depends on your needs.

If you are sorting only a small number of elements, the easiest and quickest solution is to use the string comparison techniques explained previously, and then to sort elements based on the result of those comparisons. However, this technique does not apply if you are sorting a large number of elements. Because CompareString 'salgorithm is complex, it is faster to sort long lists of words by retrieving sort keys.

Your advanced sorting algorithm should follow this general model:

Generate a sort key for each new element.
Compare strings by comparing the value of their sort keys using memcmp.
Store sort key values along with your string to avoid extra processing time to re-generate them each time they are needed.

A sort key is a composite representation of a sort element. (See Figure 4-21.) Each component of the sort key has a different weight, depending on the sort rules associated with the locale. Elements are sorted based on script (that is, all Greek characters sort together, all Cyrillic characters sort together, and so on); alphanumeric and symbolic characters; diacritics; case; and other special rules, such as two characters sorting as a single character (for example, the Spanish "CH"). The predefined constant SORT_DEFAULT refers to whichever sort algorithm associated with a given language ID is commonly preferred.

figure 4-21 the win32 sort key for the german word öde, which means wasteland.

Figure 4-21 - The Win32 sort key for the German word "öde," which means "wasteland."

The first three values in the sort key represent the Unicode sort weights² for each character in the string. The next three values represent the diacritic weights, and the final three values represent the case weights. Separator values fall between sections. The sort key ends with a null terminator.

The code example that follows contains a search-engine procedure that maps lexical tokens into sort keys. You'll notice that the search engine customizes each sort key by prefixing it with an application-defined flag to ensure that words will sort first, followed by numbers and punctuation. (The system's default behavior is to sort punctuation first, followed by numbers and letters.)

 // sort key generation from Unicode BYTE* pbSource;      // multibyte string buffer int cbSource;        // size of pbSource buffer int LCSortKey(    LCID lcid,        // locale ID    LPCTSTR pwSource, // address of source string    int cwSource,     // number of characters in source string  LPTSTR pwDest,    // address of destination buffer    int cwDest)       // size of destination buffer {    int nRet;         // size of sort key in words    nRet = LCMapString(lcid, LCMAP_SORTKEY, pwSource, cwSource,              pwDest+1, (cwDest-1)<<1) >> 1;    if (nRet)    {       nRet++;  if (cwDest && pwDest) // Set a sort key's prefix so tokens                             //    group first by alphabetics, then                             //    by numerics, then by punctuation.       {          BYTE bCharType = char_types(*pwSource);          *pwDest = ~(bCharType & WORD_TYPE);       }    } return nRet; }

Once the search engine has a sorted list of words, it can quickly look for matching strings using a binary search algorithm. Using sort keys is very convenient because they automatically handle language-specific and locale-specific sorting preferences. One example includes matching ligatures. (In English, " " is equivalent to "AE," but this is not true in Danish.) Another example involves distinguishing between single letters and two-character combinations. (In some Spanish locales, "l" and the combination "ll" are distinct letters.)

Some features of linguistic sorting are:

A language's writing system will determine what influences the sort order of the language. For example, a sort order for Russian would be based on Cyrillic letters and possibly diacritics, but a sort order for Japanese might be based on the number of strokes it takes to draw a character.
Linguistic sort orders are different than the Unicode code point order.
Languages that use the same script often have different linguistic sort orders.
A sorting element (such as a character) can be the combination of more than one Unicode code point. For example, U+0B95 + U+0BCD + U+0BB7 = Tamil Ksha, a single sorting element in Tamil.

The best thing to remember is that by using Windows 2000 and Windows XP system support, all of these sorting issues are automatically handled for you.

String Comparison and Sorting in the .NET Framework

Glossary

Invariant culture: Whereas the neutral culture is associated with a language, but not with any particular country or region, the invariant culture is neither associated with a language nor with a particular country or region. It can be used in almost any method in the Globalization namespace that requires a culture. The invariant culture must be used only by processes that require culture- and language-independent results, such as system services; otherwise, it produces results that might be linguistically incorrect or culturally inappropriate.

In the following sections, you'll see which methods and properties to use for string comparison. You'll also see what to do when a culture supports more than one sort order, how to use the Array class for string sorting, and how to use sort keys.

String Comparison

The CompareInfo class provides a set of methods you can use to perform culture-sensitive string comparisons. The CultureInfo.CompareInfo property, which is an instance of the CultureInfo class, defines how to compare and sort strings for a specific culture. The String.Compare method uses the information in the CultureInfo.CompareInfo property to compare strings. This method returns a negative integer if string1 is less than string2, zero (0) if string1 and string2 are equal, and a positive integer if string1 is greater than string2.

The following code example illustrates how two strings can be evaluated differently by the String.Compare method depending upon the culture used to perform the comparison. First, the CurrentCulture is set to Danish in Denmark and the strings "Apple" and " ble" are compared. The Danish language treats the character " " as an individual letter, sorting it after "Z" in the alphabet. Therefore, the string " ble" is greater than "Apple" for the Danish culture. Next, the CurrentCulture is set to English in the United States and the strings "Apple" and " ble" are compared again. This time, the string " ble" is determined to be less than "Apple." The English language treats the character " " as a special symbol, sorting it before the letter "A" in the alphabet.

 using System; using System.Globalization; using System.Threading; public class CompareStringSample {    public static void Main()    {       string str1 = "Apple";       string str2 = " ble";       // Set the CurrentCulture to Danish in Denmark.       Thread.CurrentThread.CurrentCulture = new CultureInfo("da-DK");       // Compare the two strings.       int result1 = String.Compare(str1, str2);       Console.WriteLine("\nWhen the CurrentCulture is \"da-DK\",\n");       Console.WriteLine("the result of comparing {0} with {1} is: {2}",   str1, str2, result1);       // Set the CurrentCulture to English in the U.S.       Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US");       // Compare the two strings.       int result2 = String.Compare(str1, str2);       Console.WriteLine(   "\nWhen the CurrentCulture is \"en-US\",\n");       Console.WriteLine(  "the result of comparing {0} with {1} is: {2}",   str1, str2, result1);    } }

If you execute this code, the output appears as follows:

 When the CurrentCulture is "da-DK", the result of comparing Apple with  ble is: -1 When the CurrentCulture is "en-US", the result of comparing Apple with  ble is: 1

The InvariantCulture is useful for storing data that will not be displayed directly to end users. Storing data in a culture-independent format guarantees a known format that does not change. When users from different cultures access the data, it can be formatted appropriately based on the user. You can initialize CultureInfo with the invariant culture by using an empty string ("") or by using CultureInfo.InvariantCulture as shown:

 // The following lines are equivalent. CultureInfo Invc = New CultureInfo(""); CultureInfo Invc = CultureInfo.InvariantCulture;

Alternate Sort Orders

Glossary

Stroke count: The number of strokes it takes to draw an ideographic character.

Some cultures support more than one sort order. For example, the culture "zh-CN" (Chinese in China) supports a sort by pronunciation (default) and a sort by stroke count. When you create a CultureInfo object using a culture name, such as "zh-CN," the default sort order is used. To specify the alternate sort order, create a CultureInfo object using the LCID for the alternate sort order. Then, obtain a CompareInfo object to use in string comparisons from CultureInfo.CompareInfo. As an alternative, you can create a CompareInfo object directlyby using the Com-pareInfo.GetCompareInfo method (Int32), specifying the LCID for the alternate sort order.

Table 4-6 lists the cultures that support alternate sort orders and the LCIDs for the default and alternate sort orders.

Table 4-6 Cultures with alternate sort orders and their corresponding LCIDs.

Culture Name	Language- Country/Region	Default Sort Name and LCID	Alternate Sort Name and LCID
es-ES	Spanish (Spain)	International: 0x00000C0A	Traditional: 0x0000040A
zh-TW	Chinese (Taiwan)	Stroke Count: 0x00000404	Bopomofo: 0x00030404
zh-CN	Chinese (China)	Pronunciation: 0x00000804	Stroke Count: 0x00020804
zh-HK	Chinese (Hong Kong SAR)	Stroke Count: 0x00000c04	Stroke Count: 0x00020c04
zh-SG	Chinese (Singapore)	Pronunciation:0x00001004	Stroke Count:0x00021004
zh-MO	Chinese (Macau SAR)	Pronunciation: 0x00001404	Stroke Count:0x00021404
de-DE	German (Germany)	Dictionary:0x00000407	Phone Book Sort DIN: 0x00010407
hu-HU	Hungarian (Hungary)	Default:0x0000040e	Technical Sort: 0x0001040e
ka-GE	Georgian (Georgia)	Traditional:0x00000437	Modern Sort:0x00010437

String Sorting

The Array class provides an overloaded Array.Sort method that allows you to sort arrays based on the CultureInfo.CurrentCulture property. In the following example, an array of three strings is created. First, the CurrentCulture is set to "en-US" and the Array.Sort method is called. The resulting sort order is based on sorting conventions for the "en-US" culture. Next, the CurrentCulture is set to "da-DK" and the Array.Sort method is called again. Notice how the resulting sort order differs from the "en-US" results because the sorting conventions for the "da-DK" culture are used.

 using System; using System.Threading; using System.Globalization; public class ArraySort {  public static void Main(String[] args)    {       String str1 = "Apple";       String str2 = " ble";       String str3 = "Zebra";       // Create and initialize a new Array instance to store       //    the strings.       Array stringArray = Array.CreateInstance( typeof(String), 3 );       stringArray.SetValue(str1, 0 );       stringArray.SetValue(str2, 1 );       stringArray.SetValue(str3, 2 );    // Display the values of the Array.       Console.WriteLine( "\nThe Array contains the following strings:");       PrintIndexAndValues(stringArray);       // Set the CurrentCulture to "en-US".       Thread.CurrentThread.CurrentCulture = new CultureInfo("enUS");       // Sort the values of the Array.       Array.Sort(stringArray);       // Display the values of the Array.       Console.WriteLine(   "\nAfter sorting for the culture \"en-US\":" );       PrintIndexAndValues(stringArray);       // Set the CurrentCulture to "da-DK".       Thread.CurrentThread.CurrentCulture = new CultureInfo("daDK");       // Sort the values of the Array.       Array.Sort(stringArray);       // Display the values of the Array.       Console.WriteLine(  "\nAfter sorting for the culture \"da-DK\":" );       PrintIndexAndValues(stringArray);    } public static void PrintIndexAndValues(Array myArray)    {     for ( int i = myArray.GetLowerBound(0);  i <= myArray.GetUpperBound(0); i++ )       Console.WriteLine( "\t[{0}]:\t{1}", i, myArray.GetValue( i ) );    } }

This code produces the following output:

The Array initially contains the following strings:

  [0]:  Apple  [1]:   ble  [2]:  Zebra

After sorting for the culture "en-US":

  [0]:   ble  [1]:  Apple  [2]:  Zebra

After sorting for the culture "da-DK":

  [0]:  Apple  [1]:  Zebra  [2]:   ble

Using Sort Keys

Once again, sort keys are used to support sorting methods that vary from one culture to another. According to sort-key methodology, each character in a string is given several categories of sort weights, including alphabetic, case, and diacritic weights. A sort key serves as the repository of these weights for a particular string. For example, a sort key might contain a string of alphabetic weights, followed by a string of case weights, and so on.

In the .NET Framework, the SortKey class maps strings to their sort keys and vice versa. You can use the CompareInfo.GetSortKey method to create a sort key for a string that you specify. The resulting sort key for a specified string is a sequence of bytes that can differ depending upon the CurrentCulture and the CompareOptions that you specify. For example, if you specify IgnoreCase when creating a sort key, a string comparison operation using the sort key will ignore case.

After you create a sort key for a string, you can pass it as a parameter to methods provided by the SortKey class. The SortKey.Compare method allows you to compare sort keys. Because SortKey.Compare performs a simple byte-by-bytecomparison, it is much faster than using String.Compare. In applications that are sorting-intensive, you can improve performance by generating and storing sort keys for all the strings that the application uses. When a sort or comparison operation is required, you can use the sort keys rather than the strings.

The following code example creates sort keys for two strings when the CurrentCulture is set to "da-DK". It compares the two strings using the Sort-Key.Compare method and displays the results. The SortKey.Compare method returns a negative integer if string1 is less than string2, zero (0) if string1 and string2 are equal, and a positive integer if string1 is greater than string2. Next, the CurrentCulture is set to "en-US" culture, and sort keys are created for the same strings. The sort keys for the strings are compared and the results are displayed. Notice that the sort results differ based upon the CurrentCulture. Although the results of the following code example are identical to the results of comparing these strings in the String.Compare example given earlier, the SortKey.Compare method is faster than the String.Compare method.

 using System; using System.Threading; using System.Globalization; public class SortKeySample {    public static void Main(String[] args)    {       String str1 = "Apple";       String str2 = " ble";       // Set the CurrentCulture to "da-DK".       CultureInfo dk = new CultureInfo("da-DK");       Thread.CurrentThread.CurrentCulture = dk;       // Create a culturally sensitive sort key for str1.       SortKey sc1 = dk.CompareInfo.GetSortKey(str1);       // Create a culturally sensitive sort key for str2.       SortKey sc2 = dk.CompareInfo.GetSortKey(str2);       // Compare the two sort keys and display the results.       int result1 = SortKey.Compare(sc1, sc2);       Console.WriteLine(  "\nWhen the CurrentCulture is \"da-DK\",\n");       Console.WriteLine("\the result of comparing {0} with {1} is: {2}",          str1, str2, result1);    // Set the CurrentCulture to "en-US".       CultureInfo enus = new CultureInfo("en-US");       Thread.CurrentThread.CurrentCulture = enus ;  // Create a culturally sensitive sort key for str1.       SortKey sc3 = enus.CompareInfo.GetSortKey(str1);       // Create a culturally sensitive sort key for str1.       SortKey sc4 = enus.CompareInfo.GetSortKey(str2);       // Compare the two sort keys and display the results.       int result2 = SortKey.Compare(sc3, sc4);       Console.WriteLine(  "\nWhen the CurrentCulture is \"en-US\",\n");       Console.WriteLine(  "\the result of comparing {0} with {1} is: {2}",          str1, str2, result2);    } }

This code produces the following output:

 When the CurrentCulture  is "da-DK", the result of comparing Apple with  ble is: -1 When the CurrentCulture  is "en-US", the result of comparing Apple with  ble is: 1

In addition to dealing with varying sort orders, representing numbers according to locale conventions is another challenge that you'll encounter. The sections that follow show how you can best meet this challenge.