Section 5.6. Collation and Sorting

5.6. Collation and Sorting

Sorting is a well-known concept: we put data into a specific order, such as alphabetical order. Collating order is a more technical concept, but closely related: the collating order of characters and strings is the order by which sorting of character data takes place. The collating order says, for example, that "a" < "b" or that "&" < ".", using the less than sign to mean "precedes (in the ordering)." Sorting is often called "alphabetizing," although it generally operates on strings in general, not just alphabetic characters.

Sorting is relevant when we present a large amount of text data to users and the data has some key component, such as a person's name in a telephone catalog or a term in a glossary. People are used to scanning through lists and tables, expecting them to be in an alphabetic order (or, more generally, collating order) they have learned at school. In the global context, it is important that different people have learned different orders.

The relative importance of sorting has diminished due to the advance of automatic searching tools. When you use a CD-ROM encyclopedia, you can type a word and expect a program to show you the corresponding entry; alphabetic order is irrelevant here. However, you might be uncertain of the spelling and would like to browse through consecutive entries.

Sorting still has many other uses as well. For example, when you need to present countries in some order, difficult political problems may arise unless you can apply some reasonably neutral or traditional order, such as alphabetic order by the name of the country in French. Moreover, even for small amounts of entriese.g., in a list of linksalphabetic order is often the best, when there is no other natural order. It is then easy to a user who is looking for a particular entry to check whether it is the list. Sorting is also relevant to operations that extract a range of values from some data.

5.6.1. Sorting Characters Versus Sorting Strings

For sorting, we need an order for characters, but this may not be sufficient. A trivial method for ordering strings, once a character order has been established, is to compare the first characters of the strings, then the second characters, until a difference is found. The first difference found will determine string order. Thus, "AAB" < "AAC", since "B" < "C". This simple method is often called lexicographic order or dictionary order .

However, when you look at a real dictionary, you may notice that entries are often ordered in a more complex manner. As an alternative to "letter by letter" ordering, a "word by word" ordering can be applied. Technically, such matters are reflected in the treatment of spaces, hyphens, and other punctuation marks. It is often better to ignore them, in order to produce an order that corresponds to readers' expectations. For example, it might be better to treat "cat eyed," "cat-eyed," and "cateyed" as basically the same in sorting.

For such reasons, the Unicode definitions related to collation are somewhat complex. They specify methods for applying special rules in addition to the simple lexicographic order, in a manner that allows different sets of rules to be used.

5.6.2. Collation and Unicode

The Unicode standard does not define a collating order. Thus, Unicode characters have no inherent order, but they can be sorted according to different orders.

The Unicode Consortium has issued a separate standard on collation: "Unicode Collation Algorithm," Unicode Technical Standard #10, http://www.unicode.org/reports/tr30/. It is an independent standard: conformance to it is not required for conformance to the Unicode standard.

Figure 5-6. Viewing Unicode Collation Charts

The Consortium has also prepared collation charts, which present the collating order visually. The charts are at http://www.unicode.org/charts/collation/, and they can be used as a handy checking tool, as is illustrated in Figure 5-6. There you can see some of the characters that are historically based on the letter "A," in collating order. Much of this order is a matter of arbitrary decisions that just had to be made. (Due to font limitations, not all characters are displayed correctly when viewing the charts; this is why there are question marks and boxes.)

5.6.3. Layered Model of Collation

Collation order is language-dependent, and it may vary even within a language. For example, in German, the letter ö is placed after "o," and the difference between them is made only for words that are otherwise the same. In Swedish, ö is the last letter in the alphabet, after "z," å, and ä. Thus, when looking for a word like "Öhman" in an alphabetic list, different people start at different places. Moreover, in some countries, different sorting principles are applied in different contextse.g., in dictionaries versus telephone catalogs.

Technically, we can say that collating order depends on a locale, which can be defined as a cultural environment, or as a collection of cultural conventions. For example, the combination "ch" was traditionally regarded as a single letter in Spanish, so that all words beginning with "ch" were placed after the set of all other words beginning with "c." People might still prefer the old stylei.e., to use a Spanish locale with the old sorting rules. Part of the Common Locale Data Repository (CLDR) activity (described in Chapter 11) is the collection of data on such variation, for use in implementing tools that automatically perform collation in a locale-dependent manner.

Ultimately, a locale is a matter of user preferences, based on a user's cultural and personal background, habits, views, and decisions. For example, should "foo" appear before or after "Foo" in collation? There might be standards on such matters, but in practice, they depend on what you are accustomed to or you find most natural.

Collating order should meet user expectations, rather than the standards of the information producer. Ideally, it would be customizable by users.

The variation between locales is described as exceptions to a default collating order, which should thus be understood as a useful tool for defining language-specific orders, rather than an attempt at a universal order to be used everywhere. On the other hand, definitions of collating orders often deal with many special characters that are usually not very relevant in sorting. For example, indexes in books are usually alphabetic but may contain entries with special characters, such as "% operator" in a book on programming. They need to put somewhere, and different authors, publishers, and standardizers have handled them differently. We can expect that the situation will become more uniform: language-specific orders will mostly cover only characters commonly used in a language, and all the rest is easiest to order according to default collating order.

More generally, the modern approach to collation is based on a layered model, where each layer may modify or replace the rules set on lower layers. For example, the layers could consist of the following, in descending order of priority:

Application-specific rules, such as "treat v and w as identical" (for some particular reason)
Company-specific rules; e.g., reflecting the traditions and decisions of a publishing house in their glossaries
Locale-specific rules; e.g., describing the collation rules of Swiss German, to the extent that they deviate from pan-European rules
Rules common to many locales by tradition or convention, such as pan-European rules
Universal default rules, such as Unicode default collating order

Thus, Unicode collating order is not meant to unify sorting rules across languages and cultures. On the contrary, it constitutes an essential part of a model designed to support cultural variation, providing the lowest layer of rules. It simplifies the definition of locale-specific sorting, since only deviations from default rules need to be specified. It also ensures that the collating order is defined for all characters, no matter how rare and little known they are.

5.6.4. Code Point Order Versus Collating Order

Code points are numbers (integers), and therefore they have an order defined by the normal ordering of numbers (0, 1, 2,...). However, this order is not meant to be used as collating order, except for special purposes.

5.6.4.1. Code point order is unnatural

It would be impossible to allocate code points so that their order, as numbers, would match the collating order of characters. Different languages have different collating orders. For example, the character ö is treated as a separate letter at the end of the alphabet in some languages but as a variant of "o" in many other languages. Moreover, the structure of the Unicode coding space imposes serious limitations, since the grouping of characters into blocks largely reflects old practices and other character standards.

Some small subsets of Unicode have code points that correspond to the mutual order of the characters. For example, the Latin letters A through Z are in consecutive code points in the "right" order in Unicode, as well as in ASCII and most other character codes. However, even for the basic Latin letters, the code point order is not suitable as a collating order, since all lowercase letters appear after all uppercase letters. In code point order, A < B < ... < Z < a < b ... < z, but the normal collating order has A < a < B < b < ... < Z < z (so that "A" and "a" are equivalent at the primary level, etc.). As we proceed to Latin letters with diacritics, it becomes even more obvious that code point order differs from collating order.

5.6.4.2. Using code point order as a fallback in definitions

Some definitions of collating order have defined things so that they specify meaningful order only for characters that are expected to appear in normal data. They use Unicode code point order for the rest, just to have it sorted out somehow, in some known order.

Although you can define your collating order as you like, it is usually not necessary to use anything as arbitrary as code point order. You can use Unicode default collating order instead. Both of these orders are more or less arbitrary, but the default collating order tries to pay attention to sorting principles in human languages.

5.6.4.3. Code point order sorting for technical reasons

Sometimes you might decide to use code point order as the collating order simply because you need some known order. For example, some algorithms require that data be placed in some order, but it does not matter which. Code point order is easy to implement, and it can be described easily. The simplicity of description is essential, if data sorted by a program will be processed by another program that expects sorted data.

Although code point order is technically very simple, it poses some problems when the data is in encoded form, as opposed to just a sequence of code points. Some Unicode encodings, described in Chapter 6, are very easy in this respect, since they use fixed-size storage units for all characters. The encodings used in practice tend to be more complicated. Methods for performing code point order sorting on UTF-8 and other Unicode encodings are described in the Unicode standard in section 5.17 "Binary Order." It discusses the even more technical orders based on numeric ordering of code units (such as octets that constitute UTF-8 encoded data) rather than code numbers.

5.6.4.4. Problems of legacy software

In simple programming tasks, comparisons of character and string data are sometimes based on comparisons of code points. This applies basically to basic Latin letters in contexts where it can be assumed (or it just is assumed) that we need not deal with any other letters and that the case of letters is fixed (e.g., due to previous case folding). This explains code like if((ch >= 'A') & (ch <= 'Z')) for testing whether the value of ch is an (uppercase) letter. Such code can be efficient, but nowadays it is usually better to use library subroutines (e.g., if(isletter(ch))), making the code more readable and more portable without sacrificing efficiency. We will discuss such methods in Chapter 11.

As a user of programs, you may encounter sorted data and sorting routines that apply to simple code point order. For example, if you use a tool for automatic generation of an index for a publication, you might notice that the index will be sorted that way. If the entries are dominantly English words, the result may look mostly OK, but the handling of spaces, punctuation marks, special characters, and case of letters may differ from the applicable rules of sorting. Therefore, you may need to fix the order separately, "by hand." Make sure you first know the rules; there are differences that depend on language, style, and publisher.

5.6.5. Unicode Collation Algorithm

The Unicode Collation Algorithm (UCA) uses multilevel comparison in order to deal with the complexities of sorting. Instead of simply putting all characters in a single order and defining the collating order of strings according to their first difference (simple lexicographic ordering), UCA defines different levels of ordering between characters. For example, you can define the difference between "o" and ö as primary (as in Swedish) or as secondary (as in German). By default, UCA works with three levels:

Alphabetic ordering (e.g., "a" < "b")
Diacritic ordering (e.g., "a" < "á" < "à")
Case ordering (e.g., "a" < "A")

Surprisingly, UCA uses by default a case ordering where lowercase precedes uppercase, resulting in, for example, mime < Mime < MIME. The usual dictionary is opposite to this.

Canonical equivalents are essentially different ways of representing the same character (e.g., ö as a precomposed character or as an "o" followed by a combining dieresis). It is therefore natural to expect them to behave identically in collation. However, when desired, a distinction between them can be made, when everything else is the same ("tie-break situation").

A collation algorithm needs to be able to work on text elements larger than individual characters, in order to be suitable for general use. For example, in some languages, the letter æ might be treated as a separate letter, different from other letters at the primary level, but in other languages or contexts, it might be treated as equivalent to the letter pair "ae." In English, æ is often understood as just a ligature of "a" and "e," even though Unicode defines it differently. Thus, in sorting material in English, you might wish to make "Cæsar" appear where "Caesar" would appear, not after "cat."

Formally, UCA is described as an algorithm that takes as input a string and aCollation Element Table, which contains mapping data for characters. The output is a sort key, which is a sequence of 16-bit integers. These integers are not code numbers (though they may coincide with the character's code number) but weights that describe the position of the input string in the collation order, in a manner that lets us sort strings by their sort keys. Comparison of sort keys means simple comparison by numeric values. By convention, the 16-bit integers are written in hexadecimal, using four digits.

The algorithm can be used for different collating orders by using different Collation Element Tables. The UCA definition contains theDefault Unicode Collation Element Table (DUCET), which you may choose to use as such or as a basis for defining your own table. In particular, you may define collating order for characters that are important in your application and leave all rest as in DUCET. This ensures that the collating order is defined for all Unicode characters.

A Collation Element Table maps characters (code points) tocollation elements . A collation element is a sequence of three or more weights (16-bit integers), and the order of the weights corresponds to their levels. Thus, the first integer indicates the primary weight, which typically corresponds to the basic alphabetic order. The principles of interpreting collation elements are the following:

The order of elements is the order of their primary (first level) weights, if these weights are different. If the weights are equal, the order of the secondary (second level) weights is used, etc.
However, a weight of 0000 means that the collation element is ignorable at the level of that weight.

The DUCET table is available at http://www.unicode.org/Public/UCA/latest/allkeys.txt. It does not contain mappings for all characters, since UCA defines weight derivationi.e., calculation rules for weights that are not explicitly listed in a table.

A simple example of an entry in DUCET:

 0041  ; [.0F6C.0020.0008.0041] # LATIN CAPITAL LETTER A

The example defines the mapping for "A" U+0041. The weights are given in square brackets, separated with periods. The primary weight is 0F6C, etc. The text starting with # is a comment, as in the Unicode database in general.

The first character inside brackets has a special meaning. If it is a period . as above, the collation element is a normal one. If it is an asterisk *, the collation element is variable. Variable elements include spaces, punctuation marks, and most symbols. They have defined weights, but an implementation of UCA may support alternate weightings. This means that a program switch called avariable weighting tag can change the status of variable elements so that they are ignored, except in the absence of other differences between strings.

For example, consider the following entry in DUCET:

 002D  ; [*0221.0020.0002.002D] # HYPHEN-MINUS

This means that the hyphen-minus character "-" has primary weight 0221, putting it among many other punctuation marks and symbols, before any letters. This makes, for example, "X-men" sort before any string that begins with an "X" and a letter, such as "Xanadu." When alternate weighting is used, the hyphen-minus is more or less ignored, making "X-men" sort basically the same way as "Xmen." The setting of the variable weighting tag affects this behavior as follows:

Blanked (ignorable): This setting sets the weights of variable collation elements to zero at the first three levels. This makes collation work as if variable collation elements were not present at all. However, as the last resort, Unicode code number order of character will be used for tie-breaking. Therefore, this would result, for example, in the order "X men" < "X-men" < "Xmen" < "Xmen" (with en dash) while all of these would be treated as "Xmen" with respect to other strings.
Non-ignorable: Variable collation elements have their weights unchanged. Thus, for this value, an implementation supporting alternate weightings behaves the same way as an implementation that does not.
Shifted: The first three weights are set to zero, as for Blanked, but the original primary weight is made the fourth-level weight. In this case, all non-variable collation elements get the maximal fourth-level weight of FFFF. Therefore, we would get an order like "X men" < "X-men" < "Xmen" < "Xmen," since now the mutual order is not by code number order but by the original first-level weight (which is 0209 for space, 0221 for hyphen-minus, and 0227 for en dash).
Shift-trimmed: This is the same as Shifted except that all trailing FFFF weights are trimmed from the sort key. This feature is intended to simulate POSIX behavior. The effect is similar to that of Shifted, but with the strings containing variable collation elements placed after an otherwise identical string without them. For example, "Xmen" < "X men" < "X-men" < "Xmen."

Control and formatting characters, except line breaks and horizontal tabs (which count as whitespace), are completely ignored in the default order.

We have discussed some general principles of UCA only, and you need to consult UTS #10 for the detailed algorithm and notations. In practice, you will probably not work with Collation Element Tables directly. Rather, you will use higher-level constructs, such as the Collator and RuleBasedCollator classes in Java programming. These classes let you specify your modifications of the default collating order using simple, readable notations like "c < ch < d," which says that "ch" is to be treated as if it were a character between "c" and "d."