Character Sorts

   

"Let justice be done, though the heavens fall."

Attributed to an ancient Roman statesman

Sorts of character strings can be fast and wrong, or somewhat slower and less wrong, or much slower and 100 percent right. Three points along this continuum are labeled: binary sort , dictionary sort , and dictionary sort with tie-breaking . If it's a simple matter of knowing what's right and wrong, and knowing how much wrongness is tolerable, then you have some choice about which type of sort you want. Okay, maybe that's not simple. Yet we must try.

Binary sortsalso called repertoire order sorts or codeset sorts or (in IBM-speak) sorts of bit data are merely sorts of the codes that are used to store the characters . For example, the Windows 1252 code page has this order:

  • Control characters

  • Some punctuation characters

  • Digits

  • Some more punctuation characters

  • Uppercase letters

  • More punctuation characters

  • Lowercase letters

  • Yet more punctuation characters and anything with accent marks

Since digits are in order 0 to 9 and letters are in order A to Z; code page 1252 is correct if the input is carefully controlled. Suppose you have a list of people's names. Can you ensure that (a) all names begin with a capital letter and contain only one capital letter, (b) accented names like Chr tien will be entered as Chr tien, and (c) the code page is not Unicode? Then use binary sort. Binary string comparisons are twice as fast as dictionary-order string comparisons on average, and that affects sort time (GAIN: 8/8).

Dictionary sorts are much closer to what a user would expect to see in an English dictionary. For example, the basic Sybase dictionary sort has this order:

  • All punctuation, but each punctuation mark still has a different value. (Note that this is not the way real dictionaries work. They treat all punctuation the same, and multiple spaces map to one space.)

  • Digits.

  • Alphabet, with accented characters mapped to unaccented equivalents (for example ' ' is treated like 'i' ).

Dictionary sorts require a conversion step before comparison, because no code page stores characters in the desired order. It's also fairly easy to add one additional step and say that lowercase characters shall map to uppercase equivalents, which makes for a variant called case-insensitive dictionary sorts. Basic dictionary sorts are a middle ground, reasonably right but with some compromises for the sake of speed.

A Lilliputian Difficulty with Unicode

"Quinbus Flestrin being afterwards commanded to destroy and put to death not only all the Big-Endian Exiles, but likewise all the People of that Empire, who would not immediately forsake the Big-Endian Heresy ."

Jonathan Swift, Gulliver's Travels

Big-endian: A binary data transmission/storage format in which the most significant bit (or byte) comes first.

Little-endian: A binary data transmission/storage format in which the least significant bit (or byte) comes first.

A binary comparison is a byte-by-byte comparison. That's fine as long as one character equals one byte. But in Unicode, one character equals two bytes, or more. Take the Turkish character I, whose Unicode representation is 0130 hexadecimal. If storage is big-endian, it looks like this:

 01 30 

That's still fine, but if storage is little-endian, it looks like this:

 30 01 

And that's not finethis letter will incorrectly sort in front of A (Unicode 0041 hexadecimal)!

Now suppose we refuse to compromise. Let's say we've gathered a list of nouns, and proper nouns have a capital letter. The list contains words that are the same except for capitalization and accents, such as smith/Smith and naive/na ve . How can we sort this list? Before we start, we must know about two gotchas:

  • Although we want words with accents to follow words without accents, we want to find them either way. That is, the search condition WHERE word = 'naive' should return both 'naive' and 'na ve' sincefor search purposesthese are two variant ways to represent the same word; they are not DISTINCT.

  • Although we want words with lowercase letters to follow words with uppercase letters, we don't want 'smith' to follow 'Soho' . Correct order would be:

     Smith smith Soho 

For both these reasons, the following rules about accents and capitals can apply only when the unaccented and uncapitalized forms are equal. That's why the uncompromising sort method is called a dictionary sort "with tiebreaking." There can be up to three passes in this type of sort:

  • Primary sort

    The primary sort orders by the letters as mapped to some canonical formfor example, A comes before B comes before C and so on.

  • Secondary sort

    The secondary sort looks for accent distinction if primary sort values are equal.

  • Tertiary sort

    The tertiary sort looks for case distinction if primary and secondary sort values are equal.

A dictionary sort without tiebreaking is the same thing as a primary sort alone.

Indexes Too

Column definitions will affect sorting for ORDER BY, but also sorting for indexes. Watch out for two implications:

Implication #!:

If column1 has a dictionary-order sort, it won't merely mean that ORDER BY column1 is slower; it also means that all index updates are slower (GAIN: 8/8 if binary sort is used instead of dictionary sort).

Implication #2:

Because indexes exist mainly to support WHERE rather than ORDER BY, keys in indexes don't need to be secondary-sorted. They are in order by primary sort and by ROWID.

Character Sort Support

For a reality check, take a look at Table 3-2, which shows the DBMSs that support the principal modes of character ordering.

Table 3-2. DBMS Support for Character Sorts
  Case Sensitive Nonaccent Preferred (Secondary) Uppercase Preferred (Tertiary) Optional Dictionary Sort Optional Binary Sort
IBM Yes* No No No Yes*
Informix Yes* No No No Yes*
Ingres Yes* No No Yes* No
InterBase Yes* No No Yes Yes*
Microsoft No* Yes Yes Yes* Yes
MySQL No* No No Yes* Yes
Oracle Yes* Yes Yes Yes Yes*
Sybase Yes* Yes Yes No Yes*

Notes on Table 3-2:

  • In all columns , "Yes" means the DBMS supports the feature. An asterisk following ( "Yes*") means the DBMS treats the feature as the default. We call an option supported if there is any way at all to specify it in either the column definition or the SELECT statement. However, the column is

    "No" if the only way to specify the feature is by reinstalling the DBMS or rebuilding the database. For example, for Sybase we say "Yes*" under binary sort, but nowhere else, because Sybase's excellent character-sort support is presented only as an installation option that cannot be changed. Table 3-2 applies only for English, with an 8-bit character set.

  • Case-Sensitive column

    Are string comparisons case sensitive? That is, are SMITH and smith equal or are they unequal strings?

  • Nonaccent Preferred (Secondary) column

    If all other characters are equal, does the DBMS sort nonaccented characters before accented characters? That is, does naive come before na ve ?

  • Uppercase Preferred (Tertiary) column

    If all other characters are equal, does the DBMS sort uppercase letters before lowercase letters? That is, does Na ve come before na ve ?

  • Optional Dictionary Sort column

    This column is "Yes" if the DBMS can do a dictionary sort at runtime.

    • IBM and Ingres do a dictionary sort on CHAR columns by default.

    • Informix does a dictionary sort on NCHAR columns if you install with the EN_GB.8859-1 db_locale.

    • InterBase supports the SQL Standard COLLATE clause, which lets you specify various collations at runtime.

    • Microsoft does a dictionary sort on CHAR columns by default and supports the SQL Standard COLLATE clause, which lets you specify various collations at runtime.

    • MySQL does a dictionary sort on all character and TEXT columns by default.

    • Oracle provides an NLSSORT function, which lets you specify various collations at runtime.

    • Sybase has an installation option that causes the DBMS to do a dictionary sort on all character columns. If this option is chosen , Sybase won't do binary sorts.

  • Optional Binary Sort column

    This column is "Yes" if the DBMS can do a binary sort at runtime.

    • IBM supports a FOR BIT DATA attribute for character columns. CHAR() FOR BIT DATA columns are sorted with a binary sort by default.

    • Informix does a binary sort on CHAR columns by default.

    • Ingres has an installation option that lets you set up a collation to do a binary sort on all character columns. If this option is chosen, Ingres won't do dictionary sorts.

    • InterBase does a binary sort on NCHAR columns by default and supports the SQL Standard COLLATE clause, which lets you specify various collations at runtime.

    • Microsoft supports the SQL Standard COLLATE clause, which lets you specify various collations at runtime.

    • MySQL does a binary sort on BLOB columns by default and also supports a BINARY attribute for character columns. CHAR() BINARY columns are sorted with a binary sort by default.

    • Oracle does a binary sort on CHAR columns by default.

    • Sybase does a binary sort on all character columns by default. If the default isn't changed during installation, Sybase won't do dictionary sorts.

Table 3-2 shows that support for character sorts is sometimes weak. Let's see what we can do about that.

Collations

Another way to measure DBMS support of ORDER BY is to look at collations. A collation , or collating sequence , is a set of rules that determines the result when character strings are compared. In standard SQL, the default collation for a character set is the collation that will normally be used to compare and sort stringsbut you can override this by adding an explicit COLLATE clause either to a column definition or to an ORDER BY clause. For example, suppose your DBMS provides a collation called German , which handles strings according to the rules of German rather than English. If you will always want the values of a character column to be collated with the German collation, you can define the column to have German as its default collation to enforce this:

 CREATE TABLE Table1 (    german_column VARCHAR(10) COLLATE German,    ... ) 

On the other hand, if you'll only want the column's strings to be handled as German strings occasionally, you can define the column in the usual manner and just use the COLLATE clause in ORDER BY when necessary:

 CREATE TABLE Table1 (    german_column VARCHAR(10),    ... ) SELECT * FROM Table1      ORDER BY german_column COLLATE German 

In this example, collation German is an object in the catalog, but ultimately the collation depends on vendor support.

It's good to have a choice of collations for supporting rules that apply outside North American English. The Big Eight don't provide consistent support for collation syntax; however, we can observe common threads.

  • IBM

    Makes two principal collations available: a dictionary sort and a binary sort. IBM does dictionary-order string comparisons using the default character set (IBM-1252) for all character columns unless the FOR BIT DATA attribute is specified.

  • Informix

    Makes two principal collations available. The main collation is associated with CHAR/VARCHAR; the optional collation is associated with NCHAR/NCHAR VARYING. It's possible to change a CHAR's collation at runtime with:

     ... ORDER BY CAST(char_column AS NCHAR...) 

    Incidentally, it's also possible to use CAST to cast to the same character set you started with, as an alternative to the TRIM function.

  • Ingres and Sybase

    Provide no ability to change a character column's collation at runtime.

  • InterBase and Microsoft

    Support the SQL Standard COLLATE clause, with several collations available for each character set supported. Microsoft uses the Windows locale-specific API functions to support choices and thus offers excellent collation support.

  • MySQL

    Makes two principal collations available: a dictionary sort and a binary sort. MySQL does dictionary-order string comparisons using the default character set (ISO-8859-1 Latin1) for all character columns other than BLOB, unless the BINARY attribute is specified.

  • Oracle

    Provides excellent collation support by allowing you to specify collations with the Oracle-specific scalar function:

     ... NLSSORT(<string>,'NLS_SORT=<collation name>') 

In most cases, the support for secondary and tertiary sorting is weaker than in English. (Microsoft and Oracle are the exceptions.) For example, IBM, Ingres, and MySQL do not put the German "Sharp S" character ( ) in the right place. (Neither do InterBase and Oracle unless you specify the German collation; neither does Informix unless you specify db_locale EN_GB.8859-1.) And in all cases, the use of a non-English collation leads to a slowdown unless your list is very small. That's not attributable to bias, mindit's just a fact that the English alphabetization rules are simpler.

Table 3-3 shows the built-in collations supported by the Big Eight. As usual, we're showing only those options that come with the DBMS as regularly installed and are therefore always available. Built-in collations are better because the optimizer knows about them.

Collations Shouldn't Depend on Character Sets

DBMSs tend to promote the idea that a given collation can only apply to a single character set. This idea is based on the mistaken belief that characters cannot be compared with each other if they come from different sets.

How is this belief mistaken? Well, we've all heard of "the set of even numbers" and "the set of odd numbers ," and everyone knows that they are distinct setsthey have no members in common. However, everyone also knows that the expression "3 > 2" is undoubtedly true. It's possible to perform odd and even number comparisons using the rules for "the set of integers" since the even and odd sets are subsets of the set of integers.

The same applies for character sets. All character sets are subsets of The One Big Character Set, which in standard SQL is called SQL_TEXT. In practice, SQL_TEXT is usually Unicode.

Table 3-3. DBMS Support for Built-in Collations (West European)
  Spanish I Spanish II German I German II Danish/Norwegian Swedish/Finnish Icelandic Irish
IBM No Yes No No No No No No
Informix No No No No No No No No
Ingres No No No No No No No No
InterBase Yes No Yes No Yes Yes* Yes No
Microsoft Yes Yes Yes Yes Yes Yes Yes No
MySQL No No No No No Yes* No No
Oracle Yes Yes Yes Yes Yes* Yes Yes No
Sybase No No Yes No No No No No

Notes on Table 3-3:

  • In all columns, "Yes" means the DBMS provides full and correct support for the collation, "Yes*" means the DBMS's support for the collation has a flawed primary sort, and "No" means there is no support at all for the collation.

  • Spanish I and Spanish II columns

    Spanish I or "Traditional Spanish" treats CH and LL as separate letters, but Spanish II or "Modern Spanish" does not. Currently, some Latin American phone books still use Spanish I, but the latest Castilian dictionaries use Spanish II. is a separate letter in both.

  • German I and German II columns

    German I treats , , and ¼ as if they're a , o , and u , but German II treats them as ae , oe , and ue . German I is for lists of words; German II is for lists of names. For that reason, the two collations are sometimes called "German dictionary order" and "German phone book order," respectively.

  • Danish/Norwegian and Swedish/Finnish columns

    Danish/Norwegian and Swedish/Finnish collations differ at the primary level. If you see a collation labeled "Nordic" or "Scandinavian," it is inadequate.

    • InterBase's and MySQL's support for Swedish/Finnish shows a flawed primary sort. V and W should not be differentiated, but both DBMSs treat them as separate letters. [1]

      [1] A personal experienceWe presented a paper on "Finnish Collation" at the Helsinki Business Polytechnic in March 2002. During the presentation, we polled our Finnish audience on the question, Should V = W in a Finnish primary sort? Ninety-five percent of the respondents voted "Yes." But on the same day, we found that the Helsinki telephone book ignores this rule; it has separate sections for names beginning with V and with W.

    • Oracle's support for Danish/Norwegian shows a flawed primary sort. Aa should sort with after Z . Instead, Oracle puts Aa at the beginning of a list.

  • Icelandic column

    The Icelandic alphabet contains two characters more than the other Nordic alphabets.

  • Irish column

    Folk in Ireland are generally happy with the English collation, but Irish phone books have a quirk. Names that begin with M' or Mc or Mac come before all other names that begin with M . This convention was once the habit of North American phone books too, but the habit appears to be fading.

The Bottom Line: Character Sorts

Sorts of character strings can be fast and wrong, or somewhat slower and less wrong, or much slower and 100 percent right. Decide how much wrongness is tolerable in a specific situation, because you do have some choice about which type of sort you end up with.

Binary sorts are sorts of the codes that are used to store the characters. Binary sorts are fast but are prone to error unless your input is carefully controlled.

Dictionary sorts give you a result that is very close to what you'd see in an English list, which is no good if you're sorting non-English words. They're slower than binary sortsthey require a conversion step before comparisonso are a middle ground, reasonably right but with some compromises for the sake of speed.

A dictionary sort with tiebreaking is the slowestbut also the most accuratesort. These sorts take up to three passes to come up with a properly sorted list.

One way to force the right kind of sort is to specify a specific collation for the ORDER BY clause. Other options are declaring a column with a binary sort attribute, such as FOR BIT DATA with IBM and BINARY with MySQL. If all else fails, define your column as a binary string columnfor example, with Sybase:

 CREATE TABLE Table1 (    column1 BINARY(10),    ...) 
   


SQL Performance Tuning
SQL Performance Tuning
ISBN: 0201791692
EAN: 2147483647
Year: 2005
Pages: 125

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net