Collation | Inside Microsoft SQL Server 2000

The term collation refers to a set of rules that determine how data is compared and sorted. Character data is sorted using rules that define the correct sequence of characters and options that specify case-sensitivity, accent-sensitivity, and kana/width-sensitivity. During installation, you must decide which collation will be the default for your server. The value you choose determines the default code page for non-Unicode characters, the default sort order for non-Unicode characters, and the Unicode collation for your server. The collation you choose during setup determines the collation of the system databases (master, model, tempdb, msdb, and distribution).

Earlier versions of SQL Server required that you make three separate decisions regarding collation during the installation process. These decisions applied to all the data stored in all the databases on a single SQL Server and were irrevocable. To change your decision, you had to rebuild your server and completely reload all your data. The three decisions were:

The character set to use for non-Unicode characters, which was determined by the code page
The sort order to use for non-Unicode characters, which included information on case sensitivity and accent sensitivity
The sort order to use for Unicode characters, which was called the "Unicode Collation"

For SQL Server 2000, you must make these same three decisions, but you need to choose only one collation value, which includes all three of your choices. I'll talk about how to specify your choice after I discuss the issues with each of the three decisions.

Character Sets

Most non-Unicode characters are stored in SQL Server as a single byte (8 bits), which means that 256 (2⁸) different characters can be represented. But the combined total of characters in all the world's languages is more than 256. So if you don't want to store all your data as Unicode, which takes twice as much storage space, you must choose a character set that contains all the characters (referred to as the repertoire) that you need to work with. For installations in the Western Hemisphere and Western Europe, the ISO character set (also often referred to as Windows Characters, ISO 8859-1, Latin-1, or ANSI) is the default and is compatible with the character set used by all versions of Windows in those regions. (Technically, there is a slight difference between the Windows character set and ISO 8859-1.) If you choose ISO, you might want to skip the rest of this section on character sets, but you should still familiarize yourself with sort order issues.

You should also ensure that all your client workstations use a character set that is consistent with the characters used by your installation of SQL Server. If, for example, the character ¥, the symbol for the Japanese yen, is entered by a client application using the standard Windows character set, internally SQL Server stores a byte value (also known as a code point) of 165 (0xA5). If an MS-DOS-based application using code page 437 retrieves that value, that application displays the character Ñ. (MS-DOS uses the term code pages to mean character sets; you can think of the terms as interchangeable.) In both cases, the internal value of 165 is stored, but the Windows character set and the MS-DOS code page 437 render it differently. You must consider whether the ordering of characters is what is semantically expected (discussed later in the "Sort Orders" section) and whether the character is rendered on the application monitor (or other output device) as expected.

SQL Server provides services in the OLE DB provider and the ODBC driver that use Windows services to perform character set conversions. Conversions cannot always be exact, however, because by definition each character set has a somewhat different repertoire of characters. For example, there is no exact match in code page 437 for the Windows character Õ, so the conversion must give a close, but different, character.

Windows NT and Windows 2000 and SQL Server 2000 also support 2-byte characters that allow representation of virtually every character used in any language. This is known as Unicode, and it provides many benefits—with some costs. The principal cost is that 2 bytes instead of 1 are needed to store a character. SQL Server allows storage of Unicode data by using the three datatypes nchar, nvarchar, and ntext. I'll discuss these in more detail when I talk about datatypes in Chapter 7. The use of Unicode characters for certain data does not affect the character set chosen for storing the non-Unicode data, which is stored using the datatypes char, varchar, and text.

The ASCII Character Set
It is worth pointing out that the first 128 characters are the same for the character sets ISO, code page 437, and code page 850. These 128 characters make up the ASCII character set. (Standard ASCII is only a 7-bit character set. ASCII was simple and efficient to use in telecommunications because the character and a "stop bit" for synchronization could all be expressed in a single byte.) If your application uses ASCII characters but not the so-called extended characters (typically characters with diacritical marks, such as à, Å, and ä) that differentiate the upper 128 characters among these three character sets, it probably doesn't matter which character set you choose. In this situation (only), whether you choose one of these three character sets or use different character sets on your client and server machines doesn't matter because the rendering and sorting of every important character uses the same byte value in all cases.

Earlier versions of SQL Server did not support Unicode, but they did support double-byte character sets (DBCS). DBCS is a hybrid approach and is the most common way for applications to support Asian languages such as Japanese and Chinese. With DBCS encoding, some characters are 1 byte and others are 2 bytes. The first bit in the character indicates whether the character is a 1-byte or a 2-byte character. (In non-Unicode datatypes, each character is considered 1 byte for storage; in Unicode datatypes, every character is 2 bytes.) To store two DBCS characters, a field would need to be declared as char(4) instead of char(2). SQL Server correctly parses and understands DBCS characters in its string functions. The DBCSs are still available for backward compatibility, but for new applications that need more flexibility in character representation, you should consider using the Unicode datatypes exclusively.

Sort Orders

The specific character set used might not be important in many sites where most of the available character sets can represent the full range of English and Western European characters. However, in nearly every site, whether you realize it or not, the basics of sort order (which is more properly called collating sequence) is important. Sort order determines how characters compare and assign their values. It determines whether your SQL operations are case sensitive. For example, is an uppercase A considered identical to a lowercase a? If your sort order is case sensitive, these two characters are not equivalent, and SQL Server will sort A before a because the byte value for A is less than that for a. If your sort order is case insensitive, whether A sorts before or after a is unpredictable unless you also specify a preference value. Uppercase preference can be specified only with a case-insensitive sort order, and it means that although A and a are treated as equivalent for comparisons, A is sorted before a.

If your data will include extended characters, you must also decide whether your data will be accent insensitive. For our purposes, accents refer to any diacritical marks. So an accent-insensitive sort order treats a and ä as equivalent.

For certain character sets, you also need to specify two other options. Double-byte character sets can specify width insensitivity, which means that equivalent characters represented in either 1 or 2 bytes are treated the same. For a Japanese sort order, you can specify kana insensitive, which means that katakana characters are always unequal to hiragana characters. If you use only ASCII characters and no extended characters, you should simply decide whether you need case sensitivity and choose your collation settings accordingly.

Sort order affects not only the ordering of a result set but also which rows of data qualify for that result set. If a query's search condition is

 WHERE name='SMITH'

the case sensitivity determines whether a row with the name Smith qualifies.

Character matching, string functions, and aggregate functions (MIN, MAX, COUNT (DISTINCT), GROUP BY, UNION, CUBE, LIKE, and ORDER BY) all behave differently with character data depending on the sort order specified.

Sort order semantics

You can think of characters as having primary, secondary, and tertiary sort values, which are different for different sort order choices. A character's primary sort value is used to distinguish it from other characters, without regard to case and diacritical marks. It is essentially an optimization: if two characters do not have the same primary sort value, they cannot be considered identical for either comparison or sorting and there is no reason to look further. The secondary sort value distinguishes two characters that share the same primary value. If the characters share the same primary and secondary values (for example A = a), they are treated as identical for comparisons. The tertiary value allows the characters to compare identically but sort differently. Hence, based on A = a, apple and Apple are considered equal. However, Apple sorts before apple when a sort order with uppercase preference is used. If there is no uppercase preference, whether Apple or apple sorts first is simply a random event based on the order in which the data is encountered when retrieved.

For example, a sort order specification of "case insensitive, accent sensitive, uppercase preference" defines all A-like values as having the same primary sort value. A and a not only have the same primary sort value, they also have the same secondary sort value because they are defined as equal (case insensitive). The character à has the same primary value as A and a but is not declared equal to them (accent sensitive), so à has a different secondary value. And although A and a have the same primary and secondary sort values, with an uppercase preference sort order, each has a different tertiary sort value, which allows it to compare identically but sort differently. If the sort order didn't have uppercase preference, A and a would have the same primary, secondary, and tertiary sort values.

Some sort order choices allow for accent insensitivity. This means that extended characters with diacritics are defined with primary and secondary values equivalent to those without. If you want a search of name ='Jose' to find both Jose and José, you should choose accent insensitivity. Such a sort order defines all E-like characters as equal:

 E=e=è=É=é=ê=ë

NOTE
When deciding on the case sensitivity for your SQL Server databases, you should be aware that the case sensitivity applies to object names as well as user data because object names (the metadata) are stored in tables just like user data. So if you have a case-insensitive sort order, the table CustomerList is seen as identical to a table called CUSTOMERLIST and to one called customerlist. If your server is case sensitive and you have a table called CustomerList, SQL Server will not find the table if you refer to it as customerlist.

Binary sorting

As an alternative to specifying case or accent sensitivity, you can choose a binary sorting option, in which characters are sorted based on their internal byte representation. If you look at a chart for the character set, you see the characters ordered by this numeric value. In a binary sort, the characters sort according to their position by value, just as they do in the chart. Hence, by definition, a binary sort is always case sensitive and accent sensitive and every character has a unique byte value.

Binary sorting is the fastest sorting option because all that is required internally is a simple byte-by-byte comparison of the values. But if you use extended characters, binary sorting is not always semantically desirable. Characters that are A-like, such as Ä, ä, Å, and å, all sort after Z because the extended character A's are in the top 128 characters and Z is a standard ASCII character in the lower 128. If you deal with only ASCII characters (or otherwise don't care about the sort order of extended characters), you want case sensitivity, and you don't care about "dictionary" sorting, binary sorting is an ideal choice. In case-sensitive dictionary sorting, the letters abcxyzABCXYZ sort as AaBbCcXxYyZz. In binary sorting, all uppercase letters appear before any lowercase letters—for example, ABCXYZabcxyz.

Specifying a collation

During the installation of SQL Server 2000, you are asked to specify a default collation for your server. This is the collation that your system databases will use, and any new database you create will use this collation by default. In Chapter 5, I'll describe how you can create a database with a different collation than the one chosen during installation. Two types of collation are available; the documentation calls them SQL Collation and Windows Collation. Figure 4-3 shows the installation dialog box in which you choose your collation. The two choices are described during installation as Windows Locale Collation Designator and SQL Collation.

click to view at full size.

Figure 4-3. The dialog box for selecting the default collation.

SQL Collation is intended for backward compatibility and basically addresses all three collation issues in a single description string: the code page for non-Unicode data, the sort order for non-Unicode data, and the sort order for Unicode data. Figure 4-3 shows the default for an installation on a US English SQL Server, and the description of the SQL Collation is "Dictionary Order, Case-Insensitive, For Use With 1252 Character Set". The code page for non-Unicode data is 1252, the sort order for non-Unicode data is case insensitive, and the sort order for Unicode is Dictionary Order. The actual name of this collation is SQL_Latin1_General_CP1_CI_AS, and you can use this name when referring to this collation using SQL statements, as we'll see in later chapters. The prefix SQL indicates that this is a SQL Collation as opposed to a Windows Collation. The Latin1_General part indicates that the Unicode sort order corresponds to dictionary order, CP1 indicates the default code page 1252, and CI_AS indicates case insensitive, accent sensitive. Other collation names use the full code page number; only 1252 has the special abbreviation CP1. For example, SQL_Scandinavian_CP850_CS_AS is the SQL Collation using Scandinavian sort order for Unicode data, code page 850 for non-Unicode data, and a case-sensitive, accent-sensitive sort order.

The SQL Collations are intended only for backward compatibility with previous SQL Server versions. There is a fixed list of possible values for SQL Collations, and not every combination of sort orders, Unicode collations, and code pages is possible. Some combinations don't make any sense, and others are omitted just to keep the list of possible values manageable.

As an alternative to a SQL Collation for compatibility with a previous installation, you can define a collation based on a Windows locale. In Windows Collation, the Unicode data types always have the same sorting behavior as non-Unicode data types. A Windows Collation is the default for non-English SQL Server installations, with the specific default value for the collation taken from your Windows operating system configuration. You can choose to always sort by byte values for a binary sort, or you can choose case sensitivity and accent sensitivity. When using SQL code to specify a Windows collation for a database, a column, or an expression, you can also specify a case preference for sorts, and, where applicable, the sensitivity of the collation to width and kana. I'll talk about the SQL statements for specifying collations when I discuss character datatypes in Chapter 7.

Performance Considerations

Binary sorting uses significantly fewer CPU instructions than sort orders with defined sort values. So binary sorting is ideal if it fulfills your semantic needs. However, you won't pay a noticeable penalty for using either a simple case-insensitive sort order (for example, Dictionary Order, Case Insensitive) or a case-sensitive choice that offers better support than binary sorting for extended characters. Most large sorts in SQL Server tend to be I/O-bound, not CPU-bound, so the fact that fewer CPU instructions are used by binary sorting doesn't typically translate to a significant performance difference. When you sort a small amount of data that is not I/O-bound, the difference is minimal; the sort will be fast in both cases.

A more significant performance difference results if you choose a sort order of Case Insensitive, Uppercase Preference. Recall that this choice considers all values as equal for comparisons, which also includes indexing. Characters retain a unique tertiary sort order, however, so they might be treated differently by an ORDER BY clause. Uppercase Preference, therefore, can often require an additional sort operation in queries.

Consider a query that specifies WHERE last_name >= 'Jackson' ORDER BY last_name. If an index exists on the last_name field, the query optimizer will likely use it to find rows whose last_name value is greater than or equal to Jackson. If there is no need to use Uppercase Preference, the optimizer knows that because it is retrieving records based on the order of last_name, there is also no need to physically sort the rows because they were extracted in that order already. Jackson, jackson, and JACKSON are all qualifying rows. All are indexed and treated identically, and they simply appear in the result set in the order in which they were encountered. If Uppercase Preference is required for sorting, a subsequent sort of the qualifying rows is required to differentiate the rows even though the index can still be used for selecting qualifying rows. If many qualifying rows are present, the performance difference between the two cases (one needs an additional sort operation) can be dramatic. This doesn't mean that you shouldn't choose Uppercase Preference. If you require those semantics, performance might well be a secondary concern. But you should be aware of the trade-off and decide which is most important to you.

Another performance issue arises when you decide between SQL Collation and Windows Collation. Because any character data defined using a Windows Collation is sorted based on the Unicode sort table for the collation, all non-Unicode data must be converted to Unicode during the sort. I ran some simple tests with two 10,000-row tables. The only difference between the tables was in a column called last_name. In one table, last_name was defined using a SQL Collation, and in the other, it was defined using an equivalent Windows Collation. Repeated sorts of the column in both tables showed a 15 to 30 percent increase in CPU time for the sort on the Unicode column.