Glossary
In order to store characters from a wide variety of languages, SQL Server relies on various data types. Collations also play an important role in handling multilingual data, with their ability to sort both Unicode and non-Unicode data in a consistent manner. The following sections differentiate the data types in SQL Server 2000, explain the significance of collations, and show how to retrieve information about collations.
Glossary
The data types you use to store international data depend on whether you are dealing with non-Unicode data, Unicode data, date and time values, or a combination of these categories. For non-Unicode data, you would either use char, varchar, or text data types. The data types nchar, nvarchar, and ntext are used in SQL Server for Unicode data, whereas datetime and smalldatetime are used to represent date and time values. There are several areas to consider when using SQL Server 2000 data types for storing international data.
When you deal with text data that is stored in the cha r, varcha r, or text data types, the most important limitation to consider is that only information from a single code page can be stored. The exact code page depends on the collation of the column. If there is no column-level collation, the collation of the database is used. (For more information on collations, see "Collation" later in this chapter.) To determine the code page that is used for a given column, you can use the COLLA-TIONPROPERTY function, as shown in the following examples:
SELECT COLLATIONPROPERTY('Chinese_PRC_Stroke_CI_AI_KS_WS','CodePage') 936 SELECT COLLATIONPROPERTY('Latin1_General_CI_AI','CodePage') 1252 SELECT COLLATIONPROPERTY('Hindi_CI_AI_WS','CodePage') 0
In the last example, Hindi was added to point out that many locales (such as Georgian and Hindi) do not have code pages, since they are "Unicode-only" collations, which are not appropriate for non-Unicode data types.
Any time Unicode data must be inserted into these columns, the columns will be internally converted from Unicode using the WideCharToMultiByte application programming interface (API) and the code page associated with the collation. Any time a character cannot be represented on the given code page, it will be replaced by a question mark (?); therefore, random question marks are a good indication of data that has been corrupted due to this conversion. The appearance of question marks also signals that you should have used a Unicode data type. If you use a string literal of a non-Unicode type, it will be converted using the database's default code page (derived from its collation). If you want the string literal to be interpreted by the server as Unicode, you must put "N" before the string literal.
You might encounter another problem if you try to store data when not all of the characters you wish to support are contained in the code page. One of the best examples of this is the Arabic script. It supports a wide array of languages, including Baluchi, Berber, Farsi, Kashmiri, Kazakh, Kirghiz, Pashto, Sindhi, Uighur, Urdu, and more. All of these languages have additional characters beyond those in the Arabic language, which is the basis for Microsoft Windows code page 1256. These extra characters are thus converted into question marks if they are stored in a non-Unicode column with the Arabic collation. This problem occurs because, in many cases, Windows will consider a particular code page to be a "best fit" code page. That means there is no guarantee you will be able to rely on the code page to handle all characters, but it is still the code page with the closest match to the characters you are trying to support.
The SQL-92 specification defines these "N" (stands for "national") data types, but does not specifically require them to be used for Unicode; the actual definition of these data types is left to the database platform or developer. In Microsoft SQL Server 7 and SQL Server 2000, these data types are defined as being equivalent to UCS-2. Remember, however, that this equivalency is specific to Microsoft SQL Server. When you work with other database servers (such as Sybase SQL Server), "N" data types do not necessarily mean Unicode.
For the storage of complex scripts, such as those used in Hindi and Tamil, it is important to note that the data is expected to be in the proper order. Many languages such as Tamil will actually specify that certain letters be reordered when the text is rendered, thus making the logical order of text that is stored in memory different from the visual order that will be seen in a UI. (For more information, see Chapter 5, "Text Input, Output, and Display.") Data should always be stored in the proper logical order for any complex-script language, which includes all of the Indic languages, Arabic, Farsi, Hebrew, and many others. The actual rendering of such data is a separate issue.
Although the "N" columns can indeed support data of any language or combination of languages, the actual sorting of this data can only be in a single collation. (For more information on the meaning and consequences of this issue, see "Collation" later in this chapter.) None of the code-page limitations mentioned previously applies to Unicode columns.
These data types have no actual international meaning; they represent a date or time value with the following definitions:
SQL Server rejects data that falls outside of these ranges. The actual data is stored internally as two integers (4-byte integers for datetime, and 2-byte integers for smalldatetime) that represent the date and time in question. Because the actual value has no actual connotations in regard to locale-specific formatting, it is up to the developer to define such conversions as needed.
SQL Server 2000 supports many different locale-specific conversions that can be performed at the server instead of relying on custom solutions from developers. These date styles can be accessed through the CONVERT function, which takes a data type, an expression, and an optional style.
The following example shows how the CONVERT function is used:
SELECT CONVERT(char, GETDATE(), 100) AS [100] Aug 16 2000 11:50AM
You can then convert the data from a string to a date value in much the same manner:
SELECT CONVERT(datetime, 'Aug 16 2000 11:50AM', 100) AS [100]
It is worth noting that when you convert such dates with Style 130 (Kuwaiti or Hijri), converting to the char data type can result in data corruption if the collation is not one of the Arabic collations that use code page 1256 for Unicode conversions. Figure 22-1 shows the results of this type of conversion.
Figure 22.1 - Converting character encoding in SQL Query Analyzer.
Note that on a U.S. client computer, attempting to use the char data type results in the Arabic characters being converted into question marks, and the nchar data type rendering Arabic characters. This particular string at the top of Figure 22-1 is still not properly formatted (as it would be on an Arabic client computer) due to limitations in the SQL grid in SQL Query Analyzer.
In addition to data types, collation is another important component in handling multilingual data, whether Unicode or non-Unicode. One of the things that many people take for granted is sorted data; after all, what could be more basic than the alphabet? However, you must remember that users of non-English languages can have different sorting expectations. For example, a Spanish user might expect the letter combination "ch" to sort as a single character after the letter "h." In general, one of the most effective ways to alienate the user of an application is to be unable to correctly handle basic sorting operations.
The sorting of data is accomplished through collations, or sort orders, and a technique known as "string normalization." This type of normalization is different from the regular concept of normalization that database developers are used to, since it is a not a design issue; string normalization refers to comparing two strings so that they can be sorted, a comparison that can be optimized through the creation of indexes.
SQL Server 2000 supports objects that have different collations being stored in a single database. Separate SQL Server 2000 collations can be specified down to the level of columns, and each column in a table can be assigned different collations.
For non-Unicode columns, collation has a second meaning that is very important. Collation specifies the code page for the data and, therefore, what characters can be represented. Data can be moved between Unicode columns seamlessly, while data moved between non-Unicode columns cannot.
In SQL Server 2000, a single, consistent model was designed to handle both Unicode and non-Unicode sorts. This model supports the languages displayed in the following list.
Table 22-1 Collation in SQL Server 2000.
Collation in SQL Server 2000 | ||
Albanian | Arabic | Chinese_PRC |
Chinese_PRC_Stroke | Chinese_Taiwan_Bopomofo | Chinese_Taiwan_Stroke |
Croatian | Cyrillic_General | Czech |
Danish_Norwegian | Estonian | Finnish_Swedish |
French | Georgian_Modern_sort | German_PhoneBook |
Greek | Hebrew | Hindi |
Hungarian | Hungarian_Technical | Icelandic |
Japanese | Japanese_Unicode | Korean_Wansung |
Korean_Wansung_Unicode | Latin1_General | Latvian |
Lithuanian | Lithuanian_Classic | FYRO Macedonian |
Polish | Romanian | Slovak |
Slovenian | Spanish (Spain) | Thai |
Traditional_Spanish | Turkish | Ukrainian |
Vietnamese |
Each of these collations is combined with a series of suffixes that help define whether there is case, accent, kana, or width sensitivity. (Note that the default setting of kana, or width sensitivity is set to insensitive.) The exact suffixes that are possible are shown in Table 22-2. Each of the 40 languages in Table 22-1 supports the 17 suffixes in Table 22-2, for a total of 680 Windows collations.
Table 22-2 Collation suffixes.
Suffix for Collation | Meaning |
_BIN | Binary sort |
_CI_AI | Case-insensitive, accent-insensitive, kanatype-insensitive,width-insensitive |
_CI_AI_WS | Case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive |
_CI_AI_KS | Case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive |
_CI_AI_KS_WS | Case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive |
_CI_AS | Case-insensitive, accent-sensitive, kanatype-insensitive,width-insensitive |
_CI_AS_WS | Case-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive |
_CI_AS_KS | Case-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive |
_CI_AS_KS_WS | Case-insensitive, accent-sensitive, kanatype-sensitive, width- sensitive |
_CS_AI | Case-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive |
_CS_AI_WS | Case-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive |
_CS_AI_KS | Case-sensitive, accent-insensitive, kanatype-sensitive, width- insensitive |
_CS_AI_KS_WS | Case-sensitive, accent-insensitive, kanatype-sensitive, width- sensitive |
_CS_AS | Case-sensitive, accent-sensitive, kanatype-insensitive, width- insensitive |
_CS_AS_WS | Case-sensitive, accent-sensitive, kanatype-insensitive, width- sensitive |
_CS_AS_KS | Case-sensitive, accent-sensitive, kanatype-sensitive, width- insensitive |
_CS_AS_KS_WS | Case-sensitive, accent-sensitive, kanatype-sensitive, width- sensitive |
To ensure that code pages from earlier versions of SQL Server are properly supported, many backward-compatible, SQL-specific sort orders are also included in SQL Server 2000. These SQL-specific sort orders are listed in Table 22-3. Many of these sort orders support various parts of the suffixes listed previously, but not all of the suffixes are supported.
Table 22-3 SQL-specific sort orders.
SQL-Specific Sort Orders | ||
SQL_1xCompat_CP850 | SQL_AltDiction_CP850 | SQL_AltDiction_CP1253 |
SQL_AltDiction_Pref_CP850 | SQL_Croatian_CP1250 | SQL_Czech_CP1250 |
SQL_Danish_Pref_CP1 | SQL_EBCDIC037_CP1 | SQL_EBCDIC273_CP1 |
SQL_EBCDIC277_CP1 | SQL_EBCDIC278_CP1 | SQL_EBCDIC280_CP1 |
SQL_EBCDIC284_CP1 | SQL_EBCDIC285_CP1 | SQL_Estonian_CP1257 |
SQL_Hungarian_CP1250 | SQL_Icelandic_Pref_CP1 | SQL_Latin1_General_CP1 |
SQL_Latin1_General_CP1251 | SQL_Latin1_General_CP1253 | SQL_Latin1_General_CP1254 |
SQL_Latin1_General_CP1255 | SQL_Latin1_General_CP1256 | SQL_Latin1_General_CP1257 |
SQL_Latin1_General_CP437 | SQL_Latin1_General_CP850 | SQL_Latin1_General_CP1250 |
SQL_Latin1_General_Pref_CP1 | SQL_Latin1_General_Pref _CP437 | SQL_Latin1_General_Pref _CP850 |
SQL_Latvian_CP1257 | SQL_Lithuanian_CP1257 | SQL_MixDiction_CP1253 |
SQL_Polish_CP1250 | SQL_Romanian_CP1250 | SQL_Scandinavian_CP850 |
SQL_Scandinavian_Pref_CP850 | SQL_Slovak_CP1250 | SQL_Slovenian_CP1250 |
SQL_SwedishPhone_Pref_CP1 | SQL_SwedishStd_Pref_CP1 | SQL_Ukrainian_CP1251 |
You can retrieve the actual information about collations by using the COLLATIONPROPERTY function. In addition to the CodePage value used previously (see "Non-Unicode Text Types: char, varchar, text " earlier in this chapter), you can also pass other information types, such as LCID, which returns the Windows locale ID (returns NULL for SQL collations). You can also specify Windows ComparisonStyle (returns NULL for both Binary and SQL collations). This information can be used to verify that there is indeed an equivalence between string normalization in Microsoft Windows 2000 and SQL Server 2000 for all of the Windows collations. All of the available collations can be returned by using the fn_helpcollations() function, as shown here:
SELECT * FROM ::fn_helpcollations()
This query returns 753 rows in SQL Server 2000. Additional collations cannot be added, unless they are added in future versions of SQL Server.
In SQL Server 2000, collations can be specified at multiple levels, including the server level, the database level, the column level, and in an expression.
When installing an instance of SQL Server, the default collation of that instance can be configured during setup. The default collation of the instance also becomes the default collation of the system databases: master, model, tempdb, msdb,and distribution.
After a collation has been assigned to any object other than a column or database, you cannot change the collation except by dropping and re-creating the object. This can be a complex operation. To change the default collation for an instance of SQL Server 2000 you must:
Note
To query the server collation, use the Transact-SQL SERVERPROPERTY function as shown here:
SELECT CONVERT (char, SERVERPROPERTY('collation'))
When a database is created, the COLLATE clause of the CREATE DATABASE statement can be used to specify the default collation of the database. If no collation is specified during database creation, the database is assigned the default collation of the model database. The default collation of the model database is the same as the default collation of the instance of SQL Server.
The collation of the database can be changed by using the ALTER DATABASE statement. For example:
ALTER DATABASE myDB COLLATE Greek_CS_AI
The current collation of the database can be retrieved by utilizing the following statement:
SELECT CONVERT (char, DATABASEPROPERTY('database_name', 'collation'))
When creating a table, collations for each character-string column can be specified by using the COLLATE clause of the CREATE TABLE statement. If no collation is explicitly specified, the column is assigned the default collation of the database.
Column collation can be changed by the ALTER TABLE statement. For example:
Page: 1 ALTER TABLE myTable ALTER COLUMN mycol NVARCHAR(10) COLLATE Greek_CS_AI
There can be many instances when data needs to be displayed to people in different countries or regions with the required locale-appropriate sorting. SQL Server 2000 can specify collations in expressions, which allows sorting in a particular manner so that the ORDER BY clause can be language-specific. This is done by the COLLATE clause. The syntax for using the COLLATE keyword is:
COLLATE [<Windows_Collation_name> | <SQL_Collation_Name>].
For example:
SELECT name FROM customer ORDER BY name COLLATE Latin1_General_CS_AI