International Features

Glossary

  • Collation: Refers to a set of rules that determine how data is sorted and compared.

In order to store characters from a wide variety of languages, SQL Server relies on various data types. Collations also play an important role in handling multilingual data, with their ability to sort both Unicode and non-Unicode data in a consistent manner. The following sections differentiate the data types in SQL Server 2000, explain the significance of collations, and show how to retrieve information about collations.

Data Types in SQL Server 2000

Glossary

  • SQL-92: Designed by what was formerly the National Committee for Information Technology Standards (NCITS)-now known as the "Inter-National Committee for Information Technology Standards" (INCITS)-Technical Committee H2 on Database to be a standard for relational database management systems (RDBMSs).
  • Transact-SQL: A method to communicate with and access data on SQL Server. Applications that communicate with SQL Server do so by sending Transact-SQL statements to the server, regardless of an appli-cation's user interface (UI).
  • SQL Query Analyzer: A graphical user interface for designing and testing Transact-SQL statements, batches, and scripts interactively.
  • SQL Server Enterprise Manager: Allows enterprise-wide configuration and management of SQL Server and SQL Server objects.

The data types you use to store international data depend on whether you are dealing with non-Unicode data, Unicode data, date and time values, or a combination of these categories. For non-Unicode data, you would either use char, varchar, or text data types. The data types nchar, nvarchar, and ntext are used in SQL Server for Unicode data, whereas datetime and smalldatetime are used to represent date and time values. There are several areas to consider when using SQL Server 2000 data types for storing international data.

Non-Unicode Text Types: char, varchar, text

When you deal with text data that is stored in the cha r, varcha r, or text data types, the most important limitation to consider is that only information from a single code page can be stored. The exact code page depends on the collation of the column. If there is no column-level collation, the collation of the database is used. (For more information on collations, see "Collation" later in this chapter.) To determine the code page that is used for a given column, you can use the COLLA-TIONPROPERTY function, as shown in the following examples:

 SELECT COLLATIONPROPERTY('Chinese_PRC_Stroke_CI_AI_KS_WS','CodePage') 936 SELECT COLLATIONPROPERTY('Latin1_General_CI_AI','CodePage') 1252 SELECT COLLATIONPROPERTY('Hindi_CI_AI_WS','CodePage') 0 

In the last example, Hindi was added to point out that many locales (such as Georgian and Hindi) do not have code pages, since they are "Unicode-only" collations, which are not appropriate for non-Unicode data types.

Any time Unicode data must be inserted into these columns, the columns will be internally converted from Unicode using the WideCharToMultiByte application programming interface (API) and the code page associated with the collation. Any time a character cannot be represented on the given code page, it will be replaced by a question mark (?); therefore, random question marks are a good indication of data that has been corrupted due to this conversion. The appearance of question marks also signals that you should have used a Unicode data type. If you use a string literal of a non-Unicode type, it will be converted using the database's default code page (derived from its collation). If you want the string literal to be interpreted by the server as Unicode, you must put "N" before the string literal.

You might encounter another problem if you try to store data when not all of the characters you wish to support are contained in the code page. One of the best examples of this is the Arabic script. It supports a wide array of languages, including Baluchi, Berber, Farsi, Kashmiri, Kazakh, Kirghiz, Pashto, Sindhi, Uighur, Urdu, and more. All of these languages have additional characters beyond those in the Arabic language, which is the basis for Microsoft Windows code page 1256. These extra characters are thus converted into question marks if they are stored in a non-Unicode column with the Arabic collation. This problem occurs because, in many cases, Windows will consider a particular code page to be a "best fit" code page. That means there is no guarantee you will be able to rely on the code page to handle all characters, but it is still the code page with the closest match to the characters you are trying to support.

Unicode Text Types: nchar, nvarchar, ntext

The SQL-92 specification defines these "N" (stands for "national") data types, but does not specifically require them to be used for Unicode; the actual definition of these data types is left to the database platform or developer. In Microsoft SQL Server 7 and SQL Server 2000, these data types are defined as being equivalent to UCS-2. Remember, however, that this equivalency is specific to Microsoft SQL Server. When you work with other database servers (such as Sybase SQL Server), "N" data types do not necessarily mean Unicode.

For the storage of complex scripts, such as those used in Hindi and Tamil, it is important to note that the data is expected to be in the proper order. Many languages such as Tamil will actually specify that certain letters be reordered when the text is rendered, thus making the logical order of text that is stored in memory different from the visual order that will be seen in a UI. (For more information, see Chapter 5, "Text Input, Output, and Display.") Data should always be stored in the proper logical order for any complex-script language, which includes all of the Indic languages, Arabic, Farsi, Hebrew, and many others. The actual rendering of such data is a separate issue.

Although the "N" columns can indeed support data of any language or combination of languages, the actual sorting of this data can only be in a single collation. (For more information on the meaning and consequences of this issue, see "Collation" later in this chapter.) None of the code-page limitations mentioned previously applies to Unicode columns.

Date and Time Types: datetime, smalldatetime

These data types have no actual international meaning; they represent a date or time value with the following definitions:

  • datetime. Date and time data in the Gregorian calendar from January 1, 1753 through December 31, 9999, to an accuracy of one three-hundredth of a second (equivalent to 3.33 milliseconds or 0.00333 seconds).
  • smalldatetime. Date and time data in the Gregorian calendar from January 1, 1900 through June 6, 2079, with accuracy to the minute. smalldatetime values with 29.998 seconds or lower are rounded down to the nearest minute; values with 29.999 seconds or higher are rounded up to the nearest minute.

SQL Server rejects data that falls outside of these ranges. The actual data is stored internally as two integers (4-byte integers for datetime, and 2-byte integers for smalldatetime) that represent the date and time in question. Because the actual value has no actual connotations in regard to locale-specific formatting, it is up to the developer to define such conversions as needed.

SQL Server 2000 supports many different locale-specific conversions that can be performed at the server instead of relying on custom solutions from developers. These date styles can be accessed through the CONVERT function, which takes a data type, an expression, and an optional style.

The following example shows how the CONVERT function is used:

SELECT CONVERT(char, GETDATE(), 100) AS [100] Aug 16 2000 11:50AM

You can then convert the data from a string to a date value in much the same manner:

SELECT CONVERT(datetime, 'Aug 16 2000 11:50AM', 100) AS [100]

It is worth noting that when you convert such dates with Style 130 (Kuwaiti or Hijri), converting to the char data type can result in data corruption if the collation is not one of the Arabic collations that use code page 1256 for Unicode conversions. Figure 22-1 shows the results of this type of conversion.

figure 22.1 converting character encoding in sql query analyzer.

Figure 22.1 - Converting character encoding in SQL Query Analyzer.

Note that on a U.S. client computer, attempting to use the char data type results in the Arabic characters being converted into question marks, and the nchar data type rendering Arabic characters. This particular string at the top of Figure 22-1 is still not properly formatted (as it would be on an Arabic client computer) due to limitations in the SQL grid in SQL Query Analyzer.

Collation

In addition to data types, collation is another important component in handling multilingual data, whether Unicode or non-Unicode. One of the things that many people take for granted is sorted data; after all, what could be more basic than the alphabet? However, you must remember that users of non-English languages can have different sorting expectations. For example, a Spanish user might expect the letter combination "ch" to sort as a single character after the letter "h." In general, one of the most effective ways to alienate the user of an application is to be unable to correctly handle basic sorting operations.

The sorting of data is accomplished through collations, or sort orders, and a technique known as "string normalization." This type of normalization is different from the regular concept of normalization that database developers are used to, since it is a not a design issue; string normalization refers to comparing two strings so that they can be sorted, a comparison that can be optimized through the creation of indexes.

SQL Server 2000 supports objects that have different collations being stored in a single database. Separate SQL Server 2000 collations can be specified down to the level of columns, and each column in a table can be assigned different collations.

For non-Unicode columns, collation has a second meaning that is very important. Collation specifies the code page for the data and, therefore, what characters can be represented. Data can be moved between Unicode columns seamlessly, while data moved between non-Unicode columns cannot.

In SQL Server 2000, a single, consistent model was designed to handle both Unicode and non-Unicode sorts. This model supports the languages displayed in the following list.

Table 22-1 Collation in SQL Server 2000.

Collation in SQL Server 2000

Albanian

Arabic

Chinese_PRC

Chinese_PRC_Stroke

Chinese_Taiwan_Bopomofo

Chinese_Taiwan_Stroke

Croatian

Cyrillic_General

Czech

Danish_Norwegian

Estonian

Finnish_Swedish

French

Georgian_Modern_sort

German_PhoneBook

Greek

Hebrew

Hindi

Hungarian

Hungarian_Technical

Icelandic

Japanese

Japanese_Unicode

Korean_Wansung

Korean_Wansung_Unicode

Latin1_General

Latvian

Lithuanian

Lithuanian_Classic

FYRO Macedonian

Polish

Romanian

Slovak

Slovenian

Spanish (Spain)

Thai

Traditional_Spanish

Turkish

Ukrainian

Vietnamese

Each of these collations is combined with a series of suffixes that help define whether there is case, accent, kana, or width sensitivity. (Note that the default setting of kana, or width sensitivity is set to insensitive.) The exact suffixes that are possible are shown in Table 22-2. Each of the 40 languages in Table 22-1 supports the 17 suffixes in Table 22-2, for a total of 680 Windows collations.

Table 22-2 Collation suffixes.

Suffix for Collation

Meaning

_BIN

Binary sort

_CI_AI

Case-insensitive, accent-insensitive, kanatype-insensitive,width-insensitive

_CI_AI_WS

Case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive

_CI_AI_KS

Case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive

_CI_AI_KS_WS

Case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive

_CI_AS

Case-insensitive, accent-sensitive, kanatype-insensitive,width-insensitive

_CI_AS_WS

Case-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive

_CI_AS_KS

Case-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive

_CI_AS_KS_WS

Case-insensitive, accent-sensitive, kanatype-sensitive, width- sensitive

_CS_AI

Case-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive

_CS_AI_WS

Case-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive

_CS_AI_KS

Case-sensitive, accent-insensitive, kanatype-sensitive, width- insensitive

_CS_AI_KS_WS

Case-sensitive, accent-insensitive, kanatype-sensitive, width- sensitive

_CS_AS

Case-sensitive, accent-sensitive, kanatype-insensitive, width- insensitive

_CS_AS_WS

Case-sensitive, accent-sensitive, kanatype-insensitive, width- sensitive

_CS_AS_KS

Case-sensitive, accent-sensitive, kanatype-sensitive, width- insensitive

_CS_AS_KS_WS

Case-sensitive, accent-sensitive, kanatype-sensitive, width- sensitive

To ensure that code pages from earlier versions of SQL Server are properly supported, many backward-compatible, SQL-specific sort orders are also included in SQL Server 2000. These SQL-specific sort orders are listed in Table 22-3. Many of these sort orders support various parts of the suffixes listed previously, but not all of the suffixes are supported.

Table 22-3 SQL-specific sort orders.

SQL-Specific Sort Orders

SQL_1xCompat_CP850

SQL_AltDiction_CP850

SQL_AltDiction_CP1253

SQL_AltDiction_Pref_CP850

SQL_Croatian_CP1250

SQL_Czech_CP1250

SQL_Danish_Pref_CP1

SQL_EBCDIC037_CP1

SQL_EBCDIC273_CP1

SQL_EBCDIC277_CP1

SQL_EBCDIC278_CP1

SQL_EBCDIC280_CP1

SQL_EBCDIC284_CP1

SQL_EBCDIC285_CP1

SQL_Estonian_CP1257

SQL_Hungarian_CP1250

SQL_Icelandic_Pref_CP1

SQL_Latin1_General_CP1

SQL_Latin1_General_CP1251

SQL_Latin1_General_CP1253

SQL_Latin1_General_CP1254

SQL_Latin1_General_CP1255

SQL_Latin1_General_CP1256

SQL_Latin1_General_CP1257

SQL_Latin1_General_CP437

SQL_Latin1_General_CP850

SQL_Latin1_General_CP1250

SQL_Latin1_General_Pref_CP1

SQL_Latin1_General_Pref _CP437

SQL_Latin1_General_Pref _CP850

SQL_Latvian_CP1257

SQL_Lithuanian_CP1257

SQL_MixDiction_CP1253

SQL_Polish_CP1250

SQL_Romanian_CP1250

SQL_Scandinavian_CP850

SQL_Scandinavian_Pref_CP850

SQL_Slovak_CP1250

SQL_Slovenian_CP1250

SQL_SwedishPhone_Pref_CP1

SQL_SwedishStd_Pref_CP1

SQL_Ukrainian_CP1251

You can retrieve the actual information about collations by using the COLLATIONPROPERTY function. In addition to the CodePage value used previously (see "Non-Unicode Text Types: char, varchar, text " earlier in this chapter), you can also pass other information types, such as LCID, which returns the Windows locale ID (returns NULL for SQL collations). You can also specify Windows ComparisonStyle (returns NULL for both Binary and SQL collations). This information can be used to verify that there is indeed an equivalence between string normalization in Microsoft Windows 2000 and SQL Server 2000 for all of the Windows collations. All of the available collations can be returned by using the fn_helpcollations() function, as shown here:

SELECT * FROM ::fn_helpcollations()

This query returns 753 rows in SQL Server 2000. Additional collations cannot be added, unless they are added in future versions of SQL Server.

In SQL Server 2000, collations can be specified at multiple levels, including the server level, the database level, the column level, and in an expression.

At the Server Level

When installing an instance of SQL Server, the default collation of that instance can be configured during setup. The default collation of the instance also becomes the default collation of the system databases: master, model, tempdb, msdb,and distribution.

After a collation has been assigned to any object other than a column or database, you cannot change the collation except by dropping and re-creating the object. This can be a complex operation. To change the default collation for an instance of SQL Server 2000 you must:

  1. Make sure you have all of the information or scripts needed to re-create your user databases and all of the objects in them.
  2. Export all of your data using a tool such as a bulk copy.
  3. Drop all of the user databases.
  4. Rebuild the master database specifying the new collation.
  5. Create all of the databases and the objects in them.
  6. Import all of your data.

Note


Instead of changing the default collation of an instance of SQL Server 2000, you can specify a default collation for each new database you create.

To query the server collation, use the Transact-SQL SERVERPROPERTY function as shown here:

SELECT CONVERT (char, SERVERPROPERTY('collation'))

At the Database Level

When a database is created, the COLLATE clause of the CREATE DATABASE statement can be used to specify the default collation of the database. If no collation is specified during database creation, the database is assigned the default collation of the model database. The default collation of the model database is the same as the default collation of the instance of SQL Server.

The collation of the database can be changed by using the ALTER DATABASE statement. For example:

 ALTER DATABASE myDB COLLATE Greek_CS_AI 

The current collation of the database can be retrieved by utilizing the following statement:

 SELECT CONVERT (char, DATABASEPROPERTY('database_name',  'collation')) 

At the Column Level

When creating a table, collations for each character-string column can be specified by using the COLLATE clause of the CREATE TABLE statement. If no collation is explicitly specified, the column is assigned the default collation of the database.

Column collation can be changed by the ALTER TABLE statement. For example:

 Page: 1 ALTER TABLE myTable ALTER COLUMN mycol NVARCHAR(10) COLLATE  Greek_CS_AI 

In an Expression

There can be many instances when data needs to be displayed to people in different countries or regions with the required locale-appropriate sorting. SQL Server 2000 can specify collations in expressions, which allows sorting in a particular manner so that the ORDER BY clause can be language-specific. This is done by the COLLATE clause. The syntax for using the COLLATE keyword is:

 COLLATE [<Windows_Collation_name> | <SQL_Collation_Name>]. 

For example:

 SELECT name FROM customer ORDER BY name COLLATE Latin1_General_CS_AI 


Microsoft Corporation - Developing International Software
Developing International Software
ISBN: 0735615837
EAN: 2147483647
Year: 2003
Pages: 198

Similar book on Amazon

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net