International Best Practices | Developing International Software

Among the guidelines that you should follow are to always use Unicode data types-if you use multilingual data-and to standardize the site collation where possible. You should also try to deploy a Unicode server and client if you can, consider storage space and performance, and delimit identifiers for objects in system tables.

Always Use Unicode Data Types

If you must store character data that reflects multiple languages, you should always use Unicode nchar, nvarchar, and ntext data types instead of the non-Unicode char, varchar, and text data types. There are significant limitations associated with char, varchar, and text data types, since only information from a single code page can be stored.

The other benefit of using Unicode data types is performance gain, which is due to fewer code-page conversions occurring. Here are some representative scenarios:

If you are running Windows 2000 or Windows XP, the kernel is expecting Unicode data; thus non-Unicode columns will have to be converted in many cases, such as when you display data or use the operating-system services.
If you access data via Microsoft OLE DB or Microsoft Open Database Connectivity (ODBC) version 3.7 and later, these data access methods are expecting Unicode and will convert to and from Unicode and non-Unicode during transition. This conversion will consume additional resources and time.

Note

To realistically evaluate performance issues, you must test to get conclusive data about the situation.

Standardize the Site Collation Where Possible

If you always choose collations with the same case and accent sensitivity, the user will experience consistent behavior across all systems. Most SQL Server 2000 sites choose to be case-insensitive and accent-sensitive. Case sensitivity also applies to the names of SQL Server objects; therefore, if you specify case- sensitive collations, all users must specify the correct case when querying the database.

Always keep in mind that standardizing the site collation helps eliminate the need for explicitly specifying a collation in a character or Unicode expression, because the default collation in an expression can vary depending on the current database setting for the connection. This means that the result of a Transact-SQL statement can have different collations when executed in the context of different databases. If you inevitably have to work with objects that have different collations and code pages, you must code your queries to comply with the rules of collation coercion.

Deploy a Unicode Server and Client If Possible

In a very small number of cases, everything you do with a SQL Server will happen on the same computer on which the server resides, and only SQL Server tools such as SQL Query Analyzer or SQL Server Enterprise Manager will be used. In most cases, however, the server will be interacting with other servers or clients, and might be using one or more data-access standards. SQL Server has basically two types of clients:

Unicode clients-OLE DB, and ODBC versions 3.7 and later
Non-Unicode clients-ODBC version 3.6 and earlier, and DB-Library

The possible scenarios of server and client connections, along with certain areas you will need to consider, are presented in the sections that follow.

Unicode Server and Client

This is the ideal type of configuration. By keeping data in Unicode throughout the process, you can guarantee the best performance and protection from corruption of retrieved data. This is the case with Microsoft ActiveX Data Objects (ADO) and OLE DB.

Unicode Server and One or More Non-Unicode Clients

In this type of configuration, you might not have any problems storing data, but there is obviously a serious limitation when it comes to bringing the data to the client and using it. The client code page must be used to convert the Unicode data at some point.

Non-Unicode Server and Unicode Client

This is not an ideal configuration for multilingual data, because you will not be able to keep such data on the server. At the very least, however, you can be sure the data will show up properly. This configuration has all the limitations of the previous case, without the risk of corrupting received data due to invalid conversions. An example is when a SQL Server 2000 database defines a linked server to a Microsoft SQL Server 6.5 database. All information that is received from the server running SQL Server 6.5 will be valid, but do not try to insert any data that is outside of the code page!

Non-Unicode Server and Client

This is the most limiting configuration, because you are basically restricted to a single code page at all times.

Consider Storage Space and Performance

Ideally, every column is defined with one of the Unicode data types; however, doing so when you do not need to support multilingual data can create issues related to storage space and speed.

Storage Space Issues

The actual amount of space required for Unicode data types is 2 bytes per character; the amount of space for non-Unicode data types is 1 byte for all non-DBCS text and 2 bytes for Asian languages that use DBCS. Therefore, unless your data is on one of the Asian code pages, you will be using twice as much space to store the data. This must be considered when you upgrade existing databases or when you are deciding on the proper data types of new projects. If you are storing data only in a column that is on a single (non-Asian) code page, you might prefer not to use Unicode so you can save disk space and memory.

Speed Issues

In addition to storage space, you will also need to consider how performance is affected when you use Unicode data types unnecessarily. The speed issue is a complicated one. Here are some of the issues:

Indexes on columns within tables or views cannot be made collation-specific. If you use the COLLATE keyword in an "ORDER BY" expression, the SQL Server query optimizer cannot use a colllation-specific index to execute this query, thus affecting the server's performance.
Additional time is needed to load the larger amount of data when you are dealing with DBCS data.
If you are dealing with a Microsoft Windows 95 or Microsoft Windows 98 client or server, much of the information might need to be converted from Unicode when operating-system services, such as data display, are needed.
If you are working between servers, database server products, or other products, the number of conversions can also play a great role in the performance battle.
If you are dealing with Asian languages, Unicode will actually be faster than using the language-specific DBCS code page. This is because DBCS data does not have a fixed width; it is a mixture of double-byte and single-byte characters.
If you are dealing with non-Asian languages, sorting Unicode data can be slower than non-Unicode data. This can be considered one cost of being able to represent global data.

Important

To realistically evaluate a performance issue, you must test to get conclusive data about the situation.

Delimit Identifiers for Objects in System Tables

The system tables in SQL Server 2000 store all the data they contain as Unicode. This minimizes the problems that can occur with different collations between databases and columns. There is no other way to deal with the fact that different databases on the same server can have any mix of Unicode and non-Unicode column names. Even if you are supporting only a single language now, SQL Server must be ready to support any language you might choose to support in the future.

When you are converting databases and servers from SQL Server 6.5 or earlier, it is easy to become concerned about the metadata that is being converted; however, there is no need to worry. The conversion to Unicode is straightforward because these earlier versions of SQL Server are only in one code page or collation, each of which is defined at the server level.

One important issue involves using identifiers for objects in the system tables. SQL Server 2000 uses the Unicode 2.0 character property definition to create the list of valid characters in identifiers. (Unicode 3.0 was not released when SQL Server 2000 development was completed.) To avoid running into issues with international characters that are not defined in the Unicode 2.0 character property definition, you should delimit your identifiers with brackets ([]) or double quotation marks ("). This prevents the server from checking for valid characters.