Characters

The SQL Standard provides four data types for columns that contain character string data: CHAR, VARCHAR, NATIONAL CHARACTER (or NCHAR), and NATIONAL CHARACTER VARYING (or NCHAR VARYING). DBMSs that support these data types may set their own maximum lengths for each. Table 7-1 shows the SQL Standard requirements and the level of support (data type and maximum size in characters or bytes) the Big Eight have for character data types, while Table 7-2 shows how the SQL Standard and the DBMSs treat trailing spaces and data truncation for character columns.

Notes on Table 7-1:

The figures in Table 7-1 show the maximum-possible defined length of an SQL character column. For example, the largest CHAR column possible with Ingres is CHAR(2000), which lets you insert strings up to 2,000 characters long, the largest possible VARCHAR column with MySQL is VARCHAR(255), which lets you insert strings up to 255 characters long, and so on. The maximum length often depends on the page size, as with Sybase. The SQL Standard lets vendors set the maximum sizes of each data type.

Table 7-1. ANSI/DBMS Support for Character Data Types

	CHAR	VARCHAR	NCHAR	NCHAR VARYING	Other
ANSI SQL	Yes	Yes	Yes	Yes	No
IBM	254 bytes	32KB	No	No	No
Informix	32KB	255 bytes	32KB	255 bytes	2KB
Ingres	2KB	2KB	No	No	No
InterBase	32KB	32KB	32KB	32KB	No
Microsoft	8KB	8KB	8KB	8KB	No
MySQL	255 bytes	255 bytes	255 bytes	255 bytes	64KB
Oracle	2KB	4KB	2KB	4KB	No
Sybase	16KB	16KB	16KB	16KB	No

The defined length and the length in bytes is the same ifand only ifthe character size is one byte. This, of course, depends on the character set being used.
Informix calls the NCHAR VARYING data type NVARCHAR. Informix also supports LVARCHAR, for variable-length character data up to 2KB long.
MySQL also supports BLOB (for case-sensitive character values larger than the CHAR/VARCHAR maximum) and TEXT (for case-insensitive character values larger than the CHAR/VARCHAR maximum). BLOB and TEXT maxima are 65,535 characters.

Notes on Table 7-2:

Strips Trailing Space from VARCHAR column

This column is "No" if the DBMS follows the SQL Standard and does not strip trailing spaces from a string assigned to a variable-length column.

Table 7-2. ANSI/DBMS Support for Character String Operations

	Strips Trailing Space from VARCHAR	Warning on Truncate
ANSI SQL	No	Yes
IBM	No	Yes
Informix	No	No
Ingres	No	Yes
InterBase	No	Yes
Microsoft	No	Yes
MySQL	Yes	No
Oracle	No	Yes
Sybase	Yes	No

Informix does not strip trailing spaces on INSERT, but it does strip them when you SELECT from a VARCHAR column; not a very useful trait.

Warning on Truncate column

This column is "Yes" if the DBMS follows the SQL Standard and returns a truncation warning or error if you assign a too-long string to a character column.
- Sybase will provide a truncation error if SET STRING_RTRUNCATION is on .

Because we're trying to determine whether it's best to define columns with a fixed-size data type or a variable-length data type, we'll ignore the distinctions between, say, CHAR and NCHAR and concentrate on the questionShould CHAR or VARCHAR be used to define columns? The main difference between the two is that CHAR is fixed-size while VARCHAR is variable-lengtha consideration we've already dealt with. But there are five other differences to consider as well.

One: Changes to maximum size

Until recently, VARCHAR's maximum length was often shorter than CHAR's. For example, in Microsoft 6.5, a VARCHAR column couldn't be longer than 255 characters. If you're upgrading an old application, it's time to see whether more columns should be VARCHAR now.

Two: Trailing spaces

If you insert this string:

'X'

into a CHAR(5) column, the DBMS will pad with four trailing spaces and store this string:

 'X    '

On the other hand, if you insert:

 'X    '

into a VARCHAR(5) column, an ANSI SQL DBMS will keep the trailing spaces and store:

 'X    '

That is, a VARCHAR column does not lose meaningful information about trailing spaces. However, several DBMSs are not SQL Standard-compliant (see Table 7-2). With those DBMSs, it's impossible to determine how many spaces were inserted originally.

Three: Concatenation

Because of the space padding requirement, concatenating CHAR columns involves extra work. For example, suppose you have two columns, forename and surname , each defined as CHAR(20). The concatenation of forename with surname , for example:

 SELECT forename  surname   FROM Table1 ...

would result in a string that looks like this (except with MySQL and Sybase):

 'Paul             Larue        '

Probably not the result you're looking for! To get the DBMS to return the logical result:

 'Paul Larue'

you'd need to amend your query to:

 SELECT TRIM(forename)  ' '  TRIM(surname)    FROM Table1 ...

Four: Data type conversion

CHAR and VARCHAR are two different data types, so expressions of the form char_column = varchar_column involve data type conversions. Some DBMSs won't use indexes if conversions are necessary, so be consistent when defining columns that might be compared or joined.

Five: Truncation

If you make a mistake and insert ABCDE into a CHAR(4) column, don't expect to see an error or warning message. Some DBMSs will silently truncate your string to ABCD and you'll lose possibly vital data (see Table 7-2).

Length Specification

Whether you choose to use CHAR or VARCHAR, you'll also have to decide on a length specification for the column. When defining length, most people err on the side of caution. For example, they find the longest value in the original input data and then double that value for the resulting column length. Here are some length-specification hints, relevant to columns that contain names :

The longest name in the Bible is Maher-shalal-hash-baz (21 characters; Isaiah 8:1).
The longest legal URL is 63 characters.
The longest taxonomic name, Archaeosphaerodiniopsidaceae, has 28 characters. The famous long words antidisestablishmentarianism and floccinaucinihilipilification have 28 and 29 characters, respectively.
The names of American states and Canadian provinces can be abbreviated to two-letter codes, and the names of countries can be abbreviated using the three-letter ISO codes for nations. (It's best not to use the ISO two-letter nation codes because of duplication that could cause confusion. For example, the two-letter ISO code for the Cayman Islands is KY, which is also the abbreviation for Kentucky.)
That famous long Welsh name, Llanfairpwllgwyngyllgogerychwyrndrobwll Llantysiliogogogoch, can be abbreviated to Llanfair PG.

Variant Character Sets

You can use CHARACTER SET clauses to support variant character sets, but it's more convenient to use this convention:

Use CHAR/VARCHAR for the basic Western European character set (variously called ISO 8859-1 or Windows 1252 or Latin1).
Use NCHAR/NCHAR VARYING for the alternate national-language set.

The obvious alternate NCHAR/NCHAR VARYING set is Unicode, and that's what you'll always get with Microsoft, Oracle, and Sybase. Unicode is the Java default, so conversions to/from Java host variables are easy, but for most ODBC and ActiveX Data Objects (ADO) drivers, there's extra overhead. Here are some other Unicode considerations:

There are actually two different Unicode sets. One uses two bytes for each character; the other uses up to three bytes per character.
With Intel processors, 16-bit operations are penalized . For example it takes about three times longer to compare two 16-bit values than to compare two 8-bit values.
Responsible DBMSs will use Unicode for system table names.
There is no guarantee that you can use collation options with Unicode.

Speaking of collationthe fastest collation is binary. A binary collation is usually managed by either defining a column with a FOR BIT DATA/BINARY attribute or by forcing a binary collation in some other waypreferably with the ANSI SQL COLLATE clause. (For a discussion of collations, see Chapter 3, "ORDER BY.") You should not, however, use a binary collation if there are Unicode characters and the storage is big-endian.

If you do use a national-language character set, you'll be happy to know that all DBMSs except InterBase support the LOWER and UPPER functions correctly despite a strange SQL-92 requirement that only "simple Latin letters " should be converted.

The Bottom Line: Characters

Some DBMSs won't give you an error or warning message if you insert a too-large value into a character column. They'll just silently truncate, and you'll lose possibly vital data.

Use CHAR/VARCHAR for the basic Western European character set (ISO 8859-1 or Windows 1252). Use NCHAR/NCHAR VARYING for the alternate national-language set, which will often be Unicode.

In the past, VARCHAR's maximum length was often shorter than CHAR's. If you're upgrading an old application, it's time to see whether more columns should be VARCHAR now.

An SQL-compliant DBMS will pad with trailing spaces, if necessary, when inserting into a CHAR column. It will not strip trailing spaces when inserting into a VARCHAR column, so a VARCHAR column does not lose meaningful information about trailing spaces.

Because of the space padding, concatenating CHAR columns involves extra work.

Expressions of the form char_column = varchar_column involve data type conversions. Be data type consistent when defining columns that might be compared or joined.

Recommendation: Prefer VARCHAR/NCHAR VARYING for character string data.