Creating Tables | Inside Microsoft SQL Server 2000

To create a table, SQL Server uses the ANSI SQL standard CREATE TABLE syntax. SQL Enterprise Manager provides a front-end, fill-in-the-blanks table designer that can make your job easier. Ultimately, the SQL syntax is always sent to SQL Server to create a table. You can create a table directly using a tool such as OSQL, ISQL, or SQL Query Analyzer; from SQL Enterprise Manager; or using a third-party data-modeling tool (such as ERwin or Microsoft Visual InterDev) that transmits the SQL syntax under the cover of a friendly interface.

In this chapter, I'll emphasize direct use of the data definition language (DDL) rather than discussing the interface tools. You should keep all DDL commands in a script so that you can run them easily at a later time to re-create the table. (Even if you use one of the friendly front-end tools, it's critical that you be able to re-create the table later.) SQL Enterprise Manager and other front-end tools can create and save operating system files with the SQL DDL commands necessary to create the object. This DDL is essentially source code, and you should treat it as such. Keep a backup copy. You should also consider keeping these files under version control using a source control product such as Microsoft Visual SourceSafe or PVCS from MERANT (formerly INTERSOLV).

At the basic level, creating a table requires little more than knowing what you want to name it, what columns it will contain, and what range of values (domain) each column will be able to store. Here's the basic syntax for creating the customer table, with three fixed-length character (char) columns. (Note that this table definition isn't necessarily the most efficient way to store data because it always requires 46 bytes per entry plus a few bytes of overhead, regardless of the actual length of the data.)

 CREATE TABLE customer ( name char(30), phone char(12), emp_id char(4) )

This example shows each column on a separate line, for readability. As far as the SQL Server parser is concerned, white spaces created by tabs, carriage returns, and the spacebar are identical. From the system's standpoint, the following CREATE TABLE example is identical to the one above; but it's harder to read from a user's standpoint:

 CREATE TABLE customer (name char(30), phone char(12), emp_id char(4))

We'll see many more detailed examples of tables later in the chapter.

Naming Tables and Columns

A table is always created within a database and is owned by one particular user. Normally, the owner is the user who created the table, but anyone with the sysadmin, db_ddladmin, or db_owner role can create a table that is owned by another user. A database can contain multiple tables with the same name, as long as the tables have different owners. The full name of a table has three parts, in the following form:

 database.owner.tablename

For example, say that a user (with the username Kalen) creates a sample customer table in the pubs sample database. This user's table will have the pubs.kalen.customer three-part name. (If this user is also the database owner, pubs.dbo.customer will be her table's name because dbo is the special username for the database owner in every database.)

The first two parts of the three-part name specification have default values. The default for the name of the database is whatever database context you're currently working in. The table owner actually has two possible defaults. If no table owner name is specified when you reference a table, SQL Server assumes that either you or the owner of the database owns the table. For example, if our hypothetical user owns the customer table and her database context is pubs, she can refer to the table simply as customer.

NOTE
To access a table owned by anyone other than yourself or the database owner, you must include the owner name along with the table name.

You should make column names descriptive, and because you'll use them repeatedly, you should avoid wordiness. The name of the column (or any object in SQL Server, such as a table or a view) can be whatever you choose, as long as it conforms to the SQL Server rule for identifiers: it must consist of a combination of 1 through 128 letters, digits, or the symbols #, $, @, or _. (For more on identifier rules, see "Using Identifiers" in SQL Server Books Online. The discussions there apply to all SQL Server object names, not just column names.)

In some cases, a table can be addressed using a four-part name, in which the first part is the name of the SQL Server. However, you can refer to a table using a four-part name only if the SQL Server has been defined as a linked server. You can read more about linked servers in SQL Server Books Online; I won't discuss them further here.

Reserved Keywords

Certain reserved keywords, such as table, create, select, and update, have special meaning to the SQL Server parser, and collectively they make up the SQL language implementation. You should avoid using reserved keywords for your object names. In addition to the SQL Server reserved keywords, the SQL-92 standard has its own list of reserved keywords. In some cases, this list is more restrictive than SQL Server's list; in other cases, it's less restrictive. SQL Server Books Online includes both lists.

Watch out for SQL-92's reserved keywords. Some of the words aren't reserved keywords in SQL Server yet, but they might become reserved keywords in a future SQL Server version. If you use a SQL-92 reserved keyword, you might end up having to alter your application before upgrading it if the word has become a SQL Server reserved keyword.

Delimited Identifiers

You can't use keywords in your object names unless you use a delimited identifier. In fact, if you use a delimited identifier, not only can you use keywords as identifiers but you can also use any other string as an object name—whether or not it follows the rules for identifiers. This includes spaces and other nonalphanumeric characters that are normally not allowed. Two types of delimited identifiers exist:

Bracketed identifiers are delimited by square brackets ([object name]).
Quoted identifiers are delimited by double quotation marks ("object name").

You can use bracketed identifiers in any environment, but to use quoted identifiers, you must enable a special option using SET QUOTED_IDENTIFIER ON. If you turn on QUOTED_IDENTIFIER, double quotes are interpreted as referencing an object. To delimit string or date constants, you must use single quotes.

Let's look at some examples. Because column is a reserved keyword, the first statement that follows is illegal in all circumstances, and the second statement is illegal unless QUOTED_IDENTIFIER is on. The third statement is legal in any circumstance.

 CREATE TABLE customer(name char(30), column char(12), emp_id char(4)) CREATE TABLE customer(name char(30), "column" char(12), emp_id char(4)) CREATE TABLE customer(name char(30), [column] char(12), emp_id char(4))

The ODBC driver that comes with SQL Server sets the QUOTED_IDENTIFIER option to ON by default, but some of the SQL Server-specific tools set it to OFF. You can determine whether this option is on or off for your session by executing the following command:

 DBCC USEROPTIONS

If you're using SQL Query Analyzer, you can check the setting by running the preceding command or by choosing Options from the Tools menu and examining the Connection Properties tab.

Theoretically, you can always use delimited identifiers with all object and column names, so you never have to worry about reserved keywords. However, I don't recommend this. Many third-party tools for SQL Server don't handle quoted identifiers well. Using quoted identifiers might make upgrading to future versions of SQL Server more difficult. For example, if you're using the SQL Server Upgrade Wizard to upgrade a database directly to SQL Server 2000 from SQL Server 6.5, objects such as views and stored procedures are automatically dropped and re-created so that they include the structures of the latest version. If your tables have names that were reserved keywords, they have to be re-created using quoted identifiers.

When you use quoted identifiers, upgrading can't be fully automated and you must tell the SQL Server Upgrade Wizard whether it should assume that QUOTED_IDENTIFIER is on. If you don't know, or if you have a mixture of objects, you can tell the wizard to use a mixed mode. It will then try to interpret all objects with double quotes as if the option is on; if that's unsuccessful, it will assume that the option is off for that object. This interpretation might not always result in the stored procedure or view having the same meaning as was intended.

Rather than using delimited identifiers to protect against reserved keyword problems, you should simply adopt some simple naming conventions. For example, you can precede column names with the first few letters of the table name and an underscore. This naming style makes the column or object name more readable and also greatly reduces your chances of encountering a keyword or reserved word conflict.

Naming Conventions

Many organizations and multiuser development projects adopt standard naming conventions. This is generally good practice. For example, assigning a standard moniker of cust_id to represent a customer number in every table clearly shows that all the tables share common data. If an organization instead uses several monikers in the tables to represent a customer number, such as cust_id, cust_num, customer_number, and customer_#, it won't be as obvious that these monikers represent common data.

One naming convention is the Hungarian-style notation for column names. Hungarian-style notation is a widely used practice in C programming, whereby variable names include information about their datatypes. This notation uses names such as sint_nn_custnum to indicate that the custnum column is a small integer (smallint of 2 bytes) and is NOT NULL (doesn't allow nulls). Although this practice makes good sense in C programming, it defeats the datatype independence that SQL Server provides; therefore, I recommend against using it.

Suppose, for example, that after the table is built and applications have been written, you discover that the custnum column requires a 4-byte integer (int) instead of a 2-byte small integer. You can use the ALTER TABLE command to modify the datatype of the existing table to int instead of smallint. SQL Server stored procedures handle the different datatypes automatically. Applications using ODBC, OLE DB, or DB-Library that bind the retrieved column to a character or integer datatype will be unaffected. The applications would need to change if they bound the column to a small integer variable because the variable's type would need to be larger. Therefore, you should try not to be overly conservative with variable datatypes, especially in your client applications. You should be most concerned with the type on the server side; the type in the application can be larger and will automatically accommodate smaller values. By overloading the column name with datatype information, which is readily available from the system catalogs, the insulation from the underlying datatype is compromised. If you changed the datatype from a smallint to an int, the Hungarian-style name would no longer accurately reflect the column definition. Changing the column name would then result in the need to change application code, stored procedures, or both.

Datatypes

SQL Server provides many datatypes, most of which I'll discuss in this section. Choosing the appropriate datatype is simply a matter of mapping the domain of values you need to store to the corresponding datatype. In choosing datatypes, you want to avoid wasting storage space while allowing enough space for a sufficient range of possible values over the life of your application.

Datatype Synonyms

SQL Server syntactically accepts as datatypes both the words listed as synonyms and the base datatypes shown in Table 6-1, but it uses only the type listed as the datatype. For example, you can define a column as character(1), character, or char(1), and SQL Server will accept any of them as valid syntax. Internally, however, the expression is considered char(1), and subsequent querying of the SQL Server system catalogs for the datatype will show it as char(1) regardless of the syntax that you used when you created it.

Table 6-1. SQL Server datatype synonyms.

Synonym	Mapped to system datatype
binary varying	varbinary
char varying	varchar
character	char(1)
character(n)	char(n)
character varying(n)	varchar(n)
dec	decimal
double precision	float
float[(n)] for n = 1 to 7	real
float[(n)] for n = 8 to 15	float
integer	int
national character(n)	nchar(n)
national char(n)	nchar(n)
national character varying(n)	nvarchar(n)
national char varying(n)	nvarchar(n)
national text	ntext
numeric	decimal

Choosing a Datatype

The decision about what datatype to use for each column depends primarily on the nature of the data the column will hold. There are five basic datatypes in SQL Server 2000: numeric, character, datetime, LOB (large object), and miscellaneous. SQL Server 2000 also supports a variant datatype called sql_variant. Values stored in a sql_variant column can be of any datatype. I'll discuss LOB and sql_variant columns later in this chapter when I discuss the internal storage of data on a page. I'll also provide more details about several of the datatypes later in the book when I illustrate programming techniques using the various datatypes. In this section, I'll examine some of the issues related to storing data of different datatypes.

Numeric Datatypes

You should use numeric datatypes for data on which you want to perform numeric comparisons or arithmetic operations. Your main decisions are the maximum range of possible values you want to be able to store and the accuracy you need. The tradeoff is that datatypes that can store a greater range of values take up more space.

Numeric datatypes can also be classified as either exact or approximate. Exact numeric values are guaranteed to store exact representations of your numbers. Approximate numeric values have a far greater range of values, but the values are not guaranteed to be stored precisely. Exact numeric values can store data in the range -10^38 + 1to 10^38 -1. Unless you need numbers with greater magnitude, I recommend that you don't use the approximate numeric datatypes.

The exact numeric datatypes can be divided into two groups: integers and decimals. Integer types range in size from 1 to 8 bytes, with a corresponding increase in the range of possible values. The money and smallmoney datatypes are frequently included among the integer types because internally they are stored in the same way. For the money and smallmoney types, it is just understood that the rightmost four digits are after the decimal point. For the other integer types, no digits come after the decimal point. Table 6-2 lists the integer datatypes along with their storage size and range of values.

Table 6-2. SQL Server integer datatypes.

Datatype	Range	Storage (bytes)
bigint	-2⁶³ to 2⁶³-1	8
int	-2³¹ to 2³¹-1	4
smallint	-2¹⁵ to 2¹⁵-1	2
tinyint	0 to 255	1
money	- 922,337,203,685,477.5808 to 922,337,203,685,477.5807 with accuracy of one ten-thousandth of a monetary unit	8
smallmoney	-214,748.3648 to 214,748.3647, with accuracy of one ten-thousandth of a monetary unit	4

The decimal and numeric datatypes allow quite a high degree of accuracy as well as a large range of values For these two synonymous datatypes, you can specify a precision (the total number of digits stored) and a scale (the maximum number of digits to the right of the decimal point). The maximum number of digits that can be stored to the left of the decimal point is (precision - scale). Two different decimal values can have the same precision and very different ranges. For example, a column defined as decimal (8,4) can store values from -9999.9999 to 9999.9999 and a column defined as decimal (8,0) can store values from -99,999,999 to 99,999,999.

Table 6-3 shows the storage space required for decimal and numeric data based on the defined precision.

Table 6-3. SQL Server decimal and numeric datatype storage requirements.

Precision	Storage (bytes)
1 to 9	5
10 to 19	9
20 to 28	13
29 to 38	17

Datetime Datatypes

SQL Server supports two datatypes for storing date and time information: datetime and smalldatetime. Again, the difference between these types is the range of possible dates and the number of bytes needed for storage. Both types have both a date and a time component. The range and storage requirements are shown in Table 64. If no date is supplied, the default of January 1, 1900 is assumed; if no time is supplied, the default of 00:00:00.000 (midnight) is assumed. In Chapter 7, I'll address many issues dealing with how SQL Server interprets datetime values entered by an application and how you can control the formatting of the displayed datetime value.

Internally, datetime and smalldatetime values are stored completely differently from how you enter them or how they are displayed. They are stored as two separate components, a date component and a time component. The date is stored as the number of days before or after the base date of January 1, 1900. For datetime values, the time is stored as the number of clock ticks after midnight, with each tick representing 3.33 milliseconds, or 1/300 of a second. For smalldatetime values, the time is stored as the number of minutes after midnight. You can actually see these two parts if you convert a datetime value to a binary string of 8 hexadecimal bytes. The first four hexadecimal bytes are the number of days before or after the base date, and the second four bytes are the number of clock ticks after midnight. You can then convert these four-byte hexadecimal strings to integers. I'll provide more details about hexadecimal strings later in the chapter.

The following example shows how to see the component parts of the current date and time, which is available using the parameterless system function current_timestamp. The example stores the current date and time in a local variable so we can know we're using the same value for both computations. I'll talk about local variables in detail in Chapter 10.

 DECLARE @today datetime SELECT @today = current_timestamp SELECT @today SELECT CONVERT (varbinary(8), @today) SELECT CONVERT (int, SUBSTRING (CONVERT (varbinary(8), @today), 1, 4)) SELECT CONVERT (int, SUBSTRING (CONVERT (varbinary(8), @today), 5, 4))

Table 6-4. SQL Server datetime datatypes.

Datatype	Range	Storage (bytes)
datetime	January 1, 1753, through December 31, 9999, with an accuracy of three-hundredths of a second	8
smalldatetime	January 1, 1900, through June 6, 2079,with an accuracy of one minute	4

Character Datatypes

Character datatypes come in four varieties. They can be fixed-length or variable-length single-byte character strings (char and varchar) or fixed-length or variable-length Unicode character strings (nchar and nvarchar). Unicode character strings need two bytes for each stored character; use them when you need to represent characters that can't be stored in the single-byte characters that are sufficient for storing most of the characters in the English and European alphabets. Single-byte character strings can store up to 8000 characters, and Unicode character strings can store up to 4000 characters. You should know the type of data you'll be dealing with in order to decide between single-byte and double-byte character strings.

Deciding to use a variable-length or a fixed-length datatype is a more difficult decision, and it isn't always straightforward or obvious. As a general rule, variable-length datatypes are most appropriate when you expect significant variation in the size of the data for a column and when the data in the column won't be frequently changed.

Using variable-length datatypes can yield important storage savings. They can sometimes result in a minor performance loss, and at other times they can result in improved performance. A row with variable-length columns requires special offset entries in order to be internally maintained. These entries keep track of the actual length of the column. Calculating and maintaining the offsets requires slightly more overhead than does a pure fixed-length row, which needs no such offsets. This task requires a few addition and subtraction operations to maintain the offset value. However, the extra overhead of maintaining these offsets is generally inconsequential, and this alone would not make a significant difference on most systems, if any.

Another potential performance issue with variable length fields is the cost of increasing the size of a row on an almost full page. If a row with variable-length columns uses only part of its maximum length and is later updated to a longer length, the enlarged row might no longer fit on the same page. If the table has a clustered index, the row must stay in the same position relative to the other rows, so the solution is to split the page and move some of the rows from the page with the enlarged row onto a newly linked page. This can be an expensive operation. If the table has no clustered index, the row can move to a new location and leave a forwarding pointer in the original location. I'll talk about the details of page splitting and moving rows when I discuss data modification operations in Chapter 9.

On the other hand, using variable-length columns can sometimes improve performance because they can allow more rows to fit on a page. But the efficiency results from more than simply requiring less disk space. A data page for SQL Server is 8 KB (8192 bytes), of which 8096 bytes are available to store data. (The rest is for internal use to keep track of structural information about the page and the object to which it belongs.) One I/O operation brings back the entire page. If you can fit 80 rows on a page, a single I/O operation brings back 80 rows. But if you can fit 160 rows on a page, one I/O operation is essentially twice as efficient. In operations that scan for data and return lots of adjacent rows, this can amount to a significant performance improvement. The more rows you can fit per page, the better your I/O and cache-hit efficiency will be.

For example, consider a simple customer table. Suppose you could define it in two ways, fixed-length and variable-length, as shown in Figures 6-1 and 6-2.

Figure 6-1. A customer table with all fixed-length columns.

Figure 6-2. A customer table with variable-length columns.

Columns that contain addresses, names, or URLs all have data that varies significantly in length. Let's look at the differences between choosing fixed-length columns vs. choosing variable-length columns. In Figure 6-1, which uses all fixed-length columns, every row uses 304 bytes for data regardless of the number of characters actually inserted in the row. SQL Server also needs an additional 10 bytes of overhead for every row in this table, so each row needs a total of 314 bytes for storage. But let's say that even though the table must accommodate addresses and names up to the specified size, the average entry is only half the maximum size.

In Figure 6-2, assume that for all the variable-length (varchar) columns, the average entry is actually only about half the maximum. Instead of a row length of 304 bytes, the average length is 184 bytes. This length is computed as follows: the smallint and char(2) columns total 4 bytes. The varchar columns' maximum total length is 300, half of which is 150 bytes. And a 2-byte overhead exists for each of nine varchar columns, for 18 bytes. Add 2 more bytes for any row that has one or more variable-length columns. In addition, such rows require another 10 bytes of overhead, regardless of the presence of variable-length fields. (This is the same 10 bytes of overhead needed in the case of all fixed-length columns; in other words, all rows have these same 10 bytes of constant overhead.) So the total is 4 + 150 + 18 + 2 + 10, or 184. (I'll discuss the actual meaning of each of these bytes of overhead later in this chapter.)

In the fixed-length example in Figure 6-1, you always fit 25 rows on a data page (8096/314, discarding the remainder). In the variable-length example in Figure 6-2, you can fit an average of 44 rows per page (8096/184). The table using variable-length columns will consume about half as many pages in storage, a single I/O operation will retrieve almost twice as many rows, and a page cached in memory is twice as likely to contain the row you're looking for.

When choosing lengths for columns, don't be wasteful—but don't be cheap, either. Allow for future needs, and realize that if the additional length doesn't change how many rows will fit on a page, the additional size is free anyway. Consider again the examples in Figures 6-1 and 6-2. The cust_id was declared as a smallint, meaning that its maximum positive value is 32,767 (unfortunately, SQL Server doesn't provide any unsigned int or unsigned smallint datatypes), and it consumes 2 bytes of storage. Although 32,767 customers might seem like a lot to a new company, the company might be surprised by its own success and, in a couple of years, find out that 32,767 is too limited.

The database designers might regret that they tried to save 2 bytes and didn't simply make the datatype an int, using 4 bytes but with a maximum positive value of 2,147,483,647. They'll be especially disappointed if they realize they didn't really save any space. If you compute the rows-per-page calculations just discussed, increasing the row size by 2 bytes, you'll see that the same number of rows still fit on a page. The additional 2 bytes are free—they were simply wasted space before. They never cause fewer rows per page in the fixed-length example, and they'll rarely cause fewer rows per page even in the variable-length case.

So which strategy wins? Potentially better update performance? Or more rows per page? Like most questions of this nature, no one answer is right. It depends on your application. If you understand the tradeoffs, you'll be able to make the best choice. Now that you know the issues, this general rule merits repeating: variable-length datatypes are most appropriate when you expect significant variation in the size of the data for that column and when the column won't be updated frequently.

Miscellaneous Datatypes

I'll end this part of the discussion by showing you a few additional datatypes that you might have use for.

Binary These datatypes are binary and varbinary. They are used to store strings of bits, and the values are entered and displayed using their (hex) hexadecimal representation, which is indicated by a prefix of 0x. So a hex value of 0x270F corresponds to a decimal value of 9999 and a bit string of 10011100001111. In hex, each two displayed characters represent a byte, so the value of 0x270F represents 2 bytes. You need to decide whether you want your data to be fixed or variable length, and you can use some of the same considerations I discussed for deciding between char and varchar to make your decision. The maximum length of binary or varbinary data is 8000 bytes.

Bit The bit datatype can store a 0 or a 1 and can consume only a single bit of storage space. However, if there is only one bit column in a table, it will take up a whole byte. Up to 8 bit columns are stored in a single byte.

Large Object These datatypes are text, ntext, and image. The text datatype can store up to 2^31 - 1 non-Unicode characters, ntext can store up to 2^30 -1 (half as many) Unicode characters, and image can store up to 2^31 - 1 bytes of binary data. We'll look at the storage mechanisms used for these large object (LOB) datatypes later in the chapter.

Cursor The cursor datatype can hold a reference to a cursor. Although you can't declare a column in a table to be of type cursor, I've included cursor in this list for completeness. I'll talk about cursors and cursor variables in detail in Chapter 13.

Rowversion The rowversion datatype is a synonym in SQL Server 2000 for what was formerly called a timestamp. I like the new name because people always assumed that the timestamp datatype had something to do with dates or times, and it doesn't. A column of type rowversion holds an internal sequence number that SQL Server automatically updates every time the row is modified. The value of any rowversion column is actually unique within an entire database, and a table can have only one column of type rowversion. Any operation that modifies any rowversion column in the database generates the next sequential value. The actual value stored in a rowversion column is seldom important by itself. The column is used to detect whether a row has been modified since the last time it was accessed by finding out whether the rowversion value has changed.

Sql_variant The sql_variant datatype allows a column to hold values of any other datatype except text, ntext, image, or rowversion (timestamp). I'll describe the internal storage of sql_variant data later in this chapter, and I'll show you some examples of using sql_variant data in Chapter 7.

Table The table datatype can be used to store the result of a function and can be used as the datatype of local variables. Columns in tables cannot be of type table. I'll talk about the table datatype in more detail in Chapter 7 when I discuss user-defined functions.

Uniqueidentifier The uniqueidentifier datatype is sometimes referred to as a globally unique identifier (GUID) or universal unique identifier (UUID). A GUID or UUID is a 128-bit (16-byte) value generated in a way that, for all practical purposes, guarantees uniqueness worldwide, even among unconnected computers. It is becoming an important way to identify data, objects, software applications, and applets in distributed systems. Since I won't be talking about uniqueidentifier data anywhere else in the book, I'll give you a bit more detail about it here.

The Transact-SQL language supports the system function NEWID, which can be used to generate a uniqueidentifier value. A column or variable of datatype uniqueidentifier can be initialized to a value in one of the following two ways:

Using the system-supplied function NEWID.
Using a string constant in the following form (32 hexadecimal digits separated by hyphens): xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx. (Each x is a hexadecimal digit in the range 0 through 9 or a through f.)

This datatype can be quite cumbersome to work with, and the only operations that are allowed against a uniqueidentifier value are comparisons (=, <>, <, >, <=, >=) and checking for NULL. However, using this datatype internally can offer several advantages.

One advantage is that the values are guaranteed to be globally unique for any machine on a network because the last six bytes of a uniqueidentifier value make up the node number for the machine. On a machine without a NIC (such as a home computer that connects to the Internet via modem), the node is a pseudorandom 48-bit value that isn't guaranteed to be unique now but is highly likely to be unique in the near future.

Another advantage is that the list of uniqueidentifier values can't be exhausted. This is not the case with other datatypes frequently used as unique identifiers. In fact, SQL Server uses this datatype internally by for row-level merge replication. A uniqueidentifier column can have a special property called the ROWGUIDCOL property; at most, one uniqueidentifier column can have this property per table. The ROWGUIDCOL property can be specified as part of the column definition in CREATE TABLE and ALTER TABLE ADD column or can be added or dropped for an existing column using ALTER TABLE ALTER COLUMN.

You can reference a uniqueidentifier column with the ROWGUIDCOL property using the keyword ROWGUIDCOL in a query. This is similar to referencing an identity column using the IDENTITYCOL keyword. The ROWGUIDCOL property does not imply any automatic value generation, and if automatic value generation is needed, the NEWID function should be defined as the default value of the column. You can have multiple uniqueidentifier columns per table, but only one of them can have the ROWGUIDCOL property. You can use the uniqueidentifier datatype for whatever reason you come up with, but if you're using one to identify the current row, an application must have a generic way to ask for it without needing to know the column name. That's what the ROWGUIDCOL property does.

Much Ado About NULL

The issue of whether to allow NULL has become an almost religious one for many in the industry, and no doubt the discussion here will outrage a few people. However, my intention isn't to engage in a philosophical debate. Pragmatically, dealing with NULL brings added complexity to the storage engine because SQL Server keeps a special bitmap in every row to indicate which nullable columns actually are NULL. If NULLs are allowed, SQL Server must decode this bitmap for every row accessed. Allowing NULL also adds complexity in application code, which can often lead to bugs. You must always add special logic to account for the case of NULL.

As the database designer, you might understand the nuances of NULL and three-valued logic in aggregate functions, when doing joins, and when searching by values. However, you must consider whether your development staff also understands. I recommend, if possible, that you use all NOT NULL columns and define default values (discussed later in this chapter) for missing or unknown entries (and possibly make such character columns varchar if the default value is significantly different in size from the typical entered value).

In any case, it's good practice to explicitly declare NOT NULL or NULL when creating a table. If no such declaration exists, SQL Server assumes NOT NULL. (In other words, no NULLs are allowed.) However, you can set the default to allow NULLs by using a session setting or a database option. The ANSI SQL standard says that if neither is specified, NULL should be assumed, but as I mentioned, this isn't SQL Server's default. If you script your DDL and then run it against another server that has a different default setting, you'll get different results if you don't explicitly declare NULL or NOT NULL in the column definition.

Several database options and session settings can control SQL Server's behavior regarding NULL values. You can set database options using the ALTER DATABASE command, as I showed you in Chapter 5. And you can enable session settings for one connection at a time using the SET command.

NOTE
The database option ANSI null default corresponds to the two session settings ANSI_NULL_DFLT_ON or ANSI_NULL_DFLT_OFF. When the ANSI null default database option is false (the default setting for SQL Server), new columns created with the ALTER TABLE and CREATE TABLE statements are, by default, NOT NULL if the nullability status of the column isn't explicitly specified. SET ANSI_NULL_DFLT_OFF and SET ANSI_NULL_DFLT_ON are mutually exclusive options that indicate whether the database option should be overridden. When on, each option forces the opposite option off. Neither option, when set off, turns the opposite option on—it only discontinues the current on setting.

You use the function GETANSINULL to determine the default nullability for your current session. This function returns 1 when new columns allow null values and the column or datatype nullability wasn't explicitly defined when the table was created or altered. I strongly recommend declaring NULL or NOT NULL explicitly when you create a column. This removes all ambiguity and ensures that you're in control of how the table will be built, regardless of the default nullability setting.

WARNING
Although internally SQL Server's default behavior is to not allow NULLs unless they are specifically declared in the CREATE TABLE statement, you might never see this behavior in action. Because SQL Query Analyzer—the basic tool for submitting SQL code to SQL Server—is ODBC-based, it automatically turns on the ANSI_NULL_DFLT_ON option. This setting means that all your new columns will allow NULLs by default. I can't overemphasize that your best bet for avoiding confusion is to always state explicitly in your table definition whether NULLs should be allowed.

The database option concat null yields null corresponds to the session setting SET CONCAT_NULL_YIELDS_NULL. When CONCAT_NULL_YIELDS_NULL is on, concatenating a NULL value with a string yields a NULL result. For example, SELECT 'abc' + NULL yields NULL. When SET CONCAT_NULL_YIELDS_NULL is off, concatenating a NULL value with a string yields the string itself. In other words, the NULL value is treated as an empty string. For example, SELECT 'abc' + NULL yields abc. If the session level setting isn't specified, the value of the database option concat null yields null applies.

The database option ANSI nulls corresponds to the session setting SET ANSI_NULLS. When true, all comparisons to a null value evaluate to FALSE ). When this option is set to false, comparisons of non-Unicode values to a null value evaluate to TRUE if both values are NULL. In addition, when this option is set to true, your code must use the condition IS NULL to determine whether a column has a NULL value. When this option is set to false, SQL Server allows "= NULL" as a synonym for "IS NULL" and "<> NULL" as a synonym for "IS NOT NULL."

You can see this behavior yourself by looking at the titles table in the pubs database. The titles table has two rows with a NULL price. The first batch of statements that follows, when executed from SQL Query Analyzer, should return two rows, and the second batch should return no rows.

 -- First batch will return 2 rows USE pubs SET ANSI_NULLS OFF GO SELECT * FROM titles WHERE price = NULL GO -- Second batch will return no rows USE pubs SET ANSI_NULLS ON GO SELECT * FROM titles WHERE price = NULL GO

A fourth session setting is ANSI_DEFAULTS. Setting this to ON is a shortcut for enabling both ANSI_NULLS and ANSI_NULL_DFLT_ON as well as other session settings not related to NULL handling. SQL Server's ODBC driver and the SQL Server OLE DB provider automatically set ANSI_DEFAULTS to ON. You can change the ANSI_NULLS setting when you define your data source name (DSN). Two of the client tools supplied with SQL Server (SQL Query Analyzer and the text-based OSQL) use the SQL Server ODBC driver but then internally turn off some of these options. To see which options are enabled for the tool you're using, you can run the following command:

 DBCC USEROPTIONS

Here's a sample of the output it might return:

 Set Option Value -------------------------- ------------------------- textsize 64512 language us_english dateformat mdy datefirst 7 ansi_null_dflt_on SET ansi_warnings SET ansi_padding SET ansi_nulls SET concat_null_yields_null SET

WARNING
The relationship between the database options and the corresponding session level SET options is not clearly documented for the concat null yields null and ANSI nulls options. The SET option always overrides the database option, whether the SET option has been turned ON or OFF. The only time the database option applies is when the SET option has not been touched. So you can think of the ANSI_NULLS and CONCAT_NULL_YIELDS_NULL options as having three possible values: ON, OFF and "never been set." If the SET option is either ON or OFF, the database option is ignored. Since the ODBC driver (used by both SQL Query Analyzer and the OSQL command tool) as well as the OLE DB provider for SQL Server set both of these options to ON, the database option is never taken into account.
In addition, the command DBCC USEROPTIONS shows you which options have been set to ON, but it doesn't show you which options have been explicitly set to OFF. I recommend that you try to write your queries so that it doesn't matter which option is in effect. When that is impossible, you should rely on the session-level option only and not the database-level option.

The database compatibility level controls two additional aspects of how SQL Server handles NULL values, as determined by the system procedure sp_dbcmptlevel. If the compatibility level is set to 70 or 80, the nullability of bit columns without explicit nullability is determined by either the session setting of SET ANSI_NULL_DFLT_ON or SET ANSI_NULL_DFLT_OFF or the database setting of ANSI null default. In 60 or 65 compatibility mode, bit columns created without an explicit NULL or NOT NULL option in CREATE TABLE or ALTER TABLE are created as NOT NULL.

The database compatibility level also controls whether SQL Server interprets an empty string (two single quotes with nothing between them) as either a single space or a true empty string. In compatibility level 60 or 65, SQL Server interprets empty strings as single spaces. If the compatibility level is 70 or 80, SQL Server interprets empty strings as truly empty—that is, a character string with no characters in it. Sometimes this empty string is referred to as a NULL, but SQL Server doesn't treat it like a NULL. SQL Server marks NULLs internally as NULLs, but an empty string is actually stored as a variable-length character field of 0 length.

In 60 or 65 compatibility mode, the empty string is interpreted as a single space in INSERT or assignment statements on varchar data. When you are concatenating varchar, char, or text data, the empty string is interpreted as a single space. This means that you can never have a truly empty string. The only alternative in 60 or 65 compatibility mode is to define the field as allowing NULLs and to use a NULL in place of an empty string.

As you can see, you can configure and control the treatment and behavior of NULL values in several ways, and you might think it would be impossible to keep track of all the variations. If you try to control every aspect of NULL handling separately within each individual session, you can cause immeasurable confusion and even grief. However, most of the issues become moot if you follow a few basic recommendations:

Never allow NULL values in your tables.
Include a specific NOT NULL qualification in your table definitions.
Make sure all your databases are running in 80 compatibility mode.

If you must use NULLs in some cases, you can minimize problems by always following the same rules, and the easiest rules to follow are the ones that ANSI already specifies.