Creating Tables

To create a table, SQL Server uses the ANSI SQL standard CREATE TABLE syntax. SQL Server Management Studio provides a front-end, fill-in-the-blanks table designer that can sometimes make your job easier. Ultimately, the SQL syntax is always sent to SQL Server to create a table, no matter what interface you use. In this chapter, I'll emphasize direct use of the data definition language (DDL) rather than discuss the graphical interface tools. You should keep all DDL commands in a script so you can run them easily at a later time to re-create the table. (Even if you use one of the friendly front-end tools, it's critical that you be able to re-create the table later.) SQL Server Management Studio and other front-end tools can create and save operating system files using the SQL DDL commands necessary to create any object. This DDL is essentially source code, and you should treat it as such. Keep a backup copy. You should also consider keeping these files under version control using a source control product such as Microsoft Visual SourceSafe.

At the basic level, creating a table requires little more than knowing what you want to name it, what columns it will contain, and what range of values (domain) each column can store. Here's the basic syntax for creating the customer table in the dbo schema, with three fixed-length character (char) columns. (Note that this table definition isn't necessarily the most efficient way to store data because it always requires 46 bytes per entry for data plus a few bytes of overhead, regardless of the actual length of the data.)

CREATE TABLE dbo.customer ( name         char(30), phone        char(12), emp_id       char(4) )

This example shows each column on a separate line, for readability. As far as the SQL Server parser is concerned, white spaces created by tabs, carriage returns, and the spacebar are identical. From the system's standpoint, the following CREATE TABLE example is identical to the preceding one, but it's harder to read from a user's standpoint:

CREATE TABLE customer (name char(30), phone char(12), emp_id char(4))

Naming Tables and Columns

A table is always created within one schema of one database. Tables also have owners, but unlike in earlier versions of SQL Server, the table owner is not used to access the table. The schema is used for all object access. Normally, a table is created in the default schema of the user who is creating it, but the CREATE TABLE statement can indicate the schema in which the object is to be created. A user can create a table only in a schema for which the user has ALTER permissions. Any user in the sysadmin, db_ddladmin, or db_owner roles can create a table in any schema. A database can contain multiple tables with the same name, as long as the tables are in different schemas. The full name of a table has three parts, in the following form:

database.schema.tablename

The first two parts of the three-part name specification have default values. The default for the name of the database is whatever database context you're currently working in. The table schema actually has two possible defaults when querying. If no schema name is specified when you reference a table, SQL Server first checks for an object in your default schema. If there is no such table in your default schema, SQL Server then checks to see if there is an object of the specified name in the dbo schema.

Note

To access a table in a schema other than your default schema or the dbo schema, you must include the schema name along with the table name. In fact, you should get in the habit of always including the schema name when referring to any object in SQL Server 2005. Not only does this remove any possible confusion about which schema you are interested in, but it can lead to some performance benefits.

You should make column names descriptive, and because you'll use them repeatedly, you should avoid wordiness. The name of the column (or any object in SQL Server, such as a table or a view) can be whatever you choose, as long as it conforms to the SQL Server rule for regular identifiers: it must consist of a combination of 1 through 128 letters, digits, or the symbols #, $, @, or _. (Alternatively, you can use a delimited identifier that includes any characters you like. For more about identifier rules, see "Using Identifiers" in Books Online. The discussion there applies to all SQL Server object names, not just column names.)

In some cases, you can access a table using a four-part name, in which the first part is the name of the SQL Server instance. However, you can refer to a table using a four-part name only if the SQL Server instance has been defined as a linked server. You can read more about linked servers in Books Online; I won't discuss them further here.

Reserved Keywords

Certain reserved keywords, such as TABLE, CREATE, SELECT, and UPDATE, have special meaning to the SQL Server parser, and collectively they make up the SQL language implementation. You should avoid using reserved keywords for your object names. In addition to the SQL Server reserved keywords, the SQL-92 standard has its own list of reserved keywords. In some cases, this list is more restrictive than the SQL Server list; in other cases, it's less restrictive. Books Online includes both lists.

Watch out for the SQL-92 reserved keywords. Some of the words aren't reserved keywords in SQL Server yet, but they might become reserved keywords in a future SQL Server version. If you use a SQL-92 reserved keyword, you might end up having to alter your application before upgrading it if the word becomes a SQL Server reserved keyword.

Delimited Identifiers

You can't use keywords in your object names unless you use a delimited identifier. In fact, if you use a delimited identifier, not only can you use keywords as identifiers but you can also use any other string as an object namewhether or not it follows the rules for identifiers. This includes spaces and other non-alphanumeric characters that are normally not allowed. Two types of delimited identifiers exist:

Bracketed identifiers, which are delimited by square brackets ([object name])
Quoted identifiers, which are delimited by double quotation marks ("object name")

You can use bracketed identifiers in any environment, but to use quoted identifiers, you must enable a special option using SET QUOTED_IDENTIFIER ON. If you turn on QUOTED_IDENTIFIER, double quotes are interpreted as referencing an object. To delimit string or date constants, you must use single quotes.

Let's look at some examples. Because column is a reserved keyword, the first statement that follows is illegal in all circumstances. The second statement is illegal unless QUOTED_IDENTIFIER is on. The third statement is legal in any circumstance.

CREATE TABLE dbo.customer(name char(30), column char(12), emp_id char(4)) CREATE TABLE dbo.customer(name char(30), "column" char(12), emp_id char(4)) CREATE TABLE dbo.customer(name char(30), [column] char(12), emp_id char(4))

The SQL Native Client ODBC driver and SQL Native Client OLE DB Provider for SQL Server automatically set QUOTED_IDENTIFIER to ON when connecting. You can configure this in ODBC data sources, ODBC connection attributes, or OLE DB connection properties. You can determine whether this option is on or off for your session by executing the following query:

SELECT quoted_identifier FROM sys.dm_exec_sessions WHERE session_id = @@spid

A result value of 1 indicates that QUOTED_IDENTIFIER is ON. If you're using SQL Server Management Studio, you can check the setting by running the preceding command in a query window or by choosing Options from the Tools menu and then expanding the Query Execution/SQL Server node and examining the ANSI Properties information, as shown in Figure 6-2.

Figure 6-2. Examining the ANSI properties for a connection in SQL Server Management Studio

Theoretically, you can always use delimited identifiers with all object and column names, so you never have to worry about reserved keywords. However, I don't recommend this. Many third-party tools for SQL Server don't handle quoted identifiers well, and they can make your code difficult to read. Using quoted identifiers might also make upgrading to future versions of SQL Server more difficult.

Rather than using delimited identifiers to protect against reserved keyword problems, you should simply adopt some simple naming conventions. For example, you can precede column names with the first few letters of the table name and an underscore. This naming style makes the column or object name more readable and also greatly reduces your chances of encountering a keyword or reserved word conflict.

Naming Conventions

Many organizations and multi-user development projects adopt standard naming conventions. This is generally a good practice. For example, assigning a standard moniker of cust_id to represent a customer number in every table clearly shows that all the tables share common data. If an organization instead uses several monikers in the tables to represent a customer number, such as cust_id, cust_num, customer_number, and customer_#, it won't be as obvious that these monikers represent common data.

One naming convention is the Hungarian-style notation for column names. Hungarian-style notation is a widely used practice in C programming, whereby variable names include information about their data types. This notation uses names such as sint_nn_custnum to indicate that the custnum column is a small integer (smallint of 2 bytes) and is NOT NULL (doesn't allow nulls). Although this practice makes good sense in C programming, it defeats the data type independence that SQL Server provides; therefore, I recommend against using it. (SQL Server 2005 gives you the ability to create DDL triggers, so you can consider a trigger for CREATE TABLE that enforces your chosen naming conventions. DDL triggers are discussed in Inside Microsoft SQL Server 2005: T-SQL Programming.)

Data Types

SQL Server provides many data types, most of which are straightforward. Choosing the appropriate data type is simply a matter of mapping the domain of values you need to store to the corresponding data type. In choosing data types, you want to avoid wasting storage space while allowing enough space for a sufficient range of possible values over the life of your application. Some problematic issues can arise when you work with a few of the available data types, and these issues are covered in detail in Inside Microsoft SQL Server 2005: T--SQL Programming. In this volume, I'll just cover some of the basic issues related to dealing with the various data types.

Choosing a Data Type

The decision about what data type to use for each column depends primarily on the nature of the data the column will hold and the operations you will want to perform on the data. The five basic data type categories in SQL Server 2005 are numeric, character, date and time, LOB (large object), and miscellaneous. SQL Server 2005 also supports a variant data type called sql_variant. Values stored in a sql_variant column can be of almost any data type. I'll discuss LOB and sql_variant columns later in this chapter when I discuss the internal storage of data on a page. Inside Microsoft SQL Server 2005: T-SQL Programming provides details and examples of working with many of the more interesting or possibly problematic data types. In this section, I'll examine some of the issues related to storing data of different data types.

Numeric Data Types

You should use numeric data types for data on which you want to perform numeric comparisons or arithmetic operations. Your main decisions are the maximum range of possible values you want to be able to store and the accuracy you need. The tradeoff is that data types that can store a greater range of values take up more space.

Numeric data types can also be classified as either exact or approximate. Exact numeric values are guaranteed to store exact representations of your numbers. Approximate numeric values have a far greater range of values, but the values are not guaranteed to be stored precisely. The greatest range of values that exact numeric values can store data is 10^38 + 1 to 10^38 1. Unless you need numbers with greater magnitude, I recommend that you not use the approximate numeric data types.

The exact numeric data types can be divided into two groups: integers and decimals. Integer types range in size from 1 to 8 bytes, with a corresponding increase in the range of possible values. The money and smallmoney data types are frequently included among the integer types because internally they are stored in the same way. For the money and smallmoney data types, it is just understood that the rightmost four digits are after the decimal point. For the other integer types, no digits come after the decimal point. Table 6-1 lists the integer data types along with their storage size and range of values.

Table 6-1. SQL Server Integer Data Types
Data Type	Range	Storage (Bytes)
bigint	2⁶³ to 2⁶³1	8
int	2³¹ to 2³¹1	4
Smallint	2¹⁵ to 2¹⁵1	2
Tinyint	0 to 255	1
Money	922,337,203,685,477.5808 to 922,337,203,685,477.5807, with accuracy of one ten-thousandth of a monetary unit	8
Smallmoney	214,748.3648 to 214,748.3647, with accuracy of one ten-thousandth of a monetary unit	4

The decimal and numeric data types allow quite a high degree of accuracy as well as a large range of values. For those two synonymous data types, you can specify a precision (the total number of digits stored) and a scale (the maximum number of digits to the right of the decimal point). The maximum number of digits that can be stored to the left of the decimal point is (precision scale). Two different decimal values can have the same precision and very different ranges. For example, a column defined as decimal (8,4) can store values from 9999.9999 to 9999.9999, and a column defined as decimal (8,0) can store values from 99,999,999 to 99,999,999.

Table 6-2 shows the storage space required for decimal and numeric data based on the defined precision.

Table 6-2. SQL Server Decimal and Numeric Data Type Storage Requirements
Precision	Storage (Bytes)
1 to 9	5
10 to 19	9
20 to 28	13
29 to 38	17

Date and Time Data Types

SQL Server supports two data types for storing date and time information: datetime and smalldatetime. Again, the difference between these types is the range of possible dates and the number of bytes needed for storage. Both types have both a date and a time component. The range and storage requirements are shown in Table 6-3. If no date is supplied, the default of January 1, 1900, is assumed; if no time is supplied, the default of 00:00:00.000 (midnight) is assumed. Inside Microsoft SQL Server 2005: T-SQL Programming addresses many issues related to how SQL Server interprets datetime values entered by an application and how you can control the formatting of the displayed datetime value.

Table 6-3. SQL Server Date and Time Data Types
Data Type	Range	Storage (Bytes)
datetime	January 1, 1753, through December 31, 9999, with an accuracy of three hundredths of a second	8
smalldatetime	January 1, 1900, through June 6, 2079, with an accuracy of 1 minute	4

Internally, datetime and smalldatetime values are stored completely differently from how you enter them or how they are displayed. They are stored as two separate components, a date component and a time component. The date is stored as the number of days before or after the base date of January 1, 1900. For datetime values, the time is stored as the number of clock ticks after midnight, with each tick representing 3.33 milliseconds, or 1/300 of a second. For smalldatetime values, the time is stored as the number of minutes after midnight. You can actually see these two parts if you convert a datetime value to a binary string of 8 hexadecimal bytes. The first 4 hexadecimal bytes are the number of days before or after the base date, and the second 4 bytes are the number of clock ticks after midnight. You can then convert these 4-byte hexadecimal strings to integers. I'll provide more details about hexadecimal strings later in the chapter.

The following example shows how to see the component parts of the current date and time by using the parameterless system function CURRENT_TIMESTAMP. The example stores the current date and time in a local variable so we can know we're using the same value for both computations.

DECLARE @today datetime SELECT @today = CURRENT_TIMESTAMP SELECT @today SELECT CONVERT (varbinary(8), @today) SELECT CONVERT (int, SUBSTRING (CONVERT (varbinary(8), @today), 1, 4)) SELECT CONVERT (int, SUBSTRING (CONVERT (varbinary(8), @today), 5, 4))

Character Data Types

Character data types come in four varieties. They can be fixed-length or variable-length strings of single-byte characters (char and varchar) or fixed-length or variable-length strings of Unicode characters (nchar and nvarchar). Unicode character strings need two bytes for each stored character; use them when you need to represent characters that can't be stored in the single-byte characters that are sufficient for storing most of the characters in the English and European alphabets. Single-byte character strings can store up to 8,000 characters, and Unicode character strings can store up to 4,000 characters. You should know the type of data you'll be dealing with in order to decide between single-byte and double-byte character strings. Keep in mind that the catalog view sys.types reports length in number of bytes, not in number of characters. In SQL Server 2005, you can also define a variable-length character string with a MAX length. Columns defined as varchar(max) will be treated as normal variable-length columns when the actual length is less than or equal to 8,000 bytes, and they will be treated as a large object value (discussed later in this section) when the actual length is greater than 8,000 bytes.

Deciding whether to use a variable-length or a fixed-length data type is a more difficult decision, and it isn't always straightforward or obvious. As a general rule, variable-length data types are most appropriate when you expect significant variation in the size of the data for a column and when the data in the column won't be frequently changed.

Using variable-length data types can yield important storage savings. It can sometimes result in a minor performance loss, and at other times it can result in improved performance. A row with variable-length columns requires special offset entries in order to be internally maintained. These entries keep track of the actual length of the column. Calculating and maintaining the offsets requires slightly more overhead than does a pure fixed-length row, which needs no such offsets. This task requires a few addition and subtraction operations to maintain the offset value. However, the extra overhead of maintaining these offsets is generally inconsequential, and this alone would not make a significant difference on most systems, if any.

Another potential performance issue with variable-length fields is the cost of increasing the size of a row on an almost full page. If a row with variable-length columns uses only part of its maximum length and is later updated to a longer length, the enlarged row might no longer fit on the same page. If the table has a clustered index, the row must stay in the same position relative to the other rows, so the solution is to split the page and move some of the rows from the page with the enlarged row onto a newly linked page. This can be an expensive operation. If the table has no clustered index, the row can move to a new location and leave a forwarding pointer in the original location. I'll talk about the details of page splitting and moving rows when I discuss data modification operations in Chapter 7.

On the other hand, using variable-length columns can sometimes improve performance because it can allow more rows to fit on a page. But the efficiency results from more than simply requiring less disk space. A data page for SQL Server is 8 KB (8,192 bytes), of which 8,096 bytes are available to store data. (The rest is for internal use to keep track of structural information about the page and the object to which it belongs.) One I/O operation brings back the entire page. If you can fit 80 rows on a page, a single I/O operation brings back 80 rows. But if you can fit 160 rows on a page, one I/O operation is essentially twice as efficient. In operations that scan for data and return lots of adjacent rows, this can amount to a significant performance improvement. The more rows you can fit per page, the better your I/O and cache-hit efficiency will be.

For example, consider a simple customer table. Suppose that you could define it in two ways, fixed length and variable length, as shown in Figures 6-3 and 6-4.

Figure 6-3. A customer table with all fixed-length columns

USE testdb GO CREATE TABLE customer_fixed ( cust_id                          smallint     NULL, cust_name                        char(50)     NULL, cust_addr1                       char(50)     NULL, cust_addr2                       char(50)     NULL, cust_city                        char(50)     NULL, cust_state                       char(2)      NULL, cust_postal_code                 char(10)     NULL, cust_phone                       char(20)     NULL, cust_fax                         char(20)     NULL, cust_email                       char(30)     NULL, cust_web_url                     char(100)    NULL, )

Figure 6-4. A customer table with variable-length columns

[View full width]
USE testdb GO CREATE TABLE customer_var ( cust_id                          smallint             NULL, cust_name                        varchar(50)          NULL, cust_addr1                       varchar(50)          NULL, cust_addr2                       varchar(50)          NULL, cust_city                        varchar(50)          NULL, cust_state                       char(2)              NULL, cust_postal_code                 varchar(10)          NULL, cust_phone                       varchar(20)          NULL, cust_fax                         varchar(20)          NULL, cust_email                       varchar(30)          NULL, cust_web_url                     varchar(100)         NULL )

Columns that contain addresses, names, or URLs all have data that varies significantly in length. Let's look at the differences between choosing fixed-length columns vs. choosing variable-length columns. In Figure 6-3, which uses all fixed-length columns, every row uses 384 bytes for data regardless of the number of characters actually inserted in the row. SQL Server also needs an additional 10 bytes of overhead for every row in this table, so each row needs a total of 394 bytes for storage. But let's say that even though the table must accommodate addresses and names up to the specified size, the average row is only half the maximum size.

In Figure 6-4, assume that for all the variable-length (varchar) columns, the average entry is actually only about half the maximum. Instead of a row length of 394 bytes, the average length is 224 bytes. This length is computed as follows: The smallint and char(2) columns total 4 bytes. The varchar columns' maximum total length is 380, half of which is 190 bytes. And a 2-byte overhead exists for each of nine varchar columns, for 18 bytes. Add 2 more bytes for any row that has one or more variable-length columns. In addition, these rows require the same 10 bytes of overhead that the fixed-length rows from Figure 6-3 require, regardless of the presence of variable-length fields. So the total is 4 + 190 + 18 + 2 + 10, or 224. (I'll discuss the actual meaning of each of these bytes of overhead later in this chapter.)

In the fixed-length example in Figure 6-3, you always fit 20 rows on a data page (8,096/394, discarding the remainder). In the variable-length example in Figure 6-4, you can fit an average of 36 rows per page (8,096/224). The table using variable-length columns will consume about half as many pages in storage, a single I/O operation will retrieve almost twice as many rows, and a page cached in memory is twice as likely to contain the row you're looking for.

Note

Additional overhead bytes are needed for each row if you are using snapshot isolation. I'll discuss this concurrency option as well as the extra row overhead needed to support it in Chapter 8.

When you choose lengths for columns, don't be wastefulbut don't be cheap, either. Allow for future needs, and realize that if the additional length doesn't change how many rows will fit on a page, the additional size is free anyway. Consider again the examples in Figures 6-3 and 6-4. The cust_id is declared as a smallint, meaning that its maximum positive value is 32,767 (unfortunately, SQL Server doesn't provide any unsigned int or unsigned smallint data types), and it consumes 2 bytes of storage. Although 32,767 customers might seem like a lot to a new company, the company might be surprised by its own success and, in a couple of years, find out that 32,767 is too limited.

The database designers might regret that they tried to save 2 bytes and didn't simply make the data type an int, using 4 bytes but with a maximum positive value of 2,147,483,647. They'll be especially disappointed if they realize they didn't really save any space. If you compute the rows-per-page calculations just discussed, increasing the row size by 2 bytes, you'll see that the same number of rows still fit on a page. The additional 2 bytes are freethey were simply wasted space before. They never cause fewer rows per page in the fixed-length example, and they'll rarely cause fewer rows per page even in the variable-length case.

So which strategy wins? Potentially better update performance? Or more rows per page? Like most questions of this nature, no one answer is right. It depends on your application. If you understand the tradeoffs, you'll be able to make the best choice. Now that you know the issues, this general rule merits repeating: Variable-length data types are most appropriate when you expect significant variation in the size of the data for that column and when the column won't be updated frequently.

Miscellaneous Data Types

I'll end this part of the discussion by showing you a few additional data types that you might have use for.

Binary Data Types

These data types are binary and varbinary. They are used to store strings of bits, and the values are entered and displayed using their hexadecimal (hex) representation, which is indicated by a prefix of 0x. So a hex value of 0x270F corresponds to a decimal value of 9999 and a bit string of 10011100001111. In hex, each two displayed characters represent a byte, so the value of 0x270F represents 2 bytes. You need to decide whether you want your data to be fixed or variable length, and you can use some of the same considerations we discussed for deciding between char and varchar to make your decision. The maximum length of binary or varbinary data is 8,000 bytes.

bit Data Type

The bit data type can store a 0 or a 1 and can consume only a single bit of storage space. However, if there is only one bit column in a table, it will take up a whole byte. Up to 8 bit columns are stored in a single byte.

Large Object Data Types

These data types are text, ntext, and image. The text data type can store up to 2^31 1 non-Unicode characters, ntext can store up to 2^30 1 (half as many) Unicode characters, and image can store up to 2^31 1 bytes of binary data. We'll look at the storage mechanisms used for these large object (LOB) data types later in the chapter.

cursor Data Type

The cursor data type can hold a reference to a cursor. Although you can't declare a column in a table to be of type cursor, this data type can be used for output parameters and local variables. I've included the cursor data type in this list for completeness, but I won't be talking more about it. Cursors are discussed in detail in Inside Microsoft SQL Server 2005: T-SQL Programming.

rowversion Data Type

The rowversion data type is a synonym for what was formerly called a timestamp. When using the timestamp data type name, many people might assume that the data has something to do with dates or times, but it doesn't. A column of type rowversion holds an internal sequence number that SQL Server automatically updates every time the row is modified. The value of any rowversion column is actually unique within an entire database, and a table can have only one column of type rowversion. Any operation that modifies any rowversion column in the database generates the next sequential value. The actual value stored in a rowversion column is seldom important by itself. The column is used to detect whether a row has been modified since the last time it was accessed, by finding out whether the rowversion value has changed.

sql_variant Data Type

The sql_variant data type allows a column to hold values of any data type except text, ntext, image, XML, user-defined data types, variable-length data types with the MAX specifier, or rowversion (timestamp). I'll describe the internal storage of sql_variant data later in this chapter, and you can see some examples of using sql_variant data in Inside Microsoft SQL Server 2005: T-SQL Programming.

table Data Type

The table data type can be used to store the result of a function and can be used as the data type of local variables. Columns in tables cannot be of type table. Table variables are discussed in detail in Inside Microsoft SQL Server 2005: T-SQL Programming.

xml Data Type

The xml datatype lets you store XML documents and fragments in a SQL Server database. You can use the xml datatype as a column type when you create a table, or as the data type for variables, parameters, and the return value of a function. XML data has its own methods for retrieval and manipulation, and these are discussed in Inside Microsoft SQL Server 2005: T-SQL Programming. I will not be covering details of working with XML data in this volume.

uniqueidentifier Data Type

The uniqueidentifier data type is sometimes referred to as a globally unique identifier (GUID) or universal unique identifier (UUID). A GUID or UUID is a 128-bit (16-byte) value generated in a way that, for all practical purposes, guarantees uniqueness worldwide, even among unconnected computers. It is becoming an important way to identify data, objects, software applications, and applets in distributed systems. Because I don't talk about uniqueidentifier data anywhere else in any of the Inside Microsoft SQL Server 2005 volumes, I'll give you a bit more detail about it here.

The Transact-SQL language supports the system functions NEWID and NEWSEQUENTIALID (which is new in SQL Server 2005), which you can use to generate uniqueidentifier values. A column or variable of data type uniqueidentifier can be initialized to a value in one of the following two ways:

Using the system-supplied function NEWID or NEWSEQUENTIALID as a default value.
Using a string constant in the following form (32 hexadecimal digits separated by hyphens): xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx. (Each x is a hexadecimal digit in the range 0 through 9 or a through f.)

This data type can be quite cumbersome to work with, and the only operations that are allowed against a uniqueidentifier value are comparisons (=, <>, <, >, <=, >=) and checking for NULL. However, using this data type internally can offer several advantages.

One reason to use the uniqueidentifier data type is that the values generated by NEWID or NEWSEQUENTIALID are guaranteed to be globally unique for any machine on a network because the last 6 bytes of a uniqueidentifier value make up the node number for the machine. When the SQL Server machine does not have an Ethernet/Token Ring (IEEE 802.x) address, there is no node number and the generated GUID is guaranteed to be unique among all GUIDs generated on that computer. However, the possibility exists that another computer without an Ethernet/Token Ring address will generate the identical GUID. The GUIDs generated on computers with network addresses are guaranteed to be globally unique.

The primary reason that SQL Server needed a way to generate a GUID was for use in merge replication, in which identifier values for the same table could be generated on any one of many different SQL Server machines. There needed to be a way to determine whether two rows really were the same row and there had to be no way that two rows not referring to the same entity would have the same identifier. Using GUID values provides that functionality. Two rows with the same GUID value must indicate that they really are the same row.

The difference between the NEWSEQUENTIALID and the NEWID functions is that NEWSEQUENTIALID creates a GUID that is greater than any GUID previously generated by this function on a specified computer and can be used to introduce a sequence to your GUID values. This turns out to greatly increase the scalability of systems using merge replication. If the unqiueidentifer values are being used as the clustered key for the replicated tables, the new rows are then inserted in random disk pages. (You'll see the details in Chapter 7, when I discuss clustered indexes in detail.) If the machines involved are performing a large amount of I/O operations, the nonsequential GUID generated by the NEWID function results in lots of random B-tree lookups and inefficient insert operations. The new function, NEWSEQUENTIALID, which is a wrapper around the Windows function UuidCreateSequential, does some byte scrambling and creates an ordering to the generated UUID values.

The list of uniqueidentifier values can't be exhausted. This is not the case with other data types frequently used as unique identifiers. In fact, SQL Server uses this data type internally for row-level merge replication. A uniqueidentifier column can have a special property called the ROWGUIDCOL property; at most, one uniqueidentifier column can have this property per table. The ROWGUIDCOL property can be specified as part of the column definition in CREATE TABLE and ALTER TABLE ADD column, or it can be added or dropped for an existing column using ALTER TABLE ALTER COLUMN.

You can reference a uniqueidentifier column with the ROWGUIDCOL property using the keyword ROWGUIDCOL in a query. This is similar to referencing an identity column using the IDENTITYCOL keyword. The ROWGUIDCOL property does not imply any automatic value generation, and if automatic value generation is needed, the NEWID function should be defined as the default value of the column. You can have multiple uniqueidentifier columns per table, but only one of them can have the ROWGUIDCOL property. You can use the uniqueidentifier data type for whatever reason you come up with, but if you're using one to identify the current row, an application must have a generic way to ask for it without needing to know the column name. That's what the ROWGUIDCOL property does.

Much Ado About NULL

The issue of whether to allow NULL has become an almost religious one for many in the industry, and no doubt the discussion here will outrage a few people. However, my intention isn't to engage in a philosophical debate. Pragmatically, dealing with NULL brings added complexity to the storage engine because SQL Server keeps a special bitmap in every row to indicate which nullable columns actually are NULL. If NULLs are allowed, SQL Server must decode this bitmap for every row accessed. Allowing NULL also adds complexity in application code, which can often lead to bugs. You must always add special logic to account for the case of NULL.

As the database designer, you might understand the nuances of NULL and three-valued logic in aggregate functions when you do joins and when you search by values. In addition, you must also consider whether your development staff really understands how to work with NULLs. I recommend, if possible, that you use all NOT NULL columns and define default values for missing or unknown entries (and possibly make such character columns varchar if the default value is significantly different in size from the typical entered value).

In any case, it's good practice to explicitly declare NOT NULL or NULL when you create a table. If no such declaration exists, SQL Server assumes NOT NULL. (In other words, no NULLs are allowed.) This might surprise many people who assume that the default for SQL Server is to allow NULLs. The reason for this misconception is that most of the tools and interfaces for working with SQL Server enable a session setting that makes it the default to allow NULLs. However, you can set the default to allow NULLs by using a session setting or a database option, which, as I just mentioned, is what most tools and interfaces already do. If you script your DDL and then run it against another server that has a different default setting, you'll get different results if you don't explicitly declare NULL or NOT NULL in the column definition.

Several database options and session settings can control SQL Server's behavior regarding NULL values. You can set database options using the ALTER DATABASE command, as I showed you in Chapter 4. And you can enable session settings for one connection at a time using the SET command.

Note

The database option ANSI null default corresponds to the two session settings ANSI_NULL_DFLT_ON and ANSI_NULL_DFLT_OFF. When the ANSI null default database option is false (the default setting for SQL Server), new columns created with the ALTER TABLE and CREATE TABLE statements are, by default, NOT NULL if the nullability status of the column isn't explicitly specified. SET ANSI_NULL_DFLT_OFF and SET ANSI_NULL_DFLT_ON are mutually exclusive options that indicate whether the database option should be overridden. When on, each option forces the opposite option off. Neither option, when off, turns the opposite option onit only discontinues the current on setting.

You use the function GETANSINULL to determine the default nullability for your current session. This function returns 1 when new columns allow null values and the column or data type nullability wasn't explicitly defined when the table was created or altered. I strongly recommend declaring NULL or NOT NULL explicitly when you create a column. This removes all ambiguity and ensures that you're in control of how the table will be built, regardless of the default nullability setting.

The database option concat null yields null corresponds to the session setting SET CONCAT_NULL_YIELDS_NULL. When CONCAT_NULL_YIELDS_NULL is on, concatenating a NULL value with a string yields a NULL result. For example, SELECT 'abc' + NULL yields NULL. When SET CONCAT_NULL_YIELDS_NULL is off, concatenating a NULL value with a string yields the string itself. In other words, the NULL value is treated as an empty string. For example, SELECT 'abc' + NULL yields abc. If the session-level setting isn't specified, the value of the database option concat null yields null applies.

The database option ANSI nulls corresponds to the session setting SET ANSI_NULLS. When this option is set to true, all comparisons to a null value evaluate to FALSE. When it is set to false, comparisons of non-Unicode values to a null value evaluate to TRUE if both values are NULL. In addition, when this option is set to true, your code must use the condition IS NULL to determine whether a column has a NULL value. When this option is set to false, SQL Server allows = NULL as a synonym for IS NULL and <> NULL as a synonym for IS NOT NULL.

You can see this behavior yourself by looking at the titles table in the pubs database. The titles table has two rows with a NULL price. The first batch of statements that follows, when executed from a query window, should return two rows, and the second batch should return no rows.

-- First batch will return 2 rows USE pubs SET ANSI_NULLS OFF GO SELECT * FROM titles WHERE price = NULL GO -- Second batch will return no rows USE pubs SET ANSI_NULLS ON GO SELECT * FROM titles WHERE price = NULL GO

A fourth session setting is ANSI_DEFAULTS. Setting this to ON is a shortcut for enabling both ANSI_NULLS and ANSI_NULL_DFLT_ON as well as other session settings not related to NULL handling. The SQL Server ODBC driver and the SQL Server OLE DB provider automatically set ANSI_DEFAULTS to ON. You can change the ANSI_NULLS setting when you define your data source name (DSN). You should be aware that the tool you are using to connect to SQL Server might set certain options ON or OFF. To see which options are enabled in your current connection, you can run the following command:

DBCC USEROPTIONS

Here's a sample of the output it might return:

Set Option                 Value -------------------------- ------------------------- textsize                   64512 language                   us_english dateformat                 mdy datefirst                  7 ansi_null_dflt_on          SET ansi_warnings              SET ansi_padding               SET ansi_nulls                 SET concat_null_yields_null    SET

Warning

The SET options always overrides the corresponding database options, whether the SET option has been turned ON or OFF. The only time the database option applies is when the SET option has not been touched. So you can think of the ANSI_NULLS and CONCAT_NULL_YIELDS_NULL options as having three possible values: ON, OFF, and "never been set." If the SET option is either ON or OFF, the database option is ignored. Because the SQL Native Client ODBC driver and SQL Native Client OLE DB Provider for SQL Server automatically set both of these options to ON, the database options are never taken into account.

I recommend that you try to write your queries so it doesn't matter which option is in effect. When that is impossible, you should rely on the session-level option only and not the database-level option.

Note that the DBCC USEROPTIONS command only shows options that have been explicitly set in your connection. Any option that just relies on the default will not be shown. In addition, you cannot see the session setting for any other connections. In SQL Server 2005, an alternative is to use the sys.dm_exec_sessions dynamic management view, to see all the session option values.

The following query shows the values for your current session, but if you have VIEW SERVER STATE permission you can change or remove the WHERE clause to return information about other sessions:

SELECT * FROM sys.dm_exec_sessions WHERE session_id = @@spid

As you can see, you can configure and control the treatment and behavior of NULL values in several ways, and you might think it would be impossible to keep track of all the variations. If you try to control every aspect of NULL handling separately within each individual session, you can cause immeasurable confusion and even grief. However, most of the issues become moot if you follow a few basic recommendations:

Never allow NULL values in your tables.
Include a specific NOT NULL qualification in your table definitions.
Don't rely on database properties to control the behavior of NULL values.

If you must use NULLs in some cases, you can minimize problems by always following the same rules, and the easiest rules to follow are the ones that ANSI already specifies.

Naming Tables and Columns

Reserved Keywords

Delimited Identifiers

Figure 6-2. Examining the ANSI properties for a connection in SQL Server Management Studio

Naming Conventions

Data Types

Choosing a Data Type

Numeric Data Types

Table 6-1. SQL Server Integer Data Types

Table 6-2. SQL Server Decimal and Numeric Data Type Storage Requirements

Date and Time Data Types

Table 6-3. SQL Server Date and Time Data Types

Character Data Types

Figure 6-3. A customer table with all fixed-length columns

Figure 6-4. A customer table with variable-length columns

Miscellaneous Data Types

Binary Data Types

bit Data Type

Large Object Data Types

cursor Data Type

rowversion Data Type

sql_variant Data Type

table Data Type

xml Data Type

uniqueidentifier Data Type

Much Ado About NULL