SQL Server uses the ANSI SQL standard CREATE TABLE syntax. SQL Server Enterprise Manager provides a front-end, fill-in-the-blanks table designer, which might make your job easier. Ultimately, the SQL syntax is always sent to SQL Server to create a table. You can create a table directly using a tool such as OSQL, ISQL, or the Query Analyzer; from SQL Server Enterprise Manager; or using a third-party data modeling tool (such as ER win or Microsoft Visual InterDev) that transmits the SQL syntax under the cover of a friendly interface.
In this chapter, we emphasize direct use of the data definition language (DDL) rather than discussing the interface tools. You should keep all DDL commands in a script so that you can run them easily at a later time to re-create the table. (Even if you use one of the friendly front-end tools, it's critical that you can later re-create the table.) SQL Server Enterprise Manager and other front-end tools can create and save operating system files with the SQL DDL commands necessary to create the object. This DDL is essentially source code, and you should treat it as such. Keep a backup copy. You should also consider keeping these files under version control using a source control product such as Microsoft Visual SourceSafe or PVCS from Micro Focus (formerly INTERSOLV).
At the basic level, creating a table requires little more than knowing what you want to name it, what columns it will contain, and what range of values (domain) each column will be able to store. Here's the basic syntax for creating the customer table, with three fixed-length character ( char ) columns. (Note that this table definition isn't necessarily the most efficient way to store data because it always requires 46 bytes per entry plus a few bytes of overhead regardless of the actual length of the data.)
CREATE TABLE customer ( name char(30), phone char(12), emp_id char(4) )
This example shows each column on a separate line for readability. As far as the SQL Server parser is concerned , whitespaces created by tabs, carriage returns, and the Spacebar are identical. From the system's standpoint, the following CREATE TABLE example is identical to the one above; but it's harder to read from a user 's standpoint:
CREATE TABLE customer (name char(30), phone char(12), emp_id char(4))
This simple example shows just the basics of creating a table. We'll see many more detailed examples later in this chapter.
A table is always created within a database and is owned by one particular user. Normally, the owner is the user who created the table, but anyone with the sysadmin or db_owner role can create a table owned by another user. A database can contain multiple tables with the same name, as long as the tables have different owners . The full name of a table has three parts , in this form:
For example, say that a user (with the username Kalen ) created a sample customer table in the pubs sample database. This user's table would have the pubs.kalen.customer three-part name. (If this user is also the database owner, pubs.dbo.customer would be her table's name because dbo is the special username for the database owner in every database.)
The first two parts of the three-part name specification have default values. The default for the name of the database is whatever database context you're currently working in. The table owner actually has two possible defaults. If no table owner name is specified when referencing a table, SQL Server will assume that either you or the owner of the database owns the table. For example, if our hypothetical user owns the customer table and her database context is the pubs , she can refer to the table simply as customer .
To access a table owned by anyone other than yourself or the database owner, you must include the owner name along with the table name.
Column names should be descriptive, and because you'll use them repeatedly, you should avoid wordiness. The name of the column (or any object in SQL Server, such as a table or a view) can be whatever you choose, as long as the name conforms to the SQL Server rules for identifiers: it must consist of a combination of 1 through 128 letters , digits, or the symbols #, $, @, or _. (For more specific identifier rules, see "Using Identifiers" in SQL Server Books Online. The discussions there and in the Microsoft SQL Server Transact-SQL and Utilities Reference are true for all SQL Server object names, not just for column names.)
Certain reserved keywords, such as table , create , select , and update , have special meaning to the SQL Server parser, and collectively they make up the SQL language implementation. If at all possible, you shouldn't use reserved keywords for your object names. In addition to the SQL Server reserved keywords, the SQL-92 standard has its own list of reserved keywords. In some cases, this list is more restrictive than SQL Server's list; in other cases, it's less restrictive . SQL Server Books Online has a complete list of both SQL Server's reserved keywords and the SQL-92 standard's list.
Watch out for SQL-92's reserved keywords. They aren't reserved keywords in SQL Server yet, but they could become reserved keywords in a future SQL Server version. Using a SQL-92 reserved keyword might require that you alter your application before upgrading it if the word has become a SQL Server reserved keyword.
You can't use keywords in your object names unless you use a delimited identifier. In fact, if you use a delimited identifier, not only can you use keywords as identifiers, but you can use any other string ”whether it follows the rules for identifiers or not ”as an object name. This includes spaces and other nonalphanumeric characters normally not allowed. Two types of delimited identifiers exist:
You can use bracketed identifiers in any environment, but to use quoted identifiers, you must enable a special option using SET QUOTED_IDENTIFIER ON. Once you have turned on QUOTED_IDENTIFIER, double quotes will always be interpreted as referencing an object. To delimit string or date constants, you have to use single quotes. Let's look at some examples. Because column is a reserved keyword, the first statement that follows would be illegal in all circumstances, and the second one would be illegal unless QUOTED_IDENTIFIER was on. The third statement would be legal in any circumstances.
CREATE TABLE customer (name char(30), column char(12), emp_id char(4)) CREATE TABLE customer (name char(30), "column" char(12), emp_id char(4)) CREATE TABLE customer (name char(30), [column] char(12), emp_id char(4))
The ODBC driver that comes with SQL Server sets the option QUOTED_IDENTIFIER to ON by default, but some of the SQL Server_specific tools will set it to OFF. You can determine whether this option is on or off for your session by executing the following command:
Prior to version 7.0 of SQL Server, the documentation discussed two categories of special words, which were called keywords and reserved words . The keywords were already part of the product, and reserved words were listed as having a good chance of becoming a part of the product in a future version.
The rationale behind reserved words was that for version 6.0, the developers needed to add a lot of new keywords to support the latest functionality. In general, these words hadn't been reserved in version 4.21. Adding so many new keywords made it tough for many sites, because they had to modify their applications by changing these keywords in order to upgrade. Although it probably seemed capricious and unwarranted to add keywords that simply might be used in the future, the developers knew that some new functionality would be added in future.
Through version 6.5, the documentation continued to designate reserved words for future use. Now in version 7.0, SQL Server is following the ANSI SQL guidelines much more closely. So instead of documenting separate lists of reserved words and keywords, the documentation can just rely on what ANSI has already specified.
If you're using the Query Analyzer, you can check your setting by either running the command on the previous page or choosing the Configure command from the File menu and examining the Connection tab.
Theoretically, you could always use delimited identifiers with all object and column names, and then you'd never have to worry about reserved keywords. However, we don't recommend this. Many third-party tools for SQL Server don't handle quoted identifiers well. Using quoted identifiers might make upgrading to future versions of SQL Server more difficult. During the upgrade process to SQL Server 7, objects such as views and stored procedures are automatically dropped and re-created so that they include the structures of the latest version.
When you use quoted identifiers, upgrading can't be fully automated and you must tell the Upgrade Wizard whether it should assume that QUOTED_IDENTIFIER is on. If you don't know, or if you have a mixture of objects, you can tell the wizard to use a mixed mode. This means that the Upgrade Wizard will first try interpreting all objects with double quotation marks as though the option is on, and only if that's unsuccessful will it assume that the option is off for that object. This interpretation might not always result in the stored procedure or view having the same meaning as was intended.
Rather than using delimited identifiers to protect against reserved keyword problems, you should simply adopt some simple naming conventions. For example, you can precede column names with the first few letters of the table name and an underscore (_). Not only does this naming style make the column or object name more readable, but it also greatly reduces your chances of encountering a keyword or reserved word conflict.
Many organizations and multiuser development projects adopt standard naming conventions, which is generally good practice. For example, assigning a standard moniker of cust_id to represent a customer number in every table clearly shows that all the tables share common data. On the other hand, if an organization used several monikers in the tables to represent a customer number, such as cust_id , cust_num , customer_number , and customer_# , it wouldn't be as obvious that these monikers represented common data.
One naming convention is the Hungarian-style notation for column names. Hungarian-style notation is a widely used practice in C programming, whereby variable names include information about their datatypes. Hungarian-style notation uses names such as sint_nn_custnum to indicate that the custnum column is a small integer ( smallint of 2 bytes) and is NOT NULL (doesn't allow nulls). Although this practice makes good sense in C programming, it defeats the datatype independence that SQL Server provides; therefore, we recommend against using it.
Suppose, for example, that after the table is built and applications have been written, you discover that the custnum column requires a 4-byte integer ( int ) instead of a 2-byte small integer. You can re-create the table relatively easily and define the column as an int instead of a smallint . (Alternatively, you can use the ALTER TABLE command to modify the datatype of the existing table.) SQL Server stored procedures will handle the different datatype automatically. Applications using ODBC, OLE DB, or DB-Library that bind the retrieved column to a character or integer datatype will be unaffected. The applications would need to change if they bound the column to a small integer variable because the variable's type would need to be larger. Therefore, you should try not to be overly conservative with variable datatypes, especially in your client applications. You should be most concerned with the type on the server side; the type in the application can be larger and will automatically accommodate smaller values. By overloading the column name with datatype information, which is readily available from the system catalogs, the insulation from the underlying datatype is compromised. (You could, of course, change the datatype from a smallint to an int , but then the Hungarian-style name would no longer accurately reflect the column definition. Changing the column name would then result in the need to change application code, stored procedures, or both.)
SQL Server provides many datatypes, as shown in Table 6-1. Choosing the appropriate datatype is simply a matter of mapping the domain of values you need to store to the corresponding datatype. In choosing datatypes, you want to avoid wasting storage space while allowing enough space for a sufficient range of possible values over the life of your application.
SQL Server syntactically accepts as datatypes both the words listed as synonyms and the base datatypes shown in Table 6-1, but it uses only the type listed as the datatype. For example, you can define a column as character(1 ), character , or char(1) , and SQL Server will accept them all as valid syntax. Internally, however, the expression is considered char(1) , and subsequent querying of the SQL Server system catalogs for the datatype will show it as char(1) , regardless of the syntax that you used when you created it.
Most available datatypes in SQL Server are well documented, so we'll leave it up to you to find additional details in the product documentation. However, one special datatype merits a bit more discussion.
Using a globally unique identifier (GUID) ”also called a universal unique identifier (UUID) ”is becoming an important way to identify data, objects, software applications, and applets in distributed systems. A GUID is a 128-bit (16-byte) value generated in a way that, for all practical purposes, guarantees uniqueness worldwide, even among disconnected computers. SQL Server 7 supports a datatype called uniqueidentifier for storing a globally unique identifier. The Transact-SQL language supports the system function NEWID(), which can be used to generate a uniqueidentifier value. A column or variable of datatype uniqueidentifier can be initialized to a value in one of the following two ways:
Table 6-1. SQL Server datatypes.
Table 6-1. SQL Server datatypes.
This datatype can be quite cumbersome to work with, and the only operations that are allowed against a uniqueidentifier value are comparisons (=, <>, <, >, <=, >=) and checking for NULL. (The online product documentation states that only the equality and inequality operators, = and <>, are allowed for uniqueidentifier columns, but this is incorrect. In fact, you can even sort by a uniqueidentifier column.) However, using this datatype internally can have several advantages.
One advantage is that the values are guaranteed to be globally unique for any machine on a network, because the last six bits of a uniqueidentifier value make up the node number for the machine. On a machine with a network interface card (NIC), the node is the unique IEEE 802 identifier of that card. On a machine without a NIC (for example, a home computer that connects to the Internet via modem), the node is a pseudo-random, 48-bit value that isn't guaranteed to be unique now but is highly likely to be unique for the near future.
Another advantage is that the list of uniqueidentifier values can't be exhausted. This is not the case with other datatypes frequently used as unique identifiers. In fact, this datatype is used internally by SQL Server for row-level merge replication. A uniqueidentifier column can have a special property called the ROWGUIDCOL property; at most, one uniqueidentifier column can have this property per table. The ROWGUIDCOL property can be specified as part of the column definition in CREATE TABLE and ALTER TABLE ADD column or added or dropped for an existing column through ALTER TABLE ALTER COLUMN.
A uniqueidentifier column with the ROWGUIDCOL property can be referenced using the keyword ROWGUIDCOL in a query. This is similar to referencing an identity column through the IDENTITYCOL keyword. The ROWGUIDCOL property does not imply any automatic value generation, and if automatic value generation is needed, the NEWID() function should be defined as the default value of the column. You can have multiple uniqueidentifier columns per table, but only one of them can have the ROWGUIDCOL property. You can use the uniqueidentifier datatype for whatever reason you come up with, but if you're using one to identify the current row, an application must have a generic way to ask for it without needing to know the column name. That's what the ROWGUIDCOL property does.
Deciding to use a variable-length or a fixed-length datatype isn't always straightforward or obvious. As a general rule, variable-length datatypes are most appropriate when you expect significant variance in the size of the data for a column and when the data in the column won't be frequently changed.
Using variable-length datatypes can yield important storage savings. Choosing them can sometimes result in a minor performance loss and at other times can result in improved performance. A row with variable-length columns requires special offset entries to be internally maintained . These entries keep track of the actual length of the column. Calculating and maintaining the offsets requires slightly more overhead than a pure fixed-length row, which needs no such offsets at all. This task requires a few addition and subtraction operations to maintain the offset value. However, the extra overhead of maintaining these offsets is generally inconsequential, and this alone would not make a significant difference on most, if not all, systems.
On the other hand, using variable-length columns can sometimes improve performance because they can allow more rows to fit on a page. But the efficiency results from more than simply requiring less disk space. A data page for SQL Server is 8 KB (8192 bytes), of which 8096 bytes are available to store data. (The rest is for internal use to keep track of structural information about the page and the object to which it belongs.) One I/O operation brings back the entire page. If you can fit 80 rows on a page, a single I/O operation brings back 80 rows. But if you can fit 160 rows on a page, one I/O operation is essentially twice as efficient. In operations that scan for data and return lots of adjacent rows, this can amount to a significant performance improvement. The more rows you can fit per page, the better your I/O and cache-hit efficiency will be.
For example, consider a simple customer table. Suppose you could define it in two ways, fixed-length and variable-length, as shown in Figure 6-1 and in Figure 6-2.
Figure 6-1. A customer table with fixed-length columns.
Figure 6-2. A customer table with variable-length columns.
Columns that contain addresses, names, or Internet URLs all have data that varies significantly in length. Let's look at the differences between choosing fixed-length columns vs. choosing variable-length columns. In Figure 6-1, using all fixed-length columns, every row uses 304 bytes for data, regardless of the number of characters actually inserted into the row. Furthermore, SQL Server needs an additional 10 bytes of overhead for every row in this table, so the rows will need a total of 314 bytes for storage. But assume that even though the table must accommodate addresses and names up to the specified size, on average, the actual entries are only half the maximum size.
In Figure 6-2, assume that for all the variable-length ( varchar ) columns, the average entry is actually only about half the maximum. Instead of a row length of 304 bytes, the average length is 184 bytes. This length is computed as follows: the smallint and char(2) columns total 4 bytes. The varchar columns' maximum total length is 300, half of which is 150 bytes. And a 2-byte overhead exists for each of nine varchar columns, for 18 bytes. Add 2 more bytes for any row that has one or more variable-length columns. In addition, this row requires another 10 bytes of overhead, regardless of the presence of variable-length fields. (This is the same 10 bytes of overhead needed in the case of all fixed-length columns; in other words, all rows have these same 10 bytes of constant overhead.) So the total is 4 + 150 + 18 + 2 + 10, or 184. (The actual meaning of each of these bytes of overhead will be discussed later in this chapter.)
In the fixed-length example in Figure 6-1, you always fit 25 rows on a data page (8096/314, discarding the remainder). In the variable-length example in Figure 6-2, you can fit an average of 44 rows per page (8096/184). The table using variable-length columns will consume about half as many pages in storage, a single I/O operation will retrieve almost twice as many rows, and a page cached in memory is twice as likely to contain the row you're looking for.
When choosing lengths for columns, don't be wasteful ”but don't be cheap, either. Allow for future needs, and realize that if the additional length doesn't change how many rows will fit on a page, the additional size is free anyway. Consider again the examples in Figures 6-1 and 6-2. The cust_id was declared as a smallint , meaning that its maximum positive value is 32,767 ( unfortunately , SQL Server doesn't provide any unsigned int or unsigned smallint datatypes), and it consumes 2 bytes of storage. Although 32,767 customers might seem like a lot to a new company, the company might be surprised by its own success and, in a couple of years , find out that 32,767 is too limited.
The database designers might regret that they tried to save 2 bytes and didn't simply make the datatype an int , using 4 bytes but with a maximum positive value of 2,147,483,647. They'll be especially disappointed if they realize they didn't really save any space. If you compute the rows-per-page calculations just discussed, increasing the row size by 2 bytes, you'll see that the same number of rows still fit on a page. The additional 2 bytes are free ”they were simply wasted space before. They never cause fewer rows per page in the fixed-length example, and they'll rarely cause fewer rows per page even in the variable-length case.
So which strategy wins? Potentially better update performance, or more rows per page? Like most questions of this nature, no one answer is right. It depends on your application. If you understand the tradeoffs, you'll be able to make the best choice. Now that you know the issues, this general rule merits repeating: variable-length datatypes are most appropriate when you expect significant variance in the size of the data for that column and when the column won't be updated frequently.
The issue of whether to allow NULL has become an almost religious one for many in the industry, and no doubt the discussion here will outrage a few people. However, our intention isn't to engage the philosophical debate. Pragmatically, dealing with NULL brings added complexity to the storage engine, because SQL Server keeps a special bitmap in every row to indicate which nullable columns actually are NULL. If NULLs are allowed, SQL Server must decode this bitmap for every row accessed. Allowing NULL also adds complexity in application code, which can often lead to bugs . You must always add special logic to account for the case of NULL.
You, as the database designer, might understand the nuances of NULL and three-valued logic when used in aggregate functions, when doing joins, and when searching by values. However, you need to consider whether your development staff understands as well. We recommend, if possible, that you use all NOT NULL columns and define default values (discussed later in this chapter) for missing or unknown entries (and possibly make such character columns varchar if the default value is significantly different in size from the typical entered value).
In any case, it's good practice to explicitly declare NOT NULL or NULL when creating a table. If no such declaration exists, SQL Server assumes NOT NULL. (In other words, no NULLs will be allowed.) However, you can set the default to allow NULLs by using a session setting or a database option. The ANSI SQL standard says that if neither is specified, NULL should be assumed, but as mentioned, this isn't SQL Server's default. If you script your DDL and then run it against another server that has a different default setting, you'll get different results if you don't explicitly declare NULL or NOT NULL in the column definition.
Several database options and session settings can control SQL Server's behavior regarding NULL values. You set database options using the system procedure sp_dboption . And you enable session settings for one connection at a time using the SET command.
The database option ANSI null default corresponds to the two session settings ANSI_NULL_DFLT_ON or ANSI_NULL_DFLT_OFF. When the ANSI null default database option is false (the default setting for SQL Server), new columns created with the ALTER TABLE and CREATE TABLE statements are, by default, NOT NULL if the nullability status of the column isn't explicitly specified. SET ANSI_NULL_DFLT_OFF and SET ANSI_NULL_DFLT_ON are mutually exclusive options, yet both options exist to determine whether the database option should be overridden. When on, each option forces the opposite option off. Neither option, when set off, turns the opposite option on. Instead, turning an option off only discontinues the current on setting.
Use the function GETANSINULL() to determine the default nullability for your current session. This function returns 1 when new columns will allow null values and the column or datatype nullability isn't explicitly defined when the table is created or altered . We strongly recommend declaring NULL or NOT NULL explicitly when you create a column. This removes all ambiguity and ensures that you're in control of how the table will be built regardless of the default nullability setting.
The database option concat null yields null corresponds to the session setting SET CONCAT_NULL_YIELDS_NULL. When CONCAT_NULL_YIELDS_NULL is on, concatenating a NULL value with a string yields a NULL result. For example, SELECT 'abc' + NULL yields NULL. When SET CONCAT_NULL_YIELDS_NULL is off, concatenating a NULL value with a string yields the string itself. (The NULL value is treated as an empty string.) For example, SELECT 'abc' + NULL yields abc . If the session level setting isn't specified, the value of the database option concat null yields null applies. Also, if SET CONCAT_NULL_YIELDS_NULL is off, SQL Server uses the concat null yields null setting of sp_dboption .
The database option ANSI nulls corresponds to the session setting SET ANSI_NULLS. When true, all comparisons to a null value evaluate to NULL (unknown). When false, comparisons of non-Unicode values to a null value evaluate to TRUE if both values are NULL. In addition, when this value is TRUE, your code must use the condition IS NULL to determine whether a column has a NULL value. When this value is FALSE, SQL Server allows "= NULL" as a synonym for "IS NULL" and "<> NULL" as a synonym for "IS NOT NULL." You can see this behavior yourself by looking at the titles table in the pubs database. The titles table has two rows with a NULL price. The first batch of statements that follows, when executed from the Query Analyzer, should return two rows, and the second batch should return no rows:
-- First batch will return 2 rows use pubs set ansi_nulls off go select * from titles where price = null go --Second batch will return no rows use pubs set ansi_nulls on go select * from titles where price = null go
A fourth session setting is ANSI_DEFAULTS. Setting this to ON is a shortcut for enabling both ANSI_NULLS and ANSI_NULL_DEFAULT_ON, as well as other session settings not related to NULL handling. SQL Server's ODBC driver and the SQL Server OLE DB provider automatically set ANSI_DEFAULTS to ON. You can change the ANSI_NULLS setting when defining your DSN. Two of the client tools supplied with SQL Server (Query Analyzer and the text-based osql ) use the SQL Server ODBC driver but then internally turn off some of these options. To see which options are enabled for the tool you're using, you can run the following command:
Here's a sample of the output it might return:
Set Option Value -------------------------- ------------------------- textsize 64512 language us english dateformat mdy datefirst 7 ansi_null_dflt_on SET ansi_warnings SET ansi_padding SET ansi_nulls SET concat_null_yields_null SET
Although internally SQL Server's default behavior is to not allow NULLs unless specifically declared in the CREATE TABLE statement, you might never see this behavior in action. Because SQL Server Query Analyzer, the basic tool for submitting SQL code to SQL Server, is ODBC-based, it automatically turns on the ANSI_NULL_DEFAULT_ON option. This setting means that all your new columns will allow NULLs by default. We can't overemphasize that you're best bet for avoiding confusion is to always state explicitly in your table definition whether NULLs should be allowed.
The database compatibility level controls two additional aspects of how SQL Server handles NULL values, as determined by the system procedure sp_dbcmptlevel . If the compatibility level is set to 70, the nullability of bit columns without explicit nullability is determined by either the session setting of SET ANSI_NULL_DFLT_ON or SET ANSI_NULL_DFLT_OFF or the database setting of ANSI null default . In 60 or 65 compatibility mode, bit columns created without an explicit NULL or NOT NULL option in CREATE TABLE or ALTER TABLE are created as NOT NULL.
The database compatibility level also controls whether SQL Server interprets an empty string (two single quotes with nothing between them) as either a single space or as a true empty string. In compatibility level 60 or 65, SQL Server interprets empty strings as single spaces. If the compatibility level is 70, SQL Server interprets empty strings as truly empty, that is, a character string with no characters in it. Sometimes this empty space is referred to as a NULL, but SQL Server doesn't treat it like a NULL. SQL Server marks NULLs internally as NULLS, but an empty string is actually stored as a variable-length character field of 0 length.
In 60 or 65 compatibility mode, the empty string ('') is interpreted as a single space in INSERT or assignment statements on varchar data. In concatenating varchar , char , or text data, the empty string is interpreted as a single space. This means that you can never have a truly empty string. The only alternative in 60 or 65 compatibility mode would be to define the field as allowing NULLs, and use a NULL in place of an empty string.
As you can see, you can configure and control the treatment and behavior of NULL values several ways, and you might think it would be impossible to keep track of all the variations. If you try to control every aspect of NULL handling separately within each individual session, you can cause immeasurable confusion and even grief . However, you'll notice that most of the issues become moot if you follow a few basic recommendations:
If you must use NULLs in some cases, you can minimize problems by always following the same rules, and the easiest rules to follow are the ones that ANSI already specifies.