Designing Data Integrity | MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (Exam 70-444) Study Guide

The ability to ensure you can retrieve persisted data without error is of primary importance. Without data confidence, no query answer can be guaranteed to be correct, and there is not much point in worrying about data availability or performance.

Thus far in the chapter you have seen methods to review and query your data for quality using implicit means. In this section, you will take a proactive approach to data integrity through explicit definitions.

Your first concern is to make certain that you have exactly one master copy of the data that is used throughout the enterprise. Then, to be proactive, you must explicitly define the data using entity and attribute values.

Warning

It goes without saying, but we will say it anyway: make certain to log all your changes and maintain up-to-date records.

Reconciling Data Conflicts

If all data came from one source and was stored in one source, you wouldn’t need to worry about data conflicts. However, that is not the nature of today’s data. Most of our data is reusable; it’s pulled from a variety of sources and distributed widely. It is important to generate a master copy of your data and retain its integrity.

Sometimes data comes from a variety of sources and must be transformed into the schema of the master database. SQL Server Integration Services provides the means to import data, transform it, and verify it as it is imported into the master database. Even if your master database does not have a solidified schema but is a result of a variety of inputs, SSIS can assist in the master database creation process.

In other enterprises, the master database just needs to be copied to locations nearer the users without using any transformations. This process is replication. In this scenario, the master database is the publisher, and the databases in the locations near the users are the subscribers. If all modifications to the master database are made at the master, the replication paradigm consists of deciding how much out of synchronization the subscriptions are (the latency) and how much independence, or autonomy, you give the subscribers. If updates are allowed at the subscribing locations, data conflict rules need to be devised.

Consequently, no one solution fits all. Companies use a variety of approaches to maintain a master copy of their data. Let’s look at each approach and how it handles data conflicts.

The Single Copy Approach

In this approach, only one master copy of the data exists. All modifications and additions are made directly to the master data. To keep consistency, any application that uses the data would also need to be specifically written to use the master data. This approach is certainly not conducive to data reuse. This approach is practical only in the instance where you are matching a particular database with a given application and using it locally; to be most practical, the application should be written in-house. In a larger environment, this approach would either cause contention issues or require the creation of copies of the master data, which takes you to the next approach.

The Multiple Copies, Single Maintenance Approach

In this approach, there is a single master copy of the data and multiple copies that are distributed for local use. All modifications or additions are made to the single master copy of the data; however, changes are sent out to the distributed copies so that they remain up-to-date. This is the classic publisher-subscriber paradigm with a single publisher updating multiple subscribers. The updating can be accomplished through a number of methodologies.

Real World Scenario-Reconciling Conflicting Data

This sidebar highlights a scenario I heard recently involving data conflicts.

Jay is a credit card customer who relocated and immediately notified the company of his new address. Even though he used the credit card, Jay did not receive a bill at his new location. He really didn’t think about this too much until he received a call from the billing department of the credit card company stating that his account was in arrears. Jay verified his new address while on the phone with the billing department; the company did indeed have his new location of Trenton, Maine, on file. Jay asked the representative to send out a copy of the bill so he could settle the account. When he had not received a bill after another few weeks, Jay decided to call the billing department of the credit card company. The billing department told him that the company had now turned the delinquent account over to a collection agency. However, while on the phone, Jay managed to get to the root of the problem. It seemed that although the address in the file was Trenton, Maine, the address being billed was Trenton, New Jersey. It took many more phone calls and paperwork for Jay to finally resolve this problem. And, needless to say, Jay no longer had confidence in the company.

In this scenario, the master copy of the data was accurate, but another copy of it was flawed. Data needs to be reusable, but the consistency of the data cannot suffer. You need to ensure that methods are in place to maintain data consistency when the data resides in many locations.

Jay associated the integrity of a company with the integrity of the data. Wouldn’t you?

-Marilyn Miller-White

The master copy of the data still remains the only copy that is able to be updated. Any changes that have to be made to the locally distributed subscriptions need to be sent to the master copy for updating. The changes will then be published to the subscribers from the single source.

The actual methodology chosen for data distribution depends on the amount of site autonomy and latency a subscriber desires. Transactional replication fills the need for subscribers that do not need to remain autonomous but need a very low latency. With transactional replication it is the changes or the transactions in the database log file that are distributed to the local subscribers. On the other hand, snapshot replication works well in situations where updates are few and the subscriber needs to be autonomous with moderate to high amount of latency. A snapshot is a point-in-time picture of the data that is distributed to the local subscribers.

Note

See Chapter 8, “Designing a Strategy to Manage Replication,” for more in-depth information about replication.

The Updatable Copies, Single Master Approach

In this approach, there is a single master copy of the data and multiple copies that are distributed for local use; however, the local copies are now able to update the master copy. This approach is the more ideal approach for reusing data and enabling it to be distributed among several applications throughout your organization. But with these benefits come other problems in data conflicts: What if data gets updated at two different locations and then sent to the master copy? Which piece of data is the winner?

This approach uses either transactional replication or snapshot replication but with updating subscribers. When you create a publication that accepts updates from the subscribers, two communication mechanisms can propagate the changes-two-phase commit, often referenced as immediate updating, and queues, which are termed queued updating.

To handle the changes, SQL Server inserts a new column into each table that is included in the publication article, as shown in Figure 7.7.

image from book
Figure 7.7: Addition of the uniqueidentifier column in the New Publication Wizard

A new globally unique identifier (GUID) is placed in the uniqueidentifier column each time the data row is modified. At the time of synchronization between the publisher and subscriber, the queue reader agent compares the values of the GUID to see whether they are the same; if a change has occurred, the synchronization process updates the publisher. Figure 7.8 shows the new uniqueidentifier column added to the NewClient table.

image from book
Figure 7.8: The uniqueidentifier column in the published table

Data Conflicts When Using Snapshot Replication with Updating Subscribers

Snapshot replication with updating subscribers is able to be generated using only system stored procedures to create the publication and subscription, sp_addpublication and sp_ addsubscription, respectively. For snapshot replication with immediate updating, the @update_mode argument of the sp_addsubscription procedure equals sync tran. There is never any concern for data conflicts in this process because it uses the Distributed Transaction Coordinator. The most it will be is slow or, if disconnected, rolled back, and a reconnect and resend will be handled appropriately by the application involved.

For snapshot replication with queued updating, the @update_mode argument of the sp_addsubscription procedure equals queued tran. The modifications that are made at the subscription are stored in a queue using a trigger to log the changes. The queue is read by the queue reader agent that applies the transactions on the publisher. The changes then remain on the publisher and are sent out to the subscriber at the time of the next snapshot. In SQL Server 2005, the queue service is provided entirely by SQL Server. Queued updating opens the possibility for updates to occur at both servers involved in the process at the same time. Data conflicts can occur when applying these changes. Therefore, you must configure a conflict resolution policy when you are creating the publisher.

The conflict resolution policy resides on the publication (see Figure 7.9). The default policy is Keep the Publisher Change. The other two options that can be selected are Keep the Subscriber Change and Reinitialize the Subscriber.

image from book
Figure 7.9: The conflict resolution policy for updatable subscriptions

Data Conflicts When Using Transactional Replication with Updating Subscribers

Transactional replication with updating subscribers can be configured either by using the replication wizards New Publication Wizard and New Subscription Wizard in SQL Server Management Studio or by using the system stored procedures sp_addpublication and sp_addsubscription. Again, there is no concern for data conflicts when using the immediate updating process because it uses the Distributed Transaction Coordinator. It does require a very reliable link, or if disconnected for 20 seconds, it will roll back and need to reconnect.

However, queued updating allows for the possibility of updates occurring simultaneously at both servers involved in the process. So, here you have a chance for data conflicts to occur. Therefore, when you create the publisher, you must also configure a conflict resolution policy.

Note

For more information about updatable subscriptions for transactional replication, see http://msdn2.microsoft.com/en-us/library/ms151718.aspx.

The Multiple Master Approach: Merge Replication

Using this approach, changes can be made at both the publisher and the subscriber. When a merge agent runs, it synchronizes the changes by implementing triggers so that the publisher and subscriber converge to the same values. Each replicated table contains a GUID column to track each replicated row uniquely.

If the data has been modified at both locations, the merge agent detects the change as a data conflict. The decision of how the conflict is resolved is based upon how you have defined the conflict resolution policy for merge replication. By default, the publisher is the winner. However, you can choose from a series of resolution options, create your own resolver using stored procedures, or create a replicated management object (RMO) procedure. Figure 7.10 shows the Resolver tab of the Article Properties page in SSMS.

image from book
Figure 7.10: The Resolver options page for a merge article

You should choose to use the default resolver or an article resolver based on the data and the application’s business logic needs.

The Multiple Master Approach: Peer-to-Peer Replication

We would be remiss if we did not mention the new algorithm in SQL Server 2005 for data merging: peer-to-peer replication. In this approach, each node in the peer-to-peer replication propagates its changes to all other nodes. Every node is both a publisher and a subscriber of every other node using the same schema and data. Hence, every node holds a master copy. In this algorithm, the transactional replication relationship is a peer relationship, in contrast to the hierarchical relationship found in the previous forms of transactional replication discussed earlier. Figure 7.11 shows a peer-to-peer replication solution.

image from book
Figure 7.11: A peer-to-peer replication solution

In this solution, you have multiple copies of the database, with each database engine maintaining its own copy. This implementation allows for updatability. The replication technology recognizes the hosting node change and propagates the change appropriately to the peer members. It is important for you to understand that the peer-to-peer approach does not utilize the immediate or queued updating publication options.

In database applications employing a peer-to-peer scale-out solution, you need to ensure that data modifications processed at one node are synchronized with all peers prior to being processed at another node. You can use a few techniques to accomplish this. One is data stewardship, which means you can update only the data you own. Let’s use a retail store chain as an example. If you allow each store to have full access to its own data and read access to all other stores’ data, you have made them stewards of their data. This concept avoids update conflicts, eliminates the overhead of merge replication, and prevents potential server bottlenecks.

Peer-to-peer replication provides high availability of your data while allowing local maintenance of the participating databases. If another store is opened or closed, you are adding or subtracting a node to the replication topology. If you have a situation where data stewardship is not a practical solution, you can still use merge replication to handle the conflicts.

Another nice feature of this approach is that you can use it for the entire database, just a table, or even just parts of a table. Since it is flexible while being able to be used with other solutions where multiple copies of the data are required, you should definitely keep peer-to-peer replication in mind as a solution.

The ETL Master Data Approach

It is more than likely that your organization has more than one set of master data. This is common because many companies grow through mergers and acquisitions. With each new merger comes a new set of data for you to handle, another company’s master data copy. If no customers or products overlapped in the two data sets, the merger might not be so bad. However, this is not usually the case, because the same types of companies merge and therefore might share some of the same clients or product base. You would not have such a bad time if you could just find a common thread, such as a Social Security number; however, at times even that is not available. Also, the records can be in a different format and use different key values. So, most often the new data must be compared through a series of filters and then cleansed through a series of fuzzy lookup and transformations to determine whether they can be merged appropriately. Finally, a data ETL master data copy emerges.

Assigning Explicit Constraints on Your Data

Essentially, data consists of entities and attributes. Thus, data integrity consists of entity integrity and domain integrity. Domain integrity includes user-defined integrity and referential integrity.

Constraints are the ANSI-standard method for enforcing data integrity. Each type of data integrity is enforced with separate types of constraints, as shown here:

Open table as spreadsheet

Type of Integrity	Constraint Type
Domain: User-Defined Referential	CHECK
Domain: User-Defined Referential	FOREIGN KEY
Entity:	PRIMARY KEY
Entity:	UNIQUE

In SQL Server 2005 Microsoft suggests that you use constraints as the preferred way to enforce your business rules.

You can create or modify constraints on your tables by using the Modify property for the specified table in SQL Server Management Studio or by using the ALTER TABLE Transact-SQL statement.

Enforcing Entity Integrity

When you specify entity integrity, you require that all rows in a table have a unique identifier. Each row must have a unique value that can be used to identify it from all others in the table. You specify this type of integrity using either a primary key constraint or a unique constraint.

Primary Key Constraint

You use a primary key constraint to provide a means to uniquely identify each row of a table. Every attribute in the row refers to the primary key according to the third normal form when used in database design. You can create a primary key constraint on multiple columns; however, a table can have only one primary key. When you create a primary key constraint, a unique index is automatically created for the primary key column(s).

A primary key constraint must be valued; it cannot contain a NULL.

A primary key can be natural-that is, understood by the user-or surrogate, which means autogenerated. In either instance the primary key is a constraint that ensures that new data meets the specified criteria and any data modification operations become enforced by the key’s definition.

The primary key builds the relationship between tables by providing the referencing object for a foreign key.

Unique Constraint

You use a unique constraint to specify that each value in a column is unique. A unique constraint can contain a NULL value, but only one!

Unique constraints can also be applied to multiple columns; unlike the primary key constraint, you can have more than one unique constraint per table.

Enforcing Domain Integrity

When you specify domain integrity, you determine a set of values for your data that are valid for a column or the referencing to that column. The referencing element of domain integrity is often studied separately as referential integrity. For our discussion, we are including both the referencing and the validity aspects of a column under the umbrella of domain integrity. You can enforce domain integrity by restricting the data type, by designing a particular format structure, or by constraining the column specifying a range of values allowed in the column or adhering to a rule or association. The check constraint and the foreign key constraint fall into that last category.

Check Constraint

You use a check constraint to specify a rule that values being placed in a column must obey. The check constraint is actually an expression defining the permitted values for the column. For example, a date earlier than a certain date or a number greater than a given number each could specify a check constraint.

Foreign Key Constraint

When you are defining domain referential integrity, you are enforcing a relationship between a column entry and a primary key or unique constraint in the same or another table. You use the foreign key constraint to enforce the relationship with the foreign key residing in the referencing table. The relationship is made with the primary key or unique constraint in the referenced table, which might be one and the same. Enforcing referential integrity means that the data in the foreign or referenced table is protected from deletion if that deletion orphans the data in the referencing table (the one containing the foreign key). You may have up to 253 foreign key constraints in a SQL Server table.

Figure 7.12 is an example of a foreign key constraint from the AdventureWorks database. As noted in the figure, the Production.ProductModel and Production.Product tables are linked using a foreign key constraint.

image from book
Figure 7.12: The Production.ProductModel and Production.Product tables

The ProductModelID of the Production.Product is linked using a foreign key constraint to the ProductModelID column, which is the primary key of the Production.ProductModel table. Referential integrity guarantees that the ProductModel row in the Production.ProductModel table cannot be deleted as long as any row in the Production.Product table references that particular ProductModelID. Figure 7.13 shows the column properties of the Production .Product.ProductModelID column in SSMS.

image from book
Figure 7.13: The Production.Product.ProductModelID column properties

Using DEFAULT Definitions

You use a DEFAULT definition to explicitly place a value in a column that would otherwise be empty. If the column accepts null values, you are basically replacing the null values with the DEFAULT value. A DEFAULT is particularly useful in data entry applications where a specific value is most often entered. If that value is defaulted, it can be tabbed over easily. You cannot create DEFAULT definitions on timestamp data or on columns defined using the IDENTITY or ROWGUIDCOL property. You can have only one default value per column.

You create a DEFAULT definition as part of the table definition when you create a table. If your table already exists, you can ALTER or use the SSMS modify option to apply the DEFAULT definition. Each column of a table can have a DEFAULT definition.

If you have existing data in a column and add the DEFAULT definition, the existing values are not affected; however, the DEFAULT is applied thereafter. If you are adding a new column to an existing table with data defined on it, you can specify to DEFAULT both old and new data rows.

Tip

Avoid using the CREATE DEFAULT statement in new development because it is being deprecated. Instead, you should use default definitions created using the DEFAULT keyword in the CREATE TABLE or ALTER TABLE statement.

In Exercise 7.9, you will create a table in the AdventureWorks database and create a DEFAULT definition on one of the columns.

Exercise 7.9: Using the DEFAULT Definition in the AdventureWorks Database

This exercise demonstrates how the DEFAULT definition works with existing and new data.

Create a new table, dbo.default_ex, in AdventureWorks with two integer columns:

 --Start the exercise using AdventureWorks Database USE AdventureWorks; GO --Create a new table with two integer columns CREATE TABLE dbo.default_ex ( col_a INT, col_b INT) ; GO

Insert a value, 2, into column a:

 --Insert a value into column a INSERT INTO dbo.default_ex (col_a) VALUES ( 2 ) ; GO

Alter the table to place a default of 100 in column b:

 --Alter the table to place a default of 100 in column b ALTER TABLE dbo.default_ex ADD CONSTRAINT col_b_default DEFAULT 100 FOR col_b ; GO

Insert a value, 3, into column a:

 --Insert a value in column a INSERT INTO dbo.default_ex (col_a) VALUES ( 3 ) ; GO

Display the results:

 --First row contains a NULL in column b --Second row contains the default in column b SELECT * FROM dbo.default_ex ; GO

Your output will look like the following:

Notice that only the second row, since it was created after the DEFAULT was defined, contains the default value of 100. The first row contains the expected NULL.

Drop the created table:

 --Drop the created table DROP TABLE dbo.default_ex ; GO

Assigning Data Types

Setting the proper data type on data can go a long way to ensure its integrity. In applications that require precise calculations, you need to make certain that not only have you chosen the most appropriate data type but that you understand and have used the correct amount of accuracy and precision needed for the implementation.

Real World Scenario-Creating Clinical Trial Deliverables

When creating deliverable database solutions for clinical trials, we used constraints whenever possible. Our studies included Phase I trials that required a lot of data entry over a very short time span. To make the data entry faster and more accurate, we used defaults abundantly. Data entry personnel could easily tab over the fields, just stopping to enter the required data. We also put check constraints in place to keep the initial entry as clean as possible.

-Marilyn Miller-White

You might find it more appropriate to define a data type from one of the built-in types available in SQL Server 2005. These are called alias data types and will be discussed later in this chapter.

In the “Using User-Defined CLR Data Types” section, you will learn how to integrate your own data types created using the CLR within the security store of SQL Server.

Assigning Transact-SQL Data Types

We’ll begin with the system-defined data types in SQL Server 2005 Transact-SQL. You will notice that instead of the general listing of data types that you might have seen in many books, we have broken the data types into categories of use, since that is how they are stored in SQL Server 2005. You need to understand not only the data type but also how it is used in your applications.

Using Exact Numerics

These data types represent exact numbers. The choice of data type depends on the range of your data and its precision.

bit This data type contains a 1, 0, or NULL, if allowed. It is useful to show status such as true/ false or on/off.

int This data type contains integer data from –2^31 (–2,147,483,648) through 2^31–1 (2,147,483,647). It occupies 4 bytes of disk space and is used in mathematical functions.

bigint This data type contains integer data from –2^63 (–9,223,372,036,854,775,808) through 2^63–1 (9,223,372,036,854,775,807). It uses 8 bytes of disk space and is useful for especially large numbers.

smallint This data type contains integer data from –2^15 (–32,768) through 2^15–1 (32,767). It occupies 2 bytes of disk space and is useful for slightly smaller numbers than you would store in an int field, because of its 2-byte storage size.

tinyint This data type contains integer data from 0 through 255. It takes only 1 byte of disk space. When you decide to use this data type, make certain you will never exceed the 255 limit!

decimal This data type contains fixed-precision and scale-numeric data from –10^38+1 through 10^38–1. It uses two parameters: precision and scale.

Precision is the total count of digits that can be stored in the field.

Scale is the number of digits that can be stored to the right of the decimal point. For instance, if you have a precision of 6 and a scale of 2, your field has the format XXXX.YY. The default precision is 18 if none is specified.

numeric This data type is the same as decimal.

money This data type includes monetary data values from –2^63 (–922,337,203,685,477.5808) through 2^63–1 (922,337,203,685,477.5807), with accuracy to a ten-thousandth of a monetary unit. It consumes 8 bytes of disk space. money is useful for storing sums larger than 214,748.3647; otherwise, you can use smallmoney.

smallmoney This data type includes monetary data values from –214,748.3648 through 214,748.3647, with accuracy to a ten-thousandth of a monetary unit. It uses 4 bytes of space and can be used to store smaller sums of money than would be stored in a money field.

Using Approximate Numerics

It is often necessary to provide data types for numerics that cannot be measured exactly. For these data, you can use the following approximate numbers:

float This data type includes floating-precision number data from –1.79E+308 through 1.79E+308. The floating-point number is the scientific representation of the number, which is an approximate representation. The number of bytes used for storage depends on the value of the mantissa, which is the coefficient, or first part of the number, with the second part being representative of the exponent.
real This data type includes floating-precision number data from –3.40E+38 through 3.40E+38 using 4 bytes of disk storage. real is also an approximate representation of a number and should be used accordingly.

Using Date and Time

Depending on the accuracy needed, the following data types are used for date and time:

datetime This data type includes date and time data from January 1, 1753, through December 31, 9999, with values rounded to increments of .000, .003, or .007 seconds using 8 bytes of disk space. It should be used to track specific dates and times.
smalldatetime This data type includes date and time data from January 1, 1900, through June 6, 2079, using an accuracy of one minute and taking 4 bytes of disk space. It should be used for less specific and smaller ranges of dates and times than you would store in datetime.

Using Character Strings

Non-Unicode character data can be represented by either an exact length or a variable-length data type. The new varchar(max) can be used for largely varying or very large character strings.

char This data type includes fixed-length, non-Unicode character data with a maximum length of 8,000 characters. It’s useful for character data that will always be the stated length, such as a country code or state code that you will confine to only two characters. This data type uses the exact amount of disk space as is defined when created, even if the entire amount is not filled with data.

varchar This data type includes variable-length, non-Unicode data with a maximum of 8,000 characters. It is used when the data varies in length, such as in a name column where each name has a different number of characters. This uses the disk space of the stored values. For example, if you have a column of varchar(30) but you are storing a name containing only 12 characters, the column will use only that amount of space plus overhead.

varchar(max) This data type has the same attributes as varchar; however, when you use the max specifier, it can hold 2^31–1 (2,147,483,67) bytes of data. It is used when the data column entries vary greatly and the data size might exceed 8,000 bytes.

Using Unicode Character Strings

Likewise, Unicode character data can be represented by either an exact length or a variable-length data type. The new nvarchar(max) can be used for largely varying or very large character strings.

nchar This data type includes fixed-length, Unicode data with a maximum length of 4,000 characters. Like all Unicode data types, it is used for storing small amounts of text that will be read by clients that use different languages.

nvarchar This data type includes variable-length, Unicode data with a maximum length of 4,000 characters. It is the same as nchar except that it uses less disk space when there are fewer characters than the defined amount.

nvarchar(max) This data type has the same attributes as nvarchar; however, when (max) is specified, the data type holds 2^31–1 (2,147,483,67) bytes of data. It is used when the data column entries vary greatly and the data size might exceed 4,000 bytes.

Using Binary Strings

Binary objects and image files use the binary string data type:

binary This data type includes fixed-length, binary data with a maximum length of 8,000 bytes. It is interpreted as a string of bits and is used for storing data that is more purposeful in binary or hexadecimal shorthand, such as a security identifier.
varbinary This data type includes variable-length, binary data with a maximum length of 8,000 bytes. It is just like binary, except that it uses less disk space when fewer bits are stored in the column.
varbinary(max) This has the same attributes as varbinary; however, when the (max) specifier is stated, the data type can hold 2^31–1 (2,147,483,67) bytes of data. This is useful for storing binary objects such as image files or Microsoft Word documents where column data entries exceed 8,000 bytes.

Using Other Data Types

Here are some other data types you use in specific situations, as noted in their definitions:

cursor This data type is for variables or stored procedure OUTPUT parameters. The cursor data type cannot be used to store data in a column of a table. It allows you to return a cursor from a stored procedure or a variable. Cursors are often used with applications that cannot use entire data sets and must query records one at a time.
sql_variant This data type allows you to store certain other data types. The only values it cannot store are varchar(max), nvarchar(max), text, image, sql_variant, varbinary(max), xml, ntext, timestamp, and user-defined data types.
table This special data type returns tables from stored procedures or stores tables in variables for later processing. You cannot use this data type to store a table in a column of another table. It is primarily used to temporarily store sets of rows returned as the result of a table-valued function.
timestamp This data type is an automatically generated, unique, binary number used as a mechanism for version stamping a row of data. Unlike previous versions, the Transact-SQL timestamp data type does not preserve date and time; it is an incrementing 8-byte counter. If you need to record the date and time of an update, use a datetime data type.
uniqueidentifier This data type is a 16-byte GUID. It is most often used in merge and transactional replication with updating subscribers to guarantee the uniqueness of a row of data. You initialize a value in the data column by using the NEWID function.

xml New to SQL Server 2005, this data type stores entire Extensible Markup Language (XML) documents or fragments.

Note

Remember, the text, ntext, and image data types have been deprecated in this version of SQL Server. You should replace these data types with varchar(max), nvarchar(max), and varbinary(max).

Using Alias Data Types

You might find it useful to create your own data type built on a system data type as a common method for referring to a particular type of data. For example, this could be the two-character representation for a state, the five-character representation of a ZIP code, or a particular representation for a measurement that you need to standardize for an application. This data type is called a Transact-SQL user-defined data type or alias.

When you create an alias data type, it is available for the entire database but not outside the created database. If you do want this new data type to be used throughout your environment, you need to create it in all existing databases and add it to the model database for all future databases to be created on that server.

The benefit of the alias is that you are creating a type that is ensuring column integrity when it is used. The downside is that an alias is built on an existing system type, so it has very little flexibility.

When you create an alias data type, you must supply three parameters:

Name
System data type on which you are building the alias
Its nullability (if not stated, connection default is assigned)

You would use the following code to create, for example, an alias PostalCode data type:

 --Define: Name, base system data type, nullability Create Type PostalCode FROM char(5) NULL;

You can use the PostalCode alias in SQL Server Management Studio using the following steps:

Expand the AdventureWorks database.
Expand the Types folder.
Expand the User-Defined Data Types folder. dbo.PostalCode should be visible. Right-click it, and choose Properties to check its definition.

You can use the PostalCode alias to define an existing column data type or for future column data definitions. Here is an example of using the PostalCode alias:

Expand the AdventureWorks database.
Expand the Tables folder.
Right-click the Person.Address table, and choose Modify.
Select the PostalCode column, and using the Data Type drop-down list, scroll to and choose the PostalCode alias you recently created. To allow null values, you would check the Allow Nulls column. You do not need to save this change; we merely wanted you to see how to use the alias.

Using User-Defined CLR Data Types

Working with managed code in SQL Server 2005 gives an added dimension to your data types. With the CLR, you are no longer restricted to building your data type on an existing system type.

CLR user-defined types (UDTs) allow you to create data types you never were able to create within the realm of your database before. They can also include multiple elements. For example, UDTs are instrumental for creating types in environments that work with geospatial data. They can also indicate time and date functionalities and be used to perform data encryption. However, it is good to remember that their use is specialized, and if you indeed can use a Transact-SQL data type for the same result, it is usually more efficient in the overall query process to do so. When you create a UDT, you are creating your own methods to serialize and deserialize all data being stored in your new data type. That is, every time SQL Server implements your UDT, it must call the code to process the request. This technique does not work well for large result sets or for complex code logic. So, keep algorithms discrete, result sets small, and code minimal.

Note

Remember that the CLR user-defined types are stored within the security of your SQL Server.

Here is the process you use to develop a UDT in SQL Server 2005:

To use a CLR UDT, you need to enable the CLR for your server’s instance by using the Surface Area Configuration utility.
You must code and build the assembly defining the UDT using any language supported by Microsoft .NET Framework CLR, such as Microsoft Visual C# or Microsoft Visual Basic .NET.
You must register the assembly. This can be accomplished either by deploying the UDT in a database project in Microsoft Visual Studio or by using the Transact-SQL CREATE ASSEMBLY statement.
You must create the UDT in SQL Server by using the CREATE TYPE statement within the database in which it will be used.
You can then use the UDT to define column in tables or in defining parameters or to define variables in functions and stored procedures.

Using Constraints Versus Triggers

This chapter would not be complete without a discussion of when to use a constraint and when to use a programmable object to control your data as it follows a business process.

If you can enforce domain integrity by restricting the data type, by designing a particular format structure, or by constraining the column specifying a range of values that are allowed in the column or adhering to a rule or association, you should use one of the constraints mentioned in the earlier sections. Constraints carry very little overhead in the processing of your data. Actually, most often they occur prior to the data even entering your database. However, a time comes when the relationship between the data is just not that simple. Perhaps a business rule needs to be followed. For example, when a customer places an order from your inventory, a chain of events must occur and be verified: Is there sufficient inventory? Will any items need to be reordered to replenish inventory? Is the customer’s credit good? What are the terms of payment? How does the order get shipped? These are all programmable events that need to be handled through a series of reactionary stored procedures to handle all the business rules. Stored procedures that automatically fire from an event are called triggers. Triggers require a greater amount of overhead on your system because they process the business rules.

If you are able to enforce the business rule by using a constraint, that is the way to go. If not, then you need to create programmable objects such as triggers to enforce the integrity of your data.