Maintaining Data Integrity | MCSE Training Guide (70-229). Designing and Implementing Databases with MicrosoftR SQL ServerT 2000 Enterprise Edition

Design attribute domain integrity. Considerations include CHECK constraints, data types, and nullability.
- Specify scale and precision of allowable values for each attribute.
- Allow or prohibit NULL for each attribute.
- Specify allowable values for each attribute.

Whether you have implemented or are in the process of implementing a data model, you will need to keep data integrity in mind as a key factor in verifying the correctness and uniqueness of data. Data integrity itself means preserving the correctness and verifying the consistency of data. When incorrect or inconsistent values and records are preserved in a database, it is said to have lost data integrity. For instance, a product in a Products table may have a Product_ID of 67. It would be up to the database designer to ensure that the same value is not inserted again. Or if a product ID is to have only numbers in it, it is the responsibility of the integrity regulations to prevent the entering of characters .

IN THE FIELD: GIGO

There is an old saying in the data processing environment that presents the end results of any operation as being determined by the data entering the operation: Garbage In gives Garbage Out, or GIGO. The process of maintaining integrity is an attempt to eliminate errors entering the system. No matter how good the design, maintaining data integrity determines how productive or usable a system is.

Many SQL Server techniques can be used to ensure data integrity is not lost. Before you can examine some of the techniques for the physical implementation, you need to break down the large topic of data integrity into some smaller, more manageable subtopics. Three types of data integrity can be enforced in your database: entity integrity, domain integrity, and referential integrity. All are discussed in detail in the sections that follow.

Entity Integrity

Applying entity integrity defines a unique row attribute as an identifier for individual entities. Generally, the regulations of this type of data integrity are easy to follow. Simple ways of enforcing this type of integrity are using Primary Keys, UNIQUE constraints, and unique indexes when the entity design moves into the physical stage of design. Entity integrity specifies that Primary Keys on every instance of an entity must be kept, must be unique, and have values other than null content.

Although SQL Server does not explicitly dictate that a table needs to have a Primary Key, it is a good practice to design a Primary Key for each table into the model. Many procedures in SQL Server require a unique identification for each row in an entity, which is no longer a concern if you establish a Primary Key. SQL Server creates a UNIQUE INDEX constraint to enforce a Primary Key's uniqueness. It also mandates no null content, so that every row in a table must have a value that denotes the row as a unique element of the entity.

Although indexing is discussed in Chapters 3 and 10, it is worth noting at this time that by default the Primary Key represents a clustered index. If left this way, the Primary Key places the data into sequence based on the key values.

Unique indexing, unique constraints, Primary Keys, and, in some cases, stored procedures can all be used to help you maintain entity integrity. There is also a mechanism for assigning a Primary Key or other attribute a sequential value. This sequential value is often referred to as an auto-numbering attribute . In SQL Server the appropriate term is an Identity column .

Using the IDENTITY property enforces uniqueness throughout attributes, which is one of the benefits of using a Primary Key. In SQL Server, the IDENTITY property creates a sequenced number each time a new row is created. This type of key on a column is often used as a surrogate Primary Key. Surrogate keys can be easily created to enforce uniqueness.

The IDENTITY property creates a column that contains sequenced numbers to distinguish one record from another, and thus behaves like a surrogate key. A good example of a column marked using the IDENTITY property is an ID column, such as CustomerID . If a column or attribute is marked with the IDENTITY property, SQL Server is obliged to maintain uniqueness throughout the column values. Remember, an IDENTITY property should be placed only when there is not a particular Primary Key candidate in mind. For example, consider the Course entity in the Lloyd's Hospital data model.

No attribute stays unique throughout the entity. Therefore, it would be proper to create your own candidate in this example: the CourseID attribute. This attribute has no meaning except that it enables you to uniquely identify a particular course. Primary Keys for the model are identified as shown in Figure 2.11.

Figure 2.11. Primary keys identified for the Lloyd's Hospital data model.

graphics/02fig11.gif

Domain Integrity

Domain integrity involves restricting the data entered to a domain or range of values, thus preventing incorrect data from entering the system. You may, for example, not want to allow values greater than 2,500 from being entered into the database or prevent the entry of non-numeric data. There are many other types of domain integrity, as well as a large number of SQL Server technologies that can help you maintain this integrity.

We have already defined and showed the implementation of Foreign Keys. Foreign Keys help maintain domain integrity by ensuring that all rows of data in a child entity have a matching row of data in the parent entity. When a Foreign Key constraint is defined, it is not possible to add a new row in a child table that does not have an appropriate and related parent.

Many column properties help you maintain domain integrity. The data type chosen for a column enforces integrity by permitting only appropriate data in the column. Precision , the number of digits in a number, and scale , the number of digits to the right of the decimal place, can be used in conjunction with numeric data types to maintain accurate data. A complete listing of data types, their descriptions, and uses is given later in this chapter. Other column properties that assist in domain integrity are nullability, which determines whether an attribute is optional or mandatory, and default value, which can supply appropriate default input.

A default provides a predefined value placed when a value isn't specified on a column during data entry. Defaults can be used in conjunction with NULL where no data is entered to prevent errors or to just make it easier for someone to enter data. Default values can be attached to a column with the help of the DEFAULT column constraint. If most customers in the case study come from within the same region, a default value could be used and overridden by data entry when necessary.

Nullability is an aspect of database design that enforces integrity by placing a constraint that determines whether null values are allowed on a column or not. A null value is not a zero-length string (""), but it is referred to when absolutely no data is present. This is useful when a user must enter data for a column and cannot leave it blank. The null option can be set at the time of table creation using the NULL and NOT NULL options. If data is required for an attribute, it should have the NOT NULL property assigned to it, and NULL if the attribute value is optional.

Other aspects of SQL Server that enforce domain integrity require a little more work on the part of the designer, but can be implemented using a number of other mechanisms discussed in the next section.

Domain integrity can also involve the checking of data entered to ensure that it is valid based on business rules or other conditions of an application. A business rule is a logical rule that is used to run a business. Business rules can be enforced by a database system through the use of rules, stored procedures, triggers, and constraints. Each of these objects interacts in a different manner to prevent invalid data from entering the system.

A rule is a database object not connected directly to a column or table. It is bound to columns or user-defined data types to specify which data values are acceptable in a column. CHECK constraints can provide the same functionality with less overhead and are preferred because they are in the SQL-92 standard and are connected directly to a column or table.

Use of stored procedures allows actions to be performed in a controlled manner. They are precompiled collections of T-SQL statements stored under a name and processed as a single unit. Stored procedures can perform modifications in a predictable manner and maintain integrity through the logic developed by the designer.

A trigger provides a high degree of functionality but at a cost. A trigger is a stored procedure that executes when data in a specified table is modified. The data modification occurs and the trigger must react to the change. This means that a great deal more overhead is associated with a trigger, and it is therefore used only when other techniques lack the capability. Triggers are often created to enforce integrity or consistency among logically related data in different tables. Triggers are a complex mechanism and are covered in detail in Chapter 8, "Triggers."

Many types of constraints exist in SQL Server. A constraint is a property assigned to a table or column that prevents invalid data values from being entered. A UNIQUE or PRIMARY KEY constraint prevents you from inserting a duplicate value; a CHECK constraint prevents you from inserting a value that does not match a specified condition; and NOT NULL mandates a data value must be provided for the column. A constraint is processed on the server and catches an error before the data gets into the system. This is an important facet of handling data integrity because it is important to catch errors at the earliest moment to minimize the use of resources and processing overhead. Further expansion on this topic is found in Chapter 3, "Physical Database Design."

Referential Integrity

Referential integrity specifies that every Foreign Key value map to a Primary Key or other unique value in related tables. Referential integrity guarantees the smooth navigation in moving from entity to entity, so it is extremely important to ensure proper referential integrity definition. Referential integrity ensures that no orphaned records exist. Orphaned data is a term used when data in a child entity exists that points to a non-existent parent.

Setting up a Foreign Key constraint enforces referential integrity, and stored procedures and triggers can also be used to maintain the integrity. Past versions of SQL Server relied on triggers much more heavily than in SQL Server 2000, in which you find several new advanced features to maintain referential integrity. These new features of the physical implementation are discussed at length in Chapter 3.

Using Advanced Referential Integrity Options

Two new options exist for defining relationships to be used in a SQL Server database system. These options allow more control over the actions that can affect key values. Cascade Update Related Fields and Cascade Delete Related Records enable updates or deletions of key values to cascade through the tables defined to have Foreign Key relationships that can be traced back to the table on which the modification is made.

A cascading deletion occurs when you delete a row with a key referenced by Foreign Keys in existing rows in other tables. In that case, all rows containing those Foreign Keys are also deleted.

A cascading update occurs when you update a key value in a row, where the key value is referenced by Foreign Keys in existing rows in other tables. All the Foreign Key values are also updated to the new value specified for the key.

If cascading referential actions have also been defined on the target tables, the specified cascading actions are also taken for the rows in those tables.

Applications Maintaining Integrity

In many environments, it is possible to catch data integrity violations without using resources on the database server. Particularly in Internet applications, developers try to minimize the return trips needed to obtain accurate data entry. If errors can be caught at the user interface and corrected before the data is submitted to the database, then a more efficient system will result.

Some simple checks that can be performed include basic error checks to ensure that required fields do contain data and that the numerical and date entries are in valid ranges. You'll learn more about some of this functionality later in the chapter as it describes the different application architectures.

Using Appropriate Data Types

Traditionally, data is divided into three basic categories: numeric, alphanumeric , or character and binary. Other data types exist, however, and several require special handling in applications and design considerations.

Assigning a data type to an object defines four attributes of the object:

The kind of data contained by the object
The length of the data stored
The precision of numeric data types
The scale of numeric data types

Selecting an appropriate data type is an important decision that must take into account the application, usage of the data, and future trends. Particularly with numeric data you must be careful to select a type that is large enough to store the data. In character data types you want to be careful to not waste storage unnecessarily by configuring the size of the data to be larger than necessary.

Numeric Data Types

Numeric data can be defined as integer data, in a variety of sizes for whole numbers or one of several different types that accommodate decimal point storage for real numbers.

The bigint data type is an 8-byte integer type that is new to SQL Server 2000 and allows for the storage of very large integer numbers. A bigint supports values in the range from “9,223,372,036,854,775,808 through +9,223,372,036,854,775,807.

NOTE

Support for bigint A lot of the functions, statements, and system stored procedures that accepted int expressions for their parameters in the previous SQL Server versions have not been changed to support conversion of bigint expressions to those parameters. Thus, SQL Server converts bigint to int only when the bigint value is within the range supported by the int data type. A conversion error occurs at run time if the bigint expression contains a value outside the range supported by the int data type.

Other integer data types support a variety of smaller integers. The int data type, using 4 bytes of storage, supports values in the range from “2,147,483,648 through +2,147,483,647. The smallint data type uses 2 bytes and supports from “32,768 through +32,767, and the smallest integer data type tinyint , at only a single byte, supports values from 0 through 255.

Decimal and Numeric are functionally the same data type, and allow for the storage of a Fixed precision and scale numeric data from “10^38 +1 through 10^38 “1. Depending on the precision chosen, this data type uses from 5 to 17 bytes for the storage of values as summarized in Table 2.2.

Table 2.2. Decimal and Numeric Storage Sizes

Precision	Bytes
1 “9	5
10 “19	9
20 “28	13
29 “38	17

Two data types are used for storage of monetary values: smallmoney and money . These values are stored with a scale of four decimal places. The smallmoney data type consumes 4 bytes and allows for the storage of monetary values from “214,748.3648 through +214,748.3647, and the money data type, at 8 bytes, stores values from “922,337,203,685,477.5808 through +922,337,203,685,477.5807.

Character Data Types

Character data types are subdivided into two categories depending on the byte size of the characters being stored. Traditionally, character data consumed one byte per character, allowing for 255 different characters. This standard was found to be inflexible , so the Unicode standard was developed, in which each character uses two bytes of storage. This standard allows for approximately 64,000 different characters.

Each of these two data types have three variations for the storage of data: fixed-length, variable-length, and large character data. Non-Unicode data uses the char , varchar , and text data types, whereas Unicode data is stored in the nchar , nvarchar , and ntext types.

The char data type allows for the storage of fixed-length non-Unicode character data with lengths from 1 through 8,000. The varchar data type allows for the same sizes of data. The primary difference between the two is that the varchar uses storage space more efficiently and uses only the space necessary to store the data value, regardless of the maximum size a variable has been configured to store. The nchar and nvarchar are essentially the Unicode implementation of char and varchar , and allow for storage of up to 4,000 characters.

The text and ntext data types are used to store large variable-length character data. The text data type can store up to 2,147,483,647 characters, whereas ntext can store 1,073,741,823.

Binary Data Types

A number of data types are used to store binary data. The smallest is the bit data type, which supports Boolean operations and stores values of 0 or 1 in a single storage bit. Other binary data types are used to store binary strings and are stored as hexadecimal values.

Binary data is stored using the binary , varbinary , and image data types. A column assigned the binary data type must have the same fixed length of up to 8KB. In a column assigned the varbinary data type, entries can vary in size. Columns of image data can be used to store variable-length binary data exceeding 8KB, such as Microsoft Word documents, Microsoft Excel spreadsheets, and images that include bitmaps and other graphic files.

Specialty Data Types

Many data types are used in special circumstances to store data that does not directly qualify as numeric, character, or binary. Data types are available to store time and date information, globally unique identifiers (GUID), cursors , and tables.

Three data types support the storage of time and date information: datetime , smalldatetime , and timestamp . All three store dates and times, although the timestamp data type stores automatically generated binary values using 8 bytes of storage and is not used to store data.

NOTE

SQL-92 Standard Timestamp The T-SQL timestamp data type is not the same as the timestamp data type defined in the SQL-92 standard. The SQL-92 timestamp data type is equivalent to the T-SQL datetime data type.

Values with the datetime data type are stored as two 4-byte integers. The first 4 bytes store the number of days before or after the base date , which is January 1, 1900. The base date is the system reference date. Values for datetime earlier than January 1, 1753, are not permitted. The other 4 bytes store the time of day, represented as the number of milliseconds after midnight.

The smalldatetime data type stores dates and times of day with less precision than datetime . It stores them as two 2-byte integers. The first 2 bytes store the number of days after January 1, 1900. The other 2 bytes store the number of minutes since midnight. Dates range from January 1, 1900, through June 6, 2079.

The uniquidentifier data type stores a 16-byte data value known as a globally unique identifier or GUID. The GUID takes on the string format xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, in which each x is a hexadecimal digit. A GUID is long and obscure, but has the advantage of being guaranteed to be unique throughout the world.

The sql_variant is a generic data type that stores values of various SQL Server-supported data types, except text , ntext , image , timestamp , and sql_variant . It may be used in column definitions as well as in parameters, variables , and return values of user-defined functions. A sql_variant can have a maximum length of 8,016 bytes.

The table data type is new to SQL Server in the 2000 release. It can be used to temporarily store a resultset for later use. The table data type is not used for defining column types within a structure; rather it is used in functions, stored procedures, and batches.

The cursor data type is another data type that is used only in functions, stored procedures, and batches. Its primary purpose is to allow the storage of a pointer to a resultset. Attributes of a T-SQL server cursor, such as its scrolling behavior and the query used to build the resultset on which the cursor operates, are set up using a DECLARE CURSOR operation within the procedure.

User-Defined Data Types

User-defined data types are stored as database objects and are based on any of the system data types. User-defined data types can be used when several tables must store the same type of data in a column and you must ensure that these columns have exactly the same data type, length, and nullability.

Using these data types can help you create tables more quickly and can also help you control the data in a predictable manner. Often a user-defined data type is created in the model database; it will then exist in all new user-defined databases created.

Data Types in a Logical Model

Data definition is a step that follows the ER modeling as the database system nears the stage where it can be designed for the actual physical implementation. If present at all, the data definition in the logical model is more a general characterization of the data as opposed to the specifics needed for the implementation. At this stage it is more important to simply categorize the data.

Between the logical modeling and the physical implementation, the actual software is chosen (though in the example case it was really predetermined). If developing a system from scratch where no software exists, the business needs are laid out, the logical model represented, and the system architecture is designed before any decisions are made about the physical design, software, and hardware. If certain software is already being used in a given business, then some of the logical design stages may be adjusted toward specific products. The next section looks at the architecture of different types of database systems from the smallest desktop environments to the large-scale Internet application becoming prevalent in today's IT environment.

REVIEW BREAK: Models and Uses

Now you have a model, and even have some ideas of how to control the model to maintain accuracy and prevent system breakdown due to loss of integrity. Many tools are available to prevent the bad data from getting in that could produce incorrect results. Imagine the result if a government agency bases its decisions that will affect an entire country on data that is inaccurate or misrepresented. (Governments have enough trouble making decisions with accurate data.)

So now you have a high-performance model that has been made as efficient possible through normalization principles. You may have even applied some planned redundancy (denormalization) in an effort to improve the performance and make the data more meaningful. But what about the storage? Now that you have attributes of all shapes and sizes, you need something to put them in. It is now time to look into the data itself.