Chapter 12: Data Integrity | Microsoft SQL Server 2005: The Complete Reference: Full Coverage of all New and Improved Features

Overview

Perhaps no other topic of discussion in database development, modeling, and management circles draws more attention, and often heated debate, than that of data integrity. It is astonishing that, despite the distance we have come in understanding, practice, and technology, so many database gurus (Celko, Codd, Date, Riordan, et al.) vary so widely in their respective philosophies. As a result, administrators and developers often handle integrity modeling and thus integrity programming by the seat of their pants. Even SQL Server Books Online defines data integrity in its glossary in a way more confused than a bat in broad daylight.

This book is certainly not the forum for a discussion of data integrity, and this is about as far as I want to venture in discussing relational database theory. But without exploring some concepts and accepting the only feasible definition of data integrity, you will not benefit from all the tools and new features that SQL Server 2005 supports with respect to data integrity modeling and programming.

Data integrity definitely is not a practice, a discipline, that ensures that data stored in a database is correct, only that is it believable or plausible. There is no way between this life and the hereafter that SQL Server 2005, or any other RDBMS, can guarantee that data in a database is correct. Get correct out of your vocab now. SQL Server 2005 has no way of knowing and thus ensuring that my area code is not 209 but rather 299, or that my last name is Shapiro and not Schapiro. I have even heard of a girl named Jeffrey. You need to start thinking, modeling, and programming SQL Server in terms of data plausibility, not in terms of data being right or wrong.

Only if you accept this definition will you be able to use the tools and techniques supported by SQL Server 2005 to ensure the integrity of your data, and thus its value as an asset to your enterprise. And after you start focusing on integrity in scalar terms and not correctness in absolute terms, you will have a lot more faith in the data in your database, and you will be able to afford it the trust and respect it deserves. After all, data that is not plausible or believable is a liability

As I discussed in Chapter 1, human error caused my wife extreme grief when, after changing medical insurance companies, she was denied coverage for some time because the last name of her doctor, instead of Shapiro, was entered in the spouse’s last name field, To my wife, the data integrity issue thus became a life-threatening one. To the medical insurance company, the issue almost exploded into a liability problem.

What would or could cause a last name, or surname, to be incorrect?

The wife goes by her maiden name.
A spouse mistakenly provides a pseudonym.
The couple just got divorced but agreed to maintain the coverage.
A child is covered by a stepfather but still goes by the last name of his or her biological father.
The first name is entered into the last name field.
The last name is typed incorrectly (Shapiro becomes Ahaoeuei with just a few slips of the finger).
The handwriting on the application form is poor, or the last name is omitted and the data entry person makes a wrong assumption.

This list could go on and on. And I am sure you could come up with dozens of scenarios that would also create questionable data, not only in last name values but also in many other places. Numbers, for example, present incredible opportunities to enter problematic data into a database.

But is this a question of integrity? If we accept that we program the DBMS to ensure that the data is as believable as possible, then it is. If we try to ensure that the data is correct, then it is not. Any value may in fact be correct when it is assumed to be wrong, and it may in fact be wrong when it is assumed to be correct. The only thing you can do to help ensure that data is believable is to help ensure that it was believable when it was entered into the database.

The best I can think of doing at the data tier to help ensure that a value, such as the spouse’s last name, is believable is to force the client to go back and check the data before it can be entered, or to compare the data against known values. It is possible to even refer the record back to the client and request it to be entered by another user, possibly a supervisor who would take the fact checking to the next level. Asking Web surfers to fill in application forms over the Internet is a good idea because it cuts out the middle data entry person, the paper trail, and delay And it puts the onus of ensuring the data plausibility on the client, who is more likely to ensure that his or her information can be relied upon.

I recently watched a horrifying story on CNN about an American pharmacist who gave a child a f fatal overdose of a drug contrary to what had been correctly prescribed by the pediatrician. The excuse was human error, failure of the supervisor to double-check prescriptions, filling hundreds of prescriptions a day Why, in heaven’s name, in this day and age, are pharmacists still using typewriters and word processors to provide instructions about dosage and administration of dangerous drugs? A database should have been used to check that the dosage did not exceed safely levels for the drug prescribed. No computer program checked the dosage, and so a mother sent her child to bed and he never woke up. Now, whenever we buy drugs, we check the label and wonder “can we trust our lives to this data?”

Obviously, the subject of human error is beyond the scope of this book, other than to discuss what possible means we might have of preventing humans from entering questionable data into a database. Joe Celko touched on the subject in his marvelous book, Joe Celko’s Data & Databases: Concepts in Practice (New York: Morgan Kaufmann, 1999). In a section titled “Models Versus Reality,” he talks about errors in models, describing Type I and Type II error levels. A Type I error is accepting as false something that is true, and a Type II error is accepting as true something that it false.

I agree without equivocation that the subject of errors in models is very important for database people to understand. Generations of people have been wiped out because of this problem. Sub-Saharan Africa, where I spent my childhood, is going to be wiped out because of AIDS. This could have been prevented, but the population there still believes, by and large, that AIDS is not sexually transmitted and that the publicity is just “Western propaganda.” The fraud is in fact self-perpetuating or self-fulfilling, because millions of Africans still have unprotected sex.

Yes, we can use fancy programming tricks and system features such as triggers and stored procedures to lessen the likelihood of implausible data; we can even build more advanced human integrity checking into the client applications. How can we avoid problems like the one just described and still program SQL Server 2005 as wisely as possible? To arrive at a possible solution, let’s first explore the integrity assurance features and functions of SQL Server 2005. After this discussion, we can redress the last name integrity issue and offer my medical insurance company some ideas before they get sued.