Preface | Data Quality: The Accuracy Dimension (The Morgan Kaufmann Series in Data Management Systems)

This book is about data accuracy. Data accuracy is part of the larger topic of data quality. The quality of data is measured against a number of dimensions: accuracy, relevance, timeliness, completeness, trust, and accessibility. The accuracy dimension is the foundation measure of the quality of the quality of data. If the data is just not right, the other dimensions are of little importance.

The book has many goals. The first is to demonstrate and characterize the data quality problem that affects all large organizations. Hopefully it will increase awareness of problems and help motivate corporations to spend more dollars to address the problem. The next goal is to outline the functions of a data quality assurance group that would bring considerable valuable to any corporation. The third goal is to promote the use of data-intensive analytical techniques as a valuable tool for executing data quality assurance activities.

Data quality is getting more and more attention in corporations. Companies are discovering that their data is generally of poor quality. This, coupled with the fact that companies are trying to squeeze more and more value from their data, leads to the greater concern. This is costing them money and opportunities. Improving the quality of data produces valuable financial returns to those who diligently pursue it.

The technology for identifying quality problems and dealing with them has lagged behind technology advances in other areas. Because of the awakening of corporations to this problem, newer technologies are coming forward to help address this important topic. One of them is data profiling: the use of analytical techniques to discover the true content, structure, and quality of data.

This book discusses the use of data profiling technology as the central strategy for a data quality assurance program. It defines inaccurate data, demonstrates how data profiling is used to ferret out inaccurate data, and shows how this is put together in a larger data quality program to achieve meaningful results.

Data quality assurance groups fundamentally operate by identifying quality problems and then fabricating remedies. They can do the first part through either an outside-in approach or an inside-out approach. The outside-in approach looks within the business for evidence of negative impacts on the corporation that may be a derivative of data quality problems. The types of evidence sought are returned merchandise, modified orders, customer complaints, lost customers, delayed reports, and rejected reports. The outside-in approach then takes the problems to the data to determine if the data caused the problems, and the scope of the problems.

The inside-out approach starts with the data. Analytical techniques are used to find inaccurate data. The inaccurate data is then studied to determine impacts on the business that either have already occurred or that may occur in the future. This strategy then leads to remedies much like those of the outside-in approach.

Much of the literature on data quality discusses what I refer to as the outside-in approach. This book covers the inside-out approach. To make the inside-out approach work, you need good analytical tools and a talented and experienced staff of data analysts that understands the business and can dig problems out of the data. You also need a thorough understanding of what the term inaccurate data means.

This book is divided into three parts. The first part defines inaccurate data, shows the scope of problems that exist in the real world, and covers how data becomes inaccurate. The intent is to provide a thorough understanding of the basic concepts of data inaccuracies and how this fits into the larger data quality topic.

The second part covers how a data quality assurance program is constructed with the inside-out approach. It covers the methodology used, the skills needed, and the general business cases for justifying projects targeting corporate databases.

The third part focuses on the technology of data profiling. It describes the basic concept of the technology and shows how individual parts contribute to the overall result. It covers some of the techniques used to expose inaccurate data and gives examples from the real world.

This topic is applicable to all sorts of organizations. Corporations, government organizations, educational organizations, and large nonprofit organizations all have information system departments that build and maintain large databases. They all depend on these databases for executing their daily tasks and for making large and small decisions that can have huge impacts on the success or failure of their enterprises. The importance of these databases and the accuracy of the data in them is no different for any of them. Although I use the term corporation in this book, you can assume that what I am saying applies equally to all types of organizations.

The target data for this book is structured data captured in corporate databases. This is predominantly record-keeping data dealing with orders, invoices, expenditures, personnel, payroll, inventory, customer data, supplier data, and much more. Data enthusiasts frequently note that only 20% of the data in a corporation is stored in corporate databases. The other 80% of the data consists of reports, memos, letters, e-mails, diagrams, blueprints, and other noncoded, nonstructured information. Although this is true, the structured 20% in the databases generally constitutes the heart and soul of the company's operations. Its value is certainly greater than 20%.

The audience for this book includes several groups of people. The primary target is those practitioners who are directly involved in data quality assurance or improvement programs. This gives them a strong framework for defining and executing a program based on the inside-out strategy. Other data management professionals who will be interacting with the data quality staff on a regular basis should also know this material. This includes data analysts, business analysts, database administrators, data administrators, data stewards, data architects, and data modelers.

Directly related to this group are the information management managers and executives who support this effort. The book specifically provides information on the scope of the quality problem and the potential for gaining back some of the value lost that should be of direct interest to IT management, chief information officer, chief technology officer, and chief knowledge officer executives and their staffs. It should also appeal to managers in data consumption areas who are heavily involved as part of the information process.

This book provides considerable benefit to application developers, system designers, and others who help build and maintain the information systems we use. Understanding the concepts of accurate data will likely result in better-designed systems that produce better data.

This book is also applicable to students of computer science. They should gain an appreciation of the topic of data quality in their studies in order to better prepare them to assume roles in industry for creating and maintaining information systems. Knowledge of this topic is becoming more valuable all the time, and those information systems professionals who are schooled in the concepts will be more valuable than those who are not. Data quality is becoming more commonplace as part of the curriculum of computer science. This book could serve as a textbook or a reference book for those studies.

Data quality problems are not relegated exclusively to large corporations with formal information system departments and professional data management staff. Inaccuracies occur in databases of all sizes. Although most small companies and some mid-range companies do not have formal organizations, they all have data that drives their operations and on which they base important business decisions. The general concepts in this book would be valuable to this audience as well. Although the employees of these companies would not normally immerse themselves in this topic, the consulting and software development firms that support their applications and operations would be better off if they understood these concepts and incorporated them into their practices.

Acknowledgments

Most of my knowledge on this topic has come from working in the field of data management. I have been involved in the design and development of commercial software for database or data management for most of my career. As such, I have been in countless IT shops and talked with countless application developers, database administrators, data analysts, business analysts, IT directors, chief information officers, chief technology officers, and chief executive officers about data management issues. Most of these conversations consisted of them telling me what was wrong with products I developed or what problems I have not yet solved. All of this has helped shape my thinking about what is important.

I have also read many books on the topic of data management over the years. I have read books by, interacted with, and absorbed as much as I could from the gurus of data quality. Those I have particular admiration for are Richard Wang, Ph.D., from Massachusetts Institute of Technology and co-director for the Total Data Quality Management program at M.I.T.; Larry English, president of Information Impact; Thomas Redman, Ph.D., president of Navesink Consulting Group; Peter Aiken, Ph.D., from Virginia Commonwealth University; and David Loshin, president of Knowledge Integrity Incorporated. Anyone interested in the field of data quality needs to become familiar with the work of these experts.

Those closer to me that have helped bring this book to completion through their professional interaction with me at Evoke Software include Lacy Edwards, Art DeMaio, Harry Carter, Jim Lovingood, Bill Rose, Maureen Ellis, Ed Lindsey, Chris Bland, John Howe, Bill Bagnell, Shawn Wikle, Andy Galewsky, Larry Noe, and Jeffrey Millman. They have all, in some way, contributed to the content, and I thank them for the many hours of conversations on this topic. If I have left someone's name out that has helped me understand this topic, please forgive me. There are so many that I could not list them all, let alone remember them all.