The quality of data is very important for company business (Stoker, 2000). The ability of a company to make critical decisions in a complex business environment largely depends on how the company uses information. If the quality of company data is poor, also the information produced by data mining tools is not correct, due to the well-known "garbage in, garbage out" principle.
Usually, a large part of the data of a company are stored in legacy databases, which complicate the task of building a data warehouse. When information coming from heterogeneous sources is gathered and integrated, it is likely that data quality problems are present. This is because there may be inconsistencies due to the fact that different sources contain redundant data with different representations. Moreover, problems may arise within a single source: possible problems are missing information, misspellings, or any kind of invalid data.
Data cleaning (Bouzeghoub & Comyn-Wattiau, 1990; Rahm & Do, 2000; Quass, 1999) consists in detecting and removing inconsistencies from data, in order to improve data quality. It constitutes the most important step of the so-called ETL (extraction, transformation, loading) process. The task of data cleaning is orthogonal to that of semantic data integration. Even if query answering problems are solved in a data warehouse, retrieved data may still be dirty and inconsistent.
The process of data cleaning is disregarded by many IT managers (Maydanchick, 2000). One of the reasons of this is the high and unpredictable cost of this operation. Moreover, it is very difficult to plan a data cleaning process, due to the overwhelming complexity of business rules in a modern data warehouse; without such a plan, not only is the data cleaning procedure unlikely to end on time, but measuring the success of the operation becomes impossible.
Henceforth, we will assume that data are represented in the relational model. There are many reasons for data to be dirty; Rahm & Do (2000) and Quass (1999) present a number of them. We group data problems in two main categories: invalid data and differences in representation of the same data.
Invalid data can be caused by extracting data from multiple sources, or they can exist in a single source, due to incorrect data entries. For example, a "City" field may contain the value "Italy," or a simple misspelling may occur, e.g., the field "Country" contains the value "Brazik" instead of "Brazil." A slightly more complex problem is due to inconsistencies among different fields of the same record; for example, a record regarding a person may have the value "12 December 1973" for the date of birth and the value "12" for the age. Violation of functional dependencies within a table is another typical example of such inconsistencies. In general, having a global schema for a data warehouse means also having integrity constraints expressed in the schema itself. Such constraints are not related to the underlying data sources; instead they are derived from the semantics of the global schema, or, in other words, from the real world. We cannot expect independent data sources to produce data which respect these constraints. Therefore, a data warehouse has to cope with many violations of constraints over the global schema (also called "business rules"). An effective strategy for dealing with violation of foreign key constraints is presented in Calì et al. (2001).
When gathering information from different data sources, in order to build a data warehouse, it is likely that the same information is represented in different ways in different sources. Rahm & Do (2000) identify three main categories of such conflicts. Naming conflicts arise when the same name is used for different objects, or when different names are used for the same object. Structural conflicts are more general, and occur when the difference in the representation lies in the structure of the representation itself. A simple case is that of the format of a field: for example, in a source the gender of a customer is denoted with "M" or "F" while another source uses the values "0" and "1". Also the format of dates is a common problem. Moreover, data that are represented by more than one field in a source may be represented in a single field in another source; Borkar, Deshmukh, & Sarawagi (2000) present an approach to give a field structure to free-text addresses. Other structural conflicts, in the relational model, are due to different relational modeling in different sources, e.g., attribute vs. relation representation, different data types, different integrity constraints, etc. Finally, data conflicts appear only at the instance level, and are mainly related to different value representations. For example, the currency may be expressed in Japanese Yen in one source and in German Marks in another. Another typical case is the different representation of the same value; for instance, "Simon M. Sze" versus "Sze, S.M."
A major problem in data cleaning is that of overlapping data (Hernàndez & Stolfo, 1995; Monge & Elkan, 1996, 1997; Roychoudhury, Ramakrishnan, & Swift, 1997; Monge & Elkan, 2000), also referred as duplicate elimination problem or merge/purge problem. This problem arises when different records of data representing the same real-world entity are gathered in a data warehouse. In this case, duplicates have to be detected and merged. Most efforts in data cleaning have been devoted to the solution of the duplicate detection problem, which proves to be a central issue. In fact, in order to be able to detect different records related to the same real-world entity, a system has to resolve all conflicts relating that entity, due to different representations in different sources. In the following, we will focus our attention on the duplicate elimination problem.
Before starting the actual data cleaning process, a pre-processing is usually performed. Pre-processing consists in standardization operations (domain-dependent standardization techniques are presented, e.g., in Borkar, Deshmukh, & Sarawagi (2000) and Roychoudhury, Ramakrishnan, & Swift (1997), for example simple format conversions, or discovering abbreviations. The aim of this preliminary step is to make candidate duplicate records more similar, in order to make their identification easier during the duplicate detection step.
Two metrics (Low, Lee, & Ling, 2001) are usually defined to measure the effectiveness of a data cleaning system: recall and false-positive error. Recall is also known as percentage hits, true positives, or true merges, and it is defined as the percentage of the duplicate records which are correctly identified, out of all duplicate records. The false-positive error is the percentage of records which are wrongly identified as duplicates, out of all identified records. This metrics, also known as false merges, is related to the precision metrics, being precision = 100% − falsepositive errors.
Determining exact duplicates is an easy task (Bitton & DeWitt, 1983), and can be done by sorting all records and checking whether there are identical neighbors. Detecting inexact duplicates relating to the same real-world entity is a quite complex task instead. Once we have a technique to compare two records, if we have N records, the naïve approach would require one to make N2 comparisons, which is obviously inefficient.
In Hernàndez & Stolfo (1995), a more efficient technique is presented, called sorted neighborhood method. This method consists of three steps:
First, a key is computed for every set of homogeneous records in the database (or data warehouse); the key selection process is strongly domain-dependent, and it is crucial for the effectiveness of the technique. This highly knowledge-intensive step is to be performed by a specialist who has very deep knowledge of the characteristics of the data, and also of possible errors in the data.
The second step is a simple sorting of the records, according to the key computed in the previous step.
The final step consists of merging the duplicate records. A fixed size, sliding window is moved along the ordered list of records. At each step, the window, which contains w records, goes one record ahead, and the newly entered record is compared with the other w-1 records in the window. When two records are found to have the same key, then they are merged into a single record. The scan is performed in O(wN) steps.
The main drawback of the sorted neighborhood method lies in the fact that the merging heavily depends on the sorting, and therefore on the key. If two inexact duplicates fall far apart after sorting, they will never be recognized as duplicate. This problem can be partly solved by enlarging the size of the window, thus improving the recall; on the other hand, this increases the computational complexity of the merging phase.
As the choice of the key is crucial in sorted neighborhood method, in order to increase recall, a good strategy is that of performing multiple passes (Monge & Elkan, 1997; Hernàndez & Stolfo, 1995), each one with a different key, and computing the transitive closure of the relations computed at each step. That is, if record A matches with record B in a scan, and record B matches with record C in another, then we deduce that A matches with C. This approach solves the problem of the window size, as multiple passes with a small window size have shown to perform better, and to be more efficient, than a single scan with a large window size. On the other hand, introducing transitivity increases the false-positive errors, lowering the overall precision.
An alternative to window scan consists of partitioning the records in clusters, where in each cluster records are stored that match each other according to a certain criterion. Duplicate elimination can be performed on each cluster separately, and in parallel. This approach suffers from the same drawbacks as the usual sorted neighborhood method; moreover, if the database is small, many very small (or even singleton) clusters are likely to occur.
A clustering technique, presented in Monge & Elkan (1997), makes use of a priority queue, containing a set of clusters. Records not already appearing in the clusters are compared with representative records of each cluster: if the match succeeds, then the record is added to the corresponding cluster. If the record cannot be put in any cluster, then a new cluster is created, containing only that record. As the priority queue could increase significantly, a maximum number of clusters is fixed; if the limit is exceeded, then the cluster with lowest priority is removed. At any time, the cluster that was most recently created has the highest priority. In this technique, the choice of the representative records is crucial, and it strongly influences the results. Heuristics have to be developed in order to optimize such choice.
Finally, knowledge-based approaches have been used in data cleaning. In Galhardas et al. (2001), an attempt of separating the logical and the physical level is done: the logical level supports the design of the data transformation procedures, while the physical level supports the implementation and optimization of such procedures. In this framework a declarative language for data cleaning transformations is presented. Also, a declarative specification of user interaction and a declarative way to specify properties of the matching operations are introduced. These techniques have been implemented in an actual data cleaning system.
A framework for knowledge-based data cleaning, based on a set of rules, has been proposed in Low, Lee, & Ling (2001). In the approach presented in this paper, the processing stage is performed according to a set of rules: duplicate identification rules, merge/purge rules, update rules, alert rules (specifying some interaction with the user). Also, a pre-processing step is defined, in which data standardization is executed on the data. At the end of the data cleaning process, a framework for human validation and verification of the results is proposed.
In conclusion, the problem of cleaning data in the context of data warehousing is still far from having a solution. The identification of inexact duplicates is inherently characterized by a certain degree of uncertainty, and this means that human validation is needed in order to evaluate the performance of a data cleaning system. On the other hand, human inspection is infeasible when the size of the data is very large, a case that is quite common in practice. Another serious problem is that of the dependence on the domain: an algorithm performing very well on a database may behave badly on another. Declarative approaches seem to introduce some flexibility, which may produce good results on several domains. Anyway, the results in this case are again strongly dependent on the knowledge provided to the system, which in turn is produced by human reasoning.