Andrea Calì, Università di Roma "La Sapienza", Italy Italy Italy Italy
ItalyDomenico Lembo, Università di Roma "La Sapienza",
ItalyMaurizio Lenzerini, Università di Roma "La Sapienza",
ItalyRiccardo Rosati, Università di Roma "La Sapienza",
While the main goal of a data warehouse is to provide support for data analysis and management's decisions, a fundamental aspect in design of a data warehouse system is the process of acquiring the raw data from a set of relevant information sources. We will call source integration system the component of a data warehouse system dealing with this process. The main goal of a source integration system is to deal with the transfer of data from the set of sources constituting the application-oriented operational environment, to the data warehouse. Since sources are typically autonomous, distributed, and heterogeneous, this task has to deal with the problem of cleaning, reconciling, and integrating data coming from the sources. The design of a source integration system is a very complex task, which comprises several different issues. The purpose of this chapter is to discuss the most important problems arising in the design of a source integration system, with special emphasis on schema integration, processing queries for data integration, and data cleaning and reconciliation.
The typical architecture of a data warehouse system is constituted by two different components, usually called back-end and front-end, respectively. While the latter is intended to provide support for the main task of the system, namely, data analysis and management's decisions, the former is responsible for acquiring the raw data from a set of relevant information sources. One of the basic assumptions in data warehouse architectures is that a correct modularization requires these two components be decoupled. It is the task of the back-end component to free the front-end from the knowledge on where data are, and how data are structured at the sources.
The goal of this chapter is to discuss the most important issues in the design of the back-end component of a data warehouse. We will call such component "source integration system," because its main goal is to deal with the task of transferring data from the set of sources constituting the application-oriented operational environment to the data warehouse. When data passes from one environment to the other, possible inconsistencies and redundancies should be resolved, so that the warehouse is able to provide an integrated and reconciled view of data of the organization (Inmon, 1996).
The constraints that are typical of data warehouse applications restrict the large spectrum of approaches that have been proposed for information integration (Hull & Zhou, 1996; Inmon, 1996; Jarke et al., 1999). In particular, since the data warehouse should reflect the informational needs of the organization, it should be based on a unified, corporate view of data, called global schema. Without the definition of a global schema, the risk arises of concentrating on what is in the sources at the operational level, rather than on what is really needed in order to perform the required analysis on data (Devlin, 1997).
The architecture of a source integration system is usually described in terms of two types of modules: wrappers and mediators (Wiederhold, 1994; Ullman, 1997). The goal of a wrapper is to access a source, extract the relevant data, and present such data in a specified format. The role of a mediator is to collect, clean, and combine data produced by different wrappers (or mediators), so as to meet a specific information need of the data warehouse. The specification and the realization of mediators is the core problem in the design of a source integration system.
The design of a source integration system is a very complex task, which comprises several different issues, including the following:
Definition of both the global schema, and the relationships between the global schema and the sources.
Specification of how to load and refresh data according to the global schema.
Cleaning and reconciliation of the data to be transferred from the sources to the data warehouse.
Problem (1) deals with the fact that sources are typically heterogeneous, meaning that they adopt different models and systems for storing data. This poses challenging problems in specifying the global schema. We call schema integration the activities to be carried out for this purpose. The goal of schema integration is to design the global schema so as to provide an appropriate abstraction of all the data residing at the sources. However, the specification of the global schema is not the only issue in schema integration. Indeed, another crucial aspect in schema integration is the specification of the relation between the sources and the global schema. To this purpose, two basic approaches have been proposed in the literature. The first approach, called global-as-view (or simply GAV), requires that a view over the sources is associated to each element of the global schema, so as to specify the global schema in terms of the data residing at the sources. On the contrary, the second approach, called local-as-view (or simply LAV), requires that a view over the global schema is associated to each element of the sources. Thus, in the LAV approach, we specify the meaning of the sources in terms of the elements of the global schema.
In the lifetime of a data warehouse, the explicit representation of relationships between the sources and the materialized data in the data warehouse is useful in several tasks: from the initial loading, where the identification of the relevant data within the sources is critical, to the refreshment process, which may require a dynamic adaptation depending on the availability of the sources, as well as on their reliability and quality that may change over time. Moreover, the extraction of data from a primary data warehouse for data mart applications, where the primary warehouse is now regarded as a data source, can be treated in a similar way. In addition, even though the sources within an enterprise are not as dynamic as in other information integration frameworks, they are nonetheless subject to changes; in particular, creation of new sources and deletion of existing ones must be taken into account. Consequently, the maintenance of the data warehouse requires several upgrades of the data flows towards the data warehouse. In other words, a data warehouse, especially in large organizations, should be regarded as an incremental system, which critically depends upon the relationships between the sources and the data warehouse.
Whereas the construction of the global schema concerns the intentional level of the source integration system, problems (2) and (3) refer to a number of issues arising when considering the integration at the extensional/instance level. Problem (2) is concerned with one of the most important issues in a source integration system, namely, the choice of the method for computing the answer to queries posed in terms of the global schema. Being able to carry out query processing on the global schema is indeed crucial for data integration, i.e., both for loading and refreshing the data warehouse according to the global schema. The main issue is that the system should be able to re-express queries expressed on the global schema in terms of a suitable set of queries posed to the sources. In this reformulation process, the crucial step is deciding how to decompose the query on the global schema into a set of subqueries on the sources, based on the meaning of the sources in terms of the concepts in the global schema. The computed subqueries are then shipped to the sources, and the results are assembled into the final answer.
Problem (3) deals with the fact that sources are generally heterogeneous not only at the intentional level, but also at the instance level. In particular, the data returned by various sources during query processing need to be converted/reconciled/combined to provide the data warehouse with high-quality data. The complexity of this reconciliation step is due to several problems, such as possible mismatches between data referring to the same real-world object, possible errors in the data stored in the sources, or possible inconsistencies between values representing the properties of the real-world objects in different sources (Galhardas et al., 1999). The above task is known in the literature as data cleaning and reconciliation.
In the rest of this chapter we discuss the above three problems in more detail. After presenting some basic definitions, we provide a survey on the problem of schema integration, starting from the early work found in the database literature more than two decades ago, and pointing out the recent developments in this area. We then deal with data integration, by describing the most advanced approaches to query processing in the context of source integration systems. The final section discusses the most important approaches that have been proposed for carrying out the data cleaning and reconciliation task.