Data integration activities concern with the problem of acquiring data from different sources, making them available in an integrated and reconciled form to users' applications. This is a central issue in several contexts: applications requiring accessing or re-engineering legacy systems, information integration on the World Wide Web, data mining or enterprise resource planning, or data warehouse systems.
With regard to the data explicitly managed by the integration systems, two different approaches exist, called materialized and virtual.
In the materialized approach, data at the sources are replicated in the integration system and can be directly queried by the users. In this setting, the problem arises of refreshing the materialized views maintained in the system in order to cope with updates at the sources. A naïve way to deal with this problem is to recompute views entirely from scratch in the presence of changes at the sources. This is expensive and makes frequent refreshing of views impractical. The study of materialized view management is an active research topic, and it is concerned both with the problem of choosing the views to be materialized into the data warehouse, and reducing the overhead in view recomputation. We do not discuss this further in this chapter and refer the reader to Jarke et al. (1999) for more details on the topic. The materialized approach to data integration is the most closely related to data warehousing. In this context, data integration activities are relevant for the initial loading and for the refreshing of the warehouse according to the global schema produced by the schema integration activities. Notice that, whereas the global schema provides a reconciled intentional representation of all source schemas, data integration deals with the problem of computing its instances, thus pertaining to the extensional level of the source integration system of the data warehouse.
In the virtual approach to data integration, data residing at the sources are not replicated, and the systems provide the user with a virtual global schema, i.e., a global schema whose extension is not materialized, e.g., in a data warehouse. Hence, sources are accessed each time a user query is posed to the system, rather than in the process of loading the data warehouse. Despite such differences, most of the issues that arise, for the case in which the views are virtual, are also relevant in the case of materialized views, which can be, therefore, conceived as a specialization of the virtual views. Hence, in the following we focus on the problem of processing user requests for new reconciled data, without distinguishing between the cases in which such data have to be materialized or not.
Generally speaking, the main goal of the data integration process is to answer a query expressed over the global schema only on the basis of the data stored at the sources, i.e., to solve the query processing problem. Given a user query, the system should be able to decompose it in terms of a suitable set of queries posed to the sources that are considered relevant to the original query, send such queries to the sources, and assemble the results into the final answer. With reference to the formal framework for source integration in data warehousing described earlier, query processing amounts to computing the set of certain answers to the query.
Query processing has been studied under different assumptions on the model adopted to represent the global schema and the sources (conceptual, object-oriented, semistructured, etc.), the language used to express both queries and views (conjunctive queries, datalog queries, etc.), the nature of constraints imposed on the global schema, and the way in which the mapping between data at the sources and the elements of the global schema is specified. As we have already said, two basic approaches exist to specify the mapping. The first approach, called global-as-view (GAV), requires that a view over the data sources is associated to every element of the global schema. On the contrary, in the second approach, called local-as-view (LAV), the relationships between the global schema and the sources are established by associating a view over the global schema to every element in the sources. Calì et al. (2002b) discussed the relationship between the expressive power of LAV and GAV in a setting where the global schema and the sources are relational, and either queries or views are conjunctive queries. In that paper the authors first show that, when no integrity constraints are allowed in the global schema, the two approaches are incomparable. Then they propose techniques that exploit the presence of suitable constraints in the global schema to transform a system following the LAV (GAV) approach into a query-preserving GAV (respectively, LAV) system, i.e., a system in which processing the same query, or an equivalent suitable transformation of it, produces the same answer.
In the rest of this section we describe several solutions proposed in the literature to address the problem of query processing, either in LAV or in GAV. For the sake of simplicity, we assume to deal with relational databases.
Since in LAV the mapping between the sources and the global schema is described as a set of views over the global schema, query processing amounts to finding a way to answer a query posed over a database using a set of views over the same database. This problem, known in the literature as query answering using views, is widely studied, since it has applications in many areas. In query optimization (Chaudhuri et al., 1995), the problem is relevant because using materialized views may speed up query processing. In database design, using views provides means for maintaining the physical perspective of the data independent from its logical perspective (Tsatalos, Solomon, & Ioannidis, 1986). It is also relevant for query processing in distributed databases (Keller & Basu, 1996) and in federated databases (Levy, Srivastava, & Kirk, 1995). Finally, since the views provide partial knowledge on the database, answering queries using views can be seen as a special case of query answering with incomplete information (Imielinski & Lipski, 1984; Abiteboul & Duschka, 1998).
The most common approach proposed in the literature to deal with the problem of query answering using views is by means of query rewriting. In query rewriting, a query over the global schema, and a set of views defining the sources, expressed in turn over the global schema, are provided. The goal is to reformulate the query into an expression, the rewriting, which refers only to the views and supplies the answer to the query. Query answering via query rewriting is divided into two steps: (1) reformulating the query in terms of a given query language over the alphabet of the view names, i.e., the alphabet of the sources, and (2) evaluating the rewriting over the view extensions, i.e., the data at the sources.
In order to be able to compare different reformulations of queries, we first introduce the notion of containment between queries. Given two queries q1 and q2, we say that q1 is contained in q2 if for all databases we have that . We say that q1 and q2 are equivalent if q1 is contained in q2 and q2 is contained q1. The problem of query containment has been studied in various settings. In particular, the containment of conjunctive queries is established to be an NP-complete problem in Chandra & Merlin (1977), and in the presence of inequalities, it is proved to be a Πp2-complete problem in Klug (1988) and van der Meyden (1992).
Let us turn our attention to the problem of rewriting. Formally, given a query q and a set of views , both expressed over the global schema, the query qr is a rewriting of q using if: (i) qr refers only to the alphabet of the sources, and (ii) the expansion of qr, i.e., the query obtained by replacing the views in it with their definitions in terms of the global schema is contained in q.
In general, the set of sources available in a source integration system may not store all the data needed to answer a query, and therefore the goal is to find a query expression that provides all the answers that can be obtained from the views. So, whereas in different contexts, e.g., query optimization or maintaining physical data independence, the focus is on finding rewritings that are logically equivalent to the original query, in data integration one is interested in maximally contained rewritings (Halevy, 2000). Formally, given a query q, a set of views , and a query language , the query qm is a maximally contained rewriting of q using with respect to if: (i) qm is a rewriting of q using , and (ii) there is no q′ rewriting of q using in such that q′ ≠ qm and qm is contained in q′.
An important theoretical result concerning the problem of query rewriting, in the case that views and queries are conjunctive queries, is presented in Levy et al. (1995). In that paper the authors demonstrate that, when a query q is a union of conjunctive queries, if an equivalent conjunctive rewriting of q exists, then such a rewriting has at most as many atoms as q. Such a result leads immediately to nondeterministic polynomial-time algorithms to find either equivalent conjunctive rewritings or maximally contained rewritings that are the union of conjunctive queries. In both cases it is sufficient to consider each possible conjunction of views that produces a candidate rewriting whose size is less or equal to the size of the original query, and then check the correctness of the rewriting. Note that the number of candidate rewritings is exponential in the size of the query.
In order to compute all the rewritings that are contained in (and not necessarily equivalent to) the original query, the bucket algorithm, presented in Levy, Rajaraman, & Ordille (1996a), improves the technique described above since it exploits a suitable heuristic that enables one to prune the space of candidate rewritings. The algorithm was proposed in the context of the Information Manifold (IM) system (Levy, Srivastava, & Kirk, 1995), a project developed at AT&T. To compute the rewriting of a query q, the bucket algorithm proceeds in two steps: (i) for each atom g in q, it creates a bucket that contains the views from which tuples of g can be retrieved; then (ii) it considers as a candidate rewriting each conjunctive query obtained by combining one view from each bucket, and checks by means of a containment algorithm whether such a query is contained in q. If so, the candidate rewriting is added to the answer. If the candidate rewriting is not contained in q, before discarding it, the algorithm checks if it can be modified by adding comparison predicates in such a way that it is contained in q.
The proof that the bucket algorithm generates the maximally contained rewriting, when the query language is a union of conjunctive queries, is given in Grahne & Mendelzon (1999). Note that, on the basis of the results in Levy et al. (1995), the algorithm considers only rewritings that have at most the same number of atoms as the original query. As shown in the same paper and in Rajaraman, Sagiv, & Ullman (1995), the given bound on the size of the rewriting does not hold in the presence of arithmetic comparison predicates, functional dependencies expressed in the global schema, or limitations in accessing the sources. In such cases we have to consider rewritings that are longer than the original query. According to Levy, Rajaraman, & Ordille (1996b), the bucket algorithm, in practice, does not miss solutions because of the length of the rewritings it considers; but other results (Duschka, Genesereth, & Levy, 2000; Gryz, 1998a, 1998b) show that in the presence of functional dependencies and limitations in accessing the sources, the union of conjunctive queries does not suffice to obtain the maximal contained rewritings, and it is necessary to make use of recursive rewritings.
Two improved versions of the bucket algorithm are the MiniCon algorithm (Pottinger & Levy, 2000) and the shared-variable bucket algorithm (Mitra, 1999). In both the algorithms the basic idea is to examine the interaction among variables of the original query and variables in the views, in order to reduce the number of views inserted in the buckets, and hence the number of candidate rewritings to be considered. Experimental results related to the performance of MiniCon and shared-variable bucket show that these algorithms scale out and outperform the bucket algorithm.
Differently from the algorithms described above, the inverse rules algorithm (Duschka & Genesereth, 1997), developed in the context of the Infomaster system, an information integration tool of Stanford University, does not search the space of candidate rewritings, but generates a rewriting in time that is polynomial in the size of the query, and does not require any containment test. The algorithm constructs a set of rules that "invert" the view definitions and show how to obtain the instances of the global schema from the data stored at the sources. Given a query q and a set of views , the rewriting is the Datalog program consisting of both the query and the inverse rules obtained from . The inverse rules algorithm is proved to return a maximally contained rewriting with respect to union of conjunctive queries, in a time that is polynomial in the size of the query. Furthermore, Duschka, Genesereth, & Levy (2000) show that the algorithm can also handle recursive datalog queries, the presence of functional dependencies in the global schema, or the presence of limitations in accessing the sources, by extending the obtained query plan with other specific rules.
Finally, we cite the unification-join algorithm (Qian, 1996), an exponential-time query answering algorithm based on a skolemization phase and on the representation of conjunctive queries by means of hypergraphs. Qian (1996) also describes a polynomial time version of the algorithm developed for the case in which queries are acyclic conjunctive queries. The unification join algorithm is extended in Gryz (1998a, 1998b) in order to deal with inclusion and functional dependencies expressed over the global schema.
Other studies are concerned with the problem of query rewriting using views, under the different assumptions that queries are recursive queries (Afrati, Gergatsoulis, & Kavalieros, 1999), description logics queries (Beeri, Levy, & Rousset, 1997; Calvanese et al., 2001), queries for semi-structured data (Calvanese et al., 1999; Papakonstantinou & Vassalos, 1999), queries with aggregates (Grumbach, Rafanelli, & Tininini, 1999; Cohen, Nutt, & Serebrenik, 1999), or in presence of limitations in accessing the views (Rajaraman, Sagiv, & Ullman, 1995; Kwok & Weld, 1996; Levy, Rajaraman, & Ullman, 1996; Florescu et al., 1999).
In the query rewriting approach, the query reformulation step is carried out, not taking into account the source database, i.e., the extensions of the views. Only in a second step the rewritten query is evaluated over such a database. A more general approach, simply called query answering (Abiteboul & Duschka, 1998; Grahne & Mendelzon, 1999; Calvanese et al., 2001), is to consider, besides the query and the view definitions, the extensions of the views. In such a case we do not pose any limit to query processing, and the only goal is to compute the answer to the query by exploiting all available information, in particular the view extensions.
The complexity of answering queries using views for different languages (both for the query and for the views) is studied in Abiteboul & Duschka (1998). In that paper, the authors consider two different assumptions, called open-world assumption and closed-world assumption, respectively. In the first case, each view stores only, but not necessarily all, the tuples that satisfy the view definition, whereas in the second case, each view stores exactly the tuples that satisfy the view definition. In the literature such views are also called sound and exact, respectively. Under the open-world assumption, Abiteboul & Duschka show practical cases in which the complexity of query answering using views is polynomial, and prove that the problem becomes coNP-hard already in the case where the views are conjunctive queries and the query is a conjunctive query with inequality, or where the query is a conjunctive query and the views are positive queries. Actually, as shown in Calvanese et al. (2000b), the complexity in the case in which the views are sound is coNP-hard also for very simple query languages containing union. Under the closed-world assumption, Abiteboul & Duschka prove that the problem is coNP-hard already in the case in which views and queries are conjunctive queries.
The problem of query answering using views in the context where constraints over the global schema are expressed in terms of description logics, and queries and views are unions of conjunctive queries, is addressed in Calvanese, De Giacomo, & Lenzerini (1998, 2000). Finally, the case of queries for semi-structured data is dealt with in Calvanese et al. (2000b, 2000c).
When the mapping is specified following the GAV approach, each relation of the global schema is expressed in terms of a view over the sources. Hence, it is in general assumed that, in order to answer a query posed over the global schema on the basis of data stored at the sources, it is sufficient to unfold each atom of the original query with the corresponding view (Ullman, 1997). It is worthwhile to stress that, even if the process of unfolding is a simple mechanism, defining the views associated with the global elements implies precisely understanding the relationships among the sources, that is in general a non-trivial task. We can say that, in the GAV approach, query processing by unfolding is realized at design time, and the main issue turns out to be the specification of mediators that synthesize the definition of views associated to global elements.
The reason why simple unfolding is assumed sufficient to process a query is that the GAV mapping, which has been commonly considered so far in the literature, essentially specifies a single database satisfying the global schema, i.e., the database obtained by populating the global schema according to the view definitions. We call this database the retrieved database. Evaluating the query over the retrieved database is equivalent to evaluating its unfolding over the sources.
This is, for example, the case of TSIMMIS (The Stanford-IBM Manager of Multiple Information Sources) (Chawathe et al., 1994), a joint project of Stanford University and IBM's Almaden Research Center, based on an architecture that presents a hierarchy of wrappers and mediators. In TSIMMIS the global schema is simply the collection of objects exported by the mediators, and no rules or integrity constraints are expressed among them, in such a way that the mediators only realize an explicit layer between the user' applications and the data sources. Hence, user' queries are posed in terms of objects synthesized at a mediator or directly exported by a wrapper, and are processed by suitable modules, called Mediator Specification Interpreters (MSIs) (Papakonstantinou, Garcia-Molina, & Ullman, 1996; Yerneni et al., 1999), each performing integration independently.
Differently from TSIMMIS, the Garlic system (Carey et al., 1995), developed at IBM Almaden Research Center, provides the user with an integrated data perspective by means of an architecture comprising a middleware layer for query processing and data access software. The middleware layer presents an object-oriented data model based on the ODMG standard (Cattell & Barry, 1997) that allows data from various information sources to be represented uniformly. Given a query over the middleware layer, its execution plan is simply the set of sub-queries produced by expanding the objects in the original query with their definitions provided by suitable wrappers. Each wrapper translates the sub-queries into the source native query language, taking into account the query processing power of the source. Note that many of the tasks done by the mediators in TSIMMIS, such as the integration of objects from different sources, in Garlic are performed by wrappers, and the notion of mediator is actually missing.
In MOMIS (Bergamaschi et al., 2001), a system jointly developed at the University of Milano and the University of Modena and Reggio Emilia, query processing and optimization are the tasks of the query manager, a component which is part of the mediator. The query manager exploits extensional relationships to first identify all sources whose data are needed to answer a user query posed over the global schema. Then it reformulates the original query into queries to the single sources, and sends the obtained sub-queries to the wrappers, which execute them and report the results to the query manager. Finally, it combines the single results to provide the answer to the original query. During the process, query optimization is performed taking into account intentional intra- and inter-source relationships, both for the original query and the local queries to the sources.
The Squirrel project, developed at the University of Colorado, provides a framework for data integration that supports either the virtual or the materialized approaches to data integration. The framework is based on the integration mediators, special mediators originally constructed to support only fully materialized views (Zhou et al., 1995a, 1995b), and then generalized in order to support a hybrid of virtual and materialized views (Zhou et al., 1996; Hull & Zhou, 1996). The architecture of a general Squirrel mediator consists of components that deal with the problem of refreshing the materialized portion of the supported view, and components related to the problem of query processing, namely the query processor (QP) and the virtual attribute processor (VAP). With regard to the two last components, the QP provides the integrated view interface to the user. When it receives a user query, it tries to answer the query on the basis of the materialized portion of the view maintained in a local store. If the QP needs virtual data to answer the query, the VAP constructs temporary relations containing the relevant data. To obtain temporary relations, the VAP uses information provided by a virtual decomposition plan (VDP) maintained in the local store. The notion of VDP is analogous to that of query decomposition plan in query optimization. More specifically, the VDP specifies the global relations (materialized, virtual, or hybrid) that the mediator maintains, and provides the basic structure for retrieving data from the sources or supporting incremental maintenance.
As we already said at the beginning of this section, all the systems described above assume that the GAV mapping specifies a single database for the global schema. Furthermore, during query processing, they do not take into account constraints imposed by the data model used to specify the global schema. Actually, as shown in Lenzerini (2001), when the global schema contains integrity constraints, its semantics is best described in terms of a set of databases rather than a single one, and this implies that, as in the LAV approach, query processing in GAV is intimately connected to the notion of querying incomplete databases (van der Meyden, 1998). Hence, to process a query in GAV, unfolding is in general not sufficient, and it is necessary to make use of reasoning mechanisms. A technique to effectively process queries when both global schema and sources are relational, and key and foreign key constraints are expressed in the global schema, is described in Calì et al. (2002a), where it is shown that, starting from the retrieved database, it is possible to build a canonical database that has the property of faithfully representing all the databases that satisfy the global schema. Furthermore, in that paper an algorithm is given that makes it possible to find the answers to a query over the given canonical database without actually building it. The algorithm constructs a suitable reformulation of the query, called query expansion, such that the answer to the expansion computed over the retrieved database is equal to the answer to the original query over the canonical database. In Calì et al. (2001), the same technique is used to deal with the mandatory participation and functional attribute constraints imposed by a conceptual model adopted to represent the global schema.
Other studies deal with the problem of query processing in GAV in the presence of limitations on how sources can be accessed (Li & Chang, 2001). To answer queries over such sources, one generally needs to start from a set of constants (provided, e.g., by the user filling in a form, or taken from a source without access limitations) to bind attributes. Such bindings are used to access sources and there obtain new constants which in turn can be used for new accesses. Hence, query processing in GAV in the presence of access limitations in general requires the evaluation of a recursive query plan. It is worth noticing that also in this case, unfolding is not sufficient to answer the query.