In this section we set up a formal framework for source integration in Data Warehousing. In particular, our main goal is to define the notion of source integration systems, which is intended to represent the component of a data warehouse system dealing with the task of integrating the sources of information for the data warehouse system. We characterize a source integration system as constituted by three elements, namely, the global schema, the sources, and the mapping between the two. Finally, we provide the semantics both of the system, and of query answering.
The formal definition of a source integration system is given below.
Definition 1: A source integration system I is a triple where is the global schema, is the source schema, and is the mapping between and .
The following comments on the above formal definition are in order.
The global schema is intended to specify the structure of the information needed in the data warehouse. From a methodological point of view, such a schema is a reconciled view of the information stored in the sources. In what follows, we denote with AG the finite alphabets for the elements of the global schema. According to Devlin (1997), a conceptual data model, e.g., the entity-relationship model, is generally used for expressing the global schema. However, our formalization is completely independent from the particular data model used.
The source schema provides the specification of the structure of the various data sources. Such a schema contains the intentional description of all the sources of the data warehouse application. Although in principle the various source schemas may be expressed using different data models and notations, it is common to define suitable wrappers that present all the schemas of the sources in a predefined form, e.g., in terms of the relational model. Therefore, the source schema is usually expressed as a set of relation schemas. In what follows, we denote with the finite alphabets for the elements of the source schema.
The mapping establishes a relationship between elements of the global schema and those of the source schema . As we already said in the introduction, two basic approaches, namely GAV and LAV, have been proposed for specifying the mapping, and we will distinguish between these two types of mappings when specifying the semantics of a source integration system.
Let us turn our attention to the semantics of a source integration system . We assume that the databases involved in our framework (both global databases and source databases) are defined over a fixed (infinite) alphabet Γ of symbols. In order to assign semantics to I, we start by considering a source database for I, i.e., a database for the source schema . Based on , we can specify which is the information content of the global schema at the extensional level. We call global database for I any database for .
Definition 2: Let be a source integration system, and a source database for I. A global database for I is said to be legal for I with respect to , if:
is legal with respect to , i.e., satisfies all the constraints of ;
satisfies the mapping with respect to .
The notion of satisfying the mapping with respect to depends on the type of the mapping considered, GAV or LAV.
GAV mapping. In the GAV approach, the mapping associates to each element r in a view, i.e., a query, over , denoted by ρ(r). We say that satisfies with respect to if, for each element r of , the set of tuples rB that assigns to r contains the set of tuples that satisfy the query ρ(r) in , i.e.,
Note that this means that the view associated to r is sound: the data provided by the sources satisfy the element of global schema, but are not necessarily complete.
LAV mapping. In the LAV approach, instead, the mapping associates to each source s in a view, i.e., a query, over , denoted by ρ(s). In this case, we say that B satisfies with respect to , if for each source s of , the set of tuples that assigns to is contained in the set of tuples that satisfy the query ρ(s) in , i.e.,
Note that, analogously to the previous case, this means that the view associated to s is sound.
Queries posed to a source integration system I are expressed in terms of a query language over the alphabet , i.e., over the global schema. In the following, if is a database, and q is a query, then denotes the result of evaluating q over the .
Definition 3: Let be a source integration system, a source database for I, and q a query of arity n to I. The answer to q with respect to , is the set of tuples (c1, ∊, cn)∊Γn such that for each global database legal for I with respect to .
Since, in general, several global databases exist that are legal for I with respect to , in the terminology of source integration, is often called the set of certain answers of q with respect to .
As we said in the introduction, the main activities that are carried out in the design of a source integration system are: schema integration, data integration, and data cleaning. To relate these activities to the formalization presented in this section, we observe that:
Schema integration has the goal to provide the specification of the three main components of the system, namely, the global schema, the source schema, and the mapping.
Data integration aims at defining the correct method for acquiring data from the sources, so as to populate (either virtually or physically) the elements of the global schema. In other words, the purpose of data integration is to come up with a suitable method for answering queries over the global schema, by accessing the data at the sources.
The goal of data cleaning is to design the mapping of the source integration system in such a way that, when acquiring data at the sources, suitable conversion, transformation, and reconciliation actions are performed on these data.