Schema integration is the activity of integrating the schemas of the various sources in order to produce a homogeneous description of the data of interest.
The work on schema integration is relevant to any information integration approach and in particular to the context of data warehousing, since, in order to integrate data, schema integration must either implicitly or explicitly be done.
Schema integration is divided into several methodological steps (Batini, Lenzerini, & Navathe, 1986). Such steps are essentially independent of the integration strategy adopted, and aim at relating the different components of schemas, finding and resolving conflicts in the representation of the same data among the different schemas, and eventually merging the conformed schemas into a global one.
In particular, the following methodological steps are singled out:
schema merging and restructuring
Traditionally, schema integration is a "one-shot" activity, resulting in a global schema in which all data are represented uniformly (Batini, Lenzerini, & Navathe, 1986). However, more recently, in order to deal with autonomous and dynamic information sources, an incremental approach is arising (Catarci & Lenzerini, 1993). Such an approach consists of building a collection of independent partial schemas, formalizing the relationships among entities in the partial schemas by means of the so-called interschema assertions. In principle, under the assumption that the various information sources remain unchanged, the incremental approach would eventually result in a global schema, similar to those obtained through a traditional one-shot approach, although in practice, due to the dynamics of the sources, such a result is never achieved. Additionally, the integration may be partial, taking into account only certain aspects or components of the sources (Catarci & Lenzerini, 1993).
In the rest of this section, we review recent studies on schema integration, according to the steps they address. Furthermore, we analyze the following key aspects: i) whether a global schema is produced or not; ii) among the ones above mentioned, which is the schema integration methodological step the work refers to; iii) which is the formalism used for representing data schemas.
We refer to Batini, Lenzerini, & Navathe (1986) for a comprehensive survey on previous work in this area.
Preintegration consists of an analysis of the schemas to decide the general integration policy: choosing the schemas to be integrated, deciding the order of integration, and possibly assigning preferences to entire schemas or portions thereof. The choices made in this phase influence the usefulness and relevance of the data corresponding to the global schema. During this phase also additional information relevant to integration is collected, such as assertions or constraints among views in a schema. Such process is sometimes referred to as semantic enrichment (García-Solaco, Saltor, & Castellanos, 1995a, 1995b; Blanco, Illarramendi, & Goñi, 1994; Reddy et al., 1994). It is usually performed by translating the source schemas into a richer data model, that allows for representing information about dependencies, null values, and other semantic properties, thus increasing interpretability and believability of the source data.
For example, Johanneson (1994) defines a collection of transformations on schemas represented in a first-order language augmented with rules to express constraints. Such transformations are correct with respect to a given notion of information preservation, and constitute the core of a "standardization" step in a new schema integration methodology. This step is performed before schema comparison and logically subsumes the schema conforming phase, which is not necessary in the new methodology.
In Blanco, Illarramendi, & Goñi (1994), relational schemas are enriched using a class-based logical formalism, a description logic (DL), available in the terminological system BACK (Peltason, 1991). Instead García-Solaco, Saltor, & Castellanos (1995a, 1995b) use as a unifying model a specific object-oriented model with different types of specialization and aggregation constructs.
The creation of a knowledge base (terminology) in the preintegration step is proposed in Sheth, Gala, & Navathe (1993). More precisely, a hierarchy of attributes is generated, thus representing the relationship among attributes in different schemas. Then, source schemas are classified: the terminology thus obtained corresponds to a partially integrated schema. Such a terminology is then restructured by using typical reasoning services of class-based logical formalisms. The underlying data model is hence the formalism used for expressing the terminology: more precisely, a description logic (CANDIDE) is used.
Schema comparison (also called schema matching) is the phase in which the correlations among concepts of different schemas are determined and possible conflicts are detected. Moreover, interschema properties are typically discovered during this phase.
There has been a considerable amount of research in studying the types of conflicts that arise when comparing source schema components (see, e.g., Batini, Lenzerini, & Navathe, 1986; Krishnamurthy, Litwin, & Kent, 1991; Spaccapietra, Parent, & Dupont, 1992; Ouksel & Naiman, 1994; Reddy et al., 1994), and consensus has arisen on their classification, which can be summarized as follows:
Heterogeneity conflicts arise when different data models are used for the source schemas.
Naming conflicts arise because different schemas may refer to the same data using different terminologies. Typically one distinguishes between homonyms, where the same name is used to denote two different concepts, and synonyms, where the same concept is denoted by different names.
Semantic conflicts arise due to different choices in the level of abstraction when modeling similar real-world entities.
Structural conflicts arise due to different choices of constructs for representing the same concepts.
In general, this phase requires a strong knowledge of the semantics underlying the concepts represented by the schemas. The more the semantics is represented formally in the schema, the easier can similar concepts in different schemas be automatically detected, possibly with the help of specific CASE tools that support the designer. Traditionally, schema comparison was performed manually (Batini, Lenzerini, & Navathe, 1986). However, recent methodologies and techniques emphasize automatic support to this phase.
For example, Blanco, Illarramendi, & Goñi (1994) exploit the reasoning capabilities of the terminological system to classify relational schema components and derive candidate correspondences between them expressed in the description logic BACK.
In Miller, Yoannidis, & Ramakrishnan (1994), the problem of deciding equivalence and dominance between schemas is analyzed, based on a formal notion of information capacity given in Hull (1996). Specifically, schemas are expressed in a graph-based data model which allows for the representation of inheritance and simple forms of integrity constraints. It is proven that such a problem is undecidable in schemas that occur in practice; moreover, sufficient conditions for schema dominance are defined, based on a set of schema transformations that preserve schema dominance. A schema S1 is dominated by a schema S2 if there is a sequence of such transformations that converts S1 to S2.
Krishnamurthy, Litwin, & Kent (1991) discuss reconciliation of semantic discrepancies in the relational context due to information represented as data in one database and as meta-data in another. The paper proposes a solution based on reifying relations and databases by transforming them into a structured representation.
An architecture where schema comparison and the subsequent phase of schema conforming are iterated is proposed in Gotthard, Lockemann, & Neufeld (1992). At each cycle, the system proposes correspondences between concepts that can be confirmed or rejected by the designer. The system uses newly established correspondences both to conform the schemas and to guide its proposals in the following cycle. A data model that essentially corresponds to an entity-relationship model extended with complex objects is used to express both the component schema and the resulting global schema.
In Bouzeghoub & Comyn-Wattiau (1990), schema comparison in an extended entity-relationship model is performed by analyzing structural analogies between subschemas through the use of similarity vectors. Subsequent conforming is achieved by transforming the structures into a canonical form.
Palopoli, Saccà, & Ursino (1999) present semi-automatic techniques for detecting synonym and homonym relationships between objects belonging to different entity-relationship schemas. In particular, such techniques are based on algorithms whose input is a set of weighted synonym, homonym, and inclusion relationships between objects. The weight represents the "plausibility factor" for the relationship to hold. Based on such relationships, new weighted relationships are automatically derived, which in turn are expected to hold with a plausibility degree corresponding to the computed weight. The method for deriving new relationships consists of pairwise comparison of schema objects E1 and E2, which measures the similarity of all objects related to E1 and E2 in the respective schemas. Such techniques have been implemented in the system DIKE (Palopoli, Terracina, & Ursino, 2000).
In Madhavan, Bernstein, & Rahn (2001), an algorithm is presented for the detection of correspondences between schema elements in a very general data model that is able to capture relational, object-oriented, and XML schemas. The algorithm takes into account several aspects of schema elements (names, data types, constraints), integrating linguistic and structural matching techniques. Moreover, the algorithm is able to cope with mappings of shared types and with some forms of schema constraints (e.g., foreign key constraints).
Finally, Bergamaschi et al. (2001) present a schema comparison technique that computes, for each pair of objects A1, A2 in the schemas, a value corresponding to the "affinity" between A1 and A2. Such an affinity is obtained as the weighted sum of three kinds of affinity: name, data type, and structural affinity. Name affinity is obtained by resorting to thesauri that specify relationships (e.g., synonym, hypernym) between object names; data type affinity is obtained by means of a table that defines compatibilities between data types, while structural affinity is obtained by analyzing the similarity of the relationships the objects participate in, in the respective schemas. In this framework, schemas are represented using description logics. Such techniques have been implemented in ARTEMIS, a tool for schema integration, which is used as a component of the MOMIS system (Beneventano et al., 2000) for integration of relational, object-oriented, and semi-structured source schemas.
The phase of schema conforming has the goal of conforming or aligning schemas to make them compatible for integration. Conflict resolution is the most challenging aspect of this phase. Typically, semi-automatic solutions to schema conforming are proposed, in which intervention of the designer is requested by the system when conflicts have to be resolved. Recent methodologies and techniques also emphasize the automatic resolution of specific types of conflicts (e.g. structural conflicts). However, a logical reconstruction of conflict resolution is far from being accomplished and is still an active topic of research.
For example, in Vidal & Winslett (1994), a general methodology for schema integration is presented, in which the semantics of updates is preserved during the integration process. Specifically, three steps are defined: combination, restructuring, and optimization. In the first phase, a combined schema is generated, which contains all source schemas and assertions (constraints) expressing relationship among entities in different schemas. The restructuring step is devoted to normalizing (through schema transformations) and merging views, thus obtaining a global schema, which is refined in the optimization phase. Such a methodology is based on a semantic data model which allows for declaring constraints containing indications on what to do when an update violates that constraint. A set of schema transformations is defined which is update semantics, preserving in the sense that any update specified against the transformed schema has the same effect as if it was specified against the original schema.
Qian (1996) presents a formal analysis of the problem of establishing correctness of schema transformations. More specifically, schemas are modeled as abstract data types, and schema transformations are expressed in terms of signature interpretations. The notion of schema transformation correctness is based on a refinement of Hull's notion of information capacity (Hull, 1986). In particular, such a refinement allows for a formal study of schema transformations between schemas expressed in different data models.
During this phase the conformed schemas are superimposed, thus obtaining a (possibly partial) global schema. Such a schema is then tested against qualities such as completeness, correctness, minimality, and understandability. This analysis may give rise to transformations of the schema obtained.
Geller et al. (1992a, 1992b) present an integration technique (structural integration) which allows for the integration of entities that have structural similarities, even if they differ semantically. An object-oriented model, called DUAL model, is used, in which structural aspects are represented as object types, and semantic aspects are represented as classes. Two notions of correspondence between classes are defined: full structural correspondence and partial structural correspondence. The (partial) integration of two schemas is then obtained through a generalization of the classes representing the original schemas.
In Spaccapietra, Parent, & Dupont (1992), a methodology for schema integration is presented. Such a methodology allows for automatic resolution of structural conflicts and building of the integrated schema without requiring conforming of the initial schemas. The methodology is applicable to various source data models (relational, entity-relationship, and object-oriented models), and is based on an expressive language to state interschema assertions that may involve constructs of schemas expressed in different models. Data model independent integration rules that correspond to the interschema assertions are defined in the general case and are also specialized to the various classical data models. Quality issues are addressed in an informal way: specifically, correctness is achieved by selecting, in case of conflicts, the constructs with weaker constraints. The methodology includes strategies that avoid introducing redundant constructs in the generated schema. However, completeness is not guaranteed, since the model adopted for the global schema lacks a generalization construct.
Once the global schema is generated, the various source schemas are related to the global schema. The way in which the data at the sources are related to elements of the global schema, i.e., the way in which the mapping is specified, may assume different forms.
As explained in the previous sections, two basic approaches, GAV and LAV, have been used to specify the mapping between the sources and the global schema (Lenzerini, 2001; Levy, 1999, 2000; Li & Chang, 2000). The GAV approach, also called query-based approach, requires that the global schema is expressed in terms of the data sources. More precisely, to every element of the global schema, a view over the data sources is associated, so that its meaning is specified in terms of the data residing at the sources. The LAV approach, also called source-based approach, requires the global schema to be specified independently from the sources. In turn, the sources are defined as views over the global schema. The relationships between the global schema and the sources are thus established by specifying the information content of every source in terms of a view over the global schema.
Intuitively, the GAV approach provides a method for source integration with a more procedural flavor with respect to the LAV approach. Indeed, whereas in LAV the designer may concentrate on specifying the content of the source in terms of the global schema, in the GAV approach, one is forced to specify how to get the data of the global schema by queries over the sources.
A comparison of the LAV and the GAV approaches is reported in Ullman (1997). It is known that the former approach ensures an easier extensibility of the integration system, and provides a more appropriate setting for its maintenance. For example, adding a new source to the system requires only provision of the definition of the source, and does not necessarily involve changes in the global schema. On the contrary, in the GAV approach, adding a new source may in principle require changing the definition of the concepts in the global schema.