In this section we briefly report on data models that have been proposed for multidimensional databases, in relation to the requirements reported in the previous section. A more thorough examination and comparison of many such models can be found in several survey papers appearing in the literature (Blaschka et al., 1998; Pedersen, 2000; Rafanelli, 1995; Vassiliadis & Sellis, 1999). General discussion on OLAP, multidimensional analysis, and data warehousing can be found in Chaudhuri & Dayal (1997), Codd, Codd, & Salley (n.d.), Colliat (1996), Inmon (1996), and Samos et al. (1998). Mendelzon (n.d.) has published a rather comprehensive on-line bibliography on this subject. Further up-to-date information can be found on specialized websites, for instance Greenfield (n.d.) and Pendse & Creeth (n.d.).
It should be said that some of the models cited in this section cannot be classified as "conceptual" in the sense specified earlier. However, they are mentioned to provide a general overview of the state of the art in both the research community and commercial systems.
According to the classification proposed by Pedersen (2000), data warehousing models can be divided into three main categories: cube models, multidimensional models, and statistical models. In the first category are simple models that provide the notion of cube, but in which the concept of dimension is modeled to only a limited extent. Conversely, multidimensional models allow representation of dimensions in structured (although different) ways. With the statistical model we finally denote the large body of work in the area of statistical database modeling, which is strictly related to the multidimensional approach (Shoshani, 1997).
Simple cube models (Datta & Thomas, 1997; Gray et al., 1996; Gyssens & Lakshmanan, 1997; Kimball, 1996) treat data in the form of n-dimensional cubes. They all have a more or less explicit notion of fact, measure, and dimension. However, the hierarchy between the various levels of aggregation in a dimension is not explicitly captured by the schema, so the user cannot infer from the schema that, for instance, City rolls-up to State and not the opposite. The star schema approach (Kimball, 1996) and its variants—like the snowflake scheme, in which a central relational table represents the fact on which the analysis is focused, and a number of tables, usually de-normalized, represent the dimensions of analysis— should also be considered a cube model as they are semantically equivalent, although at a lower level of abstraction.
The majority of models adopted by commercial systems (Pendse & Creeth, n.d.; Oracle Corporation, 1998) should also be included in this category. Modeling aspects are covered by commercial systems in a pragmatic way. The representation used in ROLAP (Relational OLAP) systems is the star schema (Kimball, 1996), whose limit in representing the multidimensional aspects of OLAP applications at the right level of abstraction has already been discussed. In MOLAP (Multidimensional OLAP) systems (Colliat, 1996), information is represented directly in multidimensional form, but the structure of a dimension is usually hard coded in the physical index structures used to access data.
Multidimensional models (Agarwal, Gupta, & Sarawagi, 1997; Cabibbo & Torlone, 1997, 1998a; Dyreson, 1996; Franconi & Sattler, 1999; Jagadish, Lakshmanan, & Srivastava, 1999; Lehner, 1998; Li & Wang, 1996; Mendelzon & Vaisman, 2000; Microsoft Corporation, 2000; Nguyen, Tjoa, & Wagner, 2000; Pedersen & Jensen, 1999; Vassiliadis, 1998) capture the hierarchies in the dimensions explicitly, providing a better understanding of the application and a support for easy data cube manipulation. This information may also be useful for query formulation and optimization.
Interestingly, while the basic features are more or less covered by these models, each of them represents the dimension structure very differently, e.g., by using grouping relations (Li & Wang, 1996), dimension merging functions (Agrawal, Gupta, & Sarawagi, 1997), measure graphs (Dyreson, 1996), roll-up functions (Cabibbo & Torlone, 1998a; Mendelzon & Vaisman, 2000), level lattices (Vassiliadis, 1998), hierarchy schemes and instances (Jagadish, Lakshmanan, & Srivastava, 1999), or an explicit tree-structured hierarchy as part of the cube (Lehner, 1998; Microsoft Corporation, 2000).
A number of data models have also been defined by extending traditional conceptual data models (Sapia et al., 1998). Others have used known paradigms, e.g., object-orientation (Abello, Samos, & Saltor, 2000) and nested structure models (Dekeyser et al., 1998), or specific metaphors, e.g., tapes (Gebhardt, Harke, & Jacobs, 1997). Finally, several data models have been proposed with the main goal of studying specific data warehousing application problems, such as incomplete information (Dyreson, 1998; Pedersen, Jensen, & Dyreson, 1999), efficiency issues (Harinarayan, Rajaraman, & Ullman, 1996; Jagadish, Lakshmanan, & Srivastava, 1999), heterogeneous dimensions (Hurtado & Mendelzon, 2001), dimension updates (Hurtado, Mendelzon, & Vaisman, 1999), and temporal OLAP queries (Mendelzon & Vaisman, 2000), and so are well suited for them.
The last group is statistical database models (Bezenchek, Rafanelli, & Tininini, 1996; Rafanelli, 1995; Rafanelli & Ricci, 1993; Rafanelli & Shoshani, 1990; Shoshani, 1997; Tininini, Bezenchek, & Rafanelli, 1996).
A great deal of relevant work has already been done in this area. Shoshani (1997) made a very interesting comparison of work done in statistical and multidimensional databases. This revealed that after taking apart the terminology used, the two areas have a lot of overlap, even if each of them has emphasized different aspects. In particular, research in statistical databases has focused on the treatment of complex classification structures, management of certain special dimensions (e.g., spatial and geographic), and on the important issues (especially from the statistical point of view) of privacy and summarizability. On the other hand, OLAP literature has emphasized data warehouse design, query processing, and above all, efficiency issues. It is clear, however, that though the emphasis is on different aspects, work done in one area can greatly benefit the other (Shoshani, 1997).
A statistical data model is usually based on the notions of summary table, summary attribute, and category attribute. Actually, there is a close correspondence between these notions and the concepts used in multidimensional data models. Specifically, a summary table corresponds essentially to a data cube, a summary attribute to a measure, and a category attribute to a dimension. As in multidimensional models, a category attribute is always associated with a hierarchy of concepts. A number of operators are usually introduced in statistical models to manipulate, concatenate, and aggregate summary tables.
Notable examples of conceptual statistical models are STORM (Rafanelli & Shoshani, 1990) and Mefisto (Rafanelli & Ricci, 1993). In particular, Mefisto introduces the important notion of statistical entity, the conceptual counterpart of the notion of summary table.
In statistical models, a structured classification hierarchy is almost always coupled with an explicit aggregation function on a single measure to produce a sort of pre-defined object capable of answering a specific set of queries. This approach is sometimes less flexible than the approaches usually taken by multidimensional models, but unlike most of these, it can provide an effective way to avoid incorrect results from queries.