DIFFERENT USE OF MULTIDIMENSIONAL DATA: OLAP APPLICATIONS | Multidimensional Databases: Problems and Solutions

Recently other proposals for modeling multidimensional databases have been presented. These proposals are based on the cube metaphor, which represents the concept of multidimensionality by a three-dimensional representation (see Agrawal, Gupta, & Sarawagi, 1997; Gyssens & Lakshmanan, 1997; Lehner, 1998; Pedersen & Jensen, 1999; Nguyen, Tjoa, & Wagner, 2000). In fact, there has been significant interest in multidimensional database systems for developing business analysis and decision support applications. E.F. Codd (1993: Codd, Codd, & Salley, 1993) proposed the term On-Line Analytical Processing (OLAP) for rendering enterprise data in multidimensional perspectives, performing on-line analysis of data using mathematical formulas or more sophisticated statistical analyses, and consolidating and summarizing data according to multiple dimension.

The new data models for OLAP, based on a multidimensional view of data, typically categorize data as being "measurable business facts (measures) or dimensions." These are mostly textual and characterize the facts. In reality, in OLAP research, most work has concentrated on performance issues. Techniques have been developed for computing the data cube (see Agarwal et al., 1996), for deciding what subset of a data cube to pre-compute (see Harinarayan, Rajaraman, & Ullman, 1996; Gupta, Harinarayan, Rajaraman, & Ullman, 1997), for estimating the size of multidimensional aggregates (see Shukla, Deshpande, Naughton, & Ramasamy, 1996), and for indexing pre-computed summaries. It has been suggested that researchers who study the conceptual model should try to combine the OLAP models with the more advanced and consolidated data model concepts from the field of statistical multidimensional aggregate databases.

For OLAP databases, we have the following conceptual structure: summary measure; summary function; dimensions; classification hierarchies, exactly like the statistical databases, which have the same components, as claimed by Shoshani (1997). In spite of the different origins of these two areas (application types, socioeconomic motivation, and business applications to collect and analyze information for decision making), there are many similarities in the problems they tackle.

Apart from the different emphasis given to the use of data made in the two environments, and to the research done (modeling and privacy the former, efficiency of access and of data analysis the latter), another distinction which can be made between OLAP and SDB is that, while statistical databases are usually derived (summarized) from other base data, OLAP databases often represent the base data directly.

Similarities and differences between OLAP and statistical databases (SDBs), that is, the unobvious connection between analyzing business data and socioeconomic data, are discussed in Shoshani (1997). It is important to underline the fact that both of them deal with multidimensional data sets, and both are concerned with statistical summarizations over the dimensions of the data sets.

The data about individuals or original objects from which statistical databases are derived is referred to in statistical database literature as "microdata," and the summarized dataset as "macrodata." In addition, the data associated with classification structures is referred to as "metadata." Metadata can be quite extensive, and are often managed by specialized systems, or general purpose database systems, such as relational systems. Statistical databases mostly present macrodata either for reasons of privacy, or because the original dataset is of no interest (i.e., only the summaries are needed for statistical analysis). In OLAP, summaries may obscure the phenomena we wish to discover, thus we start with the original dataset. These generalizations, of course, do not always hold, but, as observed in Shoshani (1997), by and large most examples bear this observation. In any case, more or less formal models were proposed in the literature. The more significant proposals are the following.

Agrawal, Gupta, & Sarawagi (1997) propose a data model (and a few algebraic operators) that provide a semantic foundation to multidimensional databases. The distinguishing feature of the proposed model (similar to other authors') is the symmetric treatment of dimensions and measures. The model also provides support for multiple hierarchies along each dimension. Its data model is a multidimensional cube with a set of basic operations defined on it and which produce a new cube (closed operators) as output. The proposed model is a logical model, so that it does not force any storage mechanism.

Gyssens & Lakshmanan (1997) propose an n-dimensional table as a fundamental data structure of the multi-dimensional database. Drawing on the terminology of statistical databases, the authors classify the attribute set associated with the schema of a table into two kinds: parameters and measures. Analogous to Agrawal, Gupta, & Sarawagi (1997), there is no a priori distinction between parameters and measures in that any attribute can play either role. The actual contents of a table are essentially orthogonal to the associated structure, i.e., the distribution of attributes over dimensions and measure. Separating both features leads to a relational view of a table. The cells of an n-dimensional table can have more than one value.

Lehner (1998) proposes a modeling approach, declaring explicitly that the nested multi-dimensional data model proposed is not yet really another data model, but provides necessary extensions in different directions. The multidimensional context of the proposed model uses dimensional structures, which model the business terms of the user's world in a very complex and powerful way, for gaining analytical access to the measures or facts. The author gives the formal definition of "Primary and Secondary Multidimensional Objects," reflecting the multidimensional view of classification and dimensional attributes, in order to represent a consistent and intuitive view of nested multidimensional data cubes. The author also explains that, as introduced in Rafanelli & Ricci (1983), the aggregation type describes the aggregation operators which are applicable to the modeled data (Σ: data can be summarized, Φ: data may be used for average calculations, c: constant data implies no application of aggregation operators). The cardinality of the context descriptor reflects the dimensionality of the corresponding data cube.

In Pedersen & Jensen (1999) and, subsequently, in Pedersen, Jensen, & Dyreson (2001), multidimensional data modeling for complex data is proposed. For every part of the model, the authors define the intention and the extension. An n-dimensional fact schema is defined as a two-tuple S = (Y, D), where Y is a fact type and D = {T, i = 1,…, n} is its corresponding dimension. A dimension type τ is a four-tuple (C, ≤_τ, T_τ, ⊥_τ), where C = {C_j, j = 1,…, k} are the category types of τ, ≤ τ is a partial order on the C_js, with T_τ ∈ C and ⊥_τ ∈ C being the top and bottom element of the ordering, respectively. Thus, they deduce that the category types form a lattice. The intuition is that one category type is "greater than" another category type if members of the former's extension logically contain members of the latter's extension, i.e., they have a larger element size. The top element of the ordering corresponds to the largest possible element size, that is, there is only one element in its extension, which logically contains all other elements. C_j is a category type of T, written C_j ∈ τ if C_j ∈ C. The authors assume a function Pred: C → 2^c that gives the set of immediate predecessors of a category type C_j. The authors observe that many types of data, e.g., ages or sales amounts, can be added together to produce meaningful results. This data has an ordering on it, so computing the average, minimum, and maximum values makes sense. For other types of data, e.g., dates of birth or inventory levels, the user may not find it meaningful in the given context to add them together. However, the data has an ordering on it, so taking the average or computing the maximum or minimum values do make sense. Some types of data do not have an ordering on them, and so it does not make sense to compute the average, etc. Instead, the only meaningful aggregation is to count the number of occurrences. Therefore, they affirm that it is possible to support correct aggregation of data by keeping track of what types of aggregate functions can be applied to what data. This information can then be used to either prevent users from doing "illegal" calculations on the data completely, or to warn the users that the result might be "wrong." Following this line of reasoning and previous works (Rafanelli & Ricci. 1983; Lehner, 1998), they distinguish between three types of aggregate functions: Σ, applicable to data that can be added together; Φ, applicable to data that can be used for average calculations; and c, applicable to data that is constant, i.e., it can only be counted. Considering only the standard SQL aggregation functions, they deduce that Σ = {SUM, COUNT, AVG, MIN, MAX}, Φ = {COUNT, AVG, MIN, MAX}, and c = {COUNT}. The aggregation types are ordered, c ⊂ Φ ⊂ Σ, so data with a higher aggregation type, e.g., Σ, also possess the characteristics of the lower aggregation types.

In Nguyen, Tjoa, & Wagner (2000), the authors introduce a conceptual multidimensional data model that facilitates a precise, rigorous conceptualization for OLAP. OLAP systems organize data using the multidimensional paradigm in the form of data cubes, each of which is a combination of multiple dimensions with multiple levels per dimension. Summarized data is preaggregated and stored with the main purpose of exploring the relationship between independent, static variables—dimensions—and dependent, dynamic variables—measures. Moreover, dimensions always have structures and contain one or more natural hierarchies, together with other attributes that do not have hierarchy relationship to any of the attributes in the dimensions. The first goal of the paper is to propose a model able to represent and capture natural hierarchical relationships among members within a dimension. The model allows the handling of dimensions with complex structures, such as unbalanced and multi-hierarchical structures. Moreover, the proposed data model permits the representation of relationships between dimension members and measure data values by means of cube cells. The second goal of the paper is the modeling of the conceptual multidimensional data model in terms of classes by using UML. Based on the formal representation of the class specifications in UML, the design and implementation of the data model for object-oriented databases are straightforward.