KEY FEATURES OF MULTIDIMENSIONAL QUERYING | Multidimensional Databases: Problems and Solutions

In this section we briefly describe some of the distinctive features of OLAP querying with respect to traditional querying on transactional systems. We first make a comparison on some general aspects, e.g., the number of involved records, the proper indexing techniques, the typical user. Then, we consider some issues related to aggregate functions in query languages. Finally, the role of dimensions is illustrated, and it is shown that trade-offs are sometimes necessary to overcome the limitations of the relational model without excessively sacrificing the simplicity of the relational algebra and calculus.

OLTP vs. OLAP Queries

Number of records involved and the role of aggregation. One of the key differences between OLTP and multidimensional queries is the number of records required to compute the answer. OLTP queries typically involve a rather limited number of records, accessed through primary key or other specific indexes, which need to be processed for short, isolated transactions or to be issued on a user interface. In contrast multidimensional queries usually require the classification and aggregation of a huge amount of data.

Indexing techniques. Transaction processing is mainly based on the access of few records through primary key or other indexes on highly selective attribute combinations. The efficient access is easily achieved by well-known and established indexes, particularly B+-tree indexes. In contrast, multidimensional queries require a more articulated approach, since different techniques are required and each index performs well only on some categories of queries. An overview of indexing techniques for OLAP queries, particularly bitmap, projection, and bit-sliced indexes, will be addressed later in this chapter.

Current state vs. historical DBs. In OLTP operations the latest version of the database is required, the concurrent access/update of information is a critical issue, and the database usually represents only the current state of the system. In OLAP systems data do not need to be the latest available and should be time-stamped, thus enabling the user to perform historical analyses with trend forecasts. On the other hand, the presence of the temporal dimension poses relevant problems to the formulation and processing of queries, since schemes may evolve over time and conventional query languages are not adequate to cope with them. The problems related to queries over evolving schemes in OLAP systems are covered in detail in Chapter 6.

Target users. The typical users of OLTP systems are clerks, and the type of queries is rather limited and foreseeable. In contrast, multidimensional databases are usually the core of decision support systems, targeted for the enterprise executives. The types of queries are only partly predictable and often require highly expressive (and complex) query languages to be expressed. On the other hand, the user usually has little confidence even with "easy" query languages like basic SQL: the typical interaction paradigm is a spreadsheet-like environment based on iconic interfaces and on the graphical metaphor of the multidimensional cube.

Aggregate Functions

As already stressed above, the presence of aggregations is one of the most important distinctive features of OLAP systems with respect to conventional transactional systems. An important issue related to the definitions of aggregate functions in the relational model is that, while all standard relational operators take sets of tuples as input and produce sets of tuples as output, aggregate functions require a careful consideration of duplicates. Let us consider for example the sum of the durations of all calls in the Fcalls table. It is evident that the table may contain several tuples sharing the same value for Dur_sum. If we define the SUM aggregate function on the set of values of Dur_sum, we would incorrectly have a single contribution for each distinct value of Dur_sum. On the contrary, in order to correctly compute the global duration of calls, the contribution of each tuple should be considered, even if it is the same as that of another tuple. In other words the sum should not be applied on the set, but rather on the multiset (bag) of values of Dur_sum, i.e., on a collection of values where the same value can appear more than once (or, equivalently, where each value has an associated multiplicity).

The first important work on the extension of the relational model to include aggregate functions is by Klug (1982). In order to avoid multisets, the author considers a definition of aggregate function, which is based on a family of parameterized functions: for example the SUM aggregate function is defined by a family of SUM functions SUM₁, SUM₂,…, SUM_i,…, one for each column of the relation. Since the functions apply on the whole relation rather than on a single column, each value to be summed is part of a distinct tuple and there is no need to consider duplicates. Both an algebra and a calculus (with equivalent expressive power) are defined, having standard grouping capabilities.

The results were extended to relations containing complex objects in Özsoyoglu, Özsoyoglu, & Matos (1987), where a particular operator aggregation-by-template is also introduced, which can be considered as the first multidimensional operator: the aggregates are formed according to the groups defined in a particular table (representing the template), conceptually analogous to what we call now a dimension level.

More recently, the important results on query languages for bags (e.g., those in Albert, 1991; Libkin & Wong, 1993; Grumbach & Milo, 1996) have led to a more "natural" characterization of aggregate functions, closer to practical languages like SQL. In Gupta, Harinarayan, & Quass (1995), a new relational operator is proposed, generalized projection, which represents an extension of the classical relational projection (producing sets of tuples). Generalized projection captures both duplicate- eliminating projection (corresponding to classical projection and SQL DISTINCT projection) and duplicate-preserving projection (corresponding to SQL standard projection and useful to represent aggregations). In Libkin & Wong (1997), some query languages for bags and aggregate functions are analyzed, by investigating the relative complexity and expressive power. In Grumbach, Rafanelli, & Tininini (1999), a first-order language with real arithmetic and aggregation operators is proposed, offering a good trade-off between expressive power (enabling the user to express complex aggregations like median and average interest rates) and a tractable complexity.

Finally, in Cabibbo & Torlone (1999), a general framework for the description and investigation of aggregate functions in query languages is proposed. The abstract description of an aggregate function is obtained by considering a family of scalar functions g₀, g₁,…, g_i,…, corresponding to the computation of the aggregation function on a multiset of 0, 1,…, i, … elements and such that the single g_is can be "easily" built (uniform construction). Depending on the complexity of this construction, different classes of aggregate functions and a hierarchy among them is defined.

The Role of Dimensions

The fact that the standard relational model and operators are inadequate to effectively represent and query multidimensional data was already shown in the early research on statistical databases (Chan & Shoshani, 1981; Su, 1983; Shoshani & Wong, 1985; Ghosh, 1986). This led to the distinction between category attributes (the dimensions) and summary attributes (the measures). One classical observation used to show the inadequacy of the relational model is that category attributes are "special" attributes which cannot be projected out as conventional attributes. For example the "number of calls by day" cannot be obtained by simply removing the category attribute Customer from the "number of calls by day and customer." On the contrary, this operation requires a summarization (roll-up) (Lenz & Shoshani, 1997; Hurtado & Mendelzon, 2001) of the category attribute Customer. Summarization may require a simple sum on the corresponding grouped values in some cases, or a complex series of operations in other cases, or may be even impossible to be performed without accessing the initial raw data, e.g., if the aggregation is a function like median.

The distinction between dimensions and measures is also at the basis of most models for OLAP systems. However, as noted by several authors, this distinction has some drawbacks, mainly because some operations that are easily expressible in the relational algebra become cumbersome in multidimensional models. Moreover, there are cases where a clear distinction between dimensions and measures prevents the user from expressing particular forms of queries. Consider for instance the query, "Give the number of calls classified by day and load level of the connecting antenna" where the load level of an antenna is classified as high, medium, or low depending on the number of calls that were routed through it in that day (let us suppose [0-1000] = low, [1001 - 5000] = medium, and over 5000 = high). It is obvious that the load level is a sort of dimension level for the dimension Antenna, but the actual instances of this dimension level depend on the measure "number of calls," so that in some sense the query transforms the measure into a dimension. Some authors have proposed a multidimensional model with a symmetric treatment of measures and dimensions to cope with this kind of problem.

A further issue related to the handling of dimension levels is that in some cases relations that were not in the initial design of the multidimensional schema, may be used to define additional dimension levels for querying. Consider for instance a relation Ant_type on the schema (antenna_id, type) which specifies the type of each antenna (e.g., the brand and model). This relation could be joined with the multidimensional data cube to express a query, "Give me the number of calls by day and type of antenna." This operation is straightforward in a relational environment (it is a simple join operation), but can be rather difficult to express in multidimensional models.

Finally, note that multidimensional querying is often an exploratory process, performed by navigating along the dimensions, increasing or decreasing the level of detail of the displayed data. So the user may decide to roll-up to less-detailed dimension levels to have a general view of the phenomenon under consideration (for example requesting the total number of calls by month) and then drill-down to investigate in more detail specific subsections of the cube, e.g., to determine which are the call plans that have produced a decrease in the total number of calls in June.

Expressing Multidimensional Queries by Extended Relational Query Languages

Several approaches to the problem of querying multidimensional data have been based on extensions of the relational algebra and calculus and/or of SQL, the most common relational query language. In Özsoyoglu, Özsoyoglu, & Matos (1987), an extension of the relation algebra is proposed to deal with set-valued attributes and aggregation functions. The authors propose an aggregate formation operator, which is a typical aggregate-group by operator and an aggregation-by-template operator, which anticipates some ideas of multidimensional operators and schemes. In practice a template is a table representing pre-specified groupings of attribute values. For instance a template may be a table defining groups of antennas of the same model, and the aggregation-by-template allows the user to perform aggregations on the groups of tuples specified by the template. The operator has evident analogies with aggregations on star schemas, with (specific columns of) dimension tables playing the role of templates.

In Gray et al. (1996), the CUBE operator is proposed as an extension of the conventional SQL GROUP BY clause. Since multidimensional applications often require the computation of several "points" of the n-dimensional data cube, the authors introduce a new operator which computes simultaneously all points of the n-dimensional cube and all possible roll-ups (note that this corresponds to 2ⁿ distinct GROUP BYs!). In order to represent roll-up values, the polymorphic value ALL is associated with each dimension: ALL represents the union of all values in the dimension. As a consequence, in our fact table Fcalls, the tuple (ALL, Jan-4-2001, ALL, 20.5M, 945K) represents the information that in the indicated day 945K calls were globally made for a total duration of 20.5M seconds. The authors also propose an extended version of the GROUP BY operator, allowing the user to compute aggregations over computed categories (dimensions), e.g., to easily define the Day category of Fcalls from the time in seconds attribute of the original Calls table. Finally, for the (rather frequent) case, where the complete computation of the CUBE is prohibitive, the ROLLUP operator is introduced, which produces the result of n consecutive GROUP BYs, by progressively rolling-up each dimension of the data cube.

Simultaneous computation of the whole cube (roll-up) is advantagous, as some expensive operations required to calculate a GROUP BY with an aggregation can be "recycled" when computing the cube or the roll-up. Moreover, for many common aggregation functions, high-level aggregates can be computed by aggregating the lower level ones, thus avoiding the costly access to raw data, which are usually several orders of magnitudes larger than the aggregate data. Both the CUBE and the ROLLUP operators have been included in the SQL99 standard and are now implemented in the main commercial RDBMSs.

The CUBE operator is particularly useful when computing large collections of views to be materialized, but it is of little use when expressing single ad-hoc queries. In Chatziantoniou & Ross (1996), it is shown that some aggregate queries with groupings are very difficult to express with standard relational query languages, particularly SQL, even though conceptually simple and easily computable. What is particularly relevant is that, due to the complex syntax required to express the queries, the query optimizer is very unlikely to be able to compute them efficiently. A trivial example of this kind of query on the table Fcalls is "For each customer find the longest call and the day(s) when it was made." In standard SQL a nested query is required with an inner aggregate block which is very difficult to optimize. For this reason, the authors propose an additional operator for the relational algebra and an extension of SQL, namely the introduction of grouping variables ranging over the tuples of each group, and of a SUCHTHAT clause, defining the range of such grouping variables. It is shown that many aggregate queries requiring a complex syntax in standard SQL can be easily expressed in the extended language and that efficient query plans can be easily generated by the optimizer. The results are further elaborated in Ross, Srivastava, & Chatziantoniou (1998), where an extension of SQL is proposed combining both the characteristics of the CUBE operator and those of the grouping variables with SUCHTHAT clause.

Another interesting extension of SQL is the one adopted by Microsoft SQL Server and called MDX (for multidimensional expressions). The language is based on a symmetric treatment of dimensions and measures, and in fact the collection of measures constitutes a particular form of dimension, accessed through the reserved keyword Measures. Several keywords and clauses are introduced to access the single values (called members) of a dimension level, to refer to the hierarchical descendants/ancestors of a member, to compute new measures (e.g., a profit from a price and a cost measures) and sets of values, to manipulate time series, etc.