MULTIDIMENSIONAL QUERY EVALUATION | Multidimensional Databases: Problems and Solutions

We have seen in the previous section that the adoption of a particular query language can be useful not only to express the desired query in a more straightforward manner, but also to speed up the computation process. In this section we investigate the problem of efficiently computing the answer to a multidimensional query, possibly expressed in one of the languages introduced above.

A fundamental requirements of OLAP systems is the ability to perform multidimensional analyses in on-line response times. Since multidimensional queries usually involve huge amount of data to be aggregated, the only way to achieve this result is by pre-computing some queries, storing the answers permanently in the database, and reusing (almost exclusively) them to compute queries over the multidimensional database. These pre-computed queries are commonly referred to as materialized views, and several issues related to them, particularly which views to materialize and how to maintain them, have been analyzed in Chapter 8. In this section we show how materialized views can be used to efficiently process multidimensional queries. We also briefly outline some techniques for their efficient calculation, based on specifically designed indexing structures.

Query Answering vs. Query Rewriting

The problem of computing the answer of a query by using some pre-computed (materialized) views has been extensively studied in the literature and has been generically denoted as the problem of answering queries using views (see for instance Levy et al., 1995; Duschka & Genesereth, 1997; Halevy, 2000). Informally speaking the problem can be stated as follows: given a query Q and a collection of views V over the same schema s, is it possible to compute the answer of Q by using (only) the information provided by the views in V? A more rigorous distinction has also been made between view-based query rewriting and query answering, corresponding to two distinct approaches of the general problem (Calvanese et al., 2000c, 2000d; Halevy, 2000).

Query rewriting is based on the use of the view definitions to produce a new rewritten query, expressed in terms of the available view names and equivalent to the original one. The answer of the query can then be obtained by using the rewritten query and the view extensions (instances). Query answering is based instead on the exploitation of both the view definitions and the view extensions, trying to determine the best possible answer, possibly a subset of the exact one, which can be extracted from the view extensions. Several authors have studied the query answering problem for various classes of queries and views, both under the Closed and the Open World Assumption (Abiteboul & Duschka, 1998; Grahne & Meldelzon, 1999; Calvanese et al., 2000a, 2000b).

In general, query answering techniques are preferable in contexts where exact answers are rather unlikely to be obtained (e.g., integration of heterogeneous data sources, like websites), and when requirements on response times are not very stringent. On the other hand, as noted in Grahne & Meldelzon (1999), query answering methods can be extremely inefficient, because it is difficult or even impossible to process only the "useful" views and to apply optimization techniques such as pushing selections and joins. As a consequence, the rewriting approach is more appropriate in contexts like OLAP systems, where the amount of data is very large and fast response times are required (Goldstein & Larson, 2001), and for query optimization, where different query plans need to be maintained in main memory and efficiently compared (Albrecht et al., 2000; Afrati, Li, & Ullman, 2001).

Rewriting Aggregate Queries

As it was shown, a typical elementary multidimensional query is described by an aggregate group by query, applied on the join of the fact table with two or more dimension tables. This motivates the interest of many researchers on the rewriting of this form of queries and views.

In Gupta, Harinarayan, & Quass (1995), an algorithm is proposed to rewrite conjunctive queries with aggregations using views of the same form. The technique is based on the concept of generalized projection (GP) and on some transformation rules which can be used by an optimizer and which enable the query and the views to be put in a particular normal form, based on GPSJ (Generalized Projection/ Selection/Join) expressions. The query and the views are analyzed in terms of their query tree, i.e., the tree representing the way to compute them by applying selections, joins, and generalized projections on the base relations. By using the transformation rules, the algorithm tries to produce a match between one or more view trees and subtrees of the query tree (and consequently to replace the computations with accesses to the corresponding materialized views). The results are extended to NGPSJ (Nested GPSJ) expressions in Golfarelli & Rizzi (2000a, 2000b).

In Srivastava et al. (1996), an algorithm is proposed to rewrite a single block (conjunctive) SQL query with GROUP BY and aggregations using some views of the same form. The considered aggregate functions are MIN, MAX, COUNT, and SUM. The algorithm is based on the detection of homomorphisms from the view to the query, as in the non-aggregate context (Levy et al., 1995). It is shown, however, that more restrictive conditions need to be considered when dealing with aggregates, because the view has to produce not only the right tuples, but also the correct multiplicities of those tuples.

In Cohen, Nutt, & Serebrenik (1999, 2000), a rather different approach is proposed: the original query to be rewritten, the usable views, and the rewritten query are all expressed by using an extension of Datalog with aggregate functions (again COUNT, SUM, MIN, and MAX) as query language. Queries and views are assumed to be conjunctive. Several rewriting candidates of particular forms are considered, and for each rewriting candidate, the views in its body are unfolded (i.e., replaced by their body in the view definition). Finally, the unfolded candidate is compared with the original query to verify equivalence by using known equivalence criteria for aggregate queries, particularly those proposed in Nutt, Sagiv, & Shurin (1998) for COUNT, SUM, MIN, and MAX queries. The technique can be extended by using the equivalence criteria for AVG queries presented in Grumbach, Rafanelli, & Tininini (1999) and based on the syntactic notion of isomorphism modulo a product.

An important issue in query rewriting is that of identifying the views that may be actually useful in the rewriting process: this is often referred to as the view usability problem. In the non-aggregate context, it is shown (Levy et al., 1995) that a conjunctive view is usable to produce a conjunctive rewritten query if a homomorphism exists from the body of the view to the body of the query. In Grumbach, Rafanelli, & Tininini (1999), it is shown that more restrictive (necessary and sufficient) conditions are needed for the usability of conjunctive count views to rewrite conjunctive count queries, based on the concept of sound homomorphisms. It is also shown that in the presence of aggregations, considering rewriting candidates of conjunctive form is not sufficient, and more complex forms of rewritings may be required, particularly those based on the concept of isomorphism modulo a product.

All rewriting algorithms proposed in the literature are based on trying to obtain a rewritten query having a particular form by using (possibly only) the available views. An interesting question is: "Can I rewrite more by considering rewritten queries of more complex form?" and the even more ambitious one "Given a collection of views, is the information they provide sufficient to rewrite a query?" In Grumbach & Tininini (2000a), the problem is investigated in a general framework based on the concept of query subsumption. Basically, the information content of one or more queries is characterized by their distinguishing power, i.e., by their capability to determine that two database instances are different. Hence a collection of views subsumes a query if it is able to distinguish any couple of instances that is distinguishable by the query, and it is shown that a rewriting of a query using some views exists if the views subsume the query. In the particular case of count and sum queries defined over the same fact table, an algorithm is proposed which is demonstrated to be complete. In other words, even if the algorithm (as any algorithm of practical use) considers rewritten queries of particular forms, it is shown that no improvement could be obtained by considering rewritten queries of more complex forms.

Finally, in Grumbach & Tininini (2000b), a completely different approach to the problem of aggregate rewriting is proposed. The technique is based on the idea of formally expressing the relationships (metadata) existing between raw and aggregate data, and also among aggregate data of different type and/or level of detail. The data are stored in standard relations, while the metadata are represented by numerical dependencies, namely Horn clauses expressing in formal manner the semantics of the aggregate attributes. The mechanism is tested by transforming the numerical dependencies into Prolog rules and then exploiting the Prolog inference engine to produce the rewriting.

(Efficiently) Computing Cubes and Materialized Views

In this section we briefly analyze the techniques proposed to compute cubes and more generally materialized views containing aggregates. We only focus on the problem of computing the views from scratch, since the problem of maintaining a collection of views in the presence of insertions, deletions, and updates has been covered in Chapter 8.

As already pointed out, a typical multidimensional query is constituted by an aggregate group by query applied on the join of the fact table with two or more dimension tables, and consequently has the form of an aggregate conjunctive query, e.g.:

    SELECT D.dim1, D.dim2, AGG(F.measure)    FROM fact_table F, dim_table1 D1, dim_table2 D2    WHERE F.dimKey1 = D1.dimKey1    AND F.dimKey2 = D2.dimKey2    GROUP BY D.dim1, D.dim2

where AGG is an aggregation function, like SUM, MIN, AVG., etc.

Traditional query processing systems perform first all joins expressed in the FROM and WHERE clause, and only afterwards the grouping on the result of the join and the aggregation on each group. The algorithms to produce the groups can be broadly classified as techniques based on a) sorting on the GROUP BY attributes and b) on hashing tables (see Graefe, 1993). However, there are many common cases where an early evaluation of the GROUP BY is possible, and it can significantly reduce the computation time since: i) it reduces the size of the input of the join (usually very large in the context of multidimensional databases), and ii) it enables the query processing engine to use indexes to perform the (early) GROUP BY on the base tables.

In Chaudhuri & Shim (1995, 1996), some techniques and applicability conditions are proposed for transforming execution plans into equivalent (more efficient) ones. As it is typical in query optimization, the technique is based on pull-up transformations, which delay the execution of a costly operation, e.g., a group by on a large dataset, by moving it towards the root of the query tree, and on push-down transformations, which are used for example to anticipate an aggregation so decreasing the size of a join.

Several transformations for multidimensional queries are also proposed in Gupta, Harinarayan, & Quass (1995), based on the concept of generalized projection (GP, see above). The transformations enable the optimizer: i) to push a GP down the query tree; ii) to pull a GP up the query tree; iii) to coalesce two GPs into one, or conversely to split up one GP into two. Query tree transformations are also used in the rewriting process.

In Agarwal et al. (1996), some algorithms are proposed and compared to extend the traditional techniques for the computation of GROUP BY queries to the processing of the CUBE operator. The algorithms are applicable in the case of distributive aggregate functions and are based on the property that higher level aggregates can be computed from the lower level ones in case of distributive aggregate functions.

In MOLAP systems, however, the above methods for the computation of the CUBE are inadequate, since they are substantially based on sorting and hashing techniques which cannot be applied to multidimensional arrays. In Zhao, Deshpande, & Naughton (1997), an algorithm is proposed for the computation of CUBEs in a MOLAP environment. It is shown that the computation can be made significantly more efficient (and even more efficient than ROLAP-based computations) by exploiting the inherently compressed representation of data of MOLAP systems. Particularly, the algorithm benefits from the compactness of multidimensional arrays, enabling the query processor to transfer larger "chunks" of data to main memory and to efficiently process them.

In many practical cases GROUP BYs at the finest level of granularity correspond to sparse data cubes, i.e., cubes where a high percentage of points correspond to null values. Consider for instance the CUBE corresponding to the query "Number of calls by customer, day, and antenna": it is rather obvious that a considerable number of combinations correspond to zero. Fast techniques to compute sparse data cubes are proposed in Ross & Srivastava (1997). They are based on: i) decomposing the fact table into fragments that can be stored in main memory; ii) computing the data cube in main memory for each fragment; and finally iii) combining the partial results.

Using Indexes to Improve Efficiency

Probably the most common indexes used in traditional DBMSs are B+-trees (Comer, 1979): a particular form of trees labeled using the index key values and having a list of record identifiers on the leaf level, i.e., a list of elements specifying the actual position of each record on the disk. The index key values may be constituted by one or more columns of the indexed table. OLTP queries usually retrieve a very limited number of tuples (or even a single tuple accessed through the primary key index), and in these cases B+-trees have been shown to be particularly efficient.

In contrast, OLAP queries typically involve large groups of tuples to be aggregated, requiring specifically designed indexing structures. Unlike in the OLTP context, there is no "universally good" index for multidimensional queries, but rather a variety of techniques, each of which can perform well for specific type of data and forms of queries, and be instead inappropriate for other ones.

Let us again consider the typical multidimensional query expressed by the SQL query above. The core operations related to its evaluation are: 1) the joins of the fact table with two or more dimension tables; 2) the grouping of tuples according to some dimensional values; 3) the application of an aggregation function on each group of tuples. An interesting type of index, which can be used to efficiently perform operation 1, is the join index (Valduriez, 1987): while conventional indexes map column values to records in one table, join indexes map values to records of two (or more) joined tables, thus constituting a particular form of materialized view. Join indexes in their original version are not directly usable to evaluate OLAP queries efficiently, but can be very effective in combination with other indexing techniques, like bitmaps and partitioning.

Bitmap indexes (ON87; O'Neil & Graefe, 1995; Chan & Ioannidis, 1998) are useful to perform the grouping of tuples and some forms of aggregation. In practice these indexes use a bitmap representation for the list of record identifiers in the leaf level of the tree: if the table t contains n records, then each leaf of the bitmap index (corresponding to a specific value c of the indexed column C) contains a sequence of n bits, where the i-th bit is set to 1 if t_i.C=c, and to 0 otherwise. Bitmap representations are indicated, when the number of distinct key values is low and when several predicates of the form (Column = value) are to be combined in AND/OR, since the operation can be efficiently performed by AND-ing/OR-ing bit to bit the corresponding bitmap representation, an operation which can be parallelized and performed very efficiently on modern processors. Finally, bitmap indexes can be used for a fast computation of count queries, since counting can be performed directly on the bitmap representation, without even accessing the selected tuples.

In projection indexes (French, 1995; O'Neil & Quass, 1997), the tree access structure is coupled with a sort of materialized view, representing the projection of the table on the indexed column. The technique has some analogies with vertically partitioned tables and is indicated when the aggregate operations need to be performed on one or more indexed columns.

Bit-sliced indexes (O'Neil & Quass, 1997) can be considered as a combination of the two previous techniques: the values of the projected column are encoded, and for each bit component of the encoding, a bitmap is associated. The technique has some analogies with bit transposed files, which were proposed in Wong et al. (1985, 1986) to compute queries on very large scientific and statistical databases. These indexes reach best performances for SUM and AVG aggregations, but are not well-suited for aggregations involving more than one column.

In Chan & Ioannidis (1998, 1999), some variations on the general idea of bitmap indexes are presented, by studying encoding schemes, time-optimal and space-optimal indexes, as well as trade-off solutions. A comparative study on the use of STR-tree-based indexes (a particular form of spatial index) versus some variations of bitmap indexes in the context of OLAP range queries can be found in Jürgens & Lenz (1999).