

When using materialized views in a MDDB, the most important problem to solve is to find the set of materializations that maximizes the performances of the MDDB in answering a given set of representative queries. The tradeoff consists of choosing a set of materializations able to speed up query response time without requiring too much work to keep the materializations current with respect to the modifications on the tables of the MDDB.
Definition 13: Given an MDDB DB, a set of queries Q, and a set of frequency values F of queries in Q and updates on the tables of DB, the MDmatproblem (multidimensional database materialization problem) is represented by Θ(DB, Q, F). A solution to the problem Θ(DB, Q, F) is a set of views of the MDlattice M (which can contain views that are not associated with queries in Q). A trivial solution, M = Ø, is always possible, which represents the situation where no additional materialization is available and all the queries must be answered directly by the fact table (the root of the hierarchy).
To identify the optimal M, we must define a cost function. The optimal solution is the one characterized by the minimum cost. The cost function is composed of two parts: the query cost and the update cost.
Definition 14: Consider a set of queries Q specified by the user. Let F be a set of frequencies f _{q i} , each associated with a query q_{i} ∊ Q, representing the frequency with which query q_{i} is asked. Let c_{qi} (M) be the cost to compute q_{i} from the set of materializations M (discussed in the second section).
Then, the total query cost C_{Q}(Q, M, F) is given by:
Definition 15: Consider the same set of materialized views M. Let f _{m i} be the frequency with which the materialized view m_{i} ∊ M is modified and c_{u} (m_{i} ) its update cost. Then, the total update cost C_{M}(M, F) is given by:
Definition 16: Given an MDmatproblem described by Θ(DB, Q, F), the cost of a solution M is the sum of query and update costs:
We do not present here a complete cost model. The techniques we describe are applicable to a wide choice of cost models, from simple to complex ones. Very few restrictions have to be imposed on the cost formulas to permit the adoption of our results.
The function c _{q i}(M) returns the cost of computing query q_{i} given a set of materializations M. We make two hypothesis about the query cost function: that each query cost depends on a unique element in M, and that the cost is monotonic with the size of the materialization on which the query depends.
Definition 17: A query cost function c_{qi} (M) is restrictible if it is always equal to the least among the values obtained by considering c _{q i} (m_{j} ), for all the m_{j} ∊ M on which query q_{i} depends.
Every query of the type introduced in the fourth section of this chapter is restrictible (since the fact table is always present in the set of materializations).
Definition 18: Given a query q_{i} , a restrictible query cost function c_{qi} , and a set of materializations M, the materialization m_{j} such that c_{qi}(M)= c_{qi}(m_{j}) is the least expensive materialization (for query q_{i} among the elements of M).
Definition 19: A query cost function cq_{i} (MDlattice) is monotonic if for all v_{j} , v_{k} ∊ MDlattice on which q_{i} depends, v_{j} < v_{k} → cq_{i}(v_{j}) ≦ cq_{i}(v_{k}), where v represents the cardinality of v and the arrow denotes logical implication. From the observation on the cardinalities of the views in the lattice, a monotonic function also guarantees that v_{j}≼v_{k} → cq_{i}(v_{j}) ≦ c_{qi}(v_{k}).
The simple cost model introduced in Harinarayan, Rajaraman, & Ullman (1996) is, for example, both restrictible and monotonic: the cost of answering a query q is set equal to the number of tuples read to return the answer. This cost model, although very simple, well characterizes the problem. An extension of this model permits consideration of the availability of indexes to accelerate the execution of queries, as described in Gupta et al. (1997). We do not treat indexes here, but a technique similar to Gupta et al.'s (1997) can be applied in this context as well, as it is also demonstrated in Agrawal, Chaudhuri, & Narasayya (2000).
The extension to an MDDB and join operations may require modification of this simple metric, which is still adequate when it is possible to keep in memory all the dimensions involved in the join. In fact, in this case the join requires reading only once all the tuples, and since the cardinality of the fact table is greater than that of any dimension, often by several orders of magnitude (Kimball, 1996), the cost of performing a join operation between any dimension and the fact table F can be modeled as the cost of reading F. When the join is more complex, a different cost function may be required. Sophisticated cost functions can be designed which adequately model this aspect of the system and still respect the requisites of restrictibility and monotonicity.
The cost of performing a groupby operation can be modeled as the cost of reading the table or materialized view on which the grouping is computed. This hypothesis does not consider the sort order which is used to store each materialization (see Agarwal et al., 1996), which has a relevant impact on the complexity of computing the aggregations; this aspect can also be modeled by a restrictible and monotonic cost function.
Compared to the number of nodes that form the MDlattice, the number of representative queries is extremely small. This considerable sparsity of queries among the views of an MDlattice suggests that only some of these views are relevant when deciding which views to materialize in order to minimize the total cost. The idea of our reduction technique is to consider only those views of an MDlattice that, when materialized, can provide some contribution to reduce the total cost. We call them candidate views, and define them as follows.
Definition 20: A view v_{i} belonging to an MDlattice is a candidate view if one of the following two conditions holds:
View v_{i} is associated with some query q_{i}.
There exist two candidate views v_{j} and v_{k}, such that v_{i} is the least upper bound (l.u.b.) of v_{j} and v_{k}.
Note that, from the previous definition, it follows that noncandidate views do not have an associated query.
Definition 21: Given a noncandidate view v_{i} with at least one candidate view depending on it, we call most directly dependent candidate view the unique candidate view v_{j} such that all the remaining candidate views that depend on v_{i} also depend on v_{j}.
We can determine the most directly dependent candidate view of v_{j} by taking the set V′ of all the candidate views that depend on v_{j} and computing the l.u.b. of all the views in V′. This view is unique (because an l.u.b. is always determined), is a candidate view (because it is an l.u.b. of candidate views), and belongs to V′ (because v_{a} ≼ v_{i} ∧ v_{b} ≼ v_{i} → (v_{a} ⊕ v_{b}) ≼ v_{i}).
Intuitively, it is not difficult to see that candidate views may provide some benefit if they are chosen to be materialized. If the view is associated with a query, materializing it will decrease the total cost if the increment of the update cost is compensated by the reduction on the query cost. When the view is the l.u.b. of other candidate views v_{j} and v_{k}, which may be associated with queries q_{j} and q_{k}, the total cost will be minor if the cost needed to answer q_{j} and q_{k} using a materialization of v_{i} is compensated by the reduction on the update cost of the materializations of v_{i} and v_{j}. We can analyze more formally these properties.
Let v_{i} be a candidate view of an MDlattice. Then, choosing v_{i} for materialization may provide some benefit when looking for the solution that reduces the total cost. There are in fact two cases:
v_{i} has an associated query q_{i}.
It is trivial to show that the materialization of a view associated with a query can help the computation of the query. Starting from the definition of the cost of a solution, we obtain the following formula, which identifies the query frequency value f _{q i} that makes convenient the materialization of view v_{i} when a set of views M is already materialized:
v_{t} ∊ M represents the least expensive materialization that can be used to answer q_{i}.
There exist at least two candidate views, v_{j} and v_{k}, such that v_{i} is the l.u.b. of v_{j} and v_{k}.
It is enough to show that there exists at least one case in which materializing v_{i} provides some benefit. Assume that there exist two queries q_{j} and q_{k} associated with views v_{j} and v_{k}, respectively. The contribution of queries q_{j} and q_{k} and views v_{j} and v_{k} to the cost C(Q, M, F), when v_{j} and v_{k} are materialized and v_{i} is not, is:
The contribution to the cost C(Q, M, F) if v_{i} is materialized and v_{j} and v_{k} are not, with v_{i} being the least expensive materialization for both q_{j} and q_{k}, is:
Choosing v_{i} for materialization will decrease the total cost if C_{1} > C_{2}, i.e., when:
with the hypothesis that c_{u}(v_{j})+ c_{u}(v_{k})− c_{u}(v_{i}) > 0. The intuition is that if the cost of updating views v_{j} and v_{k} is greater than the cost of updating v_{i}, for a high enough update frequency f_{u,}, it may be convenient to materialize view v_{i} instead of v_{j} and v_{k}.
Moreover, if we want to ensure that candidate views are the only relevant views for the process of deciding which views to materialize, we must prove also that materializing a noncandidate view may never decrease the total cost. This is done in Theorem 1.
Theorem 1: Let v_{i} be a noncandidate view of an MDlattice. Then, the choice of v_{i} for materialization is always dominated by the choice of a candidate view.
Proof: We will assume that there exists a noncandidate view v_{i} that belongs to the set of materializations M of the optimal solution, and we will get a contradiction. We distinguish two cases.
There is no candidate view depending on v_{i}.
The contribution of view v_{i} to the cost C_{1} = C(Q, M, F) of the solution M for Θ(DB, Q, F) (v_{i} ∊ M) is represented only by the contribution to the update cost f_{u} · c_{u}(v_{i}), since no query can use v_{i} (otherwise there would be candidate views depending on v_{i}). The cost of the solution M−v_{i} is equal to C_{2} = C(Q, M−v_{i} , F), where C_{1} = C_{2}+f_{u} · c_{u}(v_{i}). Since materializing v_{i} must provide some benefit, it must happen that C_{1} < C_{2}, i.e., f_{u} · c_{u}(v_{i})< 0. Then, we get a contradiction because both f_{u} and c_{u}(v_{i}) must be positive.
There exists at least one candidate view depending on v_{i}.
We first identify view v_{j}, the most directly dependent candidate view of v_{i}. We then consider separately two subcases, represented in Figures 4 and 5.
Figure 4: A Configuration of Materializations
Figure 5: A Configuration of Materializations
Case 1: Both v_{j} and v_{i} are materialized (represented in Figure 4). Let M′ = M −v_{i} −v_{j}. The cost of this solution is:
We observe that view v_{i} is not used by any query q_{i} ∊ Q, because all the queries that depend on v_{i} also depend on v_{j}, and since v_{i}≼ v_{j}, it follows that c_{qi}(v_{i})> c_{qi}(v_{j}). We can then remove it from the first term of the formula for C_{3} obtaining:
The cost of the solution with materialization M′ ∪ v_{j} is:
The difference between C_{3} and C_{4} is represented by the single term f_{u} · c_{u}(v_{i}). In order for C_{3} to be the optimum (i.e., C_{3} < C_{4}), it must happen that f_{u} · c_{u}(v_{i})< 0. We get a contradiction since both f_{u} and c_{u}(v_{i}) must be positive.
Case 2: v_{i} is materialized and v_{j} is not (represented in Figure 5). Let M′ = M − v_{i}. The cost of this solution is:
In particular, this solution must be better than the solution where v_{j} is materialized but v_{i} is not, which has the cost C_{4} defined in the above formula. But each term c _{q i} (M′ ∪ v_{j} ) in C_{4} must be less than c_{qi} (M′ ∪ v_{i} ) in C_{5} , because all the queries that depend on v_{i} must also depend on v_{j} ; v_{j} is smaller than v_{i} and the query cost function is monotonic. Term c_{u} (v_{j} ) must also be smaller than c_{u} (v_{i} ), for the monotonicity of update cost with view size. It is then impossible for C_{5} to be smaller than C_{4}.
Once we have proven that candidate views are the only views that are relevant to the process of deciding which materializations minimize the total cost, we can define the sublattice obtained by considering only candidate views.
Definition 22: Given an MDlattice and a set of queries Q, the set of its candidate views identifies also a lattice where the join and meet operations are identical to the ones defined for the MDlattice. We call this sublattice the MDredlattice.
Given a set Q of queries and the MDDB attribute hierarchy, we can identify all the elements of the MDredlattice. This is performed by means of the following algorithm.
Algorithm 3: The MDredLattice Construction Algorithm
Algorithm 3 iteratively extends the set L. L is initially equal to the set of views in Q. The l.u.b. of all the pairs of elements in L are added to L, and the process iterates by considering the l.u.b. of all the pairs of views obtained by combining the new elements of L with all the elements of L, until a fixpoint is reached where no new l.u.b. can be added to L.
It is now convenient to describe another technique based on statistical estimates of the size of the materializations. Given its statistical foundation, the next technique is not as exact as the one just presented.
When building the MDredlattice, a simple heuristic technique can be used to further reduce the size of the lattice, removing views which are not expected to contribute to the optimal solution. This heuristics is based on the estimate of the size of the materialization. These estimates can be done following traditional query estimation techniques for aggregates. If the skewness of data is of concern, it is also possible to use the technique, described in Shukla et al. (1996), which obtains an estimate of the number of tuples that is quite precise for a wide range of skewness in the distribution.
According to the estimates on the size of views, it is possible to determine when the level of aggregation used is too detailed and the materialization offers very limited help in answering the query with respect to a materialization of a higher level view.
Example 6: Consider a dimension A with 1,000 tuples. Let a view contain an aggregation for the pair of attributes {A_{1} , A_{2}}, where each attribute has 100 distinct values. There are 10,000 possible pairs of values of the attributes, but since there are only 1,000 tuples, at most 1,000 tuples can be present in the view. If the data is uniformly distributed, the estimate of the size of the view is quite close to 1,000. Instead of materializing this view, it could be convenient to use the view which has the key of dimension A as aggregating attribute. The great advantage is that it will be easier to reuse this view in the computation of other aggregates and the number of elements of the lattice will be, in general, reduced.
We can easily modify the MDredlattice construction algorithm to estimate at each step the dimension of a view, and substituting to it a higher level view if the reduction criteria are not met. In the next section we illustrate a few experimental results. We now describe more formally this technique.
Definition 23: A size estimating function size(A)is a function that, applied to a set of attributes A of an MDDB, returns an estimate of the number of tuples of the result of an MDDB query q characterized by A (i.e., having A as grouping attributes).
Definition 24: A size estimating function is correct if for all the functional dependencies fd_{i} ∊ FD_{DB}, size(A^{r}_{i}) ≠ size(A^{l}_{i}).
Intuitively, since there exists a functional dependency A^{l}_{i} → A^{r}_{i}, for each combination of values of A^{l}_{i}, more than one combination of values for the attributes in A^{r}_{i} cannot exist. Thus, a query grouping on A^{l}_{i} cannot have more tuples than a query grouping on A^{r}_{i}.
Algorithm 4: Heuristic Reduction Algorithm
Algorithm 4 considers all the functional dependencies to identify when the attributes in A that appear on the right side (or a subset of them) of a functional dependency produce a size estimate which does not differ more than a ratio ρ from the estimate of the size of the attributes on the left side. When this happens, the attributes on the right side are replaced by the attributes on the left side, effectively moving from a view v_{i} characterized by the set of attributes A to a view v_{j}≼ v_{i}.
The parameter ρ represents the threshold on the amount of reduction in size above which the left side of a dependency should be used in place of the right side. The user should determine this value, depending on a tradeoff between accuracy of the solution and reduction in the size of the solution space. Even high levels (e.g., r = 0,95) can be quite effective in simplifying the solution space.

