MAINTENANCE OF MATERIALIZED VIEWS | Multidimensional Databases: Problems and Solutions

Maintaining a materialized view means to update it according to the changes which occurred to the base tables it refers to. Many techniques have been created for the efficient maintenance of materialized views. This has recently been an extremely active research area, which is well summarized in Gupta & Mumick (1999), which collects the most significant contributions to the area. The problem of efficient view maintenance can be separated in the two problems of computing the updates that have to be applied to the materialized view, and the problem of applying the update to the materialization. In general, maintenance can be immediate (updates are propagated from the base tables to the materializations as soon as they have occurred) or may be deferred (following one of many policies).

In this context we pay attention to the deferred choice, as we refer to the data warehouse environment. Among the many existing proposals, we succinctly describe the most important ones. The book (Gupta & Mumick, 1999) is fully dedicated to the topic of view maintenance and presents all of these solutions in detail.

One of the first proposals appears in Blakeley, Larson, & Tompa (1986), where a technique for the incremental maintenance of select-project-join views is presented. The paper first identifies criteria that identify irrelevant updates, i.e., updates on the base tables that cannot possibly have an impact on the view materialization. Then, it presents the use of counters to determine the number of computations from the base tables that justify the presence of a derived tuple in a view. The technique then demonstrates that the variations to the counters is derivable by substituting the sets of inserted/deleted tuples into the view definition formula. Many solutions have been identified for the incremental maintenance of recursive views (see, for example, Dong & Su, 2000). Other situations that received the interest of many researchers are views with duplicates (Griffin & Libkin, 1995), mediated data (Lu et al., 1995), chronicles (Jagadish, Mumick, & Silberschatz, 1995), and hypertext views (Sindoni, 1998; Labrinidis & Roussopoulos, 1999).

Focusing on our environment, the important contributions are the ones offered by Gupta, Mumick, & Subrahmanian (1993) and by Mumick, Quass, & Mumick (1997). In the first contribution, the authors present two techniques: a counting technique applicable to generic views, and a technique for recursive views. The counting technique extends the proposal of Blakeley, Larson, & Tompa (1986) to views defined using selection, projection, negation, union, and aggregations. The idea of the counting technique is to associate with every tuple in the materialized view a counter which represents the number of derivations for that tuple. Decomposability of operations guarantees that the determination of the update on the view can be obtained by comparing the updates to a base table with the content of all the other tables.

Mumick, Quass, & Mumick (1997) focus on the problem of the efficient maintenance of materialized views in a context similar to the one we depict in this chapter, where views are called summary tables. The solutions presented are all based on the observation that the regular structure of MDDB queries can be exploited to design efficient view maintenance techniques. The paper first identifies the two main issues that characterize the propagation of updates in the multidimensional data model we have introduced. The first problem is the propagation to a single summary view of the updates on the base table. The second problem is how to manage the presence of many summary tables.

The maintenance of a node of the MD-lattice can be based on the observation that only updates to the fact table are significant. We may assume that referential integrity constraints forbid the occurrence of tuples in the fact table that present a value for the dimension key which is not present in the corresponding dimension. This means tuples removed from a dimension cannot correspond with tuples in the fact table, and insertions into the fact table have to correspond to tuples in the dimension. Given the restrictions on the views appearing in the MD-lattice, operations on the dimension will not have an impact on the view, as long as updates on the fact table are always correctly managed. Consider dimension Store in the MDDB example and a generic view q_A. If referential integrity holds, insertions into Store will not contribute to q_A, as they have no corresponding element in the fact table; dually, deletions will only occur when no corresponding tuple is present in the fact table, thus they will not have an impact on q_A. We thus focus on updates to the fact table. To keep the model simple, we only consider insertions and deletions; an update is represented as a deletion followed by an update.

We now reconsider the classification of aggregate functions introduced in Chapter 2. Distributive and algebraic functions are immediately usable and permit an efficient recomputation of the view. Holistic functions limit the family of updates that can be propagated. An important property that distributive and algebraic functions offer is self-maintainability, i.e., the view can be updated based only on the analysis of its current state and of the updates to the fact table. In the example the sum function was used to compute the total sales amount. Since the sum function is distributive, it is possible to self-maintain a node of the lattice. Consider the node q_A that aggregates on a subset of the keys of the dimensions: the view can be updated considering the inserted/deleted tuples one by one and adding/subtracting the sales value represented in the generic inserted tuple t to the total stored in the view tuple t_v characterized by the same values of t for all the dimension keys. For example, an insertion of a tuple Sale(StoreId: 15, Product: 58, Time: 1253, Promotion: 35, Amount: 120) can be applied on MD-lattice node q_{Store,Promotion} as an increase of 120 of the content currentTotal of the view tuple (Store: 15, Product: 58, Total: currentTotal). It may happen that a tuple with the searched values is not present in the view. In this situation, a new tuple will be inserted into the view. In the example, a tuple (Store: 15, Product: 58, Total: 120) will be inserted into q_{Store, Promotion}.

For holistic functions, the situation is more variegated. There are aggregate functions that do not permit self-maintainance of views and always require consideration of the content of the Sales table (like the median and count(distinct)); the max and min functions instead exhibit a peculiar behavior, as they allow self- maintenance for insertions, whereas they may require access to the base table for deletions. This behavior derives directly from the analysis of the properties of min and max: if the materialization keeps only the min (or max) value combined with the counter, when a new value is inserted, it is sufficient to increment the counter and to determine if the new value is smaller (respectively, greater) than the value currently stored; in this case, the view is updated. For deletions, when the deleted value is different from the one stored in the view, the update does not require propagation; when the value to remove is exactly the one stored in the view, the value has to be recomputed considering all the possible values on the base tables. In some contexts, deletions may be considered as non-applicable and min or max views become self-maintainable.

A technique improving update propagation, proposed in Mumick, Quass, & Mumick (1997), requires the computation of a synthesis of updates from all the updates that are propagated up in the MD-lattice. We illustrate the idea with a simple example. Let us suppose that in our Sales MD-lattice, a set of sales is inserted.

Suppose we materialized the queries of Example 2. The propagation of this update to each materialized view is performed by the execution of two functions: a propagate function and a refresh function. The propagate function creates a view-delta table from the above set of changes: it represents the net changes to the materialized view due to the changes in the fact table. The refresh function applies the net changes to the materialized view. This also has the advantage of leaving the warehouse available for querying by clients during the propagate phase.

The maintenance of multiple materialised views is optimized by this technique by using a view-delta to compute view-deltas for other materialised views. In particular, once the view-delta for the view which is in the higher position in the lattice is computed, the remaining deltas are computed using their ancestor delta. So, in our example, the view delta for q₄ is computed, then the view-deltas for q₃ and q₂ are computed in terms of the one for q₄, and finally the view-delta for q₁ can be computed either in terms of that for q₃ or that for q₂.

Product	Store	Time	Promotion	Amount
Britney Spears GH	Milan	2002/10/10	Buy Music	14
Mozart K441	Milan	2002/10/20	Classic4all	9
Britney Spears GH	Rome	2002/10/10	Buy Music	14

Update Cost

The cost of performing the insertion of a tuple into a base table may be split into two parts: a) the cost of writing the inserted tuple in the affected table, and b) the cost of propagating the insertion to all the materialized views that are affected by the change. When the insertion is performed on the fact table, part "b" requires either the insertion of a new tuple, or the update of the aggregate value(s) for one tuple in each group-by. We assume that when a tuple is inserted on the fact table, it is immediately joined with all the dimensions. All the materializations are then accessed to apply the effect of the new fact on the materializations.

If we follow the model of Harinarayan, Rajaraman, & Ullman (1996), we can use the number of tuples read or written also as a metric for the update cost. We may assume that an access to an index costs as a tuple read. Then, the update cost c_u(m_j) for a materialization m_j can be modeled considering an index read to identify the position of the tuple (requiring in general log_x(|m_j |) reads, where x is a numerical value depending on the implementation) and a write of the updated tuple. When instead a tuple is inserted into a dimension table, the referential integrity constraint guarantees that no tuple is present in the fact table with the same identifier value. Hence, this insertion does not need to be propagated on materialized views.

Update and delete operations on both dimension and fact table are very rare. Hence, we do not include them in our cost model. When any such operation is performed, recomputation of several materialized views may be necessary.

Since the updates on the dimensions are not propagated and insertions into the fact table generate the update of a tuple in all the materializations, we may simplify our cost model assuming a unique update frequency f_u valid for all the materializations, where f_u represents the frequency of insertions into the fact table. The update cost function is:

Definition 12: An update cost function c_u is monotonic if for all m_j , m_k ∊ M, |m_j| < |m_k| → c_u(m_j) ≤ c_u(m_k), where |m_j| represents the number of tuples of m_j . The update cost function we have described is monotonic.