MULTIDIMENSIONAL DATABASES

At the abstract level, considering its typical requirements, a multidimensional database (MDDB) is a data repository that provides an integrated environment for decision support queries that require complex aggregations on huge amounts of historical data (e.g., to identify trends in revenues). At the technical level, we assume for this chapter that an MDDB is a relational data warehouse, in which the information is organized following the star-model. Its basic structure may be represented with the simple entity-relationship diagram depicted in Figure 1, in which all the Di entities represent the dimensions of the MDDB, while the connecting relationship F is the fact table.

click to expand
Figure 1: Entity-Relationship Representation of an MDDB

Each dimension table Di contains all the information that is specific only to the dimension itself (for example, product information), while the fact table F correlates all dimensions and contains information on the attributes of interest for the intersection of all the dimensions (for example, sales by product by day).

Multidimensional queries usually require the computation of aggregate functions on data grouped on an appropriate set of attributes. For multidimensional data processing, data is usually represented as a single (huge) fact table, on which all the interesting aggregates may be computed.

As shown in the practical example below, in addition to the fact table, MDDBs may have several dimensions, each of which is characterized by a considerable number of attributes, most of which may also be relevant for grouping computation.

A Practical Example

We consider as a practical example the MDDB for a large grocery store chain, taken from Kimball (1996). The company manages a large number of stores, each of which is a supermarket selling a wide variety of different products (e.g., grocery, frozen foods, bakery, etc.).

We can identify the following dimensions:

  • Product, which can be characterized by more than 50 different attributes describing various features of the products, such as brand, category, diet-type (diet vs. non-diet), package type, weight, case size. Products belong to a merchandise hierarchy, which includes sub-category and category. This means that all the products that belong to the same sub-category also have the same value for attribute category.

  • Store, which characterizes each point of sale with attributes such as complete store address and telephone, manager, description of store services, sizes of different departments. Stores are organized in a geographic hierarchy (zip code, county, state). The store dimension may have more than 20 attributes.

  • Time, which provides the appropriate detail to allow accurate analysis of the MDDB data, that is indeed a time series. Time is represented with the granularity of day; days may be described as day in the month, in the quarter, and in the year. Furthermore, the indication of which days are holidays, weekend days, days in which special events took place (and many more) can be given. The time dimension may have more than 15 attributes.

  • Promotion, which describes the characteristics of product promotions, such as the promotion type (ad, coupon, etc.) and several pieces of information on each promotion type, the promotion cost, and start/end date. Overall the promotion dimension is characterized by at least 10 attributes.

The fact table provides the sales information on which the actual financial analysis is performed. It includes the identifiers of all the dimensions and several attributes describing sales revenues (e.g., income, number of units sold, number of customer transactions performed, etc.). In this chapter we consider a subset of the attributes of each dimension as relevant attributes for grouping computations: we assume 15 attributes for dimensions Product and Store, nine attributes for Time, and 11 attributes for Promotion.

Materialized views in MDDBs

MDDBs have three features that make them an ideal environment for the adoption of materialized views.

  • MDDBs are implemented upon data warehouse architectures, where updates occur with a rigid low frequency (daily, weekly, or even monthly), in periods where the system load is low; thus, ratio fQ / fU is particularly high.

  • MDDBs have a data model that permits the easy adoption of incremental view maintenance techniques; often, materialized views can be maintained considering only the updates and the current state of the materialization, with no need to access base tables.

  • Queries in MDDBs can be characterized by a model that permits a precise and manageable definition of the view selection problem.

We now describe the multidimensional data model and the query model. We will then present the most important techniques for the maintenance of materialized views and the MD-mat problem, that is the problem of chosing an optimal set of views to materialize.

Multidimensional Database Model

We repeat for convenience some definitions which were previously introduced.

Definition 1: A Multidimensional Database is a collection of relations D1, , Dn, F, where 0

  • Each Di is a dimension table , i.e., a relation characterized by an identifier di that uniquely identifies each tuple (di is the primary key of Di).

  • F is a fact table , i.e., a relation connecting all dimension tables D1 , , Dn ; the identifier of F is given by the foreign keys d1 , , dn of all the dimension tables it connects; the schema of F contains a set of additional attributes V (representing the values on which the aggregate functions are applied).

The dimension tables may contain hierarchies on the data.

Definition 2: Let D be a dimension table with identifier d. An attribute hierarchy on D is a set of functional dependencies FDD = {fd0 , fd1 , , fdn }, where each fdi is characterized by two sets of attributes A li Attr(D) and Ari Attr(D) (respectively called left side and right side of the dependency); the dependency is represented as fdi : A li Ari.

Each functional dependency fdi is a constraint on the content of the dimension table D: for any pair of tuples t1, t2 D, t1[Ali]= t2[Ali] t1[Ari]= t2[Ari]. A dependency fd0 with A l0 = {d} and A r0 = {Attr(D) d} will always be present in FDD. A given set of attributes Ali may appear only once as the left side of a dependency. Functional dependencies must be acyclic, i.e., the graph obtained by drawing an arc from ax to ay, if fdi FDD | ax Ali ay Ari, must be acyclic.

Definition 3: An attribute hierarchy is tree-like (FD T ) if all the attributes appear at most once in the right side of a functional dependency and the left side always contains a single attribute.

In this chapter we consider only tree-like attribute hierarchies.

Definition 4: An MDDB attribute hierarchy FDDB is the union of the attribute hierarchies FDDj of all the dimensions Dj appearing in the MDDB.

Business analysis queries usually require the computation of aggregate functions on data grouped on an appropriate set of attributes.

Example 1: Consider again the multidimensional database of our example, MDDB = {Product, Store, Time, Promotion, F}, where each dimension table has its corresponding attributes and with fact table F ={p, s, d, r, f}, having p, s, d (time dimension is represented with the granularity of day), and r as foreign keys of the dimension tables, and f representing the amount of sales. The following queries can be requested on the MDDB:

  • q1 = total sales per product

  • q2 = total sales per product and store

  • q3 = total sales per product and day

  • q4 = total sales per product, store and day

The queries we consider are select-join-group-by queries, with some restrictions on the allowed selection and join predicates. In particular, selection predicates are simple comparison predicates between a dimension attribute and a constant value (i.e., are of the type attr <operator> const-value). These predicates select a "slice" of data on which aggregates are actually computed. We envision a framework where the user provides a set of queries representative of the queries that the system must answer. For select queries we assume that only the attribute used is important and not the constants used in the comparison (which will generally change from query to query). To simplify the framework, we derive from a selection condition on an attribute a request for aggregating on that attribute. Consider a query q that returns the total sales of a particular store grouped by products. Query q can be answered by accessing the result of a query that returns the sales grouped by product and store, and selecting from them only the tuples relative to the specified store. In this way we can focus on queries computing aggregates. These considerations lead to the restriction that all the attributes appearing in a selection predicate must also appear as grouping attributes.

Join operations may be performed only between the fact table and any of its dimensions. Allowed predicates are equality predicates between a dimension identifier and the corresponding fact table foreign key, while join predicates involving non-key attributes are disallowed.

We consider the standard SQL notion of group-by and aggregate function (considered functions are count, sum, avg, min, max). Grouping attributes may be drawn both from dimension and fact table attributes.

Definition 5: Given a query q, we define the query as characterized by the set of its group-by attributes A. We may represent the query as q A.

Data-cube

A fundamental characteristic of MDDB queries is that it is often convenient to reuse the results of queries to answer other queries. In Example 1, we can use the result of query q2 to answer query q1, adding the sales across all stores to get the result. In a similar way, q1 can also be computed in terms of the answers to queries q3 and q4, while q2 and q3 can be computed in terms of query q4.

The reuse of queries is strictly related to the data-cube (Gray et al., 1996) operator. This operator, receiving as input a table T, a set of aggregating attributes A, and an aggregate function f, computes the union of the results of the queries evaluating function f, having as grouping attributes all possible combinations of attributes in A.

Consider the situation where only the fact table F is present. Applying the data cube operator on the foreign keys of F, we obtain a data cube lattice.

Definition 6: Given an MDDB = {D1 , , Dn , F}, the lattice Cube-lattice of MDDB is the lattice of the set of all possible grouping queries that can be defined on the foreign keys of F. This lattice is characterized by the following elements:

  • an ordering relation defined as the comparison between the sets of grouping attributes (i.e., qA1 ” qA2 A1 A2);

  • meet operator as union of the grouping attributes;

  • join operator as intersection of the grouping attributes;

  • the query grouping on all the foreign keys as top element;

  • the query computing the aggregate function on all the tuples of F as bottom element (empty set of grouping attributes).

Requested queries are associated with a subset of the views of the data-cube lattice of MDDB. In Agrawal, Chaudhuri, & Narasayya (2000). the subset of views is considered which appears in the execution plans of representative queries.

Example 2: Consider again the MDDB of Example 1. Figure 2 represents the data-cube lattice derived from the fact table F and shows the elements that have a query associated.

click to expand
Figure 2: Data-Cube Lattice with Associated Queries

The Multidimensional Lattice

The presence of dimensions significantly increases the size of the problem. The first aspect is the growth in the number of potential grouping attributes, which exponentially increases the number of elements of the lattice. The second aspect is the presence of hierarchies, which permits removal of some elements from the lattice. In fact, consider a query grouping on a dimension key di and also on an attribute aj of the same dimension Di. Since there exists a functional dependency from di to aj, a query grouping on {di , aj} must produce the same result of the query grouping on {di}. This observation is the basis for the following generalization.

Definition 7: Let qiAi and qjAj be two queries and FDDB the MDDB attribute hierarchy. The operator ancestor (represented by the symbol ) is defined by the following algorithm:

Algorithm 1: Ancestor of two queries.

Algorithm 1 operates by building the union of the attributes characterizing the queries and eliminating all the elements for which there exists a functional dependency in FDDB.

The result of applying the operator to queries qx and qy is the smallest view that contains all the information necessary for answering qx as well as qy. If applied in a reflexive way (i.e., qxAx qxAx), it eliminates all redundant attributes.

Example 3: Consider the Store dimension of the MDDB described earlier, with key s and restricted to a set of attributes {z, c, st, n} representing respectively zip code, county, state, and number of sale clerks. The dimension has the following tree-like attribute hierarchy: { fd0 : s {z, c, st, n}, fd1 : z c, fd2 : c st}.

The queries q {n} , q {c} , and q {st} can for example be computed from q {c,n} = q{n} q{c} q {st}.

Definition 8: The operator descendent (represented by symbol ) is defined by the following algorithm:

Algorithm 2: Descendent of two queries.

Algorithm 2 first extends the parameters Ax and Ay with all the attributes that can possibly be derived from them. It then considers their intersection Az and finally removes from it the right sides of dependencies whose left side is contained in Az.

The descendent operator computes the "greatest" among the set of attributes characterizing the queries that can be computed by both qx and qy.

Definition 9: Let {D1 , , Dn , F} be a multidimensional database and FDDB the MDDB attribute hierarchy. Consider the set of queries characterized by all the combinations among the attributes of {D1 , , Dn , F}, except the combinations that contain elements on both sides of a functional dependency fdi FDDB . This set of queries identifies a lattice where:

  • is the meet operation;

  • is the join operation;

  • the ordering relation is given by the following definition: qA1 qA2 (A1 A2) = A1 (A1 A2) =A2;

  • the query grouping on all the foreign keys of F, {d1 , , dn}, is the top element;

  • the query computing the aggregate function on all the tuples of F is the bottom element (empty set of grouping attributes).

We call this lattice the MD-lattice (multidimensional lattice).

Comparing Definitions 6 and 9, it is easy to observe that without hierarchies the MD-lattice is equivalent to the Cube-lattice built on all the schema attributes.

Example 4: Consider the Store dimension of Example 3. The MD-lattice for an MDDB where Store is the only dimension is represented in Figure 3.


Figure 3: MD-Lattice of the Store Dimension

An MD-lattice defines all possible ways of computing queries that can be defined on an MDDB in terms of other queries defined also on MDDB. In fact, if a query qi qj then qi can be answered using qj. This property can be generalized.

Definition 10: Let qi and qj be two queries on an MDDB. Then, the least upper bound (l.u.b.) of qi and qj in the MD-lattice is the query obtained as a result of evaluating qi qj , and it represents a query (the most specialized) which can be used to answer both qi and qj.

We now distinguish between queries and views. In the following we will use the term query to refer to the representative queries provided by the user, while we will use the term view to refer to the elements of the lattices. When a query q computes the content of a view v of a lattice, we say that the query q is associated with view v. We also extend to views the notation for the queries and say that a view is characterized by a set of attributes A if A is the set of grouping attributes of v, represented as vA.

Definition 11: A query qi is said to depend on view vj if the query can be answered using as only inputs the content of vj together with any appropriate dimension Di.

There is a strict relationship between dependence of a query on a view and the ordering relation on an MD-lattice. In fact, if we consider a query qi depending on view vj and the view vi associated with the query, it must happen that vi vj.

An important relationship exists between the ordering of the elements in the lattice given by relation and the cardinalities (i.e., number of tuples) of the views. In fact, for each pair of views vi , vj MD-lattice, if vj vi, vj can be computed by vi with a further grouping. Since grouping can only reduce the number of tuples, it follows that |vj| |vi|, where |v| represents the cardinality of v.

Determining the Number of Nodes of an MD-lattice

The number of views of an MD-lattice depends on the number of attributes of the dimensions of the MDDB and on the number and structure of the attribute hierarchies.

In the absence of hierarchies on the dimensional tables, i.e., when each dimension has a hierarchy that contains only the functional dependency between the key of the dimension and all its attributes, the number of views of the MD-lattice is given by the formula:

where i is the number of dimensions of MDDB and ni is the number of non-key attributes of each dimension Di. In fact, the number of views of the MD-lattice is obtained as the product of the number of views of the lattices that can be built on every dimension. Each dimension Di generates a lattice which has 2ni elements representing all the possible combinations of ni attributes (exactly like a Cube-lattice), plus one view which represents the view characterized by the key di.

In the presence of hierarchies, the number of views in the local lattices is reduced and the above formula for ntotal constitutes an upper bound. A formal treatment of the cardinality of the MD-lattice in the presence of hierarchies is beyond the scope of this chapter. We only provide as an example the determination of the number of elements for the MD-lattice of the MDDB previously described.

Example 5: Consider the MDDB described earlier in this chapter. The number of views of the lattices that can be built on every dimension are: nProduct = 12,289; nStore = 8,193; nTime = 129; nPromotion = 1025. The total number of views of the resulting MD-lattice is then the product of all these values, obtaining | MD-lattice | = 1.3313 · 1013.



Multidimensional Databases(c) Problems and Solutions
Multidimensional Databases: Problems and Solutions
ISBN: 1591400538
EAN: 2147483647
Year: 2003
Pages: 150

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net