INCOMPLETE GROUPING ATTRIBUTES

Incompleteness in a grouping attribute is a different problem than incompleteness in a measure. In this section we focus on problems and techniques for grouping of exclusive and inclusive disjunctive values only. The techniques that are presented can be adapted to cover other kinds of incomplete information in grouping attributes.

Exclusive Disjunctive Values in the Data

An exclusive disjunctive value is similar to an imprecise value, but the potential values do not have to be in a sequence. The primary problem in handling an exclusive disjunctive value is ascertaining the group it belongs to since it potentially could belong to several base nodes. The techniques discussed below are in the context of a count aggregate that is counting the number of apples sold at a store. There are many different kinds of apples, and apples are just one kind of product sold at the store.

Discard the unknown value: This alternative is unattractive because an exclusive disjunctive value could have a very limited number of alternatives. Figure 5(a) shows a fragment of a multidimensional database for counting the number of apples sold. One McIntosh, one Fuji, and one additional apple are sold, but the cashier could not decide if the last apple sold was a McIntosh or a Fuji. The incomplete sale information is represented by the exclusive disjunctive value {McIntosh, Fuji}. In Figure 5(a) the incomplete information is discarded during the computation of the count aggregate, resulting in an undercount of the number of apples sold. This technique has a straightforward and easy implementation, but it at best underestimates the data.

click to expand
Figure 5: Options for Handling an Exclusive Disjunctive Value (in a Max Aggregate)

Maintain min-max bounds: An exclusive disjunctive value potentially belongs in several groups at base nodes. With respect to a base node, the maximum bound assumes that the value is in the group for that node. The minimum bound, on the other hand, assumes that the value is in some other group. The aggregate for a base node is computed with respect to both the minimum bound group and the maximum bound group. Figure 5(b) illustrates the technique. The base nodes contain a min-max pair. The exclusive disjunctive value could belong to the McIntosh group or the Fuji group. So the maximum bound is two in each base node, whereas the minimum bound is one. Note that this technique introduces an imprecise value into the hierarchy.

Introduce unknown nodes: Min-max bounds maintain the limits on the possible undercount and overcount in a multidimensional database. But the limits become artificially exaggerated higher in the hierarchy. The distance between the min-max bounds in the base nodes is at most n, where n is the number of exclusive disjunctive values. But at the next level higher up in the hierarchy, the distance becomes n*g, where g is the number of base nodes. A refinement is to introduce a new base node to count the number of exclusive disjunctive values. An example is shown in Figure 5(c). The additional base node is labeled "unknown." It records the number of apples that could not be precisely identified, i.e., it counts the number of exclusive disjunctive values. When the count of all apples is needed, the unknown base node adds its contribution to the overall count.

Introduce XOR nodes: The main problem with introducing an unknown node is that it discards some information. The exclusive disjunctive value is almost always just a small number of potential values, one of which is the actual value. Pedersen et al. (1999) develop a clever strategy that loses less information. The technique is to add a node for each unique exclusive disjunctive value rather than a single unknown node. The additional node "fixes" the problem of overcounting and undercounting at the next higher level in the hierarchy. We will call the new node an XOR node (an exclusive-or node). Figure 5(d) illustrates the method. An XOR node labeled "McIntosh or Fuji" is introduced to count the number of {McIntosh, Fuji} exclusive disjunctive values. XOR nodes are hidden from the user since they are introduced by the system. In Figure 5(d) the node is shaded gray to indicate that it is hidden.

Exclusive disjunctive values are grouped only at XOR base nodes for computing aggregates. Only complete information is used to compute non-XOR base nodes. A further modification of the hierarchy allows min-max bounds, or other derived incomplete information, to be kept. Each non-XOR base node is split into a hidden and a visible component. The hidden component stores the count of complete information. The visible component stores the min-max computations (they become terminal measures that do not contribute up the hierarchy). Figure 5(d) illustrates the method. The base nodes for McIntosh and Fuji apples are split; the hidden component of each node is shaded in gray. The exclusive disjunctive value contributes to the visible component, but only the complete data values contribute to the hidden component. The hidden components are used for aggregating up the hierarchy.

The chief disadvantage of this strategy is the space cost. Potentially one new unit is added for each exclusive disjunctive value in a dimension. This increases the size of the dimension, which in turn makes the multidimensional space larger since the number of points in the multidimensional space is a product of the size of each dimension. Correctly managing incomplete information can be expensive.

Inclusive Disjunctive Values in the Data

An inclusive disjunctive value represents a potential set of values, more than one of which could be the actual value. For example, assume that a color scheme for apples is introduced. Experts judge apples by a "primary" color. Newton apples are a mix of red and green, and either color may predominate in the apple. So both red and green are possible colors of a Newton apple. The dilemma comes about when we want to count the number of red (or green) apples. Should Newton apples be counted? If so, then if we similarly count green apples and combine the count of red and green apples to get an idea of how many apples we have, Newton apples will be counted twice. If, however, Newton apples are not counted, then there is an undercount of the number of apples. Techniques for handling inclusive disjunctive values are similar to those for managing exclusive disjunctive values. Min-max bounds, however, tend to be even larger. As in the previous section, a good technique is to gerrymander the metadata to avoid the both overcounting and undercounting, but the solution has high space cost.



Multidimensional Databases(c) Problems and Solutions
Multidimensional Databases: Problems and Solutions
ISBN: 1591400538
EAN: 2147483647
Year: 2003
Pages: 150

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net