INCOMPLETE INFORMATION IN MULTIDIMENSIONAL DATABASES | Multidimensional Databases: Problems and Solutions

While multidimensional data models that store and manage complete information have been studied extensively, there are only a few papers that address incomplete information for multidimensional databases.

Shoshani (1997) compares multidimensional databases with statistical data models. In the tradition of the influential SUBJECT data model (Chan & Shoshani, 1981), multidimensional data models support only two kinds of nodes: cluster nodes (for units) and cross-product nodes (for combining dimensions). The greater sophistication in describing statistical data found in later statistical data models such as SAM* (Su, 1983) and STORM (Rafanelli & Shoshani, 1990) is absent from most multidimensional data models because of a difference in data modeling requirements. Statistical data models are designed to model complicated, nonstandard, heterogeneous, real-world data sets, whereas multidimensional database models create their own simple, standard, homogeneous statistical data set; consequently a simpler data model suffices. The statistical database aspect of multidimensional databases is best understood as an extension of the work done by Malvestuto (and others) on data integration in statistical databases (Malvestuto, 1991). Data integration creates a unified view of a set of different, but homogeneous, statistical tables.

Dyreson (1996) first wrote about incomplete information in a multidimensional database context. He developed a data cube (a multidimensional database) that can contain regions of unknown values. Queries on the unknown regions are either redirected to the nearest complete information regions or computed along with completeness measures. The completeness measure is a percentage of how much complete information is used in the evaluation of a query. The key to the incomplete data cube is a high-level specification of which regions in multidimensional space are complete, an idea that first gained acceptance in semantic, statistical data models (Sato, 1991). Efficient algorithms to query, update, and reorganize the cube are given. Dyreson (1997) describes a research prototype that automates the loading of data from log files into an incomplete data cube.

Making use of incomplete regions within a complete cube is also the motivation behind quasi-cubes (Barbará & Sullivan, 1997). A quasi-cube trades accuracy for space. In a quasi-cube, regions of an eager (fully materialized) multidimensional database are replaced with a single approximated value. The approximated value is subsequently used in operations to quickly provide an estimate of an actual value. Complete, accurate values can be computed from the base data when desired. The approximation is a kind of incomplete information. Quasi-cubes have techniques for differentiating between approximated data and complete data in a query.

Pedersen et al. (1999, 2001) describe a complete multidimensional data model that supports both incomplete data and metadata (among other things). One culprit that leads to uncertainty in aggregate values is an incompletely specified hierarchy in the metadata. Pourabbas & Rafanelli (2000) refer to incomplete hierarchies as partial classification hierarchies. For example, consider the problem of non-strictness, that is, many-to-many mappings between categories, in a hierarchy. Aggregating nodes that contribute to more than one unit may lead to problems. If a tomato is categorized as both a fruit and a vegetable, the computation of an aggregate value for produce (which includes both fruits and vegetables) will be incorrect because tomatoes will have been considered twice. In some sense the problem is caused by an incomplete, but very human, specification of the hierarchy. A complete specification would have a category for vegetable-fruits that would contribute exactly once to each of the produce, fruit, and vegetable categories. Research in this area has proposed efficient techniques to automatically translate among specifications.

Jagadish et al. (1999) also discuss techniques for handling problems in the specification of metadata and how to aggregate with imprecise values in the grouping attributes (those that map to somewhere above the base of the hierarchy).