In the area of scientific databases, the aspects of multidimensionality that are of interest depend on the application areas and the purpose of organizing or viewing the data as a multidimensional dataset. Scientific databases are growing in numbers and in size at a fast pace as a result of superior instrument automation and faster parallel computers. Instruments are now capable of measuring physical phenomena with higher precision and in shorter time intervals. For example, satellites are now sending large quantities of measurements in the upper atmosphere. Similarly, runs on parallel supercomputers with thousands of processors generate large simulated datasets. These advances require better data management techniques, especially for multidimensional data. In a previous paper (Shoshani, Olken, & Wong, 1984), we identified multiple problem areas common to various scientific databases. In this sub-section, we discuss in some detail four main aspects of dealing with large multidimensional data: space-time representation, clustering in multidimensional space, indexing in multidimensional space, and supporting classification structures.
Since many scientific databases represent physical phenomena, the need to represent space-time is common to many applications. A few examples are climate, aerodynamics, combustion, and fluid dynamics. In such applications, the typical method of conducting research is to develop theoretical models, to simulate these models, and to compare the simulation results to real observed data. Simulation models typically partition space into some cell structure, the most common of which is a rectangular cell structure. For example, in climate simulations a point in the cell structure represents a certain longitude, latitude, and height position (X,Y,Z). The simulations are typically run in time steps, where each time step contains a collection of X,Y,Z points, and each point has a set of tens of "variables" (such as "temperature," "wind velocity," "pressure," etc.) associated with it. These types of datasets are so common, that specialized file formats have been developed for them, such as HDF5 [NCSA/HDF5] and NetCDF (UCAR/Unidata/NetCDF).
The file organization of these datasets typically follows the way they are generated. It is a linearized representation of the multidimensional dataset, in the order of time-space variables (i.e., the multidimensional space is linearized in the order of T,X,Y,Z, and for each point in this space the values for (V1,V2,…,Vn) are stored). Each file format representation has an access library associated with it to extract desired subsets of the data for subsequent analysis. The main problem associated with this format organization is that for large simulations that span multiple files, the access requires reading small pieces from many files, making the reading of a subset of data quite inefficient. For example, if a climate analyst wishes to visualize the temperature pattern around the equator for the last 10 years, this implies that all the files containing the time steps for the last 10 years have to be accessed, and for each file the equator points have to be accessed, and for each of these points, only the "temperature" has to be extracted. Consequently, organizing the data according to the intended access pattern is desirable.
The problem of matching the file organization to the predicted access pattern has been studied (see, for example, Chen, et al., 1995), but in practice is not easily applied. There are two reasons for that: 1) the access patterns may suggest physical organizations that conflict with each other, or can change over time, and this requires automated reorganization of the data; and 2) for large datasets (some simulation datasets can be in the order of a few terabytes), it is very expensive to re-organize the data. Another approach is to cluster the data that is being accessed (referred to as "hot clustering"), but this approach is not in wide use, since it requires managing duplication of data. The most useful technique used currently is to partition the simulation dataset into one variable at a time (sometimes referred to as "vertical partitioning"), so that access to a particular variable does not require accessing the other variables. This simple solution is effective for tasks that access only a few variables at a time, such as visualization.
Many physical phenomena have regions of high activity and regions of little activity. For example, an ignition of fuel in a combustion cavity has regions of high turbulence close to where the fuel is injected. Similarly, a stress model of an airplane wing has more stress activity in the joints, and climate models have more activity in certain storm regions. For this reason, there are models that use "Adaptive Mesh Refinement" (AMR) models. The main idea is to use meshes that are more and more refined in the active regions, thus saving a lot of computation and space in the non-active regions. Such representation requires specialized data structures. This is a very important and active field of research in the mathematics area, but is not addressed at all by the database research community. See Plewa (2001) for a comprehensive set of pointers to such work by mathematicians around the world.
Accessing space-time datasets requires the support for specialized operators. Temporal operators include the specification of time spans (e.g., summer months), partial overlap of time periods, etc. Similarly, spatial operators may include distance functions (e.g., find all the points within a given radius) and spatial containment or overlap. The support for such operations requires specialized data structures that depend on the intended application. These have been covered extensively in the literature. There are several survey papers on the subject (Güting, 1994; Abraham & Roddick, 1999; Boehm, Berchtold, & Keim, 2001).
We note that space-time databases also have a natural dimensionality hierarchy. The time dimension is usually used to summarize the data in order to reduce them to a degree that can be handled by an analysis or visualization tool. For example, in climate simulation datasets, it is common to summarize the variables to the level of "monthly means." Such summarization is also useful for a high-level visualization of the data, which can lead to "zooming into" regions of interest. Similarly, the summarization over the spatial regions (which requires support of methods for spatial averaging) is also used in order to reduce the amount of data that the user has to deal with.
Given that objects are represented as points in a multidimensional space, it is natural to think about the object clusters as an indication of some common phenomenon. Similarly, an outlier (a point by itself) in the multidimensional space indicates an unusual phenomenon. Various data mining techniques are based on these observations, and numerous methods for data clustering have been proposed (see surveys by Keim & Hinneburg, 1999; Jain, Murty, & Flynn, 1999). In this sub-section, we will describe an example that illustrates the importance of high-dimensional clustering for very large scientific databases, and the reasons that most of the proposed clustering methods are inadequate.
Our example draws from our experience with large databases in High Energy Physics (see Shoshani et al., 1999, for more details). High Energy Physics experiments consist of accelerating sub-atomic particles to nearly the speed of light and forcing their collision. The particles that collide produce a large number of additional particles. Each such collision (called an "event") generates in the order of 1-10 MBs of raw data collected by a detector. A typical rate of data collection is about 108–109 events/year. Thus, the total amount of data collected is very large, in the order of 300 terabytes to 1 Peatbyte per year. This is why it is important to have efficient indexing and clustering analysis techniques.
In order to be able to find interesting clusters, summary data is extracted for each event. Each event is analyzed to determine the particles it produced, and summary properties for each event are generated (such as the total energy of the event, its momentum, and number of particles of each type). The number of summary elements extracted per event is typically quite large (100–200). Thus, the problem is one of finding clusters over a billion elements, each having 100 descriptors or more. This problem can be thought of as finding clusters in the 100-dimensional space over a billion points. The total amount of space required assuming 4 bytes per value is 400 GBs.
There are two aspects to this difficult problem: size and high dimensionality. The large size implies that the clustering analysis method must be linear to be practical. Since clusters cannot be detected in very high-dimensional space, the high-dimensionality aspect requires dimensionality-reduction methods, that is, methods that select fewer dimensions that represent the clustering properties the best. However, the known techniques are either non-linear and/or they are effective for small memory datasets. Similarly, the known linear clustering methods on large datasets are inadequate because they are based on cell-partitioning methods where the dimensions to be analyzed and the binning of each dimension are pre-selected. Therefore, hybrid techniques have been proposed: running dimension-reduction techniques on a sample of the data, and applying the results of this step to the linear clustering methods. We describe such a methodology in a recent paper (Otoo, Shoshani, & Hwang, 2001). Another effective approach reported recently is based on using different cutting planes for each dimension and finding the optimal grid partitioning (Hinneburg & Keim, 1999).
Indexing multidimensional data has been recognized as a specialized area for a long time now. If each dimension is indexed separately, then searching for objects in the multidimensional space will require the intersection of the results of all the index searches, which is an inefficient operation. Therefore, specialized multidimensional indexing methods have been developed. A recent survey covers such indexing methods (Boehm, Berchtold, & Keim, 2001). The specialized methods include such well-known indexing structures as the R-tree (Guttman, 1984), and various improvements to it, such as the R+ tree (Sellis, Roussopoulos, & Faloutsos, 1987), Grid Files (Nievergelt, Hinterberger, & Sevcik, 1984), Quad-trees (Samet, 1984), and Pyramid Indexes (Berchtold, Böhm, & Kriegel, 1998), just to name a few. Such indexing methods work especially well for low-dimensional applications (below six to seven dimensions) and for queries that involve all the dimensions. However, if we need to access part of the dimensions form a high-dimensional dataset, such methods become inefficient because too many branches of the index structures are involved in the search.
As a case in point, consider the high-dimensional space described in the High Energy Physics application in the previous sub-section. Recall that we described the problem as having a very large number of objects (many millions to a billion), and the number of dimensions (or attributes) in the hundreds. The typical search query is a range query over a few of the dimensions. This is referred to as a "partial range multidimensional query." For example, a query might be "find all objects (events) that have energy between 5 and 7 GEV and the total number of particle produced was more that 10,000." There are two problems with using the conventional multidimensional indexing methods: the high dimensionality (100 or more dimensions), and the fact that the query is a "partial range query."
The solution to this problem requires another approach. We can take advantage of the fact that scientific databases are typically "append only," and organize the index into "vertical partitions" where each partition represents a single dimension over all the objects. Further, each vertical partition can be organized as a set of compressed bitmaps. The operation of a partial range query can then be transformed into logical operations over the compressed bitmaps. We describe this technique in Shoshani et al. (1999) and further develope it in Wu, Otoo, & Shoshani (2001). This discussion is illustrative of the special indexing needs of scientific high-dimensional databases, especially as such databases are growing in size.
It is a common practice to use specialized terms to describe concepts or objects in scientific domains. If the number of such terms is large, then they are organized as "classification structures," usually hierarchies of terms. For example, in biology organisms are classified by Kingdom, Phylum, Class, Order, Family, Genus, and Species. In medicine, there is a long history of international classification of diseases, contained in a large book and updated every 10–20 years (the latest is ICD-10). Similar large classifications exist in pharmacology, chemistry, material science, etc. These are often referred to as "ontologies." In fact, there are general products that are designed to allow the user to develop his/her classification. An example, in the area of clinical information, is an open source system, called open GALEN (http://www.opengalen.org). Hierarchical organization of concepts is not unique to scientific data. It is a natural way of classifying and categorizing concepts in any domain. For example, in the business world, products are naturally organized into hierarchical categories. This requirement is the same as the category hierarchies we discussed in statistical and OLAP databases, but can be more complex in scientific databases, since the classification structures can be many levels deep.
Scientists who need classification structures in their work find it extremely difficult to use commercial relational databases. This is because the tabular organization of information is not convenient for representing these classification hierarchies. To illustrate this difficulty, consider a simple example of organizing products as shown in Figure 5a. In this figure, each row represents a single product as well as the higher-level categories it belongs to. The problem with this representation is that all the higher-level categories have to be repeated for all the products. This causes several difficulties: 1) For deep hierarchies, the number of columns is as large as the hierarchy depth. Even for the simple example used in Figure 5a, one can imagine refinement of products to additional levels, such as "milk-products" organized into "cheese," "milk," etc. sub-categories, and each of these into "non-fat milk," "low-fat milk," etc. Thus, a 10-deep hierarchy will require 10 columns, where the values in nine of the columns repeat for each instance of the "leaf" column. 2) This repetition is problematic not only because of space considerations, but also for data entry, and the potential for errors. 3) Categories often have a "description" item or a "code" associated with them. In the representation of Figure 5a, these have to be repeated as well.
Figure 5a: Representing a Category Hierarchy in a Tabular Form
Figure 5b: An Alternative, More Compact Representation
Figure 5c: A Normalized Version of the Category Hierarchy
As a result of the above difficulties, an alternative, more "normalized" representation is often used, as shown in Figure 5b. In this representation each category instance at any level is represented only once, eliminating the repetition. To support the category hierarchy, each row points to its parent category, i.e., pointing to the appropriate row ID. The problem with this representation is that while the table is more compact, operations over it are not supported by relational systems and languages, such as SQL. This is referred to as the "transitive closure" operation. For example, suppose that we wish to summarize sales for all food products in some organization. Using the table in Figure 5b, one would have to start with row 2, then find for it all the "children" by performing a "join" with the same table. This will produce rows 10, 11,…, etc. For each of these, the next level of "children" has to be found. This will then produce rows 100, 101,…, 200, 201,…, etc. Writing SQL to generate that is too difficult for the user, so the work is delegated to programmers to incorporate into special-purpose user interfaces.
We note that the "transitive closure" operation is straightforward in the case of Figure 5a. For the above example, one has only to specify "select products where product-class=food." This is the reason that many implementations end up using this representation in spite of its repetitiveness.
Finally, we should also consider the "fully normalized" alternative shown in Figure 5c. In this representation each category level is represented in its own table. An entry in one table has a pointer to an entry in its parent category table. This representation is similar to that of Figure 5b, but by splitting the category types into separate tables, it eliminates the duplication of the "product-type" column. We added in Figure 5c a "description" column for each of the category levels to show that this "normalized" representation eliminates their repetition as well.
The advantage of this representation is that a "transitive closure" is expressible by specifying "joins" between the tables. The disadvantage is the need to deal with multiple tables, and the cost of performing a large number of "joins."
Our purpose in discussing these alternatives of tabular representations of category hierarchies is to show why relational databases are rarely used for such structures, or used in a limited way. It points to a need to support such structures as specialized objects. Actually, this problem is more complex in real systems, since the hierarchy structures are not always as well organized as our example implies. In reality, there are cases where a category of one level may point to a parent category two or more levels above it. A recent paper that addresses this problem was written by Pedersen, Jensen, & Dyreson (1999).
Another aspect of classification structures that makes them even more complex to support in a tabular form is that a collection of objects can be classified in more than one category hierarchy. For example, the product category hierarchy of Figure 5 could be classified with a category hierarchy "manufacturer" where manufacturer could be further organized by location (city, state). Thus, the same base objects (in our case, "products") can be classified according several properties they may have. These properties that can be used to classify a collection of objects are referred to as "facets" (Batty, 1998). A database system that supports category hierarchies should permit the search of the objects with category specifications in multiple facets. This is equivalent to having a multidimensional space of facets, where each facet can be represented as a category hierarchy.
It should be evident from the above discussion that the two concepts of multidimensionality and category hierarchies are fundamental to summary and scientific data. In fact, it is our belief that they are fundamental to any domain that requires classification of objects or concepts into manageable collections. Further, since any collection of objects can have multiple "facets" (or attributes), these facets define a multidimensional space. There are two implications to these observations: the need for explicit conceptual modeling of these concepts, and efficient data structure to support these two fundamental concepts.
As far as conceptual modeling is concerned, it is necessary to support multidimensional classes and category-hierarchy classes as first-class concepts. By "first-class concepts" we mean that they are visible to the user. In the area of object modeling, the two dominant concepts are "object classes," which represent a collection of objects, such as "products," and "associations," which represent the associations between object classes, such as the association between "products" and "manufacturers." Object classes have attributes associated with them (such as "product-cost" or "product-weight"). We argue here that these have to be complemented with the concepts of multidimensional classes and category-hierarchy classes. For example, it should be possible to define a "product-type" as a category-hierarchy class and associate the object class "products" with it. This is shown schematically in Figure 6a.
Figure 6a: A Schematic of an Object Model for Supporting Category-Hierarchy Class
Furthermore, it should be possible to define a multidimensional class with two dimensions, one being "product-type" and one being "manufacturer," and associate this multidimensional class with the "product" object class. This is shown schematically in Figure 6b.
Figure 6b: A Schematic of an Object Model to Support a Multidimensional Class Whose Dimensions Are Two Category-Hierarchy Classes
The technology to support efficient data structures for the multidimensional class and the category-hierarchy class may vary depending on the application. For high-dimensional classes a special index may be necessary; for summary databases an efficient summarization technique may be necessary; and for very rich and deep category hierarchies, a specialized software package may be necessary. The main issue here is what is the infrastructure that can support such diversity. Extending the relational paradigm, the "object-relational" approach permits plugging "data blades" or "data cartridges" to support specialized data types. This approach was adopted by commercial vendors, especially Informix and ORACLE. However, this approach provides a table-driven view of the conceptual model. Furthermore, it does not permit one data type to be built from other data types. This will require a "data blade" that was other "data blades" building blocks.
We believe that the concepts of federation technology and the ability to associate object classes using "links" is a more flexible approach. Federated databases should permit putting together sub-systems that support the object-based data model with various other sub-systems as needed, so that the external conceptual model can support richer classes of objects and operators over them. In particular, we advocate a federated architecture to support sub-systems for object, multidimensional, and category-hierarchy classes. This architecture could exist under a single database system, or could be applied to federate multiple database sub-systems.