0996-0999 | Oracle Development Unleashed (3rd Edition)

Previous Table of Contents Next

Page 996

Figure 43.4. Designing a data cube.

We will assume that we sell in three locations: New York, Los Angeles, and Chicago. In this case, we will have a 3 3 n cube, where n is the number of months (since we know there are three distinct prices in our data and three distinct locations). Look at the summer months: June, July, and August. We now have a 3 3 3 cube, shown in Figure 43.5.

Figure 43.5.
A 3 3 3 data cube for the sale of shoes for the summer months.

NOTE

The reason a multidimensional database is called multidimensional is because it has additional dimensions to an n-dimensional cube, based on data values, not on a database schema. In a traditional relational database, the dimension of the table is based on the number of columns and the amount of data in the database.

In this cube, the actual data that populates the sales cube will determine its dimensions. For instance, if we opened a new store in Denver, we would have a 3 3 4 cube.

Page 997

On the other hand, by adding a location in a relational database, we will not change the dimensions of the original sales table unless we add data. The table will still have seven columns, only more distinct values in the location column.

The power of using an n-dimensional cube is that around this paradigm we can define a multidimensional database. If our database is truly multidimensional, by definition users can look at a single cube or any cross-sectional slicing of cubes.

For instance, if the top buyer in New York wanted a quantity of units sold in New York at the lowest price in the month of June, our multidimensional database would only return one building block within the cube.

If the vice-president of sales wanted the total quantity sold at all locations and all months at the highest pricing of the shoe, we would return him the slice of the complete top layer of our cube. That "area" within the cube would represent total quantity.

Inside the cube is a simple computed value: the quantity sold. Each cube represents a given price, location, and month, and contains aggregate sums for each price, location, and month.

This data cube can be translated as an encapsulation of many aggregate SQL commands. For example, to retrieve the front face of the cube, total sales for New York, regardless of price, SQL would need to be written as such:

 select sum(quantity_sold) from sale where item = `Tennis Shoes' and location = `New York'

By storing these values in a data cube and updating the cube dynamically, we avoid the many costly SQL statements that might be needed to give the business analyst the information he needs. Interestingly enough, Oracle's Discoverer 3.0 offers the administrator the option of dynamically maintaining values in a data cube or generating SQL for each cell in the cube.

Discoverer 3.0 will also allow for a mix of options so seldom used subsets of the cube will not be stored but retrieved at runtime.

A Semi-Formal Definition of OLAP

E.F. Codd, the founder of relational database theory, saw that OLAP needed to be supported by a multidimensional data model. In 1993 in the research paper "Providing OLAP to User -Analysts: An IT Mandate," Codd and Salley laid down the 12 rules that they perceived were needed to design multidimensional databases with OLAP capabilities:

Multidimensional Conceptual View
This requires that users be able to manipulate and view multidimensional models. Just as a relational database appears as a collection of tables, multidimensional databases should appear as a collection of simple business objects.

Page 998

Transparency
Even though data can be viewed multidimensionally, it should also be able to interface with common tools like spreadsheets and word processors, without appearing different in any way at all, just as another feed of data.
Accessibility
OLAP must map the business universe of objects needed for analysis from

multiple data feeds and different types of databases. This would imply that a true OLAP database could receive feeds from a mainframe nonrelational database, a relational database, and a series of flat files, and still appear to the user as a multidimensional series of cubes.
Consistent Reporting Performance
As the number of dimensions or the size of the database increases , the user should experience no significant degradation of performance. This is insisted upon because a true multidimensional database stores aggregate cell data dynamically, and no large computational query needs to be run as the user looks at the data cube.
Client/Server Architecture
Tools should be able to operate and store data in a client/server architecture. I question this rule because Oracle is planning multidimensional analysis for the NC computer. I don't see any imperative that the architecture should be restricted to Client/Server for OLAP processing.
Generic Dimensionality
All cells of the multidimensional database or larger subsets must be treated in the same way when the user wishes to view, analyze, or manipulate them.
Dynamic Sparse Matrix Handling
This means that NULL values should be stored in an efficient way that saves space. For instance if you sold new cars and bicycles and defined an OLAP cube as having price, item, units_sold as its dimensions, certain cells of this cube will never be filled. Even though a $100 bicycle might be sold, you probably are not going to sell a $100 car.

Along with this space compression, depending on the data, the OLAP tool should choose different access methods to retrieve the data quickly. If you stored one dimension as sorted data, the OLAP system might use a binary tree to index the data, yet if the dimension stored large random numbers , the OLAP system should adjust and use a hashing algorithm.

Page 999

Multiuser Support
This only means that multiple users can view the same data cubes and perform operations on them to analyze the data. It would make no sense to have a system where one analyst could tie up a complete branch of your corporate data.
Unrestricted Cross-Dimensional Operations
In an OLAP multidimensional database, methods of aggregating or deriving values by formula need to be consistent for any subset of the multidimensional database.
Intuitive Data Manipulation
This is highly subjective as a rule and doesn't say much. A nonintuitive data manipulation is never the goal of an OLAP vendor. What I think Codd meant when he wrote this was that OLAP data manipulation should be visually based with drill-down capabilities and ways of zooming in and out of cells.
Flexible Reporting
Reports should allow for rows and columns to be displayed for any number of dimensions in the data cube and that these values could also appear on headings; again we see a rather vague rule.
Unlimited Dimensions and Aggregation Levels
Codd uses the word "unlimited" and then goes on to state that the recommended limit in dimensions is between 15 to 20. This means that I could expand my tennis shoe analysis to contain location, price, purchase date, current weather, astrological sign of buyer, and so on. This might give me more detailed information, but I could run into the overfit problem in statistics where I end up seeing patterns in random data that really do not exist in the long run.

Multidimensional Storage Strategies ”A Star Schema Is Born

The multidimensional database is responsible for a type of storage that a normalized relational database does not allow, and that is aggregations of data. In a normalized database, aggregate data is a function of the number and characteristics of existing rows and is computed at runtime in the form of aggregate SQL statements. In a multidimensional database, a value in a cell can represent an aggregation of the different dimensions of the database.

In the previous example we had a 3 3 3 cube that contained total sales by location, month of sale, and price. These are the three dimensions of our OLAP database. To store these in a multidimensional cube so that we don't have to aggregate data whenever a user wants to see a subset of the data cube, the multidimensional database uses the star schema. Here we have the dimensions of data as tables circling the center of the star, and we will see a table called tennis_shoe_sales, which contains only the aggregate value for all of the cell combinations.

Previous Table of Contents Next