OLAP

only for RuBoard - do not distribute or recompile

The term on-line analytical processing, or OLAP, was first used in the early 1990s. We could be forgiven for imagining that this was when, so called, OLAP products first started to appear. Not so. All OLAP products are founded on the principle of the dimensional model that was described and discussed at length in the early chapters of this book, and analysis products based upon dimensional models have been around since the 1970s. In the mid-1980s there was quite a strong set of competing products although, at the time, they were generally being marketed under the heading of executive information systems (EIS). Some of the products that existed in those days have survived, by a certain amount of evolution and adaptation by new owners , to today. OLAP is supposed to have delivered a kind of paradigm shift in that all business people should now be using sophisticated analysis tools to understand their businesses instead of relying on dedicated analysts to provide them with information. Oddly enough, that was precisely the same objective of the much earlier EIS products.

A typical OLAP architecture consists of an OLAP server that sits between the data warehouse and the user . This is shown in Figure 10.2. The warehouse itself is normally dimensional in nature and can be relational, proprietary, or a combination of both, depending on the type of OLAP product chosen .

Figure 10.2. Typical OLAP architecture.

There are three main flavors of OLAP, and each OLAP product falls into one of the three:

MOLAP: Multidimensional OLAP. The term MOLAP is used to describe a set of products that utilize a proprietary database management system. By this we mean that the database does not use the relational, or any other standard, as its underlying database. These systems often have a spreadsheet basis much the same as described in the introduction to Dot Modeling in Chapter 5. There are two main problems with these products.

The first concerns their ability to scale up. Generally, we have to be able to tell the system, at the outset, how big the dimensions are likely to be. This can be done by simply entering, say, the number of customers, products, etc., that we need the system to be able to hold or by specifying a range of identifiers and allowing the system to deduce the size of the dimension itself. Once the system knows the size of each dimension, it simply multiplies all the numbers together and the result is the number of spreadsheet cells that it needs to make room for in its matrix. For example: If we have two million customers and 1,000 products and we want to store 10 years of daily sales, then the matrix would contain:

The result is that the position of individual cells can be calculated from the value of the dimensional identifiers. This makes these systems very fast indeed when it comes to accessing the data and, of course, there are no expensive joins to be performed. However, there is a kind of implicit assumption that we intend to sell every product to every customer every day. As attractive a proposition as this might be, most would agree that it is a little unrealistic and tends to lead to what is known as sparcity. Sparcity is a measure of the unused cells in a multidimensional database. Often, the sparcity can be more than 90 percent. In the example above, that's over 20 terabytes on a 32-bit architecture for what is a fairly modest application. Most of these systems have sparcity reduction algorithms, but scalability remains an issue for these products.

The second problem is that there are no standards regarding query languages. In relational databases, SQL has been the standard for a long time. It may have shortcomings, but at least we can be comfortable that SQL that works on one RDBMS can be made to work on most others with relatively little modification so long as we have stuck to the standard features.

ROLAP: Relational OLAP. The ROLAP solution is to locate the data model inside an RDBMS. There are minor variations in the data models depending on the vendor. Some allow a full snowflake schema while others restrict us to a star schema. Not surprisingly, each aggressively defends their solution against the other. We've had this discussion in Chapter 3. Basically, what we have to do is define the dimensions and the facts and the ROLAP engine will create the underlying schema and provide a means of querying the model with a query tool. This approach appears not to suffer from the two problems associated with the standard MOLAP approach. First, because the data is stored, conventionally, in a relational database, the scaling and sparcity issues don't usually apply. The primary players in the RDBMS market are well sorted in terms of space management and, although they do tend to waste a lot of space, it is nowhere near as bad as the MOLAP approach. Second, the query language is SQL. The query tool that is integrated with the ROLAP product generates SQL so we don't have to.

Don't be fooled into thinking everything is standard, because it isn't. While the underlying DBMS is relational and is, therefore, standard, the ROLAP bit is anything but standard. In the relational model, we can take the DDL (data definition language “create table statements, etc.) and use it, more or less, in another RDBMS and it will work. The same is not true of the statements and parameters used to build the ROLAP model.

There is another thing, too. One if the big benefits of MOLAP is performance because, as we said, we can find records by simple calculation and there are no joins to be done. In a ROLAP solution this is not the case. We are back with a traditional-looking model with all its inherent performance issues.

You might be wondering what the benefit is. If the ROLAP tool simply builds a dimensional model, whether a star or snowflake, and allows us to query it, why don't we just design our own dimensional model and just buy a query tool to analyze it (and save lots of money)? If you think of an answer “let me know.

HOLAP: Hybrid OLAP. This is the compromise solution. The idea behind HOLAP is that we can still get the benefits of a MOLAP architecture for the summary information, but when we need detail, the system will reach back into a relational database. So the performance gains are available, albeit only at the summary level, and the scalability issue can be resolved because all the detail is held in conventional relational tables.

There is a general issue with OLAP ”that is, its architecture is founded on the dimensional model. That means it is only useful for the behavioral part of the GCM. It cannot support a customer-centric model. This problem is somewhat exacerbated when we try to use an OLAP solution in our CRM system because we want a fully integrated model of which we can ask questions about changing circumstances as well as behavior. If you find yourself in a situation where you need only behavioral information, then an OLAP solution, in one of its three main guises, might be suitable. If you need more, then be wary.

only for RuBoard - do not distribute or recompile