OLAP to the Rescue | Practical Business Intelligence with SQL Server 2005

Relational databases have become so popular and ubiquitous that many IT professionals think that every data storage and querying problem can (and should) be solved by a relational database. Similarly, when XML was first popularized, many people thought exactly the same thing about XML. The reality of course is that although structures such as relational databases and XML files have a wide range of uses, we should follow a practical rather than dogmatic approach and apply the right tool for the job.

Any BI solution that we put in place should ideally be available across the whole company, follow a multidimensional approach that matches up with the real-world concepts, be easy to use by nontechnical users, and have really great performance. This is quite a tall order, but the technology to achieve all of this is available.

On-Line Analytical Processing (OLAP) is a different kind of database technology designed specifically for BI. Instead of organizing information into tables with rows and columns like a relational database, an OLAP database stores data in a multidimensional format. Rather than trying to get a relational database to meet all the performance and usability needs we described previously, we can build an OLAP database that the users can query instead and periodically load it with data from the relational data warehouse, as shown in Figure 1-4. SQL Server includes an OLAP database engine called Analysis Services.

Figure 1-4. Source to DW to OLAP to users flow

The central concept in an OLAP database is the cube. An OLAP cube consists of data from one or more fact tables and presents information to the users in the form of measures and dimensions. OLAP database technology also generally includes a calculation engine for adding complex analytical logic to the cube, as well as a query language. Because the standard relational query language, SQL, is not well suited to working with cubes and dimensions, an OLAP-specific query language has been developed called MDX (Multidimensional Expressions), which is supported by several OLAP database engines.

The term cube comes from the general idea that the data structure can contain many dimensions rather than just a two-dimensional table with rows and columns. Because a real-life geometric cube is a three-dimensional object, it is tempting to try and explain OLAP technology using that metaphor, but it quickly becomes confusing to many people (including the authors!) because most OLAP cubes contain more than three dimensions. Suffice to say, a cube is a data structure that allows numeric measures to be analyzed across many different dimensions.

Loading Information into OLAP Databases

As you have seen in the section on ETL, data from source systems is transformed and loaded into the relational data warehouse. To make this data available to users of the OLAP database, we need to periodically process the cube. When a cube is processed, the OLAP engine issues a set of SQL queries against the relational data warehouse and loads the resulting records into an OLAP cube structure.

In principle, an OLAP cube could be loaded directly from the source systems and instantly provide a dimensional model for accessing the information and great performance. In that case, why do we need a relational data warehouse as well? The most important reason is data quality. The data warehouse contains consolidated, validated, and stable information from many source systems and is always the best source of data for an OLAP cube.

Getting Information out of OLAP Databases

Users usually interact with a relational database (including a data warehouse) by running predefined reports that are either created for them by IT departments or built by the users themselves using a report writing application. Reports can often take several minutes to run even in a well-designed star schema data warehouse, which doesn't lend itself to the kinds of interactive queries that can really allow the user to understand new information.

The key to the success of using OLAP databases in an interactive, user-friendly way is their performance. Queries against an OLAP cube, even ones that summarize years of history and huge amounts of transactions, typically return results in a couple of seconds at most, which is orders of magnitude faster than similar relational queries. This makes it feasible to build client applications that allow users to build queries by dragging and dropping measures and dimension attributes and see results almost instantly.

Many users, especially analysts and other power users, have conventionally used rich BI client applications specifically designed for querying OLAP databases. These tools typically include features such as charting and visualization and can really improve the effectiveness of analytical tasks. As the need for access to information becomes more widespread across the organization, BI capabilities are being included in tools that most people have access to, such as Web portals and Excel spreadsheets.

Information is often presented at a summarized level with the ability to drill down to see more details (that is, to pick a particular area of interest and then expand it). For example, someone may begin with looking a list of sales revenue against quota for all the geographic regions in a country and see that a particular region has not reached their target for the period. They can highlight the row and drill down to see all the individual cities within that region, to try and understand where the problem may be.

Why Is OLAP So Fast?

So how does an OLAP database engine achieve such great performance? The short answer is pretty simple: It cheats. When somebody runs an adhoc query that asks for a total of all sales activity in a certain region over the past three years, it is very unlikely that a database engine could sum billions of records in less than a second. OLAP solves this problem by working out some of the answers in advance, at the time when the cube is processed.

In addition to the detailed fact data, OLAP cubes also store some precalculated summaries called aggregates. An example of an aggregate is a set of totals by product group and month, which would contain far fewer records than the original set. When a query is executed, the OLAP database engine decides whether there is an appropriate aggregate available or whether it needs to sum up the detailed records themselves. A properly tuned OLAP database can respond to most queries using aggregates, and this is the source of the performance improvement.

If you try to work out the total possible number of different aggregates in a cube with a reasonable number of dimensions, you will quickly realize that the number of combinations is staggering. It is clear that OLAP database engines cannot efficiently store all possible aggregations; they must pick and choose which ones are most effective. To do this, they can take advantage of the situation shown in Figure 1-5. Because products roll up to product categories, and months roll up to quarters and years, if an aggregate on product by month is available, several different queries can quickly be answered. If a query is executed that calls for totals by year and product category, the OLAP database engine can sum up the records in the product by month aggregate far more quickly than using the detailed records.

Figure 1-5. OLAP aggregations

Another key to OLAP database engine performance is where and how they store the detailed and aggregated data. There are a few different approaches to this question, but the most common answer is for the OLAP database engine to create optimized structures on disk. This approach is known as MOLAP, or Multidimensional OLAP. Modern platforms such as Analysis Services can support billions of records in highly optimized, compressed MOLAP structures.

Regardless of whether a query is answered from a precalculated aggregate or the detail-level records themselves, every query is answered completely from the MOLAP structure. In fact, after the daily cube processing has loaded data into the cube, you could even stop the relational database server without affecting cube users, because the relational database is never used for end-user queries when using MOLAP structures.

In some older technologies, MOLAP did not scale well enough to meet everybody's needs, so some analytical solutions stored the detailed information and the aggregates in relational database tables instead. In addition to a fact table, there would also be many tables containing summaries. This approach is known as ROLAP, or Relational OLAP. Although these solutions scaled relatively well, their performance was typically not as good as MOLAP solutions, so HOLAP (Hybrid OLAP) solutions were introduced that stored some of the information in relational tables and the rest in MOLAP structures.

The good news is that as far as Analysis Services is concerned, the preceding discussion is no longer a real issue. Analysis Services supports all three approaches by simply changing a setting on the cube, and the current version will have no trouble supporting huge data volumes with excellent performance, leaving you free to concentrate on a more interesting question: How should you structure the information in your particular BI solution?