PERFORMANCE CONSIDERATIONS

only for RuBoard - do not distribute or recompile

This book is primarily concerned with accuracy of information and not with performance. However, it is recognized that performance considerations have to be made at some point and that performance may be considered to be more important than accuracy when design decisions are taken. For this reason we will briefly explore the subject of performance.

A query involving the fact table such as The sum of sales for the year 2000 grouped by Sales Area using a Type 2 approach, with surrogate keys, would be expressed as follows :

 Select SalesAreaCode, sum(s.value)        From sales s, customer c, time t        Where s.CustomerSurrogate = c.CustomerSurrogate        And s.timecode = t.timecode        And t.year = 2000        Group by SalesAreaCode

Using our new schema, the same query requires a more complex join, as the following query shows:

 Select SalesAreaCode, sum(s.Value)        From sales s, CustomerSalesArea csa, time t        Where s.customer code = csa.customer code        And s.time code = t.time code        And t.time code between csa.start and csa.end        And t.year = 2000        Group by SalesAreaCode

The join between the time and customer sales area dimensions is not a natural join (it's known as a theta join) and is unlikely to be handled efficiently by most RDBMS query optimizers. A practical solution to solving the performance issue in these cases, while retaining the benefit of existence attributes, is to copy the sales area code to the fact table. As has been stated previously, the attributes of the fact table always have the property of permanent retrospection. So the accuracy of the results would not be compromised, and performance would be improved considerably, even better than the original, because a whole table is omitted from the join. This is shown by the next query:

 Select SalesAreaCode, sum(s.Value)        From sales s, time t        where s.time code = t.time code        And t.year = 2000        Group by SalesAreaCode

The relationship between the customer and sales area dimensions must be left intact in order to facilitate dimensional browsing. This is because the decomposition of the hierarchy is not reversible. In other words, the hierarchy cannot be reconstructed from the fact table, since, if there is no sale, there is no relationship between a customer and a sales area. So it would not be a nonloss decomposition.

The model enables dimensions to be queried and for time series types of analysis to be performed by using the time dimension to allow the sales area and customer dimensions to be grouped by any time attribute.

Queries involving dimensions only, such as counting the number of customers by year, would be expressed as follows:

 Select year,count(*)        From CustomerExistence ce, time t        Where t.timecode between ce.start and ce.end        And t.timecode in               (2000/12/31, 1999/12/31, 1998/12/31)        Group by year

This appears to be a reasonably efficient query. At this point it is worth reiterating that browse queries represent about 80 percent of queries executed on the data warehouse.

At this level in the model, there is no reason to distinguish between the requirements of false and permanent retrospection. The main reason for these classifications is concerned with issue of population of the data warehouse and the capture of changed data values. In any case, the method handles these requirements quite naturally.

only for RuBoard - do not distribute or recompile