THE USE OF THE TIME DIMENSION

only for RuBoard - do not distribute or recompile

In this section let's explore how the use of the composite existence attribute that was introduced in the previous section, together with the time dimension, may allow the expression of some complex queries to be simplified.

The purpose of the time dimension is to provide a mechanism for constraining and grouping the facts, as do the other dimensions in the model. We are examining methods for properly representing customers; this means providing support for time in the customer circumstances, dimensions, and dimensional hierarchies as well as the facts. The four queries listed in Query Listings 6.2 “6.5 show that similar time constraints apply to the dimensions as to the facts. Therefore, it seems appropriate to allow the users of the data warehouse to express time constraints on other components using the same approach as they do with the facts.

Kimball proscribes the use of the time dimension with other dimensions because he is concerned that the semantics of time in the dimensions is different from that of facts and is potentially misleading. The view that the time dimension should not be used in dimensional browse queries is supported implicitly by the conventional star schema and snowflake schema data models that show the time dimension as being related to the fact table alone. There is no relationship between the time dimension and any other dimension on any dimensional model that I have seen.

In considering this matter, two points emerge:

Firstly, the time dimension provides a simple interface to users when formulating queries. Preventing them from using the time dimension with other entities means that the users will be able to place time constraints by selecting terms such as 2nd Quarter 2000 in queries involving the fact table but not other entities such as customer circumstances and dimensions. In these types of queries, the explicit time values have to be coded.

Second, some dimensional browsing queries are much easier to express if a join is permitted between these other entities and the time dimension. Further, I have discovered that some useful but complex queries, such as those in the previous section, can be generalized if a join to the time dimension is permitted. This is described below.

Referring to the queries expressed in Listings 6.2 “6.5, the first query (Listing 6.2) is: How many customers did we lose during the last quarter of 2000, compared to 1999 and 1998? Using the time dimension, it can be expressed as shown in Listing 6.6.

Listing 6.6 Count of customers lost during Q4, using the time dimension.

 select t.Quarter, count(*)        from CustomerExist ce, Time t        where ce.ExistenceEnd = t.TimeCode        and t.Quarter in ('Q42000', 'Q41999', 'Q41998')        group by t.Quarter

Changes to the temporal scope of the query can be effected simply by altering one line of the predicate instead of creating additional discrete queries.

The query from Listing 6.3 is: Of the customers who were lost, how many had been customers continuously for at least one year? This can be expressed as follows :

Listing 6.7 Count of long-standing customers lost, using the time dimension.

 select t.Quarter, count(*)        from CustomerExist ce, Time t        where ce.ExistenceEnd = t.TimeCode        and t.Quarter in ('Q42000', 'Q41999', 'Q41998')        and (ce.ExistenceEnd - ce.ExistenceStart) > 365        group by t.Quarter

The third query, from Listing 6.4, is: How many of the customers experienced a change of administration because they moved in the year that they left? Using the same assumptions as before, the query can be expressed as in Query Listing 6.8.

Listing 6.8 Lost customers who moved, using the time dimension.

 select t1.Quarter, count(*)        from CustomerExist ce,        CustomerSalesAreaExist csa, Time t1, Time t2        where ce.ExistenceEnd = t1.TimeCode          and t1.Quarter in ('Q42000', 'Q41999', 'Q41998')          and (ce.ExistenceEnd - ce.ExistenceStart) > 365          and ce.CustomerCode = csa.CustomerCode          and csa.ExistenceStart = t2.TimeCode          and t2.Year = t1.Year        group by t1.Quarter

Finally, the fourth query, from Query Listing 6.5, is: How many price changes did they experience in the year that they left? This can be expressed as shown in Query Listing 6.9.

Listing 6.9 Lost customers affected by price increases , using the time dimension.

 select t1.Quarter, ce.CustomerCode,               count(distinct spe.WineCode)        from CustomerExist ce,               SalesPriceExist spe, Sales s,        Time t1, Time t2, Time t3        where ce.ExistenceEnd = t1.TimeCode          and t1.Quarter in ('Q42000', 'Q41999', 'Q41998')          and (ce.ExistenceEnd - ce.ExistenceStart) > 365          and ce.CustomerCode = s.CustomerCode          and s.WineCode = spe.WineCode          and spe.ExistenceStart = t2.TimeCode          and s.TimeCode = t3.TimeCode          and t2.Year = t1.Year          and t3.Year = t2.Year        group by t1.Quarter, ce.CustomerCode        having count(distinct spe.WineCode) > 5

Thus, we can conclude that allowing the time dimension to be joined to other dimensions, when existence attributes are used, enables a simpler expression of some temporal queries

In order to adopt a change of approach whereby joins are allowed between the time dimension and other dimensions, we have to alter the data model. There now exists a relationship between the time dimension and some of the other entities.

Only those dimensions that need true support for time will be related to the time dimension. Part of the Wine Club model has been reproduced in Figure 6.1.

Figure 6.1. ER diagram showing new relationships to the time dimension.

Figure 6.1 shows time having a relationship with sales, as before, and also with the customer and wine dimensions. In fact, as Figure 6.1 shows, there are two relationships between the time dimension and other dimensions, one for the start time of a period and one for the end time.

A problem caused by this change in the conventional approach is that dimensional models will immediately lose their simple shape, and the overall model will become horribly complex. Simplicity is one of the requirements placed on the model. In creating this new idea, I have introduced a further problem to be solved . It is important that the dimensional shape is not lost. Clearly, we cannot present the users with such a diagram.

The solution could be that the time dimension is removed altogether from the model. It has been said before that the time dimension is always included as part of a dimensional model because time is always a dimension of analysis in a data warehouse that records history. In further recognition of the fact the data warehouses are temporal databases, the explicit inclusion of a time dimension could be regarded as unnecessary. So, on the assumption that the time dimension is a given requirement, I have adopted the view that its inclusion is implicit. This means that the diagram does not need to model the time dimension explicitly. However, this causes a problem with the entity relationship modeling methodology in that it would be misleading to have implicit entities that are deemed to exist but are excluded from the diagram. The dot modeling methodology does not have this limitation and will be adapted to accommodate the new requirement.

However, the time dimension does have attributes that are specific to each application, and so it is not something that can be ignored altogether. For instance, data warehouses in some types of organization require specific information about time, such as:

Half-day closing
Prevailing weather conditions
Effect of late opening due to staff training
Whether the store was open for 24 hours

This type of information cannot be obtained through any type of derivation. So there is a need for some means of specifying the attributes for time on a per application basis. In the dot modeling methodology we can solve this problem by the introduction of a table that will satisfy the requirements previously handled by the explicit time dimension as well as the requirements covered in this section.

The table could be given a standard table name for use in all applications. The use of Time as a name for the table is likely to conflict with some RDBMS reserved word list so, for our purposes, the name dot_time will be use to describe the table.

Each application will have its own requirements as to the columnar content of the dot_time table, although some columns , such as the following, would almost always be required:

Date
Day name
Week number
Month name
Month number
Quarter
Year

As practitioners we could add value to our customers by bringing a starter dot_time table that might contain, say, 10 years of history and 10 years of future dates. This seems like a large amount of data, but in reality it is fewer than 8,000 rows where the granularity is daily. For finer levels of granularity, for example, seconds, it is sensible to provide two time tables. The first contains all the days required, as before, and the other would contain an entry for each second of a single day (i.e., from 00:00:00 to 23:59:59). It is then a simple matter to join to one table, in the case of dimensional changes, or both tables in the case of, say, telephone calls. In this way, multiple grains of time can be accommodated.

Practitioners could also provide standard programmed procedures, perhaps using the user -defined functions capability that is available as an extension to some RDBMS products, to add optional columns such as weekends and bank holidays, etc., although some customization of the dot_time table is almost inevitable for every application.

The removal of the explicit time dimension from the conceptual model to the logical model is a step forward in the approach to the design of data warehouses. It also goes some way to recognizing that data warehouses are true temporal applications and the support for time is implicit in the solution, rather than having to be made explicit on the data model.

only for RuBoard - do not distribute or recompile