THE IMPLEMENTATION OF RETROSPECTION

only for RuBoard - do not distribute or recompile

Introduction

We begin this section with a general rule:

Every query executed in a data warehouse must have a time constraint.

If an executed query does not have an explicit time constraint, then the inferred time period is for all time. Queries that embrace all of time, insofar as the data warehouse is concerned , can be generally regarded as nonsensical because for all time simply means the arbitrary length of time that the database has been in existence. Whereas it may be sensible to aggregate across all customers or all products in order to ascertain some information about, say, total revenue for a period of time, it does not make sense to apply the same approach to time under normal circumstances. The following query has been quoted extensively:

How many customers do we have?

Even this query has an implicit time constraint in that it is really asking: How many customers do we have at this precise point in time, or how many customers currently exist?

It is likely that readers of this book may be able to think of circumstances where the absence of a time constraint makes perfect sense, which is why the axiom is offered as a general rule rather than a fixed one. The principle still holds.

The Use of Existence Attributes

We'll now explore the use of existence attributes as one approach to the implementation of retrospection. The concept of existence has already been raised, but it's general application to all data warehouse components might not be entirely clear. The reasoning is as follows :

The temporal requirement for any entity relates to its existence. The existence of an entity may, in some circumstances, be discontinuous. For instance a wine may be sold by the Wine Club for a period of time and then be withdrawn. Thereafter, it may again come into existence for future periods of time. This discontinuity in the lifecycle of the wine may be important when queries such as How many product lines do we sell currently? are asked. A further example concerns customers. The regaining of lost customers is increasingly becoming an important business objective especially in the telecommunication and financial services industries. When customers are won back, the companies prefer to reinstate them with their previous history intact. It is now becoming more common for customers to have discontinuous existences.

A relationship that requires temporal support can be modelled as an m:n relationship. Using the entity relationship (ER) standard method for resolving m:n relationships, a further entity, an intersection entity, would be placed between the related entities. So there now exists another entity (a weak entity). As the temporal requirement for an entity relates to the existence of the entity, the treatment of relationships is precisely the same as the treatment of entities. It is a question of existence once again.

An attribute should really be regarded as a property of an entity. The entity is engaged in a relationship with a domain of values. Over time, that relationship would necessarily be modeled as an m:n relationship. An attribute that requires temporal support, therefore, would be resolved into another intersection entity between the entity and the domain, although the domain is not actually shown on the diagram. The treatment of attributes is, therefore, the same as entities and relationships and relates to the existence of the value of the attribute for a period of time.

It is proposed that temporal support for each element, where required, within customer circumstances, segmentation, or the dimensional structure of a data warehouse should be implemented by existence attributes. Various ways of representing such an existence attribute may be considered . At the simplest level for an entity, an existence attribute could be added to record whether each occurrence is currently active or not. However, if full support for history is needed, then this requires a composite existence attribute consisting of a start time and an end time that records each period of existence. Some special value should be assigned to the end time to denote a currently active period. Each element may have such an existence attribute implemented as follows:

For the temporal support of an entity that may have a discontinuous existence, a separate table is required, consisting of the primary key of the entity and an existence period. If discontinuous existence is not possible, the existence period may be added to the entity type.
For the temporal support of a relationship between entities, a separate table is required, consisting of the primary keys of both participating entities together with an existence period.
For the temporal support of an attribute of an entity, a separate table is required, consisting of the primary key of the entity, an existence period, and the attribute value.

It should be noted that the concept of existence attributes is not new. In fact, the approach could be described as a kind of selective attribute timestamping. However, use of attribute timestamping has largely been limited to research into temporal database management systems and has not before been associated with data warehousing. It is the selective adoption of attribute timestamps, in the form of existence attributes, for data warehousing purposes that is new.

The use of existence attributes solves the problem of cascaded extraneous inserts into the database caused by the use of the Type 2 solution with a slowly changing hierarchy that was described in Chapter 4. The reason is that there is no need to introduce a generalized key for a dimension, because changes to an attribute are kept in a separate table. However, the performance and usability of the data warehouse need to be considered. The use of existence attributes in the form of additional tables does add to the complexity, so it should be allowed only for those elements where there is a clearly identified need for temporal support. It is for the decision makers to choose which elements are sufficiently important to justify such treatment.

To explore the benefits of existence attributes, let us consider that the user wishes to know the number of customers that a particular sales executive is responsible for. The intuitive query, reproduced from Query Listing 4.6 in Chapter 4, is shown in Query Listing 6.1.

Listing 6.1 Nonexpert query to count customers (reproduced from 4.6).

 Select count(*)        from Sales_Exec S,Customer C        where S. SalesExecNum=C.SalesExecNum        And S.Name = 'Tom Sawyer'

If Type 2 is implemented, the result would definitely be wrong. Each customer would have one or more rows in the table depending on the number of changes that had occurred to the customer's record. The result would substantially overstate the real situation.

Type 2 works by creating a new dimension record with a different generalized key. One simple solution is that the dimension is given an additional attribute that signifies whether the row is the latest row for the customer. The value of this attribute declares the existence of the dimension and remains true until the row is superseded. In its simplest form the existence of a dimension can be implemented using a Boolean attribute. This means that when a change is implemented, the previous latest row is updated and the existence attribute is set to false. The new row has its existence attribute set to true.

The new query to determine the number of customers, that a particular sales executive is responsible for, is as follows:

 Select count(*)        from Sales_Exec S,Customer C        where S. SalesExecNum=C.SalesExecNum        And S.Name = 'Tom Sawyer'        And C.Existence = TRUE

The end-user query software could be configured to add this last line to all dimension tables, so the user need not be aware of it. However, this does not entirely solve the problem, because it is not answering the question: How many customers is Tom Sawyer responsible for now? Rather, it is answering the question: How many customers has Tom Sawyer ever been responsible for? One method toward solving this problem would be to also set the existence attribute to false when the customer ceased to be a customer. Whether or not this is possible depends on the ability of the data warehouse processing to detect that a customer's status had become inactive. For instance, if the customer's record in the source system were to be deleted, the data warehouse processing could infer that the customer's existence attribute in the data warehouse should be updated to false.

Another variation is to make use of null values in the existence attribute, as follows:

Null for not existing

Not null for existing (i.e., current)

If the column was called existence, to identify all customers who were still active customers, then the query would be as shown here:

 Select count(existence)        from Sales_Exec S,Customer C        where S. SalesExecNum=C.SalesExecNum        And S.Name = 'Tom Sawyer'

Due to the way nulls are treated (i.e., they are ignored unless explicitly coded for), this expression of the query is almost as simple as the original, intuitive query phrased by the user in Query Listing 6.1.

Furthermore, if the true value of existence was 1, then the following query would also return the correct result:

 Select sum(existence)        from Sales_Exec S,Customer C        where S. SalesExecNum=C.SalesExecNum        And S.Name = 'Tom Sawyer'

This appears to effectively solve the problem of determining who the current customers are. Thus, even the simplest existence attribute can improve the original Type 2 method considerably.

The existence attribute could be implemented using a single effective date. This has the advantage that we can determine when a change occurred. However, such a method does not allow us to answer the previous query ( How many customers is Tom Sawyer responsible for? ), because, again, there is no means to determine inactive customers. The use of row timestamping, using a pair of dates, does enable the question to be answered so long as the end date is updated when the customer becomes inactive. However, there are many questions, such as state duration and transition detection questions, that are very difficult to express and even some that are impossible to express using this approach.

We have already explored the concept of customer churn. The loss of valuable customers is a subject close to the hearts of many, many business people. It is a problem that could equally apply to the Wine Club. It is now becoming a common practice to contact customers that have churned in sales campaigns in an attempt to attract them back. Where organizations are successful in doing this, the customers are reinstated with their previous identifiers so that their previous history is available to the customer service staff. The incidence of discontinuous existences is, therefore, becoming more common.

The need to monitor churn and to establish the reasons for it tends to create a need for queries that return results in the form of a time series. The following exemplifies the type of questions we would like to express:

How many customers did we lose during the last quarter of 2000, compared to 1999 and 1998? The result of such a query would be a time series containing three periods and a number attached to each period. This is an example of a temporal selection query.
Of the customers who were lost, how many had been customers continuously for at least one year? The loss of long-standing customers might be considered to be as a result of worsening service. The result from this question is also a time series, but the query contains an examination of durations of time. So this query is a state duration query.
How many of these customers experienced a change of administration because they moved in the year that they left? Perhaps they are unhappy with the level of service provided by the new area. This question is concerned with the existence of the relationship between the customer and the sales area. It is also an example of a transition detection query.
How many price changes did they experience in the year that they left? Perhaps they were unhappy with the number of price rises imposed. This is similar to question 3 in that it is a transition detection query, but it is applied to the value of an attribute, in this case the selling price of a bottle of wine, instead of a relationship.

These requirements cannot be satisfied using a single Boolean attribute to describe the existence of a dimension, as there is a requirement to make comparisons between dates. Neither can the queries be expressed using a single date attribute for the reason previously explained. It seems clear that the expression of such queries requires a composite existence attribute that is, logically, a period comprising a start time and an end time.

It has been shown that row timestamping can provide a solution in many cases, but not all, and the resulting queries are complex to write. A simpler solution is sought.

The approach to be adopted will be to use a separate existence attribute, in the form of a composite start time and end time, for each dimension, relationship, and attribute where retrospection has been defined to be true. It is assumed that these existence attributes will be held in separate tables. So the existence attribute for customers' existence will be held in a table called CustomerExist and the existence attribute for the period during which a customer lives in a sales area will be called CustomerSalesAreaExist.

Using the composite existence attribute, the first query can be expressed as in Query Listing 6.2.

Listing 6.2 Count of customers who have left during Q4 year on year.

 select 'Q4 2000' as quarter, count(*)        from CustomerExist ce        where ce.ExistenceEnd               between '2000/10/01' and '2000/12/31' union select 'Q4 1999' as quarter, count(*)        from CustomerExist ce        where ce.ExistenceEnd               between '1999/10/01' and '1999/12/31' union select 'Q4 1998' as quarter, count(*)        from CustomerExist ce        where ce.ExistenceEnd               between '1998/10/01' and '1998/12/31'

The query in Listing 6.2 could have been answered using row timestamping only if the end timestamp was updated to show that the customer was no longer active. The distinction is made between the existence attribute and the row timestamp because the existence attribute is a single-purpose attribute that purely records the existence of the customer. The row timestamp is, as has been stated, a multipurpose attribute that records other types of changes as well. In order to express the query using row timestamps, it would have to be written as a correlated subquery to ensure that only the latest record for the customer was evaluated. This means that discontinuous existences could not be detected .

The second query can be expressed as in Query Listing 6.3.

Listing 6.3 Count of long-standing customers lost.

 select 'Q4 2000' as quarter, count(*)        from CustomerExist ce        where ce.ExistenceEnd               between '2000/10/01' and '2000/12/31'        and (ce.ExistenceEnd - ce.ExistenceStart) > 365 union select 'Q4 1999' as quarter, count(*)        from CustomerExist ce        where ce.ExistenceEnd               between '1999/10/01' and '1999/12/31'        and (ce.ExistenceEnd - ce.ExistenceStart) > 365 union select 'Q4 1998' as quarter, count(*)        from CustomerExist ce        where ce.ExistenceEnd               between '1998/10/01' and '1998/12/31'        and (ce.ExistenceEnd - ce.ExistenceStart) > 365

In order to express the third query, it is assumed that there is a separate existence attribute for the relationship between the customer and the sales area. This is shown in Query Listing 6.4.

Listing 6.4 Lost customers who moved.

 select 'Q4 2000' as quarter, count(*)        from CustomerExist ce, CustomerSalesAreaExist csa        where ce.ExistenceEnd               between '2000/10/01' and '2000/12/31'        and (ce.ExistenceEnd - ce.ExistenceStart) > 365        and ce.CustomerCode = csa.CustomerCode        and csa.ExistenceStart               between '2000/01/01' and '2000/12/31' union select 'Q4 1999' as quarter, count(*)        from CustomerExist ce, CustomerSalesAreaExist csa        where ce.ExistenceEnd               between '1999/10/01' and '1999/12/31'        and (ce.ExistenceEnd - ce.ExistenceStart) > 365        and ce.CustomerCode = csa.CustomerCode        and csa.ExistenceStart               between '1999/01/01' and '1999/12/31' union select 'Q4 1998' as quarter, count(*)        from CustomerExist ce, CustomerSalesAreaExist csa        where ce.ExistenceEnd               between '1998/10/01' and '1998/12/31'        and (ce.ExistenceEnd - ce.ExistenceStart) > 365        and ce.CustomerCode = csa.CustomerCode        and csa.ExistenceStart               between '1998/01/01' and '1998/12/31'

The query in Listing 6.4 is an example of a combined state duration and transition detection query.

As with the other queries, in order to express the fourth query, it is assumed that there is a separate existence attribute for the bottle price (Listing 6.5).

Listing 6.5 Lost customers affected by price increases .

 select 'Q4 2000' as quarter, ce.CustomerCode,               count(distinct spe.WineCode)        from CustomerExist ce, SalesPriceExist spe,               Sales s, Time t        where ce.ExistenceEnd               between '2000/10/01' and '2000/12/31'        and (ce.ExistenceEnd - ce.ExistenceStart) > 365        and ce.CustomerCode = s.CustomerCode        and s.WineCode = spe.WineCode        and s.TimeCode = t.TimeCode        and t.year = 1998        and spe.ExistenceStart               between '2000/01/01' and '2000/12/31'        group by quarter, ce.CustomerCode        having count(distinct spe.WineCode) > 5 union select 'Q4 1999' as quarter, ce.CustomerCode,               count(distinct spe.WineCode)        from CustomerExist ce, SalesPriceExist spe,                      Sales s, Time t        where ce.ExistenceEnd               between '1999/10/01' and '1999/12/31'        and (ce.ExistenceEnd - ce.ExistenceStart) > 365        and ce.CustomerCode = s.CustomerCode        and s.WineCode = spe.WineCode        and s.TimeCode = t.TimeCode        and t.year = 1997        and spe.ExistenceStart               between '1999/01/01' and '1999/12/31'        group by quarter, ce.CustomerCode        having count(distinct spe.WineCode) > 5 union select 'Q4 1998' as quarter, ce.CustomerCode,               count(distinct spe.WineCode)        from CustomerExist ce, SalesPriceExist spe,                      Sales s, Time t        where ce.ExistenceEnd               between '1998/10/01' and '1998/12/31'        and (ce.ExistenceEnd - ce.ExistenceStart) > 365        and ce.CustomerCode = s.CustomerCode        and s.WineCode = spe.WineCode        and s.TimeCode = t.TimeCode        and t.year = 1996        and spe.ExistenceStart               between '1998/01/01' and '1998/12/31'        group by quarter, ce.CustomerCode        having count(distinct spe.WineCode) > 5

The query in Query Listing 6.5 shows customers who left the club in the last quarter of the year and who had experienced more than five price changes during the year. This is another example of a combined state duration and transition detection query.

In dealing with the issue of churn, this approach of trying to detect patterns of behavior of customers is typical.

It is accepted that the queries have drawbacks in that they are quite complex and would prove difficult for the average data warehouse user to write. Also, each query actually consists of a set of smaller queries, and each of the smaller queries is responsible for processing a discrete point in time or a discrete duration. Any requirement to increase the overall timespan, or to break the query into smaller discrete timespans, would result in many more queries being added to the set. So the queries cannot be generalized to deal with a wide range of times. In the next section, we explore and propose an approach for making the queries much easier to express and generalize.

In this section we have seen that the use of existence periods does provide a practical solution to the implementation of true retrospection. This enables time to be properly represented in the customer circumstances and dimensional structures of the data warehouse and, therefore, satisfies one of the major requirements in the design of data warehouses.

only for RuBoard - do not distribute or recompile