CAPTURING CHANGES

only for RuBoard - do not distribute or recompile

Let's now examine how changes are identified in the source systems and can be subsequently captured into the data warehouse and the problems that can occur.

Capturing Behavior

As has been previously stated, the behavioral facts relate to the business transactions of the organization. Facts are usually derived from some entity having been frozen and captured at a particular status in its lifecycle. The process by which this status is achieved is normally triggered by an event. What do we mean by the term event?

There are two ways of considering the definition of an event. If the data warehouse is viewed in isolation so that the facts that it records are not perceived as related to the source systems from which they were derived, then they can be viewed purely as events that occurred at a single point in time. If, however, the data warehouse is perceived as part of the enterprise database systems, then the facts should be viewed within the wider context, and they become an entity preserved at a frozen state, having been triggered by an event. Either way, the distinguishing feature of facts is that they do not have a lifespan. They are associated with just one time attribute. For the purpose of clarity the following definition of facts will be adopted:

A fact is a single state entity that is created by the occurrence of some event.

In principle, the processes involved in the capture of behavior are relatively straightforward. The extraction of new behavioral facts for insertion into the data warehouse is performed on a periodic, very often daily, basis. This tends to occur during the time when the operational processing systems are not functioning. Typically this means during the overnight batch processing cycle. The main benefit of this approach is that all of the previous day's data can be collected and transferred at one time. The process of identifying the facts varies from one organization to another from between being very easy to almost impossible to accomplish. For instance, the fact data may come from:

Telephonic network switches or billing systems, in the case of telecommunications companies
Order processing systems as in the case of mail order companies such as the Wine Club
Cash receipts in the case of retail outlets

Once the facts have been identified, they are usually stored into sequential files or streams that are appended to during the day. As the data warehouse is usually resident on a hardware platform that is separate from the operational system, the files have to be moved before they can be processed further.

The next step is to validate and modify each record to ensure that it conforms to the format and semantic integration rules that were described in Chapter 2. The actual loading of the data is usually performed using the bulk load utility that most database management systems provide.

Once recorded, the values of fact attributes never change so they should be regarded as single state or stateless. There is a time element that applies to facts, but it is simply the time that the event occurred. It is usually implemented in the form of a single timestamp. The timestamp will vary, in granularity, from one application to another. For instance, in the Wine Club, the timestamp of a sale records the date of the sale. In a telecommunications application, the timestamp would record not only the date but also the hour , minute, and second that the call was placed.

Capturing Circumstances

The circumstances and dimensions are derived from what has been referred to as the reference entities within the organization. This is information such as customer, product, and market segment. Unlike the facts in a data warehouse, this type of information does have a lifespan. For instance, products may have various states during their lifespan from new to fast moving to slow moving to discontinued to deleted.

The identification and capture of new or changed dimensional information are usually quite different to the capture of facts. For instance, it is often the case that customer details are captured in the operational systems some time after the customer starts using the services of the organization. Also, the date at which the customer is enrolled as a customer is often not recorded in the system. Neither is the date when they cease to become a customer.

Similarly, when a dimensional attribute changes, such as the address of a customer, the new address is duly recorded in such a way as to replace the existing address. The date of the change of address is often not recorded. The dates of changes to other dimensional attributes are also, usually, not recorded. This is only a problem if the attribute concerned is one for which there is a requirement to record the historic values faithfully. In the Wine Club, for instance, the following example attributes need to have their historic values preserved:

Customers' addresses
Customers' sales areas
Wine sales prices and cost prices
Suppliers of wines
Managers of sales areas

If the time of the change is not recorded in the operational systems, then it is impossible to determine the valid time that the change occurred. Where the valid time of a change is not available, then it may be appropriate to try to ascertain the transaction time of the change event. This would be the time that the change was recorded in the database, as opposed to the time the change actually occurred.

However, in the same way that the valid time of changes is not recorded, the transaction time of changes is usually not recorded explicitly as part of the operational application. In order for the data warehouse to capture the time of the changes, there are methods available that can assist us in identifying the transaction times:

Make changes to the operational systems. The degree to which this is possible is dependent on a number of factors. If the system has been developed specifically for the organization, either by an organization's own IT staff or by some third party, as long as the skills are available and the costs and timescales are not prohibitive, then the operational systems can be changed to accommodate the requirements of the data warehouse. Where the application is a standard package product, it becomes very much more difficult to make changes to the system without violating commercial agreements covering such things as upgrades and maintenance. If the underlying database management system supporting the application is relational, then it is possible to capture the changes by the introduction of such things as database triggers. Experience shows that most organizations are reluctant to alter operational applications in order to service informational systems requirements for reasons of cost and complexity. Also, the placing of additional processing inside of operational systems is often seen as a threat to the performance of those systems.

Interrogation of audit trail. Some operational applications maintain audit trails to enable changes to be traced. Where these exist, they can be a valuable source of information to enable the capture of transaction time changes.

Interrogation of DBMS log files. Most database management systems maintain log files for system recovery purposes. It is possible, if the right skills are available, to interrogate these files to identify changes and their associated transaction times. This practice is discouraged by the DBMS vendors, as log files are intended for internal use by the DBMS. If the files are damaged by unauthorized access, the ability of the DBMS to perform a recovery may be compromised. Also, the DBMS vendors always reserve the right to alter the format of the log files without notice. If this happens, processes that have been developed to capture changes may stop working or may produce incorrect results. Obviously, this approach is not available to non-DBMS applications.

File comparison. This involves the capture of an entire file, or table, of dimensional data and the copying of the file so that it can be compared to the data already held in the data warehouse. Any changes that are identified can then be applied to the warehouse. The time of the change is taken to be the system time of the detection of the change, that is, the time the file comparison process was executed.

Experience shows that the file comparison technique is the one most frequently adopted when data warehouses are developed. It is the approach that has the least impact on the operational environment, and it is the least costly to implement. It should also be remembered that some dimensions are created by the amalgamation of data from several operational systems and some external systems. This will certainly exacerbate an already complex problem.

Where the dimensions in a dimensional model are large (some organizations have several million customers), the capture of the data followed by the transfer to the data warehouse environment and subsequent comparison is a process that can be very time-consuming . Consequently, most organizations place limits on the frequency with which this process can be executed. At best, the frequency is weekly. The processing can then take place over the weekend when the systems are relatively quiet and the extra processing required to perform this exercise can be absorbed without too much of an impact on other processing. Many organizations permit only monthly updates to the dimensional data, and some are even less frequent than that.

The problem is that the only transaction time available, against which the changes can be recorded, is the date upon which the change was discovered (i.e., the file comparison date). So, for example, let us assume that the frequency of comparison is monthly and the changes are captured at the end of the month. If a customer changes address, and geographic region, at the beginning of the month, then any facts recorded for the customer during the month will be credited permanently to the old, incorrect region.

It is possible that, during a single month, more than one change will occur to the same attribute. If the data is collected by the file comparison method, the only values that will be captured are those that are in existence at the time of capture. All intermediate changes will be missed completely.

The degree to which this is a problem will vary from application to application. It is accepted that, in general, valid time change capture for dimensions is not, practically speaking, realistic. However, it is important that practitioners recognize the issue and try to keep the difference between transaction time and valid time as small as possible. The fact that some data relating to time as well as other attributes is found to be absent from the source systems can come to dominate data warehouse developments. As a result of these problems, the extraction of data is sometimes the longest and riskiest part of a data warehouse project.

Summary of the Problems Involving Time

So far in this chapter we have seen that maintaining accuracy in a data warehouse presents a challenging set of problems that are summarized below:

Identifying and capturing the temporal requirements. The first problem is to identify the temporal requirements. There is no method to do this currently. The present data warehousing modeling techniques do not provide any real support for this.
Capture of dimensional updates. What happens when a relationship changes (e.g., a salesperson moves from one department to another)? What happens when a relationship no longer exists (e.g., a salesperson leaves the company)? How does the warehouse handle changes in attribute values (e.g., a product was blue, now it is red)? Is there a need to be able to report on its sales performance when it was red or blue, as well as for the product throughout the whole of its lifecycle?
The timeliness of capture. It now seems clear that absolute accuracy in a data warehouse is not a practical objective. There is a need to be able to assess the level of inaccuracy so that a degree of confidence can be applied to the results obtained from queries.
Synchronization of changes. When an attribute changes, a mechanism is required for identifying dependent attributes that might also need to be changed. The absence of synchronization affects the credibility of the results.

We have seen that obtaining the changed data can involve complex processing and may require sophisticated design to implement in a way that provides for both accuracy of information and reasonable performance.

Also in this chapter we have explored the various problems associated with time in data warehousing. Some of these problems are inherent in the standard dimensional model, but it is possible to overcome these problems by making changes to the way dimensional models are designed.

Some of the problems relate to the way data warehouses interact with operational systems. These problems are more difficult to solve and, sometimes, impossible to solve. Nevertheless, data warehouse designers need to be fully aware of the extent of the problems and familiar with the various approaches to solving them. These are discussed in the coming chapters.

The biggest set of problems lies in the areas of the capture and accurate representation of historical information. The problem is most difficult when changes occur in the lifespan of dimensions and the relationships within dimensional hierarchies, also where attributes change their values and there is a requirement to faithfully reflect those changes through history.

Having exposed the issues and established the problems, let's have a look at some of the conventional ways of solving them.

only for RuBoard - do not distribute or recompile