The solution for the Credit Union data warehouse will focus on the ETL process required to handle SCDs and on the changes to the relational tables and Analysis Services dimensions required to support this. Updating Dimensions with Changing AttributesWe will start with the Channel dimension, which does not track historical information because any new values for attributes will overwrite existing values. (This is a Type 1 SCD.) There will not be any changes required for the Channel dimension table because we have followed the best practices of using a surrogate key and creating a unique key constraint on the business key. The Slowly Changing Dimension transformation in Integration Services automates much of what used to be a fairly complex and time-consuming job for ETL designers. In the past, you would usually need to copy all the source records into a staging area, compare the staging table to the dimension table and insert any new records, compare all the attribute values that might have changed, and then retrieve and update the record using the business key. To use the Slowly Changing Dimension transformation, you need to specify the source of the dimension records as well as the destination table in the data warehouse. The wizard also needs to know information such as which column is the business key and which columns may have changed values.
The wizard will create two separate output destinations for new or updated records. Any new records that are not already in the dimension table (which is determined by looking up the record using the business key) are inserted into the table from the New Output data flow path. For dimension records that already exist, the Changing Attribute Updates Output path will execute an SQL command to retrieve the dimension record using the business key and update the changing attributes, which are ChannelName and ChannelType in our example. Because the SQL command uses the business key, it can take advantage of the clustered index that we recommended you create on all dimension tables, as described in Chapter 3, "Building a Data Warehouse." Tip: Deleted Members You might notice that the outputs from the SCD transformation do not cater for deleted records. Remember that you are trying to track history in the data warehouse. Even if a member is no longer used in new activities or transactions, it is still used by facts in your data warehouse. For this reason, you normally don't want to delete any dimension rows from the data warehouse.
Preserving Information for Dimensions with Historical AttributesThe next dimension that we will tackle is the Customer dimension. Even though some attributes, such as customer name, can be overwritten in the same way as channel, this dimension still needs to be able to preserve history because of the credit rating: Older facts still need to roll up under the correct credit rating, so we need to make some changes to the Customer dimension table. Changing the Dimension Table to Support Historical InformationWhen a customer record has an updated value for credit rating, we will be inserting a new dimension record that has the same business key but a new surrogate key. The first table change we need to make is to remove the unique key constraint for the Customer business key column, because each business key may now have multiple records. As discussed in the "Data Model" section, any new facts that are added will be using the new surrogate key and so they will automatically link to the correct credit rating. However, when we are processing the facts, we only know the business key, so how do we know which one of the two dimension records to pick? The answer is to add some columns to the dimension table that enable us to indicate which one of the records is the latest dimension record, so that we can look up the correct surrogate key. One possible approach to this is to add a Status column that contains a value such as Current for the latest record and Expired for all the older records. The problem with this approach is that it only tells you which record is the latest one. If at any point you need to reload all your historical facts (which should not happen if you have good backups), you need to know which dimension record is appropriate for each fact record. For example, a transaction from January 12, 1998, should use the dimension record that was in effect at that point. In our Customer dimension, we will be using the approach of adding StartDate and EndDate columns to the table. These are simply DateTime columns, and only EndDate is nullable. When the dimension table is first loaded, all records will have the same initial StartDate and a null EndDate. When a changed dimension record is received, the existing record has its EndDate set to the date of the load, and a new record is created with a null EndDate and the StartDate set to the current load date, as shown in Table 8-6.
We can then see which dimension record is the "current" record for each business key by looking for all the records with null end dates. Also, we can reload historical facts by joining to the dimension table using the transaction date and selecting dimension records that fall into the correct range. Loading Dimension Records with Historical InformationIn Integration Services, the logic needed to detect changed attributes, to set the end date for the old record, and to add a new record is all provided for you by the same type of SCD transformation that we used for the Channel table. To implement the Customer load package, we can just follow the same steps as for the Channel table; but when the SCD wizard wants to know the type of changes that are required for Credit Rating, we can select Historical Attribute rather than Changing Attribute. This lets the wizard know that it will need to generate the steps that are required to support Type 2 dimension changes. One thing to note is that we can still select Changing Attribute for the Customer Name column, which we always want to overwrite, even though the Credit Rating column has Historical Attribute. What will happen with this is that if the customer name changes, the Integration Services package will simply update the current dimension record with the correct customer name. However, if the credit rating changes, because it was marked as a Historical Attribute, the package will go through the steps required to expire the old record and create a new one with the changed credit rating. Updating All the Existing Records with a Changed AttributeBy default, when a customer name or other field marked Changing Attribute changes, only the current dimension record is updated by the Integration Services package. This means that when you look at older facts records that link to expired dimension records, you will still see the old customer name. This is often not the behavior that you wantif an attribute is marked as Changing rather than Historical, the reason is usually that you always want to see the most current information for that attribute. To make sure that the SCD transformation updates all the records for a changing attribute, when running the SCD wizard, you need to check the "Change all the matching records, including outdated records" check box in the Options page, as shown in Figure 8-4. Figure 8-4. Change all matching records optionRunning the PackageTo understand how the SCD transformation for historical attributes actually works, it is helpful to load your dimension once and then go back and make a few changes to the source table to see what happens. For example, try updating a customer name on one record, changing a credit rating on another record, and adding a record with a new business key. When you run the Integration Services package, you can see the counts of records that flow to each of the steps, as shown in Figure 8-5. Figure 8-5. Slowly changing dimension packageAll the records in the source dimension table flow from the source to the SCD transformation and then are split up depending on their characteristics. If changing attributes such as Customer Name have changed, the record is sent to the Changing Attribute Updates Output to update the existing customer record. If a historical attribute has changed, the record is sent to the Historical Attribute Inserts Output, which will set a variable to determine the end date and then update the end date on the current record. The record is then sent on to the Union All transformation, which combines any new records from the source and sets a value for the start date. Finally, one record is inserted for the new source record, and another record is inserted for the new version of the record with changed historical attribute. Detecting Changes in Source Dimension DataThe packages that we have built so far have used all the records in the underlying source dimension tables as the input. This is easy to implement, and Integration Services provides good performance for most scenarios. However, if you have a large number of records in a source dimension that do not change often, it is inefficient to throw all the source records at the Slowly Changing Dimension transformation and let it figure out that almost none of them have been changed. Ideally, you want to limit the list of source records to only those ones that have been added or changed since the last time you ran the load. This section describes some techniques that you can use to determine which records are affected if the source system supports them. Time Stamps and Sequence NumbersApplications often include a "Last Update Time" time stamp on any records that are updated through the application. This can prove helpful for determining which records should be loaded into the data warehouse. At the start of the load operation, you check the latest data warehouse load date and time and then load any source records with a Last Update Time since the last load. This approach only works if the application strictly enforces the updating of the time stamp and no manual processes can bypass this process. If you are going to rely on a time stamp, it is worth spending extra time confirming the behavior with the owners of the source system. If there is any doubt at all, we recommend simply loading all records; otherwise, the "missing updates" will become a data quality issue for the data warehouse. If an application database includes transaction sequence numbers such as automatically generated identity columns, these are an efficient method to detect additions to the fact tables. You can check for the highest sequence number in the fact table being loaded and then load any source facts with a higher transaction sequence number. TriggersA trigger on the source table is a good way to capture changes because it doesn't rely on developers to maintain time stamps or other indicators on the rows. They fire every time, regardless of which application loaded the data. Because every change is captured, even if your data source does not preserve history, your triggers can capture historical rows. In the code for a trigger, you can insert the affected record into one of three tables depending on whether the operation is an insert, update, or delete. Or you can put all affected rows into one table with an operation type indicator. Triggers can only be applied to data sources that are relational databases, not flat files or spreadsheets. Sometimes, the database administrator for the data source won't let you add triggers for a variety of reasons, including application performance, so a trigger isn't always an option.
Inserting Facts with Surrogate Key Lookups from Changing DimensionsThe next area of the Credit Union solution that we need to look at is the fact records. Each transaction record from the source system has, as usual, the business key for dimension such as channel and customer. This is easy to handle for the Channel dimension because each business key maps to exactly one surrogate key, so it can be handled as described in Chapter 4, "Integrating Data," using the Integration Services Lookup transformation. The lookup for the Customer dimension is more challenging. Because there can be multiple records in the customer table for each business key, we need to implement some logic that can locate the correct customer record for each incoming fact. If all the facts that we load are always new (that is, we never have to reload historical facts), we can simply use a query that returns only the current, unexpired customer records as the source for the Lookup transformation, as shown in Figure 8-6. SELECT CustomerKey, CustomerBusinessKey FROM DimCustomer WHERE EndDate IS NULL Figure 8-6. Specifying the source for the Customer lookupBecause this will return a single record for each customer business key, we can use this as the lookup source to translate business keys to surrogate keys. However, it is more likely that we will want to support reloading historical facts. In this case, we need to check the transaction date on each fact record and figure out which customer record was applicable at that time. Because SQL is the best performing tool for big joins such as this, the approach usually involves copying the source fact records that you want to load into the temporary staging area and then joining to the dimension table to retrieve the surrogate key in the data source query, rather than in a lookup transform. An example data source query is shown below: SELECT C.CustomerKey, F.Measure1, F.Measure2, ... FROM FactStaging F LEFT JOIN DimCustomer C ON C.CustomerBusinessKey = F.CustomerBusinessKey AND F.TransactionDate >= C.StartDate AND (C.EndDate IS NULL OR F.TransactionDate < C.EndDate) You need to explicitly check for null end dates in the join; otherwise, fact records for the latest time period will not be matched. You can see that we used a left-outer join to connect the dimension, so we will not lose any facts if the dimension member is missing. In a derived column transform, you can check to see if CustomerKey is null, and substitute a missing member key for CustomerKey. Dealing with Very Wide Dimension TablesIn some CRM systems, you find a large number of attributes in a single table such as Customer. These attributes exist either because they came "out of the box" because the vendor wanted to cover as many situations as possible, or system owners kept adding attributes without much regard to an efficient data model. If you are trying to support a Type 2 changing dimension, a wide row can quickly inflate the space consumed by a dimension table. Not all dimensions will be a problem, but a dimension such as Customer can have millions of members and those members will have multiple rows. It is possible that each year each member will be updated several times, and each update will generate a new row. Some of these attributes are only of interest for their current value. If they change, it is either to correct an error or the historical value isn't relevant. The history of other attributes will be of interest, which is what makes a Type 2 SCD. If you find that you have only a few attributes where tracking history is important, and many attributes where replacing history is acceptable, you might want to split the physical dimension table into two parts, one for the Type 2 attributes and one for Type 1 attributes. In this way, you will be creating multiple versions of a much smaller row. Analysis Services Dimension ChangesThe approach that we have followed in earlier chapters for creating Analysis Services dimensions works pretty well for slowly changing dimensions, too. Returning to the example shown again in Table 8-7, if a user is looking at all time periods that include transactions from when Emily Channing had a High credit rating, as well as transactions where she had the updated Medium credit rating, the query will return the information that the user usually is expecting to see.
For example, if a user drags the Credit Rating attribute onto the rows, he will see totals for both High and Medium credit ratings correctly allocated to the relevant credit ratings. This is essentially the point of SCDs: The revenue earned from this customer when she had a High credit rating is shown correctly, regardless of the fact that she currently has a Medium credit rating. If we use the Customer Business Key attribute (which contains the customer number from the source system) in a query, everything still works perfectly. Even though there are now two physical dimension records for customer number 1000 with different surrogate keys, the transactions for Emily Channing are correctly grouped together into a single row for customer number 1000 in the results. Things get interesting when we drag the Customer attribute onto the rows. Remember that in our usual approach for dimensions, we remove the separate Customer Name attribute that the wizard creates, rename the Customer Key attribute to Customer, and use the customer name as the NameColumn property setting. So, we have a single Customer attribute that is based on the unique surrogate key (that is, Customer Key) and displays the customer name when it is used in a query. Because we now have two records for Emily Channing with different surrogate keys, when we drag the Customer attribute onto the rows, we get two separate rows, as shown in Figure 8-7. All transactions from the original when she had a High credit rating are allocated to the first row, and all the later transactions are allocated to the last row. This is actually technically correct, but because the user can only see the customer name and not the surrogate key (nor would the user understand the surrogate key if he could see it), the user is likely to complain that the dimension contains duplicates. Figure 8-7. Multiple rows for a customerThis might be exactly the behavior that the users want, especially because they can disambiguate the apparently duplicate records by showing the member properties for that attribute if their client tool supports it, which will show them the changes in member property values that are the reason for the separate rows. However, if this behavior is not what they want, you can fix the issue. The first change to make to the dimension is to change the name of the Customer attribute that is based on the surrogate key to something else, such as Customer Key. You can then drag the CustomerBusinessKey column onto the list of dimension attributes to create a new attribute. Rename this new attribute to be Customer and specify the CustomerName field as the NameColumn property for this attribute. Now when the user drags the Customer attribute onto a query, the data will be grouped so that there is a single row for each unique customer number rather than separate rows for each surrogate key. You can also hide the Customer Key attribute based on the surrogate key if you want by setting its AttributeHierarchyVisible property to false, although that means that the users won't be able to use the member properties of this attribute to see the different states that the customers have gone through.
|