The Slowly Changing Dimension Wizard and Transform (SCD)


The Slowly Changing Dimension transform makes it easy to quickly generate complex Slowly Changing Dimension transform data flow. Table 22.1 contains the transform profile.

Table 22.1. The Slowly Changing Dimension Transform Profile

Property

Value

Description

Component Type

Transform

 

Has Custom Designer

Yes

Slowly Changing Dimension Wizard

Internal File I/O

No

 

Output Types

Synchronous

 

Threading

Single

 

Managed

No

 

Number Outputs

Up to 6

Depends on change types selected

Number Inputs

1

 

Requires Connection Manager

Yes

 

Supports Error Routing

No

 

Constraints

Nonoptimized

Not optimized, must be tuned to handle larger loads


A full discussion about the nature of dimensions and slowly changing dimensions is beyond the scope of this book, but a brief discussion here will help establish a context for discussing the Slowly Changing Dimension Wizard and transform.

A Typical Slowly Changing Dimension

Generally speaking, Business Intelligence is about capturing, analyzing, and understanding the trends, patterns, and history of a business so that companies can make more informed decisions. Facts record transactions and dimensions contain information about the transactions, such as customer name, time, weather, product, and advertising promotions. Slowly changing dimensions (SCD) result from the fact that over time, the world changes and dimension values must also be updated to reflect those changes. Not every dimension is a slowly changing dimension, and not every slowly changing dimension should be handled in the same way. One example of a dimension might be time. Time doesn't change. October 3, 1963, will always be October 3, 1963. However, marriage status, sales district, last names, and, nowadays, even gender dimension attributes do. Even then, you don't necessarily want to update every change in the same way. You might want to maintain the history of some changes, overwrite others, or simply detect when a change is attempted. For example, if the sales districts for a company are reorganized, to ensure that sales reports and queries remain accurate, you might want to keep a record of the previous sales districts for all salespersons. But, because customer name changes happen so rarely and because names are simply identifiers for the same entity, you might choose to overwrite last name attributes of customer dimensions. This is a fundamental problem with dimensions, and the Slowly Changing Dimension Wizard provides a way to implement the logic to handle these changes.

"Type 1" Slowly Changing Dimensions

Consider a simple example of an existing customer, Tom Brokenjaw, who is moving from Washington to New York on July 21st. If you don't really need to track the fact that, prior to moving to New York, Tom once lived in Washington, you can simply overwrite the state dimension. Table 22.2 shows how the dimension would appear before updating the dimension and Table 22.3 shows the change.

Table 22.2. Customer Dim Record Before Type 1 Change

Customer
Dimension Key

Customer
Business Key

Customer
Last Name

Customer State

42

TB2015447815

Brokenjaw

WA


Table 22.3. Customer Dim Record After Type 1 Change

Customer
Dimension Key

Customer
Business Key

Customer
Last Name

Customer State

42

TB2015447815

Brokenjaw

NY


In this case, the state value is overwritten. This is known as a Type 1 dimension, and in Integration Services is referred to as a Changing Attribute Change.

"Type 2" Slowly Changing Dimensions

Now, consider a scenario in which it is important to track the fact that Tom lived in Washington before moving to New York on July 21st. To accomplish this, you would create a new record for Tom in the customer dimension table and mark the old record as follows. Table 22.4 shows the record before updating. Table 22.5 shows the additional record.

Table 22.4. Customer Dim Record Before Type 2 Change

Customer
Dimension Key

Customer
Business Key

Customer
Last Name

Customer
State

Effective
Date

Current
Record

42

TB2015447815

Brokenjaw

WA

06-01-2000

True


Table 22.5. Customer Dim Record After Type 2 Change

Customer
Dimension Key

Customer
Business Key

Customer
Last Name

Customer
State

Effective
Date

Current
Record

42

TB2015447815

Brokenjaw

WA

06-01-2000

False

1001

TB2015447815

Brokenjaw

NY

07-21-2006

True


Notice that there are two more attributes in the dimension for Type 1 records because Type 2 changes preserve history, providing a way to determine when a particular change took effect. Notice also that the old record now has the current record flag set to FALSE and the new record is flagged as the current record indicating that Tom now resides in New York, thus preserving the history of Tom's state of residence. Slowly changing dimensions using this technique are referred to as Type 2 dimensions and are, by far, the most common type of slowly changing dimension. In Integration Services, this type of change is called a Historical Attribute Change.

Additional Slowly Changing Dimension Types

Type 1 or Type 2 are the most common change types, but occasionally you might have late-arriving dimension information or you might want to detect when a dimension attribute changes. Although the SCD does not directly support Type 3 changes, it is possible to detect when a Type 3 change might be appropriate, as described in the following section.

Inferred Member

Traditionally, late-arriving dimensions have been exceptional cases, such as data-entry errors or back-dated, real-world events. For example, occasionally, customers don't alert the company of address changes until months after they occurred or product manufacturing processes or parts change but are not correctly updated.

With the emergence of new technologies and business models, late-arriving dimensions are becoming more common. For example, grocery stores offer membership cards that you can use for discounts on items in the store. The membership card has a unique ID that tracks the buying habits of the card holder even if the card holder hasn't provided any information about herself to the grocery store. The customer might purchase groceries, which represent facts, but might not provide any information about herself at the time she makes her initial purchase with the membership card. Even without the customer information, the grocery store still needs to capture the transaction information as fact records and the fact records need to reference the membership card dimension record. However, the card holder might later provide her personal information, which then needs to be added to the dimension record as the late-arriving dimension.

Another example is furnaces and other home appliances that carry warranties. Large builders sometimes purchase such appliances in bulk to get discounted pricing. When a builder purchases the appliance, they often do not know who the ultimate owner will be or even where the utility will be installed. Only when the owner purchases the house, condo, or other building and registers the appliance will that information become available. When that happens, they will provide new information that needs to be inserted into the existing dimension record for that appliance warranty.

Inferred member changes are dimension records created in anticipation of such late-arriving dimension data.

Fixed Attribute

A Fixed Attribute change type enforces no changes. This change type detects when a record has changed so that it can be recorded and handled in the downstream data flow.

"Type 3" Slowly Changing Dimensions

Type 3 Slowly Changing Dimensions (SCD) are very rare and occur when a business must track the original and current values of only certain specific attributes of a dimension record. Because Type 3 Slowly Changing Dimensions are uncommon and require that the dimension table be modified, the Slowly Changing Dimension Wizard doesn't support them.

Tip

If you want to implement a Type 3 SCD, you can mark the column you want to update with the Fixed Attribute change type and then capture the Fixed Attribute output rows to discover Type 3 attribute candidates.


The Slowly Changing Dimension Wizard

OK, let's put the Slowly Changing Dimension Wizard through its paces. The package you're about to build is the SCD.dtsx package in the book samples in the S21-Advanced solution. The wizard takes you through six dialog boxes and a summary dialog box, as described in the following sections.

The Slowly Changing Dimension Wizard Steps

The following are the general steps the Slowly Changing Dimension Wizard walks you through for defining the additional data flow and the runtime behavior of the Slowly Changing Dimension transform.

  1. Select a Dimension Table and Keys This is where you select which dimension table you want to modify and the keys for matching the source records with dimension table records.

  2. Select the Slowly Changing Dimension Columns This is where you select the dimension attributes and the change type for each attribute.

  3. Select Fixed and Changing Attribute Options This is where you set the SCD transform to fail if a fixed attribute has changed and whether to update all records when a change is detected in a changing attribute.

  4. Select Historical Attribute Options This is where you tell the Slowly Changing Dimension Wizard how to detect current and expired records.

  5. Select Inferred Dimension Members This is where you turn on inferred member support and how to detect that an incoming record is for a previously inserted inferred dimension record.

  6. Finish the Slowly Changing Dimension Wizard This is where you can review the settings before finalizing the wizard.

Running the Slowly Changing Dimension Wizard

The preceding steps show the dialog boxes the wizard walks you through to generate the data flow. Let's run through the wizard and see what it produces. Make sure you have the AdventureWorks and AdventureWorksDW sample databases installed. To run the Slowly Changing Dimension Wizard, open the designer and perform the following steps:

1.

Create a new Integration Services project.

2.

Click on the package in the new solution.

3.

Drop a Data Flow Task onto the workflow.

4.

Switch to the Data Flow tab.

5.

Drop an OLE DB Source Adapter and double-click on it.

6.

Click the New button to create a new OLE DB connection manager.

7.

Connect to the AdventureWorks database.

8.

Select Table or View for the Data Access mode. (Default)

9.

Select [HumanResources].[Employee] for the name of the table.

10.

Click on the Columns node.

11.

Click the OK button to close the OLE DB Source Adapter dialog box.

12.

Drop a Slowly Changing Dimension transform onto the Data Flow designer.

13.

Drag the output from the OLE DB Source Adapter to the SCD transform.

Caution

The SCD transform doesn't support null business keys. If the source data includes rows in which the Business Key column is null, you should filter those rows from the flow with a conditional split before attempting to process them in the SCD transform.

14.

Double-click the Slowly Changing Dimension transform to open the Slowly Changing Dimension Wizard.

15.

You should see the Welcome screen shown in Figure 22.1 Click the Next button, if you do.

Figure 22.1. The Slowly Changing Dimension Wizard welcome screen


16.

Click the New button to create a connection manager to AdventureWorksDW or select an existing one.

17.

From the Table or View drop down, select [dbo].[DimEmployee].

18.

In the grid, in the Input Column for EmployeeNationalIDAlternateKey Dimension Column, select NationalIDNumber and set the Key Type to Business Key. The wizard should appear as Figure 22.2 shows. This tells the SCD transform how to match the dimension table record with the source system record and is usually a match between the source table primary key and the dimension table alternate or surrogate key.

Figure 22.2. Matching records from the source to the dimension tables


19.

Click the Next button to take you to the Slowly Changing Dimension Columns dialog box, as shown in Figure 22.3.

Figure 22.3. Setting up the attribute change types


20.

In the Dimension Columns column of the grid, select BirthDate.

21.

Leave it as a Fixed Attribute because someone's birth date should never change. Later in the wizard, you'll tell the wizard what to do if there is a row with a birth date value that has changed.

22.

In the Dimension Columns column of the grid, select MaritalStatus.

23.

Set this to the Changing Attribute change type to simply overwrite the old value with the new one.

24.

In the Dimension Columns column of the grid, select SalariedFlag.

25.

Set the change type to Historical Attribute, assuming you want to keep a history of when employees become salaried.

Note

The Slowly Changing Dimension Wizard creates outputs for fixed, changing, and historical change types only if there is at least one dimension with one of those change types. If you do not select a certain change type, for example a fixed change type, for any of the dimension table columns, there will be no Fixed Attribute option.

26.

Click the Next button; the Fixed and Changing Attribute Options dialog box opens, as shown in Figure 22.4. If, in the preceding dialog box, you did not create any fixed or changing attributes, these two options will be disabled.

Figure 22.4. Setting options for fixed and changing attribute types


27.

Ensure that the changes detected in fixed attributes do not fail the transform by deselecting the top option. This allows the record with the changed fixed attribute to be routed to the error output.

28.

Put a check in the Changing Attributes check box to enable Changing Attributes. This ensures that there is only one active dimension record for each employee. As you'll see in a minute, this check box affects the update query by adding an additional condition to the WHERE clause.

29.

Click the Next button, which takes you to the Historical Attribute Options dialog box, shown in Figure 22.5.

Figure 22.5. Setting up the Historical Attribute options


30.

Select the Use a Single Column to Show Current and Expired Records option because the DimEmployee table has one called CurrentFlag. Alternately, you could use the Use Start and End Dates to Identify Current and Expired Records option.

Tip

If you use the start and end dates option, the Slowly Changing Dimension Wizard populates the start and end date column drop downs with any date-time columns available in the dimension table for you to select. In addition, the Variable to Set Date Values drop down will contain all date-time variables in scope with the Data Flow Task. So, you can create your own and populate it with a date of your choosing, or you can use one of the system date-time variables.

31.

In the Column to Indicate Current Record drop down, select CurrentFlag.

32.

In the Value when Current drop down, select True.

33.

In the Expiration Value drop down, select False.

34.

Click the Next button, which takes you to the Inferred Dimension Members dialog box, shown in Figure 22.6.

Figure 22.6. Enabling inferred dimension members


35.

Check the Enable Inferred Member Support check box.

36.

Leave the All Columns with a Change Type Are Null option selected. This tells the SCD transform how to detect if a dimension record is an inferred record.

37.

Click the Next button, which takes you to the summary page.

38.

Click the Finish button.

When you finish the wizard, it takes all the settings you've input and creates a data flow. If you've followed along with the preceding steps, the Slowly Changing Dimension Wizard generates a data flow similar to what is shown in Figure 22.7.

Figure 22.7. The data flow resulting from the Slowly Changing Dimension Wizard


The Generated Data Flow

When you complete the Slowly Changing Dimension Wizard, you end up with a data flow that contains a Slowly Changing Dimension transform, with varying numbers of outputs going to various transforms and destination adapters. At execution time, the Slowly Changing Dimension transform plays the role of row diverter. By applying the rules established in the wizard, the SCD transform decides which output the source rows should flow down, letting the downstream data flow process the rows appropriately. Let's take a look at each of the resulting data flows that result from the settings you entered into the wizard.

The New Output Data Flow

The New Output data flow handles all new dimension rows. Depending upon whether your dimension table supports historical attribute change types, this output might vary. This is basically a straight insert into the dimension table after ensuring that the Current column is set correctly. In the previous example, the DimEmployee table does support the historical attribute change type, so the resulting data flow for the New Output has a Derived Column transformation to set the CurrentFlag column to trUE.

The Historical Attribute Output Data Flow

The Historical Output data flow is similar to the New Output data flow because it is essentially doing the same thinginserting a new record. The difference is that the historical attribute must change the last record current column to indicate that it is now expired. So, the additional data flow has an OLE DB Command transform that sets the current record to be expired with the following command:

UPDATE [dbo].[DimEmployee] SET [CurrentFlag] = ? WHERE [EmployeeNationalIDAlternateKey] = ? 


This sets the current dimension's CurrentFlag to FALSE and the key is the business key.

If you selected the option to set all historical attribute records to be expired in the wizard, the command has the following condition added:

UPDATE [dbo].[DimEmployee]  SET [CurrentFlag] = ?  WHERE [EmployeeNationalIDAlternateKey] = ?  AND [CurrentFlag] = '1' 


This ensures that when the new dimension record is inserted, only the new one will be considered current.

The rest of the data flow for the historical attribute is identical to the New Output and so is merged into the New Output data flow with a Union All.

Changing Attribute Updates Data Flow

The Changing Attribute Updates data flow is quite simple because all it is required to do is update the existing attribute in the given dimension record. Recall that you set the MaritalStatus attribute as the Changing Attribute change type. The resulting data flow is simply an OLE DB Command transform with the following update command:

UPDATE [dbo].[DimEmployee] SET [MaritalStatus] = ? WHERE [EmployeeNationalIDAlternateKey] = ? 


The MaritalStatus value is the new column value from the source and the key is the current business key.

Inferred Member Updates Data Flow

The records flowing down this output are bound for existing dimension records, so an update is in order. The OLE DB Command transform on this output performs the following UPDATE command:

UPDATE [dbo].[DimEmployee] SET [BirthDate] = ?,     [CurrentFlag] = ?,     [Gender] = ?,     [HireDate] = ?,     [LoginID] = ?,     [MaritalStatus] = ?,     [SalariedFlag] = ?,     [SickLeaveHours] = ?,     [Title] = ?,     [VacationHours] = ? WHERE [EmployeeNationalIDAlternateKey] = ? 


The attribute values are the new values from the source and the key is the current business key.

Fixed Attribute Output Data Flow

The Slowly Changing Dimension Wizard does not automatically generate any data flow for this output, but it might be sent to a destination or other transform to detect possible Type 3 change candidates or as a way to audit columns that should not be changing.

Unchanged Output Data Flow

The Slowly Changing Dimension

Wizard does not automatically generate any data flow for this output because the SCD transform doesn't do any work for these rows. You can connect this output and the SCD transform pushes all the rows that were current to the output.

How the SCD Processes Rows

When you first run the Slowly Changing Dimension Wizard, you might be a bit overwhelmed when it generates up to six outputs, each with their own data flow. Previously, you examined the various data flows that the Slowly Changing Dimension Wizard automatically generated. Now let's take a look at how the SCD transform processes each incoming row. Figure 22.8 graphically shows the process the SCD transform uses for diverting rows down its various outputs and the expected processing that should occur for each output.

Figure 22.8. Slowly Changing Dimension transformation processing steps


Performance Considerations

The primary design goals for the Slowly Changing Dimension Wizard and transform were to make the process of building robust SCD handling data flow quick and easy. As you can see, the Slowly Changing Dimension Wizard makes it easy to quickly build data flow for handling slowly changing dimensions and is useful for those wanting to quickly generate such packages. However, after you start dealing with larger dimensional datasets, you'll probably want to modify the automatically generated data flow to improve performance. The following are some optimizations you might consider for optimizing the generated SCD data flow.

Limiting the Dimension Recordset

When processing Type 2 slowly changing dimensions, the SCD transform performs a lookup for every incoming row to determine if there is an existing dimension. You can speed up the lookup by eliminating the number of records in the lookup. In fact, you can have the lookup performed on a different table, perhaps a temporary table with only one indexed column containing the subset of the alternate or surrogate keys that are currently active.

Staging Outputs

The Slowly Changing Dimension Wizard automatically generated data flow makes heavy use of the OLE DB Command transform. Although powerful and useful, this transform performs row-by-row operations, which can be very expensive. To eliminate the OLE DB Command transform, replace it with an OLE DB Destination and stage the data for both update rows and insert rows. Then, later in the workflow, execute separate INSERT and UPDATE commands using the staged data as the source. Because both INSERT and UPDATE statements allow you to join and compare two or more tables, you can conditionally update and insert rows as necessary. Ensure that you run the updates before the inserts. Running them concurrently can create locking issues.

Use a Lookup Transform Before the SCD Transform

Consider using a stock Lookup transform with partial caching turned on right before the SCD to quickly identify rows that don't exist in the dimension table. Those rows that do not exist can be directly inserted. This effectively eliminates the rows flowing through the New Output from the SCD transform and increases the efficiency of the SCD transform lookup step.

New Dimension Tables

When building a new dimension table, the SCD transform has a property named IncomingRowChangeType with the default value Detect. If the dimension table is empty, you can set this value to AllNew the first time you run the package and the SCD redirects all rows to the New Output.

The Slowly Changing Dimension transform provides a way to quickly and easily build complex dimension update data flow logic that is robust and customizable. The SCD transform supports four dimension change types, including changing, historic, fixed, and inferred member. For many applications, the Slowly Changing Dimension Wizard generated data flow is sufficient. However, if needed for increasingly large data loads, the SCD transform also lends itself well to optimizations.



Microsoft SQL Server 2005 Integration Services
Microsoft SQL Server 2005 Integration Services
ISBN: 0672327813
EAN: 2147483647
Year: 2006
Pages: 200
Authors: Kirk Haselden

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net