SSIS Slowly Changing Dimension Wizard | Expert SQL Server 2005 Integration Services (Programmer to Programmer)

Now it’s time to look at the built-in support for dimension ETL, called the Slowly Changing Dimension (SCD) Wizard. The SCD Wizard is a data-flow transformation and initially works like all the other transformations-simply drag and drop the transformation into the data flow and connect it to the upstream source or transformation. Figure 4-16 shows the data flow that was used earlier in the chapter with the SCD transformation now connected to the output of the Union All.

image from book
Figure 4-16: Data flow with the SCD transformation connected to the output of the Union All

Double-clicking the transformation will invoke the wizard. Like other user interface wizards, several windows will prompt you for configurations, in order to build the dimension ETL process. One of the nice advantages of the SCD Wizard is that it allows for very rapid ETL development.

The SCD Wizard supports Type 1 changing attributes, Type 2 historical attributes, inferred members, and Type 0 fixed attributes, all out-of-the-box.

When the source is connected to the SCD component, and the wizard invoked, the first screen will prompt you to identify the target dimension table, then the mapping of source columns from the data flow pipeline to the dimension columns in the dimension table, and, finally, the business keys in the source and dimension table. Figure 4-17 shows the mapping between the source rows generated for the product dimension and the dimension columns themselves.

image from book
Figure 4-17: Mapping between source rows generated for the product dimension and the dimension columns

Note that the data types must match in order for the source columns to be compared with the dimension table columns, which may require use of the Data Conversion transformation in the upstream data cleansing logic. For the wizard to handle the matching automatically, the column names must be the same. Matching names is not a requirement, because the matching can be done manually. Furthermore, if you have any columns in the dimension table that are not attributes, but rather management or metadata columns (such as StartDate and EndDate columns that identify when the row is active), these will not be matched. Later options in the wizard will give you the opportunity to specify usage of these.

The final step in the first screen is to identify the business key or keys. In Figure 4-17, the ProductAlternateKey is the business key manually matched to the source column Product ID (which has been aliased as ProductAlternateKey). The business keys help identify dimension records that need to be added as new members. The business keys also provide part of the equation on identifying matching records that need to be evaluated for changes.

The next screen of the wizard is about associating the dimension columns that have just been matched with the dimension change type. The wizard does not use the common dimension changing type numbers (Type 0, Type 1, or Type 2); rather, it uses the descriptive terms to identify the type of change (fixed, changing, or historical).

Figure 4-18 shows the matching of the dimension changes. The column on the left contains any non-business key matches identified in the prior screen, and the column on the right is a drop-down of the three different change types.

image from book
Figure 4-18: Matching of the dimension changes

In this example, the following groupings were made:

Fixed attributes- Size (changes in size will generate a new product, therefore, this is fixed)
Historical attributes- Class, ModelName, ListPrice
Changing attributes- Color, DaysToManufacture, EnglishProductName, FinishedGoodsFlag, FrenchProductName, ProductLine, ProductSubcategoryKey, ReorderPoint, SafetyStockLevel, SizeRange, SizeUnitMeasureCode, SpanishProductName, StandardCost, Style, Weight, WeightUnitMeasureCode

Important

The matching shown in Figure 4-18 can be a tiresome process if you have dozens of dimension columns and you are using your mouse to select the column in the drop-down. A quicker way to fill in these values is to use the Tab key and the up and down arrow keys. After finishing a row, the Tab key will send the cursor to the next row and automatically select the next column in the list.

Now that you have defined the column matches and identified the change types, the next few screens will help you to manage the advanced requirements for the dimension ETL process.

In the next screen, shown in Figure 4-19, some specific requirements are asked about Fixed attribute and Changing attribute members.

image from book
Figure 4-19: Fixed and Changing Attribute Options screen

If you do not identify any Fixed attribute or Changing attribute columns, then the respective detail questions shown in the screen shot will be grayed out.

The option for fixed attribute asks, “If there is a change in a fixed attribute, what should happen?”

By leaving the box unchecked, the change will be ignored and the value in the dimension table will stay the same as it was originally.
By selecting the box, if a change is detected in a Fixed attribute column, then the transformation will intentionally generate an error, so the data can be reviewed. This may be useful if, when a change is identified, it indicates a bigger problem.

The option for changing (Type 1) attributes identifies which records to update when a change happens. In other words, if the dimension record has any Type 2 historical attributes, there may be multiple records for the same business key.

When the check box is selected, all the related records (the current dimension member and the outdated members) are updated.
When the option is cleared, only the current record gets updated with the new value of the column. For example, if there’s a change in Color from blue to aqua, and the particular business key had five historical changes, then only the last record would be updated with aqua. When the Changing attributes check box is cleared, and color is marked as a Type 1, only the Current record gets updated from blue to aqua. The rest of the historical records remain as blue.

If any historical attributes have been selected, then the next screen will prompt to identify how the current row is identified. Figure 4-20 shows the Historical Attribute Options screen.

image from book
Figure 4-20: Historical Attribute Options screen

The two choices to identify a current row are:

Use a flag column (usually Boolean) to identify whether a row is current or not. Besides just selecting the column that should be used, the SCD Wizard also supports defining what values identify a current record and expired record. Note also that the columns in the drop-down list are any dimension columns that have not been matched from the source. That is because these are considered metadata columns that are used for management purposes like this.
Use a StartDate and EndDate combination to manage when a dimension record is active. The StartDate and EndDate column selections need to be dimension table columns defined with a datetime data type. Furthermore, one other option exists if taking the approach of start time and end time columns-that is, choosing which package variable should be used as the value to update the record with. In other words, when a new dimension member needs to be added, because there is a change in a Type 2 historical attribute, the prior record needs to be first updated with a new EndDate and the new record needs to be initiated with a new StartDate. Any system or user variable can be used. One good choice is to use the System::StartTime variable, which is automatically updated when the package is executed. And, in this example, the StartDate and EndDate columns are used in conjunction with the System::StartTime variable.

Important

Choosing to use StartDate and EndDate columns will give you the advantage of knowing exactly when a particular dimension member was active, either for reporting purposes or even to handle the scenario when a fact record arrives late and a historical dimension record needs to be selected instead of the current member.

The SSIS inferred member support is found in the next screen. As a reminder, inferred members are added during the fact ETL and updated during the dimension ETL. The built-in support for inferred members revolves around the dimension update. Figure 4-21 shows the options available for inferred members.

image from book
Figure 4-21: Options available for inferred members

First of all, inferred member support is either enabled or disabled. If enabled, the SCD Wizard needs to know how to identify whether a record is an inferred member. The two choices are to leave all the dimension columns in the dimension table as NULL, or to use an Inferred Member Boolean column in the table that identifies which rows are inferred members.

Since the product dimension table in the AdventureWorksDW database does not have an inferred member column, the first choice is selected.

Important

Using the first choice for inferred members (all attributes contain NULL values) is not often practical, because, first of all, it assumes that the columns in the dimension table allow NULL values, and secondly, it makes for difficult querying. Using an unknown value, for example, is often a better way to see data for a reporting system. But in addition, if Analysis Services is used for analytics, NULL values are also not a good choice. The best practice is to define an inferred member column and handle the identification by setting the Boolean value to True or False.

The final screen, not shown, presents a summary of the outputs that will be created. By selecting Finish on this screen, the SCD Wizard will take all the configuration options and create several downstream transformations and destinations. Figure 4-22 shows the end result of the SCD Wizard-the SCD transformation remains, but it contains several outputs to handle the different attribute types.

image from book
Figure 4-22: End result of the SCD Wizard

Since SSIS dynamically builds the data flow, the resulting layout in this example may not look exactly like your testing. However, the functionality is the same. For this example, on the left are changing attributes. Down the center are new members and historical attributes. On the right are inferred members. Starting with the SCD transformation, when a dimension row is processed, the SCD will determine which (if any) attribute changes occur, and whether there are new members to be added or inferred members to update. Consider the SCD transformation to be like a Conditional Split; it evaluates every row, one at a time, and routes the records to different outputs.

The simplest output to understand is the Changing Attributes Updates Output, which is linked to the OLE DB Command 2 transformation, connected to the AdventureWorksDW database. The following code shows the SQLCommand property, which defines the UPDATE statement:

  UPDATE [dbo].[DimProduct] SET [Color] = ? ,[DaysToManufacture] = ? ,[EnglishProductName] = ? ,[FinishedGoodsFlag] = ? ,[FrenchProductName] = ? ,[ProductLine] = ? ,[ProductSubcategoryKey] = ? ,[ReorderPoint] = ? ,[SafetyStockLevel] = ? ,[SizeRange] = ? ,[SizeUnitMeasureCode] = ? ,[SpanishProductName] = ? ,[StandardCost] = ? ,[Style] = ? ,[Weight] = ? ,[WeightUnitMeasureCode] = ? WHERE [ProductAlternateKey] = ?

What you should note in the UPDATE statement is that only the columns that were defined as changing attributes (Type 1) are included in the UPDATE statement, simply because this output is only for the Type 1 changing attributes. Also notice that the SQL statement is an OLE DB parameterized statement with question marks, which is the way that the OLE DB provider handles the parameterization. Figure 4-23 shows the Column Mappings tab, which maps (in order) the pipeline input columns to the parameterized query.

image from book
Figure 4-23: Column Mappings tab

The order of the question marks defines the order of the mappings.

The second output is the Inferred Member Updates Output. This output is very similar to the Changing Attributes Update Output because it also performs an UPDATE statement. Just like the first output, the inferred member output uses an OLE DB Command transformation to handle the updates (in this case, the OLE DB Command 1 transformation). The UPDATE statement defined in the SQLCommand property is as follows:

  UPDATE [dbo].[DimProduct] SET [Class] = ? ,[Color] = ? ,[DaysToManufacture] = ? ,[EnglishProductName] = ? ,[FinishedGoodsFlag] = ? ,[FrenchProductName] = ? ,[ListPrice] = ? ,[ModelName] = ? ,[ProductLine] = ? ,[ProductSubcategoryKey] = ? ,[ReorderPoint] = ? ,[SafetyStockLevel] = ? ,[Size] = ? ,[SizeRange] = ? ,[SizeUnitMeasureCode] = ? ,[SpanishProductName] = ? ,[StandardCost] = ? ,[Style] = ? ,[Weight] = ? ,[WeightUnitMeasureCode] = ? WHERE [ProductAlternateKey] = ?

The difference, as you would expect, is that there are more columns in the UPDATE statement. Not only are the Type 1 changing attributes updated, but also the Type 2 historical attributes. Included in the UPDATE statement are the Class, ModelName, and ListPrice columns, which were defined as a Type 2 historical attributes. These are updated because of the nature of an inferred member, which requires updates to all the columns without generating a new record. Furthermore, if you had defined an Inferred Member Flag column, this is where the inferred member column would also be updated. Just like the first OLE DB Command transformation, the order of question marks defines the order of the mapping.

The next two outputs to consider are also related. Both the New Output (new members to be added) and the Historical Attribute Inserts Output add rows to the dimension table. What you can see in Figure 4-22 earlier is that a Union All transformation is used to bring these two outputs together for a single destination insert.

If the business key from the source does not exist in the dimension table, it’s identified as a new member that needs to be inserted. The New Output that handles new members goes directly to the Union All transformation. With the Historical Attribute Insert Output, before the records are brought together in the Union All, a couple of metadata management tasks need to happen. Whether the new historical attribute record is marked as current through a combination of dates or a separate column, the old record needs to be updated before the insert can happen. Either the End Date column is updated, or a current flag column is updated, which is handled in a two-step process:

The Derived Column transformation that is attached to the Historical Attribute Insert Output adds either an EndDate column to the data flow (as in this example), or it adds the expired flag value. Figure 4-24 shows the Derived Column editor that defines an EndDate column and then also uses the System::StartTime variable as specified in the SCD Wizard.

Figure 4-24: Derived Column editor
Another OLE DB Command transformation is used to update the End Date for the expired record, based on the business key and the current indicator (in this case, the record to be expired will currently have an End Date of NULL). The SQLCommand property of this OLE DB Command transformation is as follows:
```
  UPDATE [dbo].[DimProduct] SET [EndDate] = ? WHERE [ProductAlternateKey] = ? AND [EndDate] IS NULL 
```

One nice feature of the OLE DB Command transformation, as this example shows, is that the records from the pipeline can perform the UPDATE statement, and then still be available downstream for other purposes. After the expired record has been updated, then the record is ready to be “union-ed” with the new member output.

Before the final insert for the new members and Type 2 historical attributes, one final step is necessary- you need to add the Start Date (or current record flag). This is handled by a Derived Column transformation, which adds a column to the pipeline called StartDate (in this example). Figure 4-25 shows the details of the Derived Column transformation that falls just below the Union All transformation and before the OLE DB destination (from the data flow shown earlier in Figure 4-22).

image from book
Figure 4-25: Details of the Derived Column transformation

Just like the EndDate, the StartDate uses the System::StartTime variable, which is used for the new record. New records that are new members or Historical-attribute inserts require the EndDate to be NULL. (If you have specified a current indicator, you put the indicator’s current value in this Derived Column transformation.)

When executed, the SCD transformation routes the rows from the source to the different outputs. Notice in Figure 4-26 that the 504 input rows are not all sent out the outputs. This is because some of the records do not go through any changes, so they are effectively ignored.

image from book
Figure 4-26: Input rows not sent out the outputs

Advanced Properties and Additional Outputs of the SCD

There are two advanced properties of the SCD that can be used to customize how the SCD transformation handles the processing:

The CurrentRowWhere property identifies how the current dimension row for a source record is identified. If you have configured the SCD to use a Current Flag column, then, by default, this property would filter on the Current Flag where the value you defined in the wizard is current. Alternately, if you specified Start Date and End Date, then, by default, the property would assume that the End Date column IS NULL. This property can be changed if you need to re-define how the SCD transformation searches for the current record. For example, if your organizational standards do not allow NULL values, then you would have to modify this property to check the End Date for the default value set for the column (oftentimes, a date far out in the future is used to accommodate this, such as 1/1/2050).
The SQLCommand property contains the SQL syntax used for the lookup against the dimension table to either determine if a new member needs to be created, or if there have been any changes in a Type 0 fixed attribute, Type 1 changing attribute, or Type 2 historical attribute.

Both of the properties referenced here can be found by reviewing the Properties window when selected on the SCD transformation, as shown in Figure 4-27, or by looking at the advanced editor of the SCD transformation.

image from book
Figure 4-27: SCD Transformation Properties window

Furthermore, the SCD transformation contains two additional outputs that are not used by default, but are useful for auditing and data validation.

The first output enables you to capture the rows that have not gone through any change. Very likely, if you are pulling the entire dimension source (as opposed to just targeting new and changed records), you will have many dimension records from the source that are unchanged or are completely in synch with the dimension table. Although a change has not happened, you may have a need to count the number of rows that are unchanged, or capture the unchanged rows in a table or file for review. The Unchanged Output is accessible by selecting the green path output from the SCD transformation, and connecting it to another transformation or destination. When this is done, you will be prompted to choose the right output, and are prompted with the remaining outputs that are not used. In Figure 4-28, a Row count transformation is used to capture the number of rows that are unchanged into a variable that is later captured for auditing purposes.

image from book
Figure 4-28: Row count transformation used to capture the number of unchanged rows

The second additional output is the Fixed Attribute Output, which sends out any rows where a fixed attribute column has changed when it should not have. Rather than a Row count, a better use of this output is to capture the records to a staging table for review, since a change was not supposed to happen.

Tip

Only when the Ignore fixed attribute changes option is selected will this output be used. Otherwise, if a fixed attribute change occurred, the SCD would intentionally fail.

Slowly Changing Dimension Wizard Advantages and Disadvantages

The SCD Wizard is a very powerful tool, and will be appreciated by ETL developers who commonly deal with managing complicated ETL processes for dimensions. Several benefits will be achieved by using the built-in SCD support. However, there are also a few limitations surrounding the SCD that should be mentioned.

The advantages focus on management, development, and standardization, including the following:

Simplicity- The SCD Wizard can handle most dimension scenarios. It makes the often complicated dimension processing straightforward, and helps standardize ETL development for dimensions.
Rapid development- The SCD can save time in the development lifecycle by reducing the design and development time, and also easing the management and testing. This leaves more availability for other areas of an ETL process.
Wizard allows changes- If the inputs entered into the SCD Wizard require changing, the wizard can be re-invoked, and these changes will propagate down to the downstream-generated components automatically. A caveat to this is presented in the limitations section.
Customized output transformations- Since the SCD Wizard generates transformations rather than a black-box approach, the output can be customized. For example, the OLE DB Command transformation used for Type 1 changing attributes can be removed and replaced with a staging table in order to achieve set-based updates, which often perform faster than row-by-row updates.
Beyond dimension processing- The SCD transformation can be used beyond just dimension processing, such as table synchronization. Even though the name suggests that the SCD Wizard focuses exclusively on dimension processing, one alternate use is to just leverage the Type 1 changing attribute support (and the included new member support).

The limitations of the SCD support focus mostly on scalability for large-dimension scenarios:

Dimension table lookup scalability- The dimension table is not cached in memory. Therefore, for every row coming in from the source, a separate lookup statement is sent to the dimension table in the underlying relational engine.
All updates are row-based- Relational updates are required for the Type 1 changing attribute output, the Inferred Member output, and the Type 2 historical attribute output (to expire the previous record). Because the OLE DB Command transformation is employed, every row coming through these transformations sends a separate UPDATE statement to the dimension table, in a cursor-like fashion. When dealing with several thousand updates, this can be limiting.
Customized outputs are overwritten by changes- Although the wizard can be re-run (with the prior run’s values remaining), if you have customized the output and then run through the wizard again, when the wizard finishes, it will overwrite any changes you made (the transformations will be orphaned by a new set of transformations). Be careful with that if you’re making customizations. The wizard will overwrite them.
Locking issues and inserts- All at the same time, data may be queried from the dimension table for comparison, it may be updated in the dimension table to handle a Type 1 change, and it may also be inserted for new members and Type 2 historical records. All this activity on the dimension table at one time can slow down the dimension ETL process. Furthermore, the inserts cannot take advantage of the Fast Load option because of the locking contentions, thus resulting in row-by-row inserts.

Optimizing the Built-in Slowly Changing Dimension Support

Later in this chapter, we examine writing an SSIS package for dimension ETL without using the built-in SCD Wizard support. However, if you are dealing with a large dimension table, a couple of techniques can be used to achieve better performance.

Index Optimizations

Since the dimension lookups and updates both are row-by-row operations, be sure to check the indexes on your dimension table to speed up the identification of the current record. If you are seeing very poor performance with the SCD (anything less than approximately 2,000 rows per minute), then chances are the SCD lookups are requiring relational table scans or bookmark lookups after identifying the record key. For best ETL optimization, create your dimension table’s clustered index on the business key, rather than the dimension surrogate key. Including the current indicator flag or end date as the second column in the index will improve the performance even more.

Important

Index optimization must be balanced between the ETL and the query usage. Optimization for query patterns should take priority over optimizations for ETL. However, some situations may require ETL-focused optimization as a priority to achieve the service-level agreements (SLAs) identified for processing times. Also, be cautious that too many indexes can slow down operations.

Update Optimizations

Dimension table updates to handle Type 1 changing attributes are a common occurrence in any dimension table ETL process. Although the inferred member output also requires updates, the number of rows will typically be a fraction of the number of Type 1 changing attributes, since inferred members are considered an exception to the rule. Since the updates are row by row (this is the way the OLE DB Command transformation works), then dealing with thousands of updates will create a processing bottleneck. One way to improve performance is to replace the OLE DB Command update with a set-based update approach. In Figure 4-29, the OLE DB Command that handles the Type 1 changing output has been replaced with an OLE DB Destination to a staging table.

image from book
Figure 4-29: OLE DB Destination to a staging table

Using a staging table for this scenario improves performance, because it allows a single set-based UPDATE statement to be run. The UPDATE statement is handled with an Execute SQL Task in the control flow.

The set-based statement is comprised of an inner join between the staging table and the dimension table across the business key, where any matching rows (already identified as requiring a Type 1 changing update) will require the attributes to be updated. The following highlights the TSQL code that performs this operation:

  UPDATE [dbo].[DimProduct] SET  [Color] = STG.[Color] ,[DaysToManufacture] = STG.[DaysToManufacture] ,[EnglishProductName] = STG.[EnglishProductName] ,[FinishedGoodsFlag] = STG.[FinishedGoodsFlag] ,[FrenchProductName] = STG.[FrenchProductName] ,[ProductLine] = STG.[ProductLine] ,[ProductSubcategoryKey] = STG.[ProductSubcategoryKey] ,[ReorderPoint] = STG.[ReorderPoint] ,[SafetyStockLevel] = STG.[SafetyStockLevel] ,[SizeRange] = STG.[SizeRange] ,[SizeUnitMeasureCode] = STG.[SizeUnitMeasureCode] ,[SpanishProductName] = STG.[SpanishProductName] ,[StandardCost] = STG.[StandardCost] ,[Style] = STG.[Style] ,[Weight] = STG.[Weight] ,[WeightUnitMeasureCode] = STG.[WeightUnitMeasureCode] FROM [dbo].[DimProduct] INNER JOIN [SSISOps].[dbo].[stgDimProductUpdates] STG ON [DimProduct].[ProductAlternateKey]  = STG.[ProductAlternateKey]

Be sure to truncate your staging table for every ETL run (by adding an Execute SQL Task at the start of the control flow); otherwise, you will be updating data from old rows from a prior execution.

Handling Advanced Dimension Processing with the Slowly Changing Dimension Support

This section examines how to handle the ETL for advanced dimension forms, combining the functionality of the SCD with other out-of-the-box transformations, focusing on snowflake dimensions, parent-child dimensions, and date dimensions.

Snowflake Dimension Tables

A snowflake table, as briefly described in the beginning of this chapter with the product dimension (see Figure 4-2 earlier), requires some unique ETL handling aspects. In a snowflake dimension, the higher-level tables (subcategory and category in this example) also have surrogate keys. With the product snowflake dimension, the ProductCategoryKey cascades down as a foreign key in the DimProductSubCategory table, and the ProductSubCategoryKey cascades down to a foreign key relationship in the DimProduct table. The ProductKey itself relates directly to the fact table, whereas the surrogate keys in the category and subcategory tables do not relate directly to the reseller fact table.

The design is called a snowflake because when viewed in relationship to the fact table, the table layout looks like a snowflake as opposed to a star. (A star schema has dimension tables one level out from the fact table. A single table dimension is often called a star dimension.) Generally, most dimensions are designed as a star dimension. However, there are two very valuable reasons to break out a dimension table into a snowflake design:

When a dimension table has several attributes that relate directly to a higher level within a dimension hierarchy, managing those dimension changes can be a lot easier with a snowflake design. For example, say the product subcategory table contains the English, Spanish, and French names of the subcategory. If these columns were included in the base product table, and the subcategory changed for the dimension, it would be much more difficult to ensure that the Spanish and French names were in synch with the subcategory. Otherwise, if they were not, reporting of these attributes would be misleading with the incorrect association.
A second and more compelling reason to use a snowflake is when you have multiple fact tables related to the same dimension table at different levels. For example, if the Reseller Sales fact table tracked sales at the Product Level, but the Sales Quota facts were assigned to a Product Category, if the Product dimension only had one table, the Category could not be effectively joined. Using a snowflake design, the fact tables can be related to one another because they share a common table at different levels of the product dimension hierarchy. Sharing dimension tables between fact tables is a driving emphasis of dimensional modeling called conforming dimensions.

Important

Analysis Services supports having a standard dimension relate to different measure groups, at different levels. (A measure group in Analysis Services is equivalent to a fact table in the database.) It understands how to do the right aggregations. The Analysis Services engine naturally understands how to associate this data and perform aggregations.

A straightforward method to processing snowflake dimension in SSIS is to use multiple SCDs embedded in different data flows, linked by precedence constraints in the control flow starting at the top level of the snowflake tables and working down to the lowest level. Figure 4-30 shows the control flow of the product dimension package. Note the very first task is an Execute SQL Task that truncates the staging table used for the set-based update, followed by the three Data Flow Tasks.

image from book
Figure 4-30: Control flow of the product dimension package

The first data flow shown is the product category dimension table, followed by the subcategory dimension table, and concluding with the product data flow. The final Execute SQL Task handles the set-based updates for Type 1 changing attributes as discussed earlier.

The product category data flow is the most straightforward because it only requires one Lookup transformation to get the category translation attributes, and, furthermore, it only contains Type 1 changing attributes. Figure 4-31 shows the product category data flow.

image from book
Figure 4-31: Product category data flow

Next, you process the product subcategory. When processing data within intermediate snowflake tables in the hierarchy, a lookup is also required to reference the parent table surrogate key. For example, as you process the product subcategory table, include a Lookup transformation to pull the surrogate key of the category, as shown in Figure 4-32.

image from book
Figure 4-32: Subcategory lookup to pull the surrogate key of the category

The ETL processing in the lowest-level product table has already been discussed. Not mentioned, but also included in the data preparation steps for the product dimension, is the need to pull the surrogate key of the subcategory table.

Parent-Child Dimension ETL

The next dimension design that we will consider is the parent-child dimension. A parent-child is a self-referencing dimension table and also has special ETL requirements. Simply put, a parent-child dimension has a surrogate key and a parent surrogate key, which gives the dimension a self reference. In addition, parent-child dimensions usually have a business key and a parent business key. The self-referencing business key nicely illustrates the parent relationship in the table.

An organizational structure is a great example. An organization typically has common attributes for all the individuals in the organization, such as location, office, salary, and so on. All of these attributes relate to higher levels in the hierarchy. A parent-child enables you to build a hierarchy where members at different levels have common attributes. Secondly, a parent-child dimension also allows the hierarchy to be unbalanced, where not every drill path in the hierarchy goes down to the same level.

The Employee dimension table as shown in the beginning of the chapter (see Figure 4-2) is an example of an organizational structure. Figure 4-33 shows a subset of data and columns within the Employee dimension table.

image from book
Figure 4-33: Subset of data and columns within the Employee dimension table

As you can see, some dimension members relate to other dimension members. For example, Dan Bacon reports to Jean Trenary (Dan’s parent employee key is 44, which is the employee key of Jean). At the top of the table, the Chief Executive Officer, Ken Sanchez has no parent key and, therefore, no manager. Every member in the table is a member at a different level in the hierarchy. Taking this subset of data and building the hierarchy for the parent-child relationship turns into the hierarchy shown in Figure 4-34.

image from book
Figure 4-34: Building the hierarchy for the parent-child relationship

Note a few points about this parent-child dimension:

This is an unbalanced hierarchy. The levels within this dimension don’t all extend to the lowest level (Laura Norman, for example, has no direct reports). In a USA-based geography dimension, for example, everything typically goes to the same state or city level. Within an unbalanced hierarchy, there are levels that don’t cascade all the way down.
A parent-child dimension can also be a ragged hierarchy. A ragged hierarchy has holes in the hierarchy; you could be at the top level, skip the next level, and go directly to the level below. You must be aware of these variations when processing ETL.
Parent-child dimensions have shared attributes. Most of the records, except at the top level, share common attributes, such as Employee Address.

When processing parent-child dimensions in SSIS, a couple of methods can be applied for acquiring the parent record key.

The first approach is to use a Lookup transformation to acquire the parent record for the parent-child relationship. Figure 4-35 shows the data flow used to process the Employee dimension table.

image from book
Figure 4-35: Data flow used to process the Employee dimension table

This data flow has similarities to the product dimension data flow, as the first several transformations are used to prepare the data for the SCD transformation. Note that the third data flow object is a Lookup transformation used to acquire the parent surrogate key. In this case, the Lookup transformation joins the parent employee business key of the source to the matching employee business key from the employee table. Figure 4-36 shows the Columns tab of the Lookup editor.

image from book
Figure 4-36: Columns tab of the Lookup editor

Besides the join being across the parent business key to the business key in the reference table, the surrogate key that is returned is aliased to match the ParentEmployeeKey.

A second approach involves using an Execute SQL Task in the control flow and performing a set-based update to associate the employee with the employee’s manager record. This SQL Task needs to be run after the data flow that adds new dimension records. The following SQL code is used to perform the operation:

  UPDATE dbo.DimEmployee    SET ParentEmployeeKey = Parent.EmployeeKey FROM dbo.DimEmployee INNER JOIN dbo.DimEmployee Parent       ON DimEmployee.ParentEmployeeNationalIDAlternateKey =             Parent.EmployeeNationalIDAlternateKey       AND Parent.EndDate IS NULL WHERE DimEmployee.ParentEmployeeKey IS NULL

The query could be handled by a self join or a correlated subquery. In this case, a join is used between the Employee table and the Employee table (aliased as Parent) matching across the child employee with the parent manager record. The ParentEmployeeKey is updated with the EmployeeKey of the Employee table joined as the parent table. Also important to note is that the only records affected are those with the ParentEmployeeKey set to NULL, which targets and, therefore, optimizes the updates.

A final approach is to use a combination of a Lookup transformation and an Execute SQL Task. The Lookup transformation approach may fall short of handling all situations if an employee is added at the same time as the assigned manager. In this case, the parent employee key would not yet be available. This combined approach would give better scalability in higher-volume situations by reducing the rows affected by the update.

Date Dimension ETL

The date dimension is probably the most common and conformed dimension in any data warehouse or data mart structure. In other words, most fact tables have relationships to a Date table. In the sample AdventureWorksDW, the date dimension table is called DimTime. Honestly, it’s poorly named. The word Time has connotations of time of day-hour or minute, not the date grain, which is actually what is stored in the DimTime table. DimDate would have been a better name.

A date dimension can have many different attributes, besides the year, month, and date. It might include the following:

Day of week
Week of year
Holiday
Fiscal hierarchy

The attributes combine to create different hierarchies. An example of a natural calendar hierarchy would be Year-Quarter-Month-Date or Year-Week-Date. In the natural calendar, weeks do not line up with months (a week can span two months). Therefore, there are two natural calendar hierarchies. Because of this challenge, organizations have come up with different, and often unique, fiscal hierarchies. Your organization may have its own fiscal hierarchy.

A common approach to a custom fiscal hierarchy is to break a quarter into three periods: four weeks in period 1, four weeks in period 2, and five weeks in period 3 (commonly called a 4-4-5 fiscal date hierarchy). A period essentially replaced a month to accommodate the week-month challenge. One organization we worked with had at least 10 different fiscal hierarchies-every branch of the organization wanted to see the sales data from a different perspective.

When it comes to handling the date dimension ETL, the dimension is relatively easy to process because once a new date record is added, it is usually never updated. The Date dimension is commonly not snowflaked into multiple tables.

Important

The date dimension is not typically snowflaked even if you have higher-level fact tables and a multi-grained scenario. For example, account balances might be tracked at a week level, whereas sales come in at a day level, and inventory at a month level. The date dimension almost always uses the day for granularity. You would use the first period of that grain as the key. For example, if inventory is at the month level, use the first day of the month as the key for the month.

There are a few different ways to process date-dimension ETL. The following are some common mechanisms:

Handle the date dimension population through a stored procedure. Whether you’re calculating one date at a time every day or out into the future once a year, a stored procedure can load all of the records and handle all the date logic.
Use a data flow with Derived Column transformations and date part expressions. The data flow generates a set of dates where additional attributes are added using the date functions in SSIS updated.
Use a For Loop Container to iterate through a range of dates with SQL Inserts. The For Each Loop container could be configured to increase the date variable by one every loop and then leverage an Execute SQL Task to manage the inserts, one row at a time.

Any of these approaches are viable. Since the date dimension usually contains a small number of records, this package will process quickly regardless of the approach chosen.

As an example, if the second approach is taken, the process would involve identifying the first new date and the last date to add, followed by a data flow to handle the inserts. Figure 4-37 shows the control flow of a package with an Execute SQL Task followed by a data flow.

image from book
Figure 4-37: Control flow of a package with an Execute SQL Task followed by a data flow

Defined in the package are two user variables, FromDate and ToDate, which are populated by the Execute SQL Task using a single row resultset. The FromDate is populated by querying the MAX date in the time dimension plus one day, and the ToDate is populated from the MAX date in the sales header table source. As you would guess, the data flow only needs to run if the ToDate is greater than the FromDate. This is handled through the Precedence Constraint after the Execute SQL Task. Figure 4-38 shows the Precedence Constraint Editor.

image from book
Figure 4-38: Precedence Constraint Editor

If the Execute SQL Task is successful and the @ToDate is greater than @FromDate, then the Precedence Constraint will allow the Data Flow Task to execute. At this point, if the criteria are met, then there are rows that need to be inserted into the time dimension table. The data flow, shown in Figure 4-39, contains five components:

A Script Component that generates a row for every day starting with the Start Date and ending with the End Date
A Derived Column transformation that adds the required date attributes
Two Lookup transformations to pull in some date translations from a translation table
A destination adapter that inserts the rows into the Time dimension table.

image from book
Figure 4-39: Five components of the data flow

Since the Script Component can be used as a source (as described in Chapter 2), it is a great candidate to generate rows from the FromDate to the ToDate values. The following code takes the SSIS variables and, using a Do While loop, iterates over the dates from the FromDate to the ToDate, adding a date for every loop:

  Imports System Imports System.Data Imports System.Math Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper Imports Microsoft.SqlServer.Dts.Runtime.Wrapper Public Class ScriptMain   Inherits UserComponent   Public Overrides Sub CreateNewOutputRows()         Dim vFromDate As Date = Me.Variables.FromDate         Do While vFromDate <= Me.Variables.ToDate             With NewDatesOutputBuffer                 .AddRow()                 .FullDateAlternateKey = vFromDate             End With             vFromDate = DateAdd(DateInterval.Day, 1, vFromDate)         Loop     End Sub End Class

The second component to highlight is the Derived Column transformation. Figure 4-40 shows the Derived Column editor, which contains several additional columns using the DATEPART SSIS expression function to pull out common attributes.

image from book
Figure 4-40: Derived Column editor

Before the OLE DB Destination, which is just an insert into the DimTime table, two Lookups handle the translation, because the time dimension table supports multiple languages for the month- and day-related attributes.

Overall, the package is straightforward, performs well, and leverages SSIS capabilities.