Dimension tables provide the structure and context for the fact tables and measurements in the DW/BI system. Dimension tables are much smaller than fact tables, which is good because their processing is complex.
Dimensions are updated over time, either by updating in place (Type 1), or by adding new rows to track history (Type 2). Sometimes youre lucky enough to be handed separate streams of new rows and changed rows. Often you need to figure out what changes have occurred, and handle them appropriately.
This section presents examples related to the AdventureWorks databases. Youll be able to follow along simply by reading the text and looking at the figures. But if you want to experiment with the packages from this chapter, you need to go to the books web site. Were populating a modified version of the AdventureWorksDW database that ships with SQL Server 2005: We call our versions MDWT_AdventureWorksDW. The web site (www.MsftDWToolkit.com) has copies of the packages that populate MDWT_AdventureWorksDW.
|Kimball Method Etl Subsystem|| |
This section describes techniques relevant to Kimball Method ETL Subsystems:
Start with a simple dimension: Promotions. In the Adventure Works Cycles case study, the DimPromotions table is sourced from the [Sales].[SpecialOffer] table, with no transformations necessary. You should be so lucky in the real world.
Even though this is the simplest possible example, were still going to develop a data flow diagram in Visio, which is illustrated in Figure 6.1. As you can see, even a trivial example has several steps. Ignore the Audit dimension for now, and focus on the basics. We discuss the auditing subsystem later in this chapter.
First, define the extract from the source system. Theres a grand total of 16 promotions in our source system, so simply select all the rows from the source table. In the section on Slowly Changing Dimensions, we describe how to compare them against the Promotion dimension table before deciding whether to insert them or update any changed values.
This section walks you step by step through the process of creating, modifying, and running the simple Promotions package. After this section, we abandon the step-by-step approach.
To extract the Promotions data from the SQL Server source in AdventureWorks, set up a connection and define the extraction query. Open the DimPromotion_ETL package that you created earlier and add a Data Flow task if you dont have one already. In the Data Flow task, add an OLE DB Source, and set the connection to your source database (AdventureWorks). Specify that youll be extracting from a table, and choose [Sales].[SpecialOffer], as you see in Figure 6.2.
If you source from a table or view, Integration Services extracts all the rows in that object. To extract only the rows changed today, you need to change the Data Access Mode to a SQL command. We show how to do that later in this chapter.
Choose all the columns in the table except the last two: rowguid and ModifiedDate. This step is optional, but its a good practice to extract only the data you really need. Remember that data is resident in memory within an Integration Services Data Flow task. Keep the data stream as small as possible. Memory is not a concern for our 16 promotions, but you should develop good habits from the outset.
Change column names to match target columns as soon as possible. You can do so in the OLE DB Source Editor, as illustrated in Figure 6.3.
You may wonder , in Figure 6.3, what the column names that end in _U are all about. These columns are in Unicode format, and our target tables are single-byte character strings. By creating the extract columns with the U suffix, youre saving the correct column name for later on, when you perform this conversion.
You may never run into this Unicode to single-byte string conversion in the real world. Typically, source systems are defined correctly to support the data that is stored in them. Microsoft uses Unicode by default in all of its demo databases to support localization of the AdventureWorks database. But you will definitely use this trick: Theres always something to convert.
The third section of the OLE DB Source Editor is where you configure error outputs. Were hard pressed to understand how a relational source could have malformed data. Error outputs make a lot of sense for a flat file source, but theyre seldom useful for a relational source.
If you look closely back at Figure 6.2 , youll see that we renamed the OLE DB Source transform to Sales SpecialOffers. We like to name source transforms to indicate the data source table, in this case the SpecialOffers table in the Sales schema. You should rename all objects and descriptions in your packages, and add text annotations to the package surface as well. As we describe in Chapter 13 , package object names and descriptions are a part of your metadata. There arent many ways to decorate packages with custom metadata, and you should use every option you have.
Although we claimed that no transformations were necessary on the promotions data, a quick look at Figure 6.1 tells you that was an exaggeration. You want to get rid of the null values for the maximum quantity applicable to a promotion. If you leave this as null, business users will have to construct a complicated filter condition: where the quantity sold is between the minimum and maximum quantities , or where its greater than the minimum, and the maximum is null. Instead lets take the simple step of replacing null with a large number. Because this is an integer column, use the maximum value an integer can take: 2,147,483,647.
The maximum value for a smallint is 32,767; the largest tinyint is 255. The largest bigint is a really big number: 9,223,372,036,854,775,807.
If you often replace null values, add a MaxInt variable to your template package.
Add a Derived Column transform to the data flow diagram, and hook it up to the SpecialOffers Source. Add two columns here. First, add a new column for the AuditKey, and set it equal to zero for now. Well come back and fix this later. Next , add a derived column that replaces MaxQty . Set its value to the following expression:
ISNULL(MaxQty) ? 2147483647 : MaxQty
This expression checks to see if MaxQty is null. If so, it replaces its value with our new maximum value; otherwise it keeps the existing value of MaxQty .
The Books Online topic Data Types ( Transact -SQL) contains information about relational data types.
The Books Online topic Integration Services Expression Reference is an important reference for creating expressions.
Figure 6.4 illustrates the Derived Column transform. In addition to the AuditKey and MaxQty columns, trim all of the string columns. String columns often have extra spaces, usually to the right. These spaces are undesirable, and are really hard to debug because theyre invisible.
Trim all string columns immediately after the data source step.
You could convert the Unicode strings to single-byte in this same transform. This is probably what wed do in a real package, trimming and converting in the same step. But in this case add a third transform to the data flow, a Data Conversion transform. Set it up to convert PromotionName_U, PromotionType_U, and PromotionCategory_U to single-byte strings of the appropriate lengths (50, 20, and 20, respectively). Remove the _U when you specify the Output Alias.
Readers who are working with a localized version of SQL Server 2005, localized to a double-byte language, might find that the data in the demo AdventureWorks database has been translated and actually does use double- byte strings. In this case, the conversion step will throw an error.
We like to test packages incrementally, and although we usually build up a bit more logic before testing, lets start small. Test your logic at various points by setting up a flow to the Row Count transform. The only thing to set up in the Row Count transform is the package variable to write the row count to. Your template package has a RowCount variable already.
Package variable names are case sensitive.
The point of writing to the Row Count transform for debugging purposes is not so much for counting the rows, although thats useful. The greater value is that it lets you add a data visualizer to the pipeline. You want to examine the rows as they flow out of the Derived Column transform. You need one additional transform in order to check the output of the Derived Column transform. Any transform will do. We usually use the Row Count transform to serve the role of low-cost pipeline endpoint for debugging purposes.
Add the data visualizer by right-clicking the flow arrow between the derived column and row count transforms. Choose Data Visualizers Add. There are several options for types of visualizers, including histograms, scatter plots, and column charts , but we use the grid most often.
Figure 6.5 illustrates our simple package at this point. Execute the package by highlighting it from the list of packages in the Solution Explorer; right-clicking, and choosing Execute Package. BI Studio automatically saves the package when you execute it.
Always execute packages by right-clicking in the Solution Explorer. Get into this habit from the outset. If youve done any Visual Studio development, you know that pressing the F5 key will start execution. But in an Integration Services project, this can, and usually does, launch a package other than the one youre looking at.
The BI Studio environment provides a lot of information to the developer while the package is executing. Transforms turn yellow while theyre executing, green when theyve succeeded, and red if they throw an error. You can see how many rows have flowed through each step. And you can actually see the data in any data visualizers in the package. The data visualizer pauses the packages execution; you need to detach it (or press the green arrow button) to proceed. When a package finishes executing in BI Studio, its left in a debugging state. You need to stop the execution by pressing the blue square box in the debug toolbar, or by selecting Stop Debugging in the Debug menu to edit any package.
Maximize the useful design surface by setting the Solution Explorer, Toolbox, Variables , and Properties windows to auto-hide. You can get them back by moving the mouse over the window tab.
If you followed our suggestion to add an OnPostExecute breakpoint to the package template, the package will pause for you to examine package variables. You can see the RowCount along with a list of system variables by selecting the Locals tab that opened up in the lower left corner during execution. When you are finished, simply continue or stop the packages execution.
If you close the Locals window, you can get it back by choosing Debug Windows while the package is in Debug mode.
The next step is to write the result set into the target table. This is going to be another throwaway step because you still need to look for whether these are new or changed rows. But its a good development practice to test your package bit by bit. Its easiest to debug when youre doing something really simple.
Delete the Row Count transform and add an OLE DB Destination adapter to your data flow diagram. Hook it up to the output from the Derived Column transform, and configure it to point to the DimPromotion table in MDWT_AdventureWorksDW. Click on the Mappings tab, and youll see that because you changed the column names in the source transform, the source-to-target mappings come up correctly by default. The surrogate key, PromotionKey, has no mapping. Thats because its an identity column thats populated as the data is inserted into the table. Dont worry about any of the advanced properties for now.
You have to go to the Mappings tab in the destination adapter, even if you know the default mappings are going to work. Its tempting to set up the connection and then click OK, but that doesnt set up the adapter correctly. You have to click on the Mappings tab.
Now, execute this package. Youd expect to see 16 rows processed and added to DimPromotion. After the package finishes executing, run a query on DimPromotion; the 16 rows should be in there.
This version of the Promotion package is available on the web site. Its called Promo v.1.dtsx.
You can execute the new package as often as you like. Each time, you add 16 rows to the target table DimPromotion. Obviously, this isnt what you want to happen. You need to modify the package to check for whether rows exist and add them only if they are new. You also need to look for whether an existing row has been changed. We discuss these modifications in the following sections.
The example in this section uses the OLE DB destination adapter. You could have used the SQL destination instead. In Chapter 5 we discuss the tradeoffs between the two destination adapters. In general, use OLE DB during development. If the performance of the insert is a problem, consider switching to the SQL destination, which is sometimes faster.
|Kimball Method Etl Subsystem|| |
This section describes techniques relevant to Kimball Method ETL Subsystem #1 Extract System: source data adapters, push/pull/dribble job schedulers, filtering and sorting at the source, proprietary data format conversions, and data staging after transfer to ETL environment.
Whether youre processing dimensions or facts, you either need to be told which rows might have seen changes, or you need to figure it out in your ETL process. In Chapter 5 we discuss some of the many ways that source systems can be designed to flag changed rows. AdventureWorks, luck for us, uses a ModifiedDate column in each table.
Explore the ModifiedDate columns or their equivalents in your organizations source systems. We have seen many systems where ModifiedDate is managed in application logic. Any bulk updates by the DBA do not necessarily set the ModifiedDate correctly. You really want ModifiedDate to be maintained by a database trigger, as is done in AdventureWorks.
Each of your packages should have two package-level variables to bracket the date range for which the package is to be run. We call these ExtractStartDate and ExtractEndDate . Set the initial values of the variables to a useful range for the initial stages of development and testing. In production, youll set the variables values dynamically at runtime. Because most or all of your packages will use these variables, you should add them to your package template. We use a convention that the data range for the packages run starts on ExtractStartDate , for example 2000-01-01 00:00:00.000 and runs to but excluding the ExtractEndDate , for example 2004-08-01 00:00:00.000. In other words, by putting the end date of 2004-08-01 in the variables value, use a less than on the comparison. This guarantees that youll get all transactions that occurred on July 31 no matter how close to midnight, and no transactions that occurred on August 1.
The variable editor taskpad tries to be helpful when you create a date variable. It puts up a calendar date dropdown and forces you to use it. We find that more annoying than helpful. If you go to the variables property pane, you can just type in a date.
You could get clever and put a dynamic default value for the end date, equal to todays date at 00:00. That way, by default the package will pull all changes up until yesterday midnight. There are two tricks to accomplishing this that are worth discussing:
Set the variables default value to an expression. Open the property pane for the variable, and set EvaluateAsExpression to True . Now you can enter an expression, like getdate() . The function getdate() includes the time part of todays date, so you need to strip that out.
Strip out the time part of the date by using some cast functions. The expression should be:
In the real world, you dont always set the end date to yesterday. You need to be able to run the package for two days ago, in case you missed a load. Use a package configuration to set the value of the variable at runtime. When setting the date from a configuration, you dont want the variable to be set to an expression: The expression will override the configuration value. Dont get so clever that you hurt yourself.
If you have a reliable row modified date from the source system, and some package variables with the date range you want to extract, simply parameterize the query from the source system so that it filters on the row modified date. Figure 6.6 illustrates a parameterized source system query.
The Parameters button on the right side of the SQL command text box brings up a small window titled Set Query Parameters where you map package variables to the parameters in your query. The user interface is fine here, where there are only two parameters. But if your SQL statement has many parameters, youll improve readability by making that SQL statement into a stored procedure.
This query will extract only the rows that changed during the date range of interest. For daily processing, that date range refers to yesterday. In a small dimension like Promotions in the demo database, wed expect that most days would deliver zero changes to dimension rows. You can continue to execute the rest of your package with zero rows, which shouldnt take long, or add a check to halt package execution. Later in this chapter we discuss how to design your package so that it halts execution at appropriate points.
If no one tells you which rows have changed, you have to figure out the changes by brute force: by comparing the new rows with the existing data. For dimension tables, the easiest way to develop this logic is to use the Slowly Changing Dimension transform.
In Chapter 2, we talked about the two main types of slowly changing dimensions:
Type 1: Restate history by updating the dimension row when attributes change.
Type 2: Track history by propagating a new dimension row when attributes change.
|Kimball Method Etl Subsystem|| |
This section describes techniques relevant to Kimball Method ETL Subsystems:
Any dimension that contains a Type 2 attribute should track the date range for which each dimension row is valid. For any dimension with a Type 2 attribute, add three columns: RowStartDate, RowEndDate, and IsRowCurrent. For every dimension member like customer, there should be one and only one current row at any one time. Older rows have their RowStartDate and RowEndDate set appropriately. Figure 6.7 illustrates the logic for handling updates to a dimension with both Type 1 and Type 2 attributes.
Weve seen companies get so intimidated by this complexity that they decide to manage all dimensions as Type 1, even if thats not what the users want. The Integration Services Slowly Changing Dimension transform is a great feature. It does most of this work for you.
When you drag the Slowly Changing Dimension (SCD) transform into the Data Flow design palette, it consists of a single rectangle like all the other transforms. When you edit it, it launches a wizard with several pages of questions. And when you finally press Finish, the wizard generates a bunch of transforms and flows for you. The generated transforms and flows do the work thats outlined in Figure 6.7.
The wizard starts by asking you to specify the dimension table youll be loading. Next, identify the business key, used to tie together all the instances of a particular entity. In a customer dimension the business key is usually the account number or customer ID. Map the other columns in the input flow to the attributes in the target dimension.
The wizard next has you identify how to manage the changes in each attribute. In addition to the Types 1 and 2 (restate and track history) described previously, Integration Services includes a Fixed attribute, which should never be updated. Set the attribute change type for all the columns in your target table.
On the next screen youre asked several housekeeping questions. Do you want the processing to fail when you encounter a change to a fixed attribute? Answer No (you rarely want processing to fail outright ). Do you want the bad row to go into an error flow? Answer Yes. Youre also asked if a Type 1 change, when encountered , should update all the historical rows for the dimension entity, or just the current row. The textbook definition of a Type 1 attribute indicates you should update all the rows, and this is the recommended setting. This is a characteristic thats set for the whole table, not for each column, which is too bad.
If you have any Type 2 attributes in your dimension, youre next asked how to identify the current row. Do you use row start and end dates, or an indicator like IsRowCurrent?
You cant have the SCD Wizard maintain both the row start and end date and the row current indicator. Its one or the other. As we discuss in the next section, you can edit the results of the wizard, and you can generally make it do what you want. We find it easier to generate the SCD Wizard using the row current indicator technique, and then edit the resulting transforms to add the row date handling.
When you reach the wizards screen about inferred members, you are almost at the end. Inferred members are an Integration Services innovation, and help you handle the late arrival of a dimension member generated by an early arriving fact. Early arriving facts appear in your incremental processing stream before you have any dimensional data for them. If you expect to have early arriving facts, design your fact table load so that it will generate a dummy dimension row. That way you can load all of your fact data.
Creating that dummy dimension row creates a problem during dimension processing. If the dimension contains any Type 2 attributes, then as soon as you get the real information about the new dimension member, youll automatically propagate a new dimension row. What you really wanted to happen was to update all the missing values in the original row with the new information. And thats exactly what the inferred dimension member support does: If all the attributes are missing it will update them in place even if its a dimension with Type 2 attributes.
Perhaps youre exasperated with the complexity of this wizard, although well point out that its a complex problem. Your reward comes when you press the Finish key and see all the transforms that have been created for you, as illustrated in Figure 6.8. These objects are exactly whats been generated by running through the wizard for a hybrid dimension with Type 1 and Type 2 attributes and inferred member support. We rearranged and renamed the transforms to improve readability, and labeled the data flow diagram branches with 1, 2, 3, and 4 to correspond to the branches in our logical flow in Figure 6.7.
Set up the Slowly Changing Dimension Wizard on the Promotions package. Delete the target transform to DimPromotion, and set up the SCD transform in its place. Because the Promotion dimension attributes are all Type 1, the output isnt as interesting as the customer example illustrated in Figure 6.8 . Go through the wizard again, setting up one of the attributes as Type 2. You can see how different flows are created.
The Slowly Changing Dimension transform will meet many projects requirements without any further changes. However, there are circumstances where you need to do something tricky, or circumvent the wizard altogether. The next few sections discuss some advanced topics around handling dimension changes.
You will probably want to customize the output from the SCD Wizard. There are even a few cases where it makes sense for you to develop custom handling for dimension changes.
The SCD Wizard will identify the current row for an entity like customer in one of two ways: with a True/False (or Yes/No) indicator, or with a range of dates for which the row is valid. If you choose the date range approach, the SCD transform will look for the single row for each natural key that has a null end date.
We recommend that you use both the row current indicator technique and the valid date range technique. Also, set the end date to a date far in the future, rather than leave it null. Can you still use the SCD Wizard? Yes you can, but you need to modify the generated objects in order to populate your dimension the way you want. Use the row current indicator to identify the active row, and then modify the computed column transform thats between the Union All and the OLE DB transforms, in such a way that it calculates the row start and end dates the way you want them. This is easy to do, but be warned : If you need to go through the wizard again, perhaps to switch an attribute from Type 1 to Type 2 handling, youll lose any customizations youve made to the generated objects.
The objects generated by the SCD Wizard are standard Integration Services transforms. You can edit them to do whatever you like.
The only logic that you cant modify is the comparison logic thats hidden in the SCD transform itself. Under the covers, this transform takes a row in the pipeline and compares it to the current row in the dimension table; determines if the pipeline row is new or an update; and sends the row to the correct downstream path or paths. Except by going through the wizard again, you cant change any of the logic in this transform.
The Type 2 Slowly Changing Dimension logic doesnt explicitly track which attributes changed when it creates a new row for an entity like customer. All you know is that something changed. Our Type 2 dimension tables include a Row Changed Reason column. This column should indicate which attributes changed, causing the creation of a new dimension row.
The SCD transform does not supply this information. If you use the wizardwhich you almost always want to doyou cant populate this column. If this information is really important to your business users, you must do one of the following:
Develop a post-processing step that runs through the dimension rows added today and figures out what changed.
Abandon the SCD transform, and manage dimension changes using fundamental Integration Services transforms.
For additional detail on tracking the row changed reason, please refer to the following sources:
It makes sense that you cant modify the internal logic of the SCD transform, but this restriction presents a problem for large dimensions. Unfortunately, the comparison logic is not blazingly fast. Use the SCD transform for as many dimensions as you can. Its well thought out and tested , and using it will simplify your development process. For very large problems, you may have to custom code the dimension change logic.
The best way to find out if the SCD transform works for you is to test it against your data in your environment. But we can provide some guidance. There are two main parameters: the size of the data flow youre sending through the SCD transform, and the size of the target dimension.
Minimize the number of rows going into the SCD transform. The SCD transform can handle rows in the pipeline that are unchanged: neither inserts nor updates. But if you can efficiently filter out those rows in advance, do so.
If the flow and the target dimension are both smallsay less than 10,000 rowsuse the Slowly Changing Dimension transform. If the input rowset is small and the dimension is large, the SCD transform should still work acceptably quickly.
If the input rowset is large and the dimension is small, its probably because this is a first time load of the dimension table. You might want to build a separate package to handle the historical load of a dimension with Type 2 changes, and simply circumvent the SCD Wizard during the historical load. Other circumstances where the input rowset is large yet the dimension is small are also probably better handled by cleaning up the input rowset using custom logic.
Finally, we reach the tough problem: The input rowset and the dimension are both large. The most obvious solution is better hardware, especially 64-bit with large memory. But that may be too expensive, or not speed the processing sufficiently. You may have to build dimension change logic by hand, as people have done for ages past.
To hand-craft the dimension change logic, use a hash or checksum function to speed the comparison process. Add to your dimension table two new housekeeping columns: HashType1 and HashType2. Soon youll have more housekeeping columns than real attributes! Into the HashType1 column place a hash of a concatenation of the Type 1 attributes; similarly for HashType2. Hashing algorithms are commonly used for encryption. The algorithms convert a very long string into a much shorter string that is guaranteed to be almost unique. (Computer scientists, forgive me.) This is a common technique for high performance dimension change management. Compute the hashes and store them in the dimension table. Then compute the hashes on the incoming rowset and compare to the stored values. The comparison on a single, relatively short string column is far more efficient than pair-wise comparison on dozens of separate columns.
By far the easiest way to compute the hashes is to use the Checksum transform available for download from www.sqlis.com .
Alternatively, use one of the many algorithms for computing hashes. Among those available to you from within Integration Services are:
These algorithms have different performance and uniqueness characteristics. Experiment in your own environment. If you use a System.Security algorithm, youll implement Script Transform. Later in this chapter we walk through an example of creating a Script Transform. Script Transforms are easy to develop, extremely powerful, and are not inherently slow (although you could certainly code a slow transform!).
Here is an outline of the logic flow for custom-coded dimension change management:
Filter out unchanged rows as early in the flow as possible.
Identify new rows as cheaply as possible. Some source systems keep track of the date an entity (like customer) was added to the system, lets hope in addition to the date the row was last modified. If this information is reliable, by all means use it to separate new rows into a separate stream for insert into the target dimension table.
Compute HashType1 and HashType2 on the incoming stream, as described previously.
Perform an Integration Services Lookup to the active dimension rows in the dimension table, matching on the natural key. Keep the columns from the incoming rowset, plus the surrogate key and hashes from the dimension table.
Create an Error Flow on the Lookup transform to handle failures from the lookup.
Lookup failures are new dimension members not already in the dimension table. Flow the error stream into the dimension table.
Write a script component that creates streams for Type 1 and Type 2 changes. The script component will create two output streams.
Compare the incoming HashType1 to the HashType1 that you picked up in the Lookup or Merge Join.
If the incoming and stored hashes differ , output to the Type 1 output flow.
Repeat for the Type 2 comparison.
You must write a script rather than use the Conditional Split transform because some rows will go to both the Type 1 and Type 2 output flows. Conditional Split sends a row to one output or the other, but not both. Writing this script component sounds really ugly, but its approximately 20 lines of code.
Recreate the logic to perform the updates and inserts, similar to what is automatically generated by the SCD transform.
We can think of even more complicated strategies to improve dimension processing performance and functionality. If nothing else, we expect this discussion has convinced you what a great feature the SCD transform is. We didnt discuss Fixed attributes or Deferred Members in this custom example. That logic is left as an exercise for the reader.
The book Professional SQL Server 2005 Integration Services , by Knight, Chaffin, Barnes, Warren, and Young (Wrox, 2006), is an important reference if you need to further improve the performance for processing dimension changes.
An example package that implements the custom dimension change handling logic is available on the books web site, www.MsftDWToolkit.com .
If your dimension has one or more Type 2 attributes, you should recreate history for the dimensions initial load. Some source systems keep track of all the historical addresses associated with a customer, for example, and the date ranges for which those addresses were active. Working with the business users, decide at the outset whether you will recreate dimension changes. Once the dimension is built and associated facts are loaded with the appropriate surrogate keys, it is usually too expensive to go back and rebuild.
As usual, there are many possible solutions to this problem. One alternative is to create a package with a loop that executes the incremental dimension load Data Flow task for each day in the historical time period.
Alternatively, construct a data flow that contains the current image for each dimension member and all the historical changes as well. In the case of a customer dimension that triggers a new row when the address changes, the data flow pipeline would contain one row for each combination of customer and address. Include the date the address became effective. Usually multiple attributes trigger Type 2 changes: include as many as you have historical data for.
Use the Sort transform to sort by natural key and by RowStartDate descending.
Next, develop a Script transform that works through the set of rows associated with each dimension entity like customer. Because the data is sorted descending by RowStartDate, the first row that the script encounters for a new customer will become the active row. Use one rows start date to set the end date for the next row in the sequence.
In the preceding sample, we blithely talked about creating a script transform. This may sound intimidating, but its really easy. Here is an example of the script transform code to set the end dates and row current indicators:
Public Class ScriptMain Inherits UserComponent 'These variables retain their values from row to row Dim CurrentEntity As String = "" ' Change data type to match BusinessKey Dim NextEndDate As Date Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) ' ' Kimball Group, Microsoft Data Warehouse Toolkit ' Example script component used to set row end dates and row current ' indicators for the historical load of a dimension with type 2 ' attributes. ' Dim FirstEndDate As Date If Row.BusinessKey <> CurrentEntity Then CurrentEntity = Row.BusinessKey 'We have a new group NextEndDate = Row.StartDate.AddDays(-1) 'Set next end date Row.IsRowCurrent = "Y" 'We set all row end dates to 9999-12-31 in a computed column 'before we run this script transform. Leave that value for 'this row, which is the currently active row. Else 'Fix up end date and is current indicator for the inactive rows Row.EndDate = NextEndDate Row.IsRowCurrent = "N" End If End Sub End Class
As you can see, the code is simple. This same logic will work for any dimension, except youll need to rename the columns like Row.NaturalKey as appropriate. You may want to do something more complex in your script, but this sample provides the basic structure.
Its more complicated to set up a Script transform than most Integration Services transforms. In the Input Column setup tab, you must choose which columns will flow through the transform. If you dont choose a column, its thrown away. Also note that you can choose whether an input column is ReadOnly or ReadWrite. In the example illustrated here, we made the EndDate column be ReadWrite.
To add columns to your flow, go to the Column Properties setup tab. Add an output, and then add columns to that output. If youre simply replacing a ReadWrite input attribute you dont need to do this, but if you want a new column, you do.
Finally, go to the Script tab of the Script Transform Editor, and click the Design Script button. Integration Services launches a VB.NET editor for you, and autogenerates a bunch of code. Compare the sample script here to whats automatically generated; you can see how few lines of codeeleven!we wrote in order to solve this little problem.
Integration Services keeps track of the code youve written, integrating it into the package. If you were ambitious enough to open an Integration Services package in Notepad or your favorite XML editor, youd see it in there. The point is that you dont need to worry about where to store the source code; its part of your package.
You can learn more about scripts and the script transform in the book The Rational Guide to Scripting SQL Server 2005 Integration Services Beta Preview , (Rational Guides, 2005) by Donald Farmer.
If you play around with the Column Properties tab, youll come to realize that a Script transform can have multiple inputs and outputs. You can also adjust the grain of the flow by outputting a different number of rows than are input.
If your Script transform needs multiple outputs, or needs to adjust the grain of the flow, youll need to break the connection between the input and output. Click on an output, like Output 0, in a Script transform. It has a SynchronousInputID property, which by default points to the ID of the first input. To break the synchronicity, set the outputs SynchronousInputID property to zero. Now youll have to add, by hand, an output column for each input column. Make sure you get the data types correct.
In your script youll now need to distinguish between the input buffer and the output buffer. You can refer to a single input buffer as Row, but youll refer to your output buffer as Output0Buffer. To figure out the correct name for the buffer, poke around in the BufferWrapper module that Integration Services autogenerates for you. Youll need to explicitly add a row to the output buffer, using syntax like Output0Buffer.AddRow(). You will also need to explicitly set the value of each output column; they dont automatically get set as they do in the simple case.
For most scripts, the output is synchronized with the input, the columns flow automatically from the input to the output, and you dont need to worry about this technical note.
A great ETL challenge is to de-duplicate information, like customer accounts. A common situation is a retail operation thats operated by phone and direct mail now opens a web ordering application. Its easy to have the same customer or family have many account numbers . Similarly, if your business has grown by acquisition youre likely to have the same customerand even productreflected in the source systems of the different business entities. If your business users are interested in analyzing the history of a customers purchases, some de-duplication may be necessary. De-duplication is also necessary when you integrate data from multiple source systems even within a single company, or combine multiple individuals accounts into a family or household group.
|Kimball Method Etl Subsystem|| |
This section describes techniques relevant to Kimball Method ETL Subsystem #4 Data Cleansing System: Typically a dictionary-driven system for complete parsing of names and addresses of individuals and organizations, possibly also products or locations. De-duplication includes identification and removal usually of individuals and organizations, possibly products or locations. Often uses fuzzy logic. Surviving using specialized data merge logic that preserves specified fields from certain sources to be the final saved versions. Maintains back references (like natural keys) to all participating original sources.
The need for de-duplication is really an artifact of stovepipe transaction systemssystems that were developed independently but that deal with the same customers, products, employees , or other entities. Historically, the transaction systems were not required to make an effort to integrate their data; if payroll correctly generated the paychecks and benefits correctly tracked program enrollments, it didnt really matter that the two systems used different IDs for the same employee.
Most organizations have long known this is the wrong approach, but they were unwilling to invest in fixing it because they didnt understand the full cost. The emergence of DW/BI systems has shown that this lack of integration makes accurate analysis impossible . Organizations are beginning to see the direct cost of integrating the data downstream; its an expensive, ongoing effort. Even worse , this stovepipe approach puts the burden of maintaining correct attribute values in multiple systems on the employee (or worse, the customer), a hidden cost that could be greater than the direct cost.
Many organizations are assuming the responsibility to create and maintain an integrated data store as part of the transaction system. Most of the major ERP vendors and several specialty software companies have built modules or systems to address data integration at the source, before it becomes a problem. Search the Internet for master data management to find some examples.
Whats the difference between similarity and confidence ? If you had a lot of U.S, customers, and ran Jo Smith through the algorithm, you might find a match to Joe Smith with high similarity. The confidence rating would be relatively low, however, because J. Smith, John Smith, Joe Smith, and many other similar names are all very common.
De-duplication of existing data is usually done as an analytic project, often as part of the development of a customer-oriented DW/BI system. Its a labor- intensive process. Tools help a lot, but a person needs to be looking at results and making judgments about where correlations exist and what techniques are working. At the end of this process of trial and error, youll have a handful of techniques, and an order in which to apply those techniques that work well with your data. Youll make a one-time pass to de-duplicate existing data and integrate these de-duplication transforms into your incremental loads.
Integration Services is a productive environment in which to conduct a de-duplication project. The Data Flow task visualizers and Data Source View data viewers are hugely valuable for investigating the data. The Fuzzy Lookup and Fuzzy Grouping transforms are particularly valuable for de-duplication. These transforms find one or more matches in an incoming row to a second data set, using one or more columns in each data set. The underlying algorithm is sophisticated. Unlike some similar algorithms, its not specific to a language (like English), but instead is based on pattern matching. The Fuzzy transforms return both matches and measures of similarity and confidence.
Microsoft has published a number of excellent examples on using the Fuzzy transforms, so we wont repeat usage details here. However, we cant refrain from mentioning one important best practice. The algorithm used by the Fuzzy transforms is efficient, but its not magic. Always use a standard lookup first to find perfect matches. Then use a conditional split to divide your data flow diagram and use Fuzzy matching only on the unmatched data.
Search on MSDN ( www.msdn.com ) for fuzzy lookup to find articles and examples about the fuzzy transforms and de-duplication.