Fact Processing

Weve been describing how to build packages to perform dimension processing. Now its time to turn our attention to fact processing. Extracting fact data, consolidating it, and loading it into the data warehouse is generally less complex than managing dimension changes. Most fact tables are at the grain of a transaction. During normal processing of a transaction grain fact table, rows are inserted but not updated. Later in this chapter, we discuss exceptions to this general rule.

There are three main things to worry about when processing facts:

  • Cleaning data and synthesizing data from multiple sources : This task is dependent on your environment. The greatest challenge is the design step: What data needs to be cleaned and synthesized , and how will you go about doing it? Integration Services has many tasks and transforms to use in building your custom logic. For example, the fact table in the MDWT_AdventureWorksDW dimensional model keeps orders data in both U.S. dollars and the original local currency. The source system holds only the local currency, so the fact table ETL process must pull conversion rates and calculate the U.S. dollars amount for each row.

  • Performing the surrogate key lookups: The greatest burden of cleaning and conforming is in the dimension processing. By the time you process the facts, you have well- formed dimension tables or staging tables against which to perform the lookups that swap out the natural keys from the extracts for the surrogate keys that get stored in the fact table.

  • Moving and processing large data volumes : A characteristic of fact table processing is that the data volumes are typically much larger than for dimension table processing. Although you want all processing to be efficient, its particularly important for fact tables. Techniques that work just fine for a dimension table with ten thousand rowseven a million rowsjust arent practical for a fact table with billions of rows.

As with dimensions, you should set up an Integration Services package for each fact tables normal processing. You might include the fact package as another child package in the same master package that calls the dimension table packages. Alternatively, create a separate master package for fact processing. A complex enterprise DW/BI system covering many business processes might have several master packages, especially if some fact tables are populated on different schedules.

Extracting Fact Data

Because youre building transactional fact tables at an atomic grain, new facts are usually easy to find. Most source systems include a transaction date on a fact table record. Work closely with the source system DBAs to develop the fact table extract queries or reports , and the precise schedules and conditions under which the extracts are run. In an ideal world you can specify what you need and when, and its the transactional DBAs job to provide it. More likely, it will be a joint effort. These issues are discussed in Chapter 5.

Kimball Method Etl Subsystem 

This section describes techniques relevant to Kimball Method ETL Subsystems:

  • #1 Extract System: Source data adapters, push/pull/dribble job schedulers , filtering and sorting at the source, proprietary data format conversions, and data staging after transfer to ETL environment.

  • #2 Change Data Capture System: Source log file readers, source date and sequence number filters, and CRC-based record comparison in ETL system.

You can probably develop a single Data Flow task to perform fact processing. While this is theoretically possible, its not advised. You should extract data from the source system and immediately write it as a flat or raw file, with minimal or zero transformations. This staged data provides a restarting point, without having to go back to the source database. The staged data also provides a data quality milestone so the Data Steward can investigate the source of a data problem.

image from book
INTEGRATION SERVICES RAW FILES

The raw file format is unique to Integration Services: Only Integration Services can read and write a raw file.

The only required configuration property in the raw file destination is the name and location of the file. We set up a directory called c:\SSIStemp, where we dump temporary files. When debugging, use the default WriteOption, CreateAlways. If you use raw files in your real ETL system, perhaps to hold an image of the extracted data, you might prefer to use a different WriteOption to be sure you dont clobber existing valuable data.

The other trick to the Raw File Destination is that you have to specify the columns to go into the raw file. You might expect the default would be to write all columns from the pipeline into the raw file, but instead the default is to write no columns.

Integration Services reads and writes raw files so fast that you should generally use raw files instead of flat files, unless an external process needs to read the data. You can always set up a trivial Integration Services package to read the data if you need to. Once youve done it once or twice, its much easier than using flat files especially if youve ever tried to open a million-row flat file in Notepad!

There are circumstances in which reading and writing raw files is slower than using flat files. Raw files are bigger than the corresponding flat files because they allocate space that corresponds to the data types in the Integration Services data flow. If your data types are wide but the actual data is small, a flat file is smaller. Depending on your disk subsystem, the raw files advantage in not having to parse the data can be eaten up by disk I/O from the larger file. Raw files are usually faster, but if your disk subsystem is slow and performance is an issue, test how flat files perform in your environment.

image from book
 

Figure 6.9 illustrates a Data Flow task that extracts from a source table, counts the rows, and then writes the flow into a raw file. The Row Count transform puts a count of rows into a package variable created for that purpose. You may recall that we recommend you use a template package that predefines, among other elements, a package variable called RowCount. Ignore the Multi-cast transform and the second output from it in Figure 6.9 for now. We discuss these elements in the next sections.

image from book
Figure 6.9: Data Flow task to extract and check data

You may use technology thats native to the source system, rather than Integration Services, for this extract step. For example, a mainframe-based source system may use reporting functionality to generate a flat file. Communicate clearly and completely between systems, so that you can be confident that the source system is extracting data for the date range that youre expecting in your packages. You can use the Integration Services Execute Process task to launch the extract. Alternatively, ensure the source system extract writes a clear message that you can pick up in Integration Services. The most common method of communication is to have the source system extract write a row to a metadata table that you design. But there are many alternative methods , including making a web service call or sending a message via MSMQ.

Extracting Fact Updates and Deletes

Although new rows to be inserted into the fact table are usually easy to identify, updated and deleted rows can be more challenging. For a transactional fact source table, as opposed to a snapshot fact table, true updates and deletes are, we hope, abnormal events.

In the best case, updates and deletes are handled in the source system as ledgered entries as we described in Chapter 5. A ledgering source system would handle a change by making a change entry, like increasing quantity by +2 or canceling an order by entering an offsetting transaction. Your normal fact processing sees these offsetting entries as new fact rows, and should handle them smoothly and correctly as inserts .

It shouldnt surprise you to learn that not all transaction systems act this way. If your source system generates updates to historical transactions, lets hope it tags those rows as changed. Many transaction systems flag each row with a last modified date column, on which you can place a filter condition. You need to be diligent in your data exploration, as this column may not be properly maintained. Weve often found the column to be well maintained through the source system application logic, but if a DBA performs a bulk operation the last modified date is not reset.

Unless your fact data volume is tiny, these un- audited updates are a problem you cannot overcome . The only solutions are extremely unattractive:

  • Ask the source system to fix the problem.

  • Grab the history of facts, and perform a column-by-column comparison to find rows that have changed.

  • Rebuild the entire fact table every load cycle.

The right solution is to push the problem onto the source system DBAs. If they cant properly maintain the last modified date in the transaction fact data, and are unwilling to set up triggers or replication, you should seriously consider refocusing your project on a different subject area.

Deletes present a similar problem. Most often, transaction systems perform a soft delete, which simply flags the row as inactive. It would be a poorly designed transaction system that performs hard deletes on business process data without logging that deleted informationbut that doesnt mean it never happens. We like to ask the transactional DBAs if they perform non-logged hard deletes, and gauge their level of horror at the notion.

Note 

The source system in the AdventureWorks sample database does not behave nicely . The SalesOrderHeader table, which is a source for the Orders fact table, gets a new row when an order is placed. That row is updated with the ShipDate when the order ships. We can probably agree that this isnt a great design for a transaction system, but its not terribly uncommon either. If you extract updates based on ModifiedDate, youll get these changed rowseven though the Orders fact table does not contain shipping information. Evaluating these changed rows is an unavoidable cost of managing the DW/BI system, but its lucky that you can expect to encounter only one update per row. In the case of AdventureWorks, we know from discussions with the source system DBAs that no changes occur to a fact row that are relevant to the Orders subject area. In the case of AdventureWorks, we dont need to look for updates or deletes at all.

If your transaction grain source system allows updates and deletes, you should convert these into ledgered transactions for your dimensional model. In other words, even if the source system will physically update an order quantity in place, say from five to three, you should convert that update into an insert of -2 in the fact table. This is even more important for deletes. This means you will need a transaction date in addition to the order date in your fact table.

If you dont convert hard updates and deletes into transactions, your DW/BI system will be un-auditable. In addition, hard updates and deletes create a terrible management problem for aggregates and Analysis Services databases. Analysis Services cant process hard updates or deletes in the facts except by reprocessing the entire associated partition. That may sound harsh , but think about what the logic would be for maintaining aggregate tables by hand. Dropping and rebuilding is a reasonable strategy. By contrast, ledgered entries are easy to manage for aggregate tables and Analysis Services cubes. Ledgered entries are simply incremental data.

Cleaning Fact Data

As with dimension data, cleaning fact data is idiosyncratic to your source systems. Initially, most DW/BI systems need relatively little cleaning of fact data. The fact data sources are usually the core systems used to run the business, and are in pretty good shape. The dimensions are often a mess, but by the time it comes to process the facts, the dimension cleaning has been done. Most of the work of the fact table processing is in looking up the surrogate keys from the cleaned dimensions, as we discuss in the next section.

Kimball Method Etl Subsystem 

This section describes techniques relevant to Kimball Method ETL Subsystems:

  • #7 Quality Screen Handler: In-line ETL tests applied systematically to all data flows checking for data quality issues. One of the feeds to the error event handler (see subsystem 8).

  • #8 Error Event Handler: Comprehensive system for reporting and responding to all ETL error events. Includes branching logic to handle various classes of errors, and includes real-time monitoring of ETL data quality.

The exception to the rule about significant cleaning of fact data is when those facts come from some type of logging system rather than a transaction system. A clickstream system is a good example: The data source is web logs. In this case, you may need to perform significant filtering, cleansing, and scoring of the fact data to determine its quality and prepare it for loading.

Reference 

Chapter 4 of The Data Warehouse ETL Toolkit has an extensive discussion of these issues.

Checking Data Quality and Halting Package Execution

Kimball Method Etl Subsystem 

This section describes techniques relevant to Kimball Method ETL Subsystem #28 Recovery and Restart System: Common system for resuming a job that has halted, or for backing out a whole job and restarting. Significant dependency on backup system (see subsystem 36).

A critical but often-neglected step in fact data processing is to verify that the data in todays extract is valid. Look for evidence that the extract executed correctly. The most obvious metric is the count of rows in the extract. If there are zero rows, abort fact table processing. Design the package to throw an error condition, unless this is a source for which zero activity is plausible.

You can apply increasingly stringent tests. Its probably not sufficient to test for zero rows: Instead, test for whether the count of rows today is within several standard deviations of the expected daily row count. Another test is to compare row counts from the beginning and end of the transformation process to make sure you didnt drop any rows. You may also want to sum the sales activity for the day, for comparison to an expected value. Apply these simple tests immediately after the extract and transformation steps, before loading the data into the data warehouse database.

Reference 

Chapter 4 of The Data Warehouse ETL Toolkit includes a discussion on the kinds of reasonableness checks that might cause you to halt package execution.

The article Is your Data Correct?? by Ralph Kimball, Intelligent Enterprise , December 2000, includes a longer discussion on this technique. You can find a link to this article at www.kimballgroup.com/html/articlesbydate/articles2000.html .

Figure 6.10 shows the same data flow task as Figure 6.9, but weve moved the transforms around so you can see how to compute aggregate values to check for reasonableness. At the same time as the data flows into the raw file for staging and backup, a copy of that data flows into the Aggregate transform. That transform computes how many customers are in todays extract, how many products, and how many days of data, with rowcounts by day. Youd expect these counts to be fairly consistent over time, subject to some seasonal or cyclical variation.

image from book
Figure 6.10: Computing reasonableness checks

In Figure 6.10, were planning to perform some more transformations on the data. A second Data Flow task will pick up the data from the raw file, and continue processing it.

Figure 6.11 illustrates details of the Aggregate transform. The Aggregate transform is a high-performance method for computing multiple aggregate measures. Here, the Aggregate transform performs three calculations in parallel:

  • Count of rows by date ( illustrated )

  • Count of distinct customers

  • Count of distinct products

image from book
Figure 6.11: Calculating row counts with the Aggregate transform

The row count by date will produce a stream with multiple rows; the other two aggregations are single-valued.

Note 

When you first open the Aggregate transform, it presents a simplified user interface to support a single aggregation. In Figure 6.11 you see the Advanced user interface that shows the different aggregations defined to run in parallel within this single transform.

The next step is to get the customer and product counts back to the Control Flow in order to make a decision about whether to continue processing, or abort the package. Remember, the Control Flow is aptly named: Its where you can control the flow of the packages execution. There are several possible methods for transferring the information back out to the Control Flow, including the values into package variables , or writing the values out to an audit table. Figure 6.10 illustrates both of these options.

  • The Row Count transform writes the count of rows to a variable named RowCount .

  • The Recordset Destination transform writes the count of rows by day to a variable named RowCountByDay . This variable will hold a recordset rather than a single value.

  • The distinct count of customers and products is written to an audit table in the two OLEDB Command transforms called Update Audit ExtractValue1 and Update Audit ExtractValue2.

The OLE DB Command transform labeled Update Audit ExtractValue1 performs an update to an auditing table. For the customer count, that update statement is:

 UPDATE AuditTableProcessing SET ExtractCheckValue1 = ? WHERE TableProcessKey = ? 

The first question mark in the script picks up the customer count value that is output from the Aggregate transform. The second question mark is the primary key for the AuditTableProcessing table, directing the update statement to the correct row. Later in this chapter we talk about the Audit dimension and its supporting tables, so we dont go into details here. The important point now is that the system maintains processing metadata about how many rows were extracted and inserted each time a package is run.

This Data Flow task ends after writing the aggregations to the database. The next step is to test these values for reasonableness, and decide whether to halt execution:

  • Read the values from the AuditTableProcessing table

  • Compare to the average counts for the past month

  • Determine whether todays counts are reasonable

Depending on the quality of your data and the workings of your business, the rule you use to evaluate reasonableness might be simple or complex. A good yet simple rule is to evaluate whether todays value is within three standard deviations of the past 90 days values.

image from book
MORE ON THE OLE DB COMMAND TRANSFORM

The OLE DB Command transform is a little complicated to set up. Read Books Online carefully . You can find a helpful walkthrough of setting up the transform at www.sqlis.com . Examine the completed package FactOrders_ETL, available on the books web site.

The Derived Column transforms in Figure 6.10 are a trick worth explaining. The OLE DB Command transform cannot access package variables. (We assume this is a product oversight that will be fixed in a future version.) In order to use a package variablethe primary key for AuditTableProcessing in this caseyou need to create a derived column in the data flow diagram and populate it with the package variable. Its inelegant, but it works.

image from book
 

The Control Flow is the place to gracefully stop package execution, and launch an event like notifying an operator. In this chapter weve worked with the Data Flow, which is just a taska complex taskwithin the packages Control Flow. The design of a Data Flow task is analogous to a pipe manifold through which data will flow. In order to launch some event other than the flow of data, you need to put the data somewhere so you can exit to the Control Flow. In the current example, weve put the extracted data in a raw file, written some reasonableness checks to a database table, and stored other reasonableness checks in package variables.

Figure 6.12 illustrates the Control Flow for the fact table extract. The first Data Flow task, labeled Extract and compute rowcounts, is the one from Figure 6.10. In this Control Flow, the system stops all processing and emails Warren if the extract fails, or if it succeeds but contains too few rows. We also want to check the reasonableness of the extract by verifying whether the counts of distinct customers and products in the extracts are within three standard deviations of the recent history. If the reasonableness checks fail, the system stops and emails Warren.

image from book
Figure 6.12: Control Flow for fact extract, clean, and check

Notice that several of the precedence constraints, displayed as arrows in the Control Flow, have a function notation ( fx ). This notation signals that the precedence constraints are more complicated than simply Success, Failure, or Completion. Figure 6.13 shows the Precedence Constraint Editor for the precedence constraint at the top right of Figure 6.12 (with the dashed line). In this case, the system hands control to the Send Mail task if one of two things happened in the first Data Flow task: It failed, or it generated fewer than 5,000 rows. (In the real world, you should use a standard deviation calculation rather than hard-coding a row count limit.) The precedence constraint leading to the Sequence container is the opposite condition.

image from book
Figure 6.13: Precedence Constraint Editor
image from book
PRECEDENCE CONSTRAINT CONDITIONS

When you set up a conditional flow between tasks, you often want the control to pass to one and only one downstream task. In other words, you want the flow to be exclusive. When the precedence constraints are just success or failure, thats simple. When the precedence constraints have expressions in them, as illustrated in 6.13 , its easy to introduce gaps where either no downstream task is executed, or where multiple tasks are executed. Theres no built-in catch for this; you need to be careful and test thoroughly.

image from book
 

Next in the Control Flow of Figure 6.12 are two SQL tasks grouped together in a Sequence Container. The first reads todays counts and puts the results into package variables. The second is a bit more complicated. The task reads the series of customer and product rowcounts from the auditing table, and computes the reasonableness ranges as:

 SELECT      isnull(convert(int, avg(Extract_Check_Value1) -         3*stdev(Extract_Check_Value1)),0) As CustCountMin,      isnull(convert(int, avg(Extract_Check_Value1) +         3*stdev(Extract_Check_Value1)),0) AS CustCountMax,      isnull(convert(int, avg(Extract_Check_Value2) -         3*stdev(Extract_Check_Value2)),0) As ProdCountMin,      isnull(convert(int, avg(Extract_Check_Value2) +         3*stdev(Extract_Check_Value2)),0) AS ProdCountMax FROM AuditTableProcessing WHERE TableName = 'FactOrders' AND SuccessfulProcessingInd = 'Y' 

Every time the package runs, it logs the count of distinct customers in the audit table. This query calculates a range plus or minus three standard deviations from the mean, since the package started logging this information to the audit system. In the real world, youd make this query more complex, perhaps limiting to the last 60 days, or performing some kind of seasonal adjustment. Put the values of the range minimums and maximums into four package variables.

Placing the tasks inside a Sequence Container makes it easier to define a precedence constraint. This precedence constraint is similar to the one described earlier in this chapter. If the count of distinct customers or products is outside their expected ranges, control will pass to the Email task.

Fact data cleansing is as varied as the possible source systems. More often than not, youll have data as clean as that in the Adventure Works sample, and you wont have any fact data cleansing to do. In that case, you can check row counts as we did here to confirm that the extract worked as expected.

In a more complex situation, you might not choose to fail processing when you encounter anomalies. You may be able to apply a logical rule to fix the data.

Reference 

Chapter 4 of The Data Warehouse ETL Toolkit talks about the kind of actions you might take when you encounter data anomalies.

The data quality checks that we walked through in this section were performed on the data as it was extracted. You should perform similar checks just before loading the data, to confirm your processing hasnt introduced anomalies.

Transforming Fact Data

Weve already discussed the first part of the fact data ETL process: extracting, cleaning, and checking the fact data. In this section, we talk about a second set of work: performing transformations. Cleaning data requires transformations, of course. This section focuses on other kinds of transformations that you may make to fact data.

Aggregating Data

You may wonder why were talking about aggregating data before loading it. After all, the Kimball Method strongly recommends that transaction fact tables be developed at the finest possible grain. Why would we aggregate data before loading it? Sometimes you just have to. As hardware and software grows more powerful, what we think of as extreme scale has grown substantially. At the time of this writing, a DW/BI system that captures telephone network call detail records is no longer extreme, although five years ago it was. But call detail records are created from call event details, which few telecommunications companies would keep for more than a short while. Similarly, a click-stream or RFID DW/BI system may throw away or aggregate a lot of data and what it keeps may still be a very large data store.

In a lot of these cases, custom applications already exist to prune and aggregate the data stream. You should certainly use those applications if they do what you need them to do. But if your ETL system needs to do this work, you can use Integration Services to aggregate your data.

Use the Aggregate transform to aggregate the data. Weve already used the Aggregate transform to perform distinct counts used for determining whether the extract worked correctly. That same transform could be used to aggregate the entire stream of data that will flow into the fact tables. The Aggregate transform is high performance and automatically parallelizes its operations across multiple processors.

Warning 

The Aggregate transform, like the Sort transform, works in memory. Performance is excellent as long as the data flow fits in memory. Consider increasing server memory, or using an alternative method of aggregating data, if you see memory pressure. You may be able to find a third- party aggregation routine that plugs into the pipeline. Or, you could stage the data to a table and use GROUP BY .

Disaggregating Data

Disaggregating data, the opposite of aggregation, sounds like a strange concept, but we do it all the time. The Orders fact table has an example of disaggregating data. The Orders fact grain is at the order line item level, so that users can see sales information by product. But several dollar amounts are stored in the transaction system at the order header level, Freight and Tax in particular. You need to allocate these amounts to the order line item grain of the Orders fact table.

Your business users provide the business logic for allocating these amounts to the line item. There is certainly an algorithm for computing sales tax, usually as a flat rate on some categories of products. (In many states, food items are exempted from sales tax, unless they are intended to be consumed on the premises.) Freight is often calculated based on product weight, which is a common attribute in the Product dimension.

Well illustrate an example here, allocating sales tax and freight to all the products in the order based on the line item total sale amount as a percentage of the order total sale amount. Heres the logic flow:

  1. Pick up the fact data from the post-extract staging area.

  2. Aggregate by order number, summing total sale amount to serve as the divisor in the allocation computation.

  3. Join the aggregates back in to the original data flow.

  4. Compute line item freight and tax.

Figure 6.14, from the FactOrders_ETL package, illustrates this common design pattern. Source the data from a raw file and then sort it by OrderId (these two steps are not shown). Then the data flows into the Multicast transform, which copies the pipeline into two or more identical streams. One stream goes into an Aggregate transform, which sums amounts by OrderId.

image from book
Figure 6.14: Data Flow for allocating tax and freight

The only slightly tricky element of this design pattern is the Merge Join transform. Merge Join works just like a database join, except it requires that the input data be sorted in the same order. You can do an inner or outer join; in this case an inner join is appropriate. The Merge Join transform, unlike most other transforms, does not automatically map all the input columns to the output flow. Instead, you need to select the columns you want by checking their boxes. This is a good opportunity to drop unneeded columns to keep the pipeline as small as possible.

Finally, calculate the Freight and Tax at the line item level. In the Derived Column transform replace Freight and Tax by their existing amounts (from the Order Header), divided by the total sale amount for each order.

Although this transformation occurs before the surrogate key lookup and assignment, sometimes its more efficient to do the surrogate key assignment first. If you want to use a dimension attribute like Product Weight in the calculation, pick up that attribute during the ProductKey lookup step of the key assignment.

Other Transformations and Calculations

There are several kinds of common transformations and calculations that we make to fact data. These fall into the following categories:

  • Within-row calculations, like computing the ratio of two columns. The vast majority of computational requirements can be met with the Derived Column transform. Anything else can be met by writing a Script transform. A common transformation of this type is to replace nulls with a default value. This is particularly useful when the null value is in one of the natural key columns you need to join to a dimension. In some source systems, a column is null most of the time and has a code only if a specific event occurred. For example, a null promotion code means the sale did not use any promotions; or a null currency code means the transaction used the home countrys currency. Although you could handle these null violations in the surrogate key pipeline, its best to replace the null with a default value that is the natural key of its corresponding dimension member.

  • Between-row calculations, like the tax allocation problem described in the previous section. Once again, any logic that cant be met by the expression language in the Derived Column transform can be handled via custom Script transform.

  • Mapping, for example collapsing information about several accounts to a single household ID. Most of the complex mapping logicdetermining which accounts are linked together in a householdis handled before the fact table processing, when youre working on the dimensions. The outcome of that dimension processing is a straightforward mapping table for use when processing the facts. The mapping table identifies which set of account numbers maps to a single household. The steps required to run the fact data against the mapping table in order to look up the household ID are essentially the same as youll use for surrogate key assignment, described next.

Surrogate Key Pipeline

Everyone who follows the Kimball Method must load fact data by looking up surrogate keys from the conformed dimensions. Your extract contains natural keys; the fact table contains the substituted surrogate keys. The dimension tables contain the mapping between natural keys and surrogate keys that you use to do the substitutions.

Kimball Method Etl Subsystem 

This section describes techniques relevant to Kimball Method ETL Subsystems:

  • #19 Surrogate Key Pipeline: Pipelined, multithreaded process for replacing natural keys of incoming data with data warehouse surrogate keys.

  • #29 Parallelizing/Pipelining System: Common system for taking advantage of multiple processors, or grid computing resources, and a common system for implementing streaming data flows. Highly desirable (eventually necessary) that parallelizing and pipelining be invoked automatically for any ETL process that meets certain conditions, like not writing to the disk or waiting on a condition in the middle of the process.

Type 1 Dimensions

As usual, well start with the easiest possible case: dimensions whose attributes are all updated in place. Type 1 dimensions have a one-to-one mapping between natural keys and surrogate keys.

The heart of the surrogate key pipeline is to look up the natural key in the fact stream and replace it with the surrogate key from the dimension table. You wont be surprised to learn that well use the Lookup transform to do the heavy lifting .

The Lookup transform supports a single or multiple column equi-join between the data in the pipeline and data in an OLE DB source like a dimension. By default, the entire lookup table or query is cached. To minimize memory pressure, use a query that returns only the columns of interestfor dimensions that only have Type 1 attributes, this is usually just the natural key and the surrogate key. If you dont have enough memory, you can set up the transform to use partial or even zero caching.

Figure 6.15 illustrates a piece of the surrogate key pipeline that includes three lookups for Type 1 dimensions: Date, Promotion, and Product. These use the default settings in the Lookup transform, except we used a source query instead of the entire table, as we just described.

image from book
Figure 6.15: Surrogate key pipeline, Type 1 dimensions

It looks like this pipeline will execute the lookups in series, one after the other. Not so: On a multi-processor system, multiple lookups will execute in parallel. You dont need to set up anything; Integration Services handles parallelism for you automatically.

Note the error flows to the right from each Lookup transform, labeled Lookup Error Output. These flows handle the cases where the fact has no corresponding value in the lookup table. Never insert a fact that has no corresponding dimension member. The way you choose to handle the error rows depends on your business requirements, technical environment, and data realities. We list several possibilities here, and Figure 6.15 illustrates the first three:

  • Throw away the fact rows. This is seldom a good solution, but we illustrated it for the Product dimension. The error flow for the Product dimension counts the error flows but doesnt transform them or store them anywhere .

  • Write the bad rows to a file or table. This is a common solution, and we illustrated it for the OrderDate dimension lookup.

  • Map all the bad fact rows to the Unknown member for that dimension by supplying a key of -1 (or whatever your Unknown Member key is). We illustrated this solution for the Promotions dimension. This is seldom a good solution because the fact row is difficult to fix if you receive better information about what the promotion was supposed to be.

  • Insert a dummy row into the dimension. This case is discussed in the following subsection, Early Arriving Facts.

  • Fail the package and abort processing. This is the default if you dont set up an error flow. It seems a bit draconian.

Logging to a file or mapping to the Unknown member (or both) are the most common approaches. Whatever you do, be sure to place each error row count into its own package variable. When you get back out to the Control Flow, youll log these error counts, shoot off email about the events, and potentially halt processing.

Early Arriving Facts

An early arriving fact is a fact row that arrives before its associated dimension member has been created. If you always process dimensions before facts, how can you get a fact before its dimension member? In some systems this is a possible (non-error) scenario. The best example is when you sign up for a grocery store loyalty card. They generate a member number right there at the register and tag that first purchase to you. You take home the form, fill it out, and send it in. In this case your purchase facts will arrive long before the new customer dimension member.

If this scenario fits your business, you really want to create a skeleton dimension member and load the facts. In due time youll get the attributes for this dimension member, and can update the appropriate dimension row. We discussed the dimension part of this issue earlier in this chapter, when we talked about inferred members in the Slowly Changing Dimension transform.

The logic flow for handling the early arriving facts is to:

  1. Send the lookup failure errors into an OLE DB Command transform.

  2. Set up the OLE DB Command transform to insert a row into the dimension table with only three columns populated: the natural key, the surrogate key (which is automatically populated if you use table identity columns), and the audit key for the current package. Once again, were trying not to talk in detail about the audit subsystem until later in this chapter, but it makes sense that youd want to flag this dimension row as having been inserted during a fact tables lookup process.

  3. If you defined the dimension table to generate default values for the non-key columns, these attributes will be populated automatically as you insert each row. The SCD transform identifies an inferred member in one of two ways: either all the non-key attributes are null, or you explicitly specify a column that identifies an inferred member.

  4. Send the data flow into a new Lookup transform. Dont cache the second Lookup, to ensure the data flow picks up the new dimension member that you just added. On the Advanced tab of the Lookup editor, choose Enable memory restriction.

  5. Add error handling to this second Lookupjust in case.

  6. Finally, Union this flow back with the packages main flow.

Type 2 Dimensions

Fact table lookups against a dimension that tracks historical change are more complicated because the dimension table has multiple rows for each natural key. A simple lookup against the dimension table, joining on the natural key, will not give the results you need.

If todays load includes data only from yesterday , this is only slightly more complicated than the surrogate key pipeline for Type 1 dimensions. In this case, the source query for the lookup limits the results to the currently active row for the dimension, like:

 SELECT CustomerKey, BKCustomerID FROM DimCustomer WHERE RowIsCurrent = 'Y' 

All the error handling, and handling of early arriving facts, is exactly the same.

Loading Fact Data

The next step of fact processing isat long lastto load the data into the fact table. This section describes how to load data for the three main kinds of fact tables: the common transaction grain, and the periodic and accumulating snapshot fact tables. We discuss some extra things to worry about during the initial load of historical data. And we discuss how you should modify your fact table packages to automate the creation of relational table partitions.

Kimball Method Etl Subsystem 

This section describes techniques relevant to Kimball Method ETL Subsystems:

  • #16 Transaction Grain Fact Table Loader: System for updating transaction grain fact tables including manipulation of indexes and partitions. Normally append mode for most recent data. Uses surrogate key pipeline (see subsystem 19).

  • #17 Periodic Snapshot Grain Fact Table Loader: System for updating periodic snapshot grain fact tables, including manipulation of indexes and partitions. Includes frequent overwrite strategy for incremental update of current period facts. Uses surrogate key pipeline (see subsystem 19).

  • #18 Accumulating Snapshot Grain Fact Table Loader: System for updating accumulating snapshot grain fact tables including manipulation of indexes and partitions, and updates to both dimension foreign keys and accumulating measures. Uses surrogate key pipeline (see subsystem 19).

image from book
ADVANCED TOPIC: LATE ARRIVING FACTS

A late arriving fact is one that arrives after its transaction date. Under normal operations, you expect to load today the fact data from yesterdays transactions. Most systems work this way, but some sources have late data trickling in days or even months after the event date. If you have only Type 1 dimensions, late arriving facts dont need special processing. But if you have any Type 2 dimensions, you need to look up the dimension key that was active at the time the fact event occurred, not the key thats current today.

If you used SQL to perform the lookup, the join between a fact and dimension table would look like the following:

 SELECT d.DimSurrogateKey, d.DimNaturalKey FROM FactStream f INNER JOIN DimTable d ON (f.DimNaturalKey = d.DimNaturalKey AND f.EventDate BETWEEN d.RowStartDate AND d.RowEndDate) 

But the Lookup transform doesnt support this kind of complex join.

We dont have a single recommended strategy to solve this problem because no single strategy dominates the others. Your solution is going to depend on business requirements, whether you have large data volumes or very large dimensions, and whether late arriving data trickles in over months or just a few days. We describe three techniques next, but we can think of dozens of variations of these techniques. This is a good opportunity for some design creativity.

STRATEGY 1: LOOP OVER DAYS

The first strategy is to create a loop out in the Control Flow that runs all or part of the fact table surrogate key pipeline for each date represented in the fact tables incremental load. If you look back to Figure 6.10 , youll see that when we were testing for reasonable row counts, we also ran a check to see how many dates contain data. In the Control Flow, you could set up a For Loop container to iterate over this rowset.

To support late arriving facts, define a view on the Type 2 slowly changing dimensions, and update that view at the start of each iteration. The view selects the dimension rows that were active at a specific date in the past. For example, the iteration for June 1, 2004 would redefine the dimension view as:

 CREATE VIEW v_MyDimKeyLookup AS SELECT DimSurrogateKey, DimNaturalKey FROM DimTable WHERE '06/01/2004' BETWEEN RowStartDate AND RowEndDate 

This view will return a single row for every dimension member active on June 1. In the Lookup transform for this dimension, use the view as the source. When youre processing fact rows for June 1, theyll get assigned the correct dimension surrogate keys that were active on that date.

As you can see, its vital that you correctly maintain dimension row start and end dates. You need a complete set of rows for a dimension member, with ranges that contain neither overlaps nor gaps.

At the beginning of the Data Flow step, use a Conditional Split to pull out the data youre working on for that iteration. Depending on relative data volumes, it may be efficient to write the older data to a raw file, rewriting over the same file each time. If you work through the data from newest to oldest, youll rewrite the least data.

This technique will work well if your data arrives just a few days late. If you have late fact data arriving for dozens of dates, this technique will probably not perform well enough. This would be especially true if your dimensions are large.

STRATEGY 2: USE THE RELATIONAL DATABASE

Use a Conditional Split to split off the late arriving facts just before youre faced with the Type 2 lookups, and write them to a staging table. Complete the processing of yesterdays facts as normal.

Back out in the Control Flow, use an Execute SQL task to create indexes on the staging table. Index each natural key.

Write a second Data Flow task to handle the late arriving data. You will have a single source, which is a query that joins the staging table to the Type 2 dimensions, joining on the natural key, and using the correlated BETWEEN clause that picks up the correct dimension surrogate keys. One query could pick up all the dimension keys. This is effectively a star query on the natural keys, returning the surrogate keys. Write the resulting flow into the fact table.

This approach has the significant disadvantage of not handling errors well. Youll need to write code to handle referential integrity violationslate arriving facts whose dimension natural keys dont exist in the dimension tables.

STRATEGY 3: ACCUMULATE LATE FACTS AND PROCESS WEEKLY

Once your fact data is late, it may be less urgent to add that data into the DW/BI system immediately. It may be acceptable to the business users if you stage late arriving facts until the weekend . A weeks worth of late data will benefit from economies of scale in processingand weekends typically have longer load windows .

As a variation on this theme, you might immediately load the data thats one or two days late, and defer the rest until the weekend.

image from book
 

Loading Transaction Grain Facts

For the incremental load into a non-partitioned transaction grain fact table, use the OLE DB or SQL Server destination as the final step in the data flow diagram. The fast load options are usually irrelevant for incremental loads. Of course you want the load to be fast, but the fast loading path requires either that you load into an empty table, or that you load into an unindexed table. For a daily incremental load, it usually doesnt make sense to drop and rebuild indexes because of the way SQL Server handles indexes during inserts, and obviously theres already data in the table. Unless youre using partitions, the only time you can realistically perform a fast load is during the initial historical load of the fact data.

As with the Lookup transform, always redirect errors generated by the Destination adapter. Count errors and dump the rows to a file or staging table. This is vital during package development; by the time youre in production you hope to have found and fixed the myriad errors that could cause the insert to fail. But you never know.

Aside from using relational table partitions, the only way to improve the performance of incremental loads is to not enforce referential integrity between the fact and dimensions. Its common practice to rely on the surrogate key pipeline to maintain the all-important referential integrity. Always try to finish processing within your load window with the foreign key constraints in force; remove them only if you must.

Most systems will be able to finish processing within the allotted time, even without the fast path for table inserts.

Incremental Loads and Table Partitions

The most common table partitioning strategy is to partition by month. A monthly partitioning strategy is easy to maintain (see the related discussion in Chapter 4) but doesnt help you with the performance of the daily incremental load. Only if you process facts monthly could you load into an empty monthly partition, and hence use the fast load path.

In this most common case, specify the partitioned table as the destination for the OLE DB or SQL Server destination adapter, just as if the table were not partitioned. The data will be inserted at approximately the same speed as into a non-partitioned table.

If you have extreme daily fact table volumes and a short load window, create daily partitions. In theory this is no more difficult to manage than monthly partitions. In practice, you cant have more than 1,000 partitions, so youll need to create a process to consolidate partitions on a weekly or monthly basis.

The Historical Load

The historical fact table load is the process in which you are manipulating the greatest volume of data. If you have only Type 1 dimensions, or if youre not recreating history for your Type 2 dimensions, the historical load is simply a big version of the incremental load. Its common to run the historical load in chunks of months, quarters , or years.

In this simple case, you have two things in your favor: You can perform fast loads, and you will be loading the data before the DW/BI system goes online. So if it takes a week, it takes a week.

image from book
PARTITIONED FACT TABLES

If your fact table is partitioned, load partitions in parallel for best performance. There are many ways to load in parallel. You could design your package to load a month of data. Parameterize the month, and launch several versions of the package at the same time. Alternatively, set up a conditional split to create multiple streams within a single Data Flow task.

We discussed partitioned tables in Chapter 4 . Table partitioning is a feature of SQL Server Enterprise Edition.

image from book
 

For the historical load or loads, do the following:

  1. Completely load all dimensions.

  2. Back up the data warehouse database.

  3. Make sure you have enough disk space, and that your database files are set big enough, or are set to autogrow.

  4. Put the data warehouse database into Simple recovery mode.

  5. Drop the indexes on the fact table. Drop (or disable) any foreign key constraints.

  6. Use the OLE DB or SQL Server destinations as appropriate. With the OLE DB destination, make sure you choose the data access mode of Table or viewfast mode.

  7. Turn on Table Lock.

  8. Handle insert errors appropriately in your error flow.

  9. After all the historical data is loaded, rebuild the indexes.

  10. Recreate and check any foreign key constraints, if youre using them.

The historical load process is more likely to fail than the incremental loadsprobably because of the inherent challenges of dealing with large data volumes and the reality that source systems change over time. And because youre dealing with so much data, restarting the load is so much more expensive.

With DTS 2000, we would commit every 10,000 rows as a best practice. With Integration Services, set up an error flow to catch bad data. The error flow will also catch data that arrives after a database runs out of room, in the unlikely event that youve miscalculated your space requirements. Error flows minimize the importance of periodic commits for the purpose of recoverability. However, if the batch size and commit level are set to zero, the relational database will consume valuable memory. Its still a good idea to batch and commit at reasonable intervals.

Tip 

The optimal commit level is application dependentit depends on what else is happening on your machine. The best way to find the appropriate commit level is through trial and error: Set it at different levels, and watch memory usage. At some point, buffering the load in memory constrains the server performance. Set the commit batch size a bit smaller than that level.

You cant set the commit level and batch size on the SQL Destination, but you can on the OLE DB Destination.

If youve been able to recreate history for your Type 2 dimensions, you now face the challenge of matching the historical fact data with the correct Type 2 dimension surrogate key. This is exactly the same problem we discussed at length with late arriving facts.

Because youll be processing a full day of data for each date, your best bet is almost certainly to load historical data by using the looping strategy discussed previously.

Loading Periodic Snapshot Fact Tables

A periodic snapshot fact represents a span of time, regularly repeated. This style of table is well suited for tracking long-running processes like bank accounts. The facts are things like end-of-month account balances , and total deposits and withdrawals during the month.

Reference 

Issues around loading periodic snapshot fact tables are well described in Chapter 6 of The Data Warehouse ETL Toolkit .

Periodic snapshots, and periodic snapshots with a current hot rolling period, are particularly easy to load if the fact table is partitioned. In either case, load into an empty partition and then swap that loaded partition into the fact table. These processes can use fast path loading.

An Analysis Services database measure group thats built from a periodic snapshot fact table can be incrementally processed each month. If the measure group also contains the current rolling period, the Analysis Services database should be partitioned by month. Plan to reprocess the current months partition each day; older partitions need not be touched. See Chapter 15 for more information on Analysis Services partitions.

Loading Accumulating Snapshot Fact Tables

The accumulating snapshot fact table is used to describe processes that have a definite beginning and end, like order fulfillment, claims processing, and most workflows. The accumulating snapshot fact table is characterized by many date dimensions for different roles of the date. For example, an order fulfillment fact table might have dates for when the order was placed, shipped, delivered, and paid for.

The nature of the accumulating snapshot means that each fact row will be updated several times, as the underlying process moves through its lifecycle. Microsoft SQL Server offers no special magic for managing an accumulating snapshot. Here are two ways to improve performance in an environment with a high volume of updates:

  • Conditionally split the flow of updates in order to spin up several parallel update transforms. For example, split the stream into four flows based on an arbitrary column like product. Each of the four flows ends in an OLE DB Command transform that performs simultaneous updates on the fact table.

  • Write the set of fact rows to be updated into staging tables, and update in a bulk transaction rather than row by row. This SQL-based approach cannot take advantage of the error handling that Integration Services provides. However, its likely to perform far better than using the OLE DB Command transform.

An Analysis Services database or measure group thats built against an accumulating snapshot fact table is challenging to process. Its difficult to avoid a scenario where you are reprocessing the entire structure each load cycle.

Analysis Services Processing

The processing of Analysis Services databases is usually a part of the ETL system. You usually want to trigger Analysis Services database processing upon successful completion of the relational data warehouse loads.

Kimball Method Etl Subsystem 

This section describes techniques relevant to Kimball Method ETL Subsystem #22 Multi-Dimensional Cube Builder: Creation and maintenance of star schema foundation for loading multidimensional (OLAP) cubes, including special preparation of dimension hierarchies as dictated by the specific cube technology.

Integration Services provides Analysis Services Processing and Analysis Services DDL Control Flow tasks. These are easy to integrate into your ETL system, at the master package or table-specific package level, and are described in Chapters 7 and 15.

Analysis Services provides several ways of processing its databases, including an automated proactive caching system that self-processes. Well defer detailed discussion of Analysis Services processing to Chapter 7.



Microsoft Data Warehouse Toolkit. With SQL Server 2005 and the Microsoft Business Intelligence Toolset
The MicrosoftВ Data Warehouse Toolkit: With SQL ServerВ 2005 and the MicrosoftВ Business Intelligence Toolset
ISBN: B000YIVXC2
EAN: N/A
Year: 2006
Pages: 125

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net