Data Flow Optimization | Expert SQL Server 2005 Integration Services (Programmer to Programmer)

Integration Services is a platform, meaning that its purpose is to support applications built on its features. The data flow, for example, has no value in and of itself; the value is gained when it is applied to a data-processing purpose (such as warehouse ETL, system integration processes, or DBA data management operations). What this means when it comes to optimization techniques for the data flow is that there are no guaranteed formulas that yield performance gains, because every data flow design has a unique scenario. This section includes principles and common areas for you to evaluate against your data flows to determine applicability.

Pipeline Architecture Review

The SSIS data flow engine uses data buffers to manage data as it flows through the pipeline. As data is extracted form the sources, it is consumed into reserved memory spaces, called buffers. These buffers are acted upon by the various transformations as the data flows through the data flow to the destination. A complete review of the SSIS pipeline engine is found in Chapter 10 of the Professional SQL Server 2005 Integration Services book (Indianapolis: Wiley, 2006). Here are key points to remember:

When data is loaded into a buffer, the transformation logic is applied to the data in place (where possible). In a sense, the transformations pass over the buffers, giving SSIS the capability to stream data through the pipeline. These inline buffer changes are cheap as compared to when the pipeline needs to copy data to new buffers. In some cases, the source data can be in the process of extraction while at the same time, data is being landed to a destination.
Transformations can be categorized as blocking transformations, partially blocking, or streaming transformations (streaming transformations are sometimes called row transformations). Blocking transformations, like the Aggregate or Sort transformations, hold all the rows up in the pipeline before they are allowed downstream. Partially blocking transformations hold up a portion of the data, and streaming transformations allow the data to be available to the next transformation immediately.
Partially blocking and blocking transformations have asynchronous outputs, meaning that the output of the transformation is not synchronously connected to the inputs of the next transformation. The effect is that the data in the buffers is copied to a new buffer from the upstream transformation to the downstream transformation. This has obvious resource overhead.
The Conditional Split and Multicast are examples of streaming components, even though they appear to copy data. Their outputs are synchronous and, therefore, their operations perform logical multicasting and splitting, not physical. This helps optimize the data flow by limiting the places that require data copies, which are expensive. The Union All transformation has an asynchronous output and is, therefore, a semi-blocking transformation, even though it appears to be streaming.
Each data flow has execution trees, which are groupings of sources, transformations, and destinations based on whether data is copied to new buffers (asynchronous outputs). A single execution tree only has one process thread that can work on the transformations contained in it.

Sources also get their own process threads (or engine threads). Therefore, in SQL Server 2005 Integration Services, the total number of threads that can perform work on a data flow is the number of sources plus the number of execution trees. However, this can be throttled by the EngineThreads property at the data flow level (which is more of a hint to the thread scheduler than a rule). Sources always get their own thread, so this setting only applies to execution trees. If a data flow has seven execution trees but the EngineThreads property is set to 5, then two of the execution trees will share a process thread.

Tip

Note that in the next release of SQL Server, the SSIS thread mapping model will be more free-threaded and, therefore, these direct thread mappings to sources and execution trees will not apply.

Execution trees and engine thread mappings are viewable by logging the PipelineExecutionTrees and PipelineExecutionPlan events, which are available in the SSIS Logging at the data flow level.

To demonstrate, the following two packages are significantly different in how many threads the engine can apply. In the first example, Figure 12-6 shows the data flow containing seven execution trees.

image from book
Figure 12-6: Data flow containing seven execution trees

In this example, since the Merge Join and Union All are both examples of semi-blocking transformations with asynchronous outputs, each use causes a new execution tree.

In contrast, the example shown in Figure 12-7 only has one execution tree for the entire data flow.

image from book
Figure 12-7: One execution tree for the entire data flow

In this example, the Lookup transformation, Derived Column, and Conditional Split are examples of streaming transformations (or transformations with synchronous outputs). Therefore, this second data flow only contains a single execution tree.

General Pipeline Optimization

This first data flow optimization section covers general design practices to be mindful of when building or optimizing a data flow. The ensuing sections will look at the data flow settings and the OLE DB destination.

Look for Data Flow Backpressure

Backpressure is the situation when a transformation or destination is causing enough impact that the upstream data feed has to slow down because the transformation or destination cannot keep up. This is caused by resource bottlenecks. For example, if an Aggregate transformation is being used, and there is either not enough RAM to handle all the buffers in memory, or not enough processor utilization to handle the aggregate calculations, then the upstream source extraction or transformation will be slowed down. If you are trying to extract data from a source as quickly as possible, then you must resolve the backpressure.

Use Blocking Transformations Sparingly

Blocking transformations (such as Sort, Aggregate, and Fuzzy Grouping) can impact performance when dealing with large data sets in the pipeline. If you do not have enough memory to handle these operations, then SSIS will use temporary disk space to store data, which increases I/O and processor impact.

Pre-sort Sources as an Alternate to the Sort Transformation

You can tell a source adapter that the data is pre-sorted, and, therefore, limit the Sort transformation use. As already mentioned, a relational database can effectively apply an ORDER BY statement to reduce or eliminate the need to use the Sort transformation. However, flat files may also be created in a physical sort order. Therefore, these and any source adapters can be set up to know that the data entering the pipeline is sorted. Other transformations such as Merge and Merge Join require that the inputs be sorted. Memory can be preserved and performance enhanced by eliminating the Sort. Pre-sorting requires the input source and column properties to be modified. Details are discussed in Chapter 4.

Limit Row-by-Row Operations

A row-by-row (or row-based) operation requires that independent actions be taken, one at a time, for each row going through the particular transformation. The worst offender is the OLE DB Command transformation, because it calls an OLE DB command (such as an UPDATE statement or stored procedure) for every row that is sent through it. The non-cached Lookup, Fuzzy Lookup, Import and Export Column, and Slowly Changing Dimension transformations all exhibit similar characteristics. To be sure, these transformations provide valuable functionality in moderate to low volumes, or where expected package execution time is flexible.

Optimizing the Lookup and Managing the Cache Size

When you have a large table required for a Lookup reference table, there are a few methods to optimize its performance. First of all, a non-cached lookup is a row-based operation, and when your input row count is large, then the cost of every source row causing a separate database query is expensive. Therefore, it is better to use a partial cache or full cache Lookup. A partial cache builds the cache as the Lookup is executing, and, therefore, also comes with high transaction impact. A partial cache approach is a viable solution if you have a large number of input rows, but the unique count of distinct values is low for the lookup key columns. However, the best option is to use the full cache feature of the Lookup and filter the reference table.

Following are two approaches that can filter the reference table cache:

Apply an 80/20 rule and load one Lookup transformation with 20 percent (or some smaller percentage) of the most common matching records, which will allow at least 80 percent matches in the full cache. Then, for any row that does not match, redirect the row to a non-cached lookup, which will catch the rest of non-matched rows. This approach would look like the data flow in Figure 12-8, and greatly reduce the memory requirement.

Figure 12-8: Applying an 80/20 rule
A second approach to filter the Lookup is to use a view or hard-coded SQL statement that only loads the rows that are needed in the Lookup cache. This assumes, however, that you know which rows need to be available in the Lookup cache. To do this, you must have the matching key columns from the source in a staging table on the same server. If you already have a staging table, then you can simply perform an INNER JOIN with the primary Lookup reference table (hard-coding the SQL statement or using a SQL view). If you don’t already have a staging table, then you must add a new data flow to your package (before the data flow that contains the Lookup cache) that creates a staging table with only the matching key columns. You may be surprised, but this approach often produces a package that will run faster than not filtering the Lookup, even though there is an added data flow. And, in many cases, this approach will allow the use of the full-cache Lookup, where before it was not possible.

However, when all else fails and you need to use a non-cached lookup, then you should at least be sure that the underlying reference table has an index with the matching key columns as the first column in the index, followed by the Lookup return columns.

Removing Unused Columns from the Data Flow

An often unheeded warning in the execution results is that there are columns in the data flow that are not being used downstream in the pipeline. Sometimes this is seen as an annoyance, but in reality it is a very important warning. What is happening is that the pipeline engine is reserving space in the buffers when columns are not being used downstream, and that can cause significant memory overhead. With the extra columns, your data flow buffers will take up more space than necessary. By removing the columns, you are making more memory available for other operations. To be sure, you can only remove columns between execution trees. So, for the Union All, Aggregate, Sort, Merge Join, and other transformations with asynchronous outputs, be sure to edit the transformations and only let the columns through if they are being used! This is similar to the DBA tenet of never using SELECT * and instead always naming only the columns needed in a SQL SELECT statement so that the SQL engine can optimize the query.

Use the Union All to Break Up Execution Tree

Sometimes you will have a data flow with a very large execution tree due to the transformations that are used. The potential bottleneck here is that any single execution tree can only use one process thread. Therefore, if you have a lot of Lookup or Data Convert transformations that don’t create new execution trees, but have higher processing requirements, then you may run into a process thread bottleneck with a single engine thread trying to handle all the operations.

The solution is to add a dummy Union All transformation in the middle to break up the execution tree. The Union All would only have one input and output, but allow multiple threads to perform the operations. Figure 12-9 shows a modified version of the previous example shown in Figure 12-7. Previously, this data flow had a single execution tree, and now a second has been added.

image from book
Figure 12-9: Example data flow with a second execution tree added

You might think that the data copied to new buffers would negate the value of forcing a new execution tree. However, another value of this approach is that it enables you to remove the columns not needed in the downstream transformations, thus also reducing the buffer size below the Union All. For example, if you have a Lookup transformation before the Union All that uses a set of business keys to look up a primary key reference, once the Lookup is complete, those business keys are often not needed later in the data flow, and, therefore, they can be removed at the Union All.

Tip

Note that the SSIS threading will change in the next release of SQL Server, therefore, this optimization will only apply to SQL Server 2005 SSIS.

Stage Data with a Multicast

As mentioned in Chapter 3, another data flow optimization technique is to leverage the Multicast transformation when you need to stage data, instead of using two separate data flows. This approach is further worked out in Chapter 3, but the advantage comes in the reduced disk I/O and the parallel or asynchronous nature of this approach.

Be Mindful of the SSIS Logging Impact

SSIS logging certainly provides a valuable tool for monitoring and troubleshooting packages, but be aware that if you are logging every event and have a moderately complicated to complicated package, this will produce a lot of overhead activity for the package, as well as the underlying connection that the logging is configured to use, especially if the connection is a database (as opposed to a file). By limiting the logging to only the OnPreExecute, OnPostExecute, OnWarning, OnError, and OnTaskFailed events, you will reduce the possibility of your logging significantly impacting your package performance.

Transactions

You should be careful of the impact of transactions on your data flow. When you set the Transaction Option property of a package or task to required, SSIS uses the Distributed Transaction Coordinator (DTC, a system service) to allow transaction association with various technologies. However, this comes with overhead and it may be better (depending on your situation) to use native transactions in the database with which you are working. For example, for SQL Server, coordinating Execute SQL Tasks before and after your data flow with BEGIN TRAN and COMMIT TRAN will allow you to use native transactions.

The bigger impact of transactions is rollback time. If you have large volumes in your data flow and get an error near the end of the large transaction, it may take just as much time (if not more) to roll back your transaction as it did to process the data flow before the error! And, if you set a transaction on the entire package or parent package, this is exasperated. Review the Chapter 7 discussion on leveraging Database Snapshots as a transaction mechanism if you are working with SQL Server, which can greatly ease the rollback time and requirements for large volumes.

Data Flow Properties

Each Data Flow Task has a series of properties and settings that are important to know and understand so that you can tune your data flows. Figure 12-10 highlights a simple data flow with the Properties window showing.

image from book
Figure 12-10: Simple data flow with the Properties window showing

Highlighted in Figure 12-10 are six properties that each Data Flow Task has that can affect performance. Take the following into consideration when you are working with these properties.

Up the EngineThreads Value

Be sure to set the EngineThreads property on your data flow to at least the number of execution trees that the data flow contains. In fact, it may be easier to just jump up the value of the EngineThreads property to 20 or 30, just to ensure that each execution tree has its own thread. Even if your server only has two or four processors, many process threads can be running efficiently in parallel, because each may not be working with the same level of intensity, or require code execution for every processor cycle. However, remember that this property is just a hint to the scheduler; there is no danger of creating a set of unused threads by setting this value too high.

Optimize the Temporary Storage Locations

The BufferTempStoragePath and BLOBTempStoragePath properties tell the data flow where to store temporary data to disk if needed. By leaving these blank, SSIS will use the system-defined temp path in the server’s system properties. If the data flow ever needs to use temporary space, be sure that it is pointing to a system volume or mount point that has higher I/O throughput. Setting this is especially important when dealing with large object binary data (either flowing through your data flow or through the use of the Import or Export Column transformations) because BLOB data takes up considerably more space than standard number or character data types.

Leave RunInOptimizedMode as True

RunInOptimizedMode will help performance by removing columns from execution trees with leaf components (such as destinations or other transformations that terminate a pipeline) where a column is not used in the leaf component. This does not mean that columns are removed from other execution trees if a column is no longer needed after already being used in an earlier transformation. This does mean that you should still apply the principle mentioned earlier of removing columns from components that have asynchronous outputs not used downstream in a data flow.

Tuning Buffers

Two advanced properties of the data flow enable you to tune the size and number of rows that are in the allocated buffers. These settings, the DefaultBufferMaxRows and DefaultBufferSize, apply to all of the different buffers in a data flow, not just a single buffer type for one of the execution trees.

Essentially, the DefaultBufferMaxRows is defaulted to 10,000, which means that no single buffer can have more than 10,000 rows. However, this may not be optimal if your average row width is very small, because the buffers will be small, and in total there will be many more buffers for the execution trees to manage. So, in these cases, you should consider revising this number upward and testing performance.

The DefaultBufferSize is also a threshold setting specifying the number of bytes that a single buffer cannot exceed. If the max row setting multiplied by the width of a single row is greater than this value, then the number of rows used in the buffer will be reduced to be under this limitation. For example, if the DefaultBufferMaxRows is set to 10,000 and the DefaultBufferSize is set to 10,485,760 and a single row is 1,200 bytes, then the calculation of max rows (10,000) times row width (1,200) equates 12,000,000, which is greater than the DefaultBufferSize (10,485,760). So, the pipeline engine would scale back the count of rows in the buffers so that this threshold would not be exceeded.

An excellent treatment on this topic is the Microsoft white paper titled “Integration Services: Performance Tuning Techniques” (www.microsoft.com/technet/prodtechnol/sql/2005/ssisperf.mspx). Note that there is no magical formula that determines how to set and balance these properties. It is a matter of testing, because each transformation that is used in a data flow is affected differently by the size and row count in the buffers. Buffer size and row counts do have an effect on batching rows into a destination, which is discussed in the next section.

Optimizing Destinations

To optimize a destination, the first thing you should attempt to understand is the destination’s impact on the package performance. It is not always the case that your data flow is slow because the destination cannot consume the rows fast enough to keep up. At times, it may just be that the data flow cannot pass the rows fast enough to the destination. To understand which case you have, perform the following test:

Run your data flow and capture the execution time.
Remove the destinations and replace them with Row Count transformations. Then, re-run your data flow and capture the execution time.

If the execution times of these two tests are identical (or very close), then your data flow cannot send rows fast enough to the destinations. If the second run of the data flow is a lot faster, then your destinations are impacting your data flow performance, and you should look at optimizing those destinations. Some optimizations for the OLE DB Destination adapter are discussed later in this chapter.

Here are a few other generalities when considering scaling your data flow destinations.

Database Growth

With SQL Server, when the underlying files reach a certain saturation point, the server will pause operations and grow the files. This is a problem if you or your DBA are not performing file-management operations to ensure that the file growth does not happen during peak hours or during ETL operations. Since ETL and other data processing operations perform bulk data inserting, it makes database growth more possible during these times. If you see your destinations chugging along and all of a sudden they pause for a long period of time, then it may be caused by a database growth process (or a locking/ blocking condition).

For SQL Server, be sure to set your destination database Auto Shrink setting to False; otherwise, the files will shrink, and then more space will have to be re-allocated, which takes time.

Consider Dropping and Re-creating Indexes on Destination Tables

When records are inserted into a relational table (or updated) and the table has indexes, the engine has to reorganize the underlying indexes before the commit process is complete. This can add significant overhead to your data flow destinations.

To understand the overall impact of the indexes, in your test environment, drop all the indexes on the destination table and rerun the data flow, which will show you the delta of change and impact of the indexes.

A very common optimization technique for bulk operations is to first drop the indexes on the destination, run the data load, and then re-create the indexes. It is more often than not faster to drop and re-create the indexes, rather than to insert the records with the indexes in place and force the relational engine to manage the changes.

SQL Server 2005 Partition Tables

Although it is possible to insert partitioned tables directly into SQL Server 2005, this does come with relational engine overhead. A good approach to deal with partitioned tables is to have your destinations insert into a separate table, and then have the partitioned table consume the secondary table. You could also first separate out a partition from the partitioned table, drop the index, load the table, re-create the indexes, and then include the table back into the partitioned table as a partition. This is the approach that the Project REAL solution takes. The code is available and this approach is written up in the paper titled “Project REAL: Business Intelligence ETL Design Practices,” which is available at www.microsoft.com/technet/prodtechnol/sql/2005/realetldp.mspx.

OLE DB Destination Optimization

The most common destination adapter used to insert data into a relational database is the OLE DB Destination adapter. This is because an ADO.NET destination adapter does not exist out-of-the-box. This destination adapter has several features that control what tables to insert the data into, and how that data is being inserted. Figure 12-11 shows the OLE DB Destination Editor.

image from book
Figure 12-11: OLE DB Destination Editor

The OLE DB Destination settings shown in Figure 12-11 are the default settings when configured to insert into a table. The Data access mode is set to the Table or view option and a table is selected. For the purpose of demonstration, this destination is used in the sample data flow shown in Figure 12-12.

image from book
Figure 12-12: Sample data flow

This adapter is configured to land data into a table called FullSalesLoad in a database called SSISOps. The data flow in this example contains a single source file with two million rows, which is directly connected to the OLE DB Destination adapter referenced earlier. The problem with this destination in its default configuration is that each row is inserted separately into the table. Figure 12-13 shows a SQL Server Profile trace on the database while this destination is being landed.

image from book
Figure 12-13: SQL Server Profile trace

This profiler session reveals that the insert statement would have run two million individual times and taken a couple hours if the package had not been stopped prematurely. Therefore, when dealing with large volumes and scalability, using the Table or view option for the Data access mode is not the right approach to take with the OLE DB Destination.

If you are using SQL Server as a destination, a better approach when configuring the OLE DB Destination adapter is to use the Table or view – fast load option of the Data access mode. This will allow the rows to be bulk inserted into a single batch or multiple batches. Figure 12-14 shows the options available when using the fast load feature.

image from book
Figure 12-14: Fast load options

By choosing the fast load support, you are given several more advanced features to configure for the destination. These options align directly with the BULK INSERT statement in SQL Server, the mechanism actually used for the inserts. Following is a summary of the options:

Keep identity- By selecting this option, you are able to insert explicit values into an IDENTITY column. This is identical to the IDENTITY INSERT function within TSQL.
Keep nulls- Checking this option will ignore DEFAULT assignments in the destination table if a NULL value is inserted, which has the effect of helping performance, although negligible. However, this setting may adversely affect data integrity if default values were meant to avoid NULL values being present in certain columns.
Table lock- Enabling this will put an exclusive lock on the destination table so that the insert process can optimize the inserts. However, this could cause locking or blocking problems if other processes are trying to update the same table simultaneously.
Check constraints- With check constraints enabled, the insert will still go through the process of checking the constraints on the destination table. De-selecting this will increase performance if you are handling the value checks in the data flow. However, the constraints will be marked as not-trusted.
Rows per batch- This entry simply provides an estimated source row count to help the bulk insert optimize the insert.
Maximum insert commit size- The insert commit size drives the size of the batches that are inserted. When set to 0, the entire batch will be inserted; otherwise, this will dictate how many rows should be committed to a destination at one time. More details are provided later in this chapter, as this can help to optimize your destination insert by committing rows in smaller batches than the entire statement.

The bottom-line impact of these settings must be tested in your environment with your data and destination table. However, if you are looking to achieve the best scalability, begin by selecting the Keep nulls and Table lock options, and clearing the Check constraints option. Of course, these must be evaluated with the business rules behind you process. If at all possible, handle the constraints and NULL values in your data flow and allow the table to have an exclusive lock. The Maximum insert commit size setting can also have a profound impact on your inserts, and is discussed next.

Maximum Insert Commit Size

With large data volume processing in the multi-millions of records, you will be able to control the insert transaction batch sizes with the Maximum insert commit size entry setting (referenced in the prior section). Here are some considerations to make when setting this and testing various values for optimization:

Any insert failures within a batch commit group will cause that entire batch to be unsuccessfully inserted.
Leaving the max commit size at 0 will require the entire data set to be batched in one commit. This has a couple drawbacks. First of all, if there is a failure, the entire data set will be rolled back and no rows will be committed. Secondly, if you have indexes and foreign key relationships, the commit time can be very lengthy in a single large volume insert because the database must reorganize the indexes and check the foreign keys.
A setting greater than 0 will not necessarily mean that the engine will batch that exact number of rows. The commit size will be the lesser of the number of rows in this setting or the number of rows in the data flow being inserted. For example, if you set the max commit size to 50,000 but your pipeline buffers only contain 9,000 rows, then only 9,000 rows will be committed at a time. Similarly, if you set the commit size to 7,000 and you have 9,000 rows in your buffers, the first batch for every buffer will be 7,000 rows and the second will only be 2,000 rows.
For every commit batch, the database engine will perform the index reorganization and check any foreign keys, therefore reducing the overall performance impact if this all happened at once.
If your destination table doesn’t have indexes, foreign keys, or constraints, the commit time will be very rapid and negligible no matter what you set the commit size to be. The only difference is that a 0 setting will be an all-or-nothing insert.

Consider these two situations when evaluating this setting.

If you set the max commit size, but do not want to lose all the rows in the batch for one row that fails, you can redirect the batch, and then do a second OLE DB Destination that inserts the data one row at a time. Figure 12-15 shows how this would look.

Figure 12-15: Redirecting the batch and then doing a second OLE DB Destination
To achieve much larger commit batches, you will need to modify the DefaultBufferMaxRows and DefaultBufferSize settings. A high-level review explained these settings earlier. Simple data flows that have limited row widths can greatly benefit by increasing the number of rows in a data flow buffer, which will allow you to set the insert commit size higher.