An Approach to Data Flow Optimization | Microsoft SQL Server 2005 Integration Services

The preceding section presents a pretty long list of methods to improve the performance of your data flows and addresses a large percentage of the data flow performance problems you'll run into. However, if after applying all those suggestions, you still need to squeeze some performance from your data flow, you might need to step back and take a more holistic approach to analyzing your packages. This section presents a disciplined and somewhat theoretical approach you can use to evaluate and incrementally optimize your data flows.

The Theory of Constraints

One of the more impressive books I read in college was The Goal by Eliyahu M. Goldratt. For me, it was an epiphany. In The Goal, Dr. Goldratt deftly articulates the kinds of problems businesses face when continually improving their processes and trying to get the most from their resources. Although the book was about how to improve the throughput of a manufacturing plant, the underlying concept is about optimizing processes, identifying resources and constraints, and consistently and significantly increasing the overall performance of systems. Sounds an awful lot like a process we could use for optimizing data flows, doesn't it?

Note

For more information regarding the Theory of Constraints and its Thinking Processes, or The Goldratt Institute's products and services, visit www.goldratt.com.

The Five Steps of the Theory of Constraints

Dr. Goldratt has codified his process of improvement into five steps, as follows:

Identify the constraint.
Decide how to exploit the constraint.
Subordinate and synchronize everything else to the preceding decisions.
Elevate the performance of the constraint.
If in any of the previous steps, the constraint has shifted, go back to Step 1.

Let's apply the Theory of Constraints to optimizing data flow.

Identifying and Prioritizing Resource Constraints

The first step is to identify the constraints, which are elements of a system that limit the overall system's performance. In this step, you should focus on educating yourself about your computing environment. What are your resources and in what ways are those resources constrained? For a simple example, evaluate your machine memory. How much is installed? Open Windows Task Manager and look at the Performance tab. How much total physical memory is available? A common mistake people make is to forget how much memory other processes on a machine consume. They might have 2GB of RAM, but only have 500MB available because they are running SQL Server and a plethora of other processes on the machine. If a package running on the machine must sort 1GB of data, even though they have 2GB of memory, the data flow will likely spool (swap buffers) to disk and the performance will suffer because only 500MB is available to SSIS. Try to understand your system's resources and how they are or potentially can be constrained.

The second part of this step is to prioritize the constraints in order of importance. Prioritizing constraints is important because, although a given constraint might exist on all your machines, it might not be significant enough to factor into your optimization for a given package. For example, recently a customer asked how to optimize a package's startup time. Their system was not concerned with throughput, or even data flow processing speed, as the loads were very small. Rather, they were interested in how to speed up the startup time of packages because they were running them on demand as part of a reporting system and the load time for the package constituted a large percentage of the overall report load time. Although memory might be a high-priority constraint for packages processing very large datasets, for these particular packages, memory was a low-priority constraint. However, task types, hard drive speed, configuration type, and package size were high-priority constraints. Understand the priorities of the constraints.

Some Common Constraints

The following is a partial list of constraints of which you should be aware. There are more, no doubt, but this list is a good start.

Amount of memory installed This is perhaps the most important factor for processing complex data.
Disk IO speed Using slow IDE or optimized RAID drives? This is important for big data throughput that involves hard drives.
Available memory Changes based on other processes should be monitored and understood. For example, how much memory SQL Server is using.
CPU cache This can increase processing for certain transforms that do repetitive, same-memory access, such as the Lookup transform.
Query performance Queries should be optimized for speed.
Package size Larger packages take longer to load.
CPU speed The raw processing power is important for processing throughput.
Amount of data to process Generally speaking, the more data, the longer the processing time.
Complexity of data flow More complex data flow can take longer and more memory, especially if it involves multiple blocking transforms.
Number of CPUs More CPUs widen the processing pipe.
Server availability Server availability impacts connection times and query performance.
Logging Although quite efficient, logging can impact performance when verbose.
Other software running on the same machine Other software reduces available memory, reads and writes to disk, and occupies the bus.
Order of data Sorted data might process faster for data flows with certain transforms, such as Merge and Merge Join.
Event handlers Event handlers are useful, but can be expensive if overused.
Indexes Table access is improved; joins and searching speed improves.
Designer environment The Business Intelligence Development Studio is a full-featured development platform, but does not function well as a package execution platform for production or performance testing purposes.
Bus speed Bus speed impacts memory and disk access speeds.
Network Speed Impacts cross network loads, extracts, and queries.

Most Common High-Priority Constraints

Having listed all these constraints, it's important to stress that a handful will be the most common high-priority constraints, in no particular order:

Memory
CPU speed
Number of CPUs
Hard drive speed
Network speed

How to Identify and Prioritize Constraints

To identify constraints, you need to witness the effect of the constraints. The priorities naturally fall out of the identification process. The most drastic decreases in performance are caused by the highest-priority constraints. To witness the effect of constraints, you need a baseline against which you can measure decreases in performance.

Get a Baseline

To understand where a given package spends its processing time, it's necessary to decompose the package into its processing time constituent parts. To do that, you must decompose the package and get a baseline measurement of its core processing time, then measure each incremental processing time increase as more components and other package parts are added to the core package. It is very important to isolate what you are measuring so that when you measure deltas, you know what causes them. Start by decomposing the package and eliminating unmeasured constraints.

Eliminate Unmeasured Constraints

To ensure that you have the most control over the system, try to get total control over everything happening on the machine. Eliminate as much outside influence as you can.

Shut down other applications that might consume memory or impact other resources Applications can do unexpected things, such as read from disk at unexpected times or launch a reminder message box.
Make sure you have enough memory for package execution If you swap to disk, it ruins your measurement.
Disconnect from networks, if possible If you're accessing flat files across the network, copy them locally to eliminate the network element. Later, you can add it back to measure it.
Shut down SQL Server, if possible If the package connects to SQL Server, replace the destination or source with a rowcount, trash, or Data Generation transform.

Tip

One of the biggest unmeasured constraints that is too often ignored is the Business Intelligence Development Studio. Don't use the designer for measuring performance. If you do, you're measuring the designer performance, not the package performance. Execute packages in DTExec.exe. It is ideal for measuring package performance because it is very lightweight and generally resembles how packages will be executed in production.

Build a Baseline

The following are some typical baseline measurements. You should pick one with which you are comfortable. If you want to measure source performance, you should choose the first one. If you're more interested in the destination performance, choose the second one, and so forth.

Source Multicast
Source Destination
Source Transforms Multicast

Note

For measuring an existing package, try to deconstruct it into simple parts that you can measure using this approach.

Tip

Because the Multicast does almost nothing, if there are no outputs, using it as a replacement for destinations is a great way to measure the performance impact of destinations or transforms. However, be sure that RunInOptimized mode is turned off. Otherwise, the Execution Engine trims the execution trees.

Note the differences in performance between the different baselines. After you've done that, you'll have a good idea where some of your bottlenecks are. For example, when you replace the Multicast with a real destination, the execution time might increase, even drastically. But, when you add in transforms, you might see very little performance degradation. This is what is meant by witnessing or measuring the impact of changes.

Note

What's the difference between a constraint and a bottleneck? Constraints always exist, but they might not be a factor in the performance of the system because they are hidden by bottlenecks or they are low-priority constraints. Bottlenecks are constraints that are actually limiting the overall system's performance in some way. Remove a bottleneck, and you'll expose another oneturning a previously hidden constraint into a bottleneck.

Add Elements Back to the Package and Measure Performance Impact

After you've established these baseline measurements, you can start incrementally adding previously removed elements to measure the impact each has on the overall system. Make sure you only add them back and measure them one by one, otherwise you won't know which element is causing the impact you measure. Continue this "Set, Test, and Measure" process until you have reconstructed the package and the environment. Also, don't do this in a straight line. Try adding components in a different order and in different combinations.

Measure with Performance Counters

Performance counters are useful for 'measuring internal or otherwise hidden processes. Integration Services provides the following useful performance counters:

Buffers in Use
Flat Buffers in Use
Private Buffers in Use
Rows Read
Rows Written
Buffers Spooled

The most useful of these for performance measurements is Buffers Spooled. Values greater than zero indicate that the Execution Engine has swapped buffers to disk. You should always try to avoid swapping out buffers to disk, which is usually an indication that you don't have enough memory or that you need to rearrange your data flow.

Rows Read and Rows Written help gauge the overall progress of the data flow and can be useful as sanity checks. For example, as you increase the number of rows a data flow processes, you would expect a commensurate increase in processing time. These measurements help' you quickly determine if the increased processing times are warranted.

Deciding How to Exploit the Constraint

This step is about making sure the bottleneck is never overutilized. Applying this to data flow, always try to match the data load at a transform's input with its throughput capacity. The Data Flow Task already attempts to do this using back pressure, but that only temporarily masks the problem. It doesn't improve throughput, it just slows down the data flow. This is a tough one to get right because throughput and processing loads at an input aren't constant, so you have to rely on the Data Flow Task for the most part.

Subordinating and Synchronizing Everything Else to the Preceding Decisions

After you've identified the bottleneck, and you've fully exploited it as described in the preceding step, you should apply all the other nonconstrained resources to breaking the bottleneck. If you've got resources that are not fully utilized in your environment and that are waiting around because of a bottleneck in your package, you're wasting resources. Try to get those underutilized resources to take on more of the work that the bottleneck is performing. You can decrease the data loads flowing into the bottleneck or ease the bottleneck processing burden using the following techniques:

Filtering the data before it flows into the bottleneck transform, for example, prebuilding a Lookup transform dimension reference table from the new facts so that only those dimensions that are required end up in the reference table.
Ignoring errors and using a conditional split to filter them, for example, when doing upserts, testing shows that it is generally faster to ignore errors from the Lookup transform and conditionally split the not-found rows into an insert than to try and handle the not-found rows with an error output.
Preprocessing the data the constrained transform will process, for example, placing a Lookup transform before the Slowly Changing Dimension transform to filter out nonexistent rows.
Postprocessing the data in lieu of a transformation, for example, the OLEDB Command transform performs row-by-row expensive operations, which you can optimize by eliminating the transform and flowing the data to a table. Later, use SQL Server to perform the set-based operations instead.
Cloning and parallelizing the data flow around the bottleneck. For example, if the bottleneck is a transform, segment the rows with a Conditional Split transform into clones of the bottleneck, then union all the rows back into the next transform.

In other words, off-load work from your overutilized bottleneck resources and load it onto the laggard, underutilized resources. Optimizing nonbottleneck resources doesn't help much if they're all constrained by the bottleneck.

Elevating the Bottlenecks

As you continue to increase the performance of the bottleneck by off-loading work from it and placing more of the load on other less-utilized resources, the system becomes more balanced and the bottleneck is eventually broken.

If in any of the Previous Steps the Constraint Has Shifted, Go Back to Step 1

When you take these steps, what you'll find is that you're in an iterative improvement loop. After you eliminate one bottleneck, you'll find others. So, you start the process again until you reach your performance goal.