An Approach to
Data Flow Optimization
section presents a pretty long
to improve the performance of your data flows and
addresses a large percentage of the data flow performance problems
you'll run into. However, if after applying all those suggestions,
you still need to squeeze some performance from your data flow, you
might need to step back and take a more holistic approach to
analyzing your packages. This section
a disciplined and
somewhat theoretical approach you can use to evaluate and
incrementally optimize your data flows.
The Theory of
One of the more impressive books I read in
by Eliyahu M.
Goldratt. For me, it was an epiphany. In
, Dr. Goldratt deftly articulates the
kinds of problems businesses face when continually improving their
processes and trying to get the most from their resources. Although
the book was about how to improve the throughput of a manufacturing
plant, the underlying concept is about optimizing processes,
identifying resources and constraints, and consistently and
significantly increasing the overall performance of systems. Sounds
lot like a process we could use for optimizing data flows,
For more information regarding the Theory of
Constraints and its Thinking Processes, or The Goldratt Institute's
products and services, visit www.goldratt.com.
The Five Steps of
the Theory of Constraints
Dr. Goldratt has codified his process of
improvement into five steps, as
Decide how to
everything else to the preceding decisions.
performance of the constraint.
If in any of the previous steps, the constraint
go back to Step
Let's apply the Theory of Constraints to
optimizing data flow.
Prioritizing Resource Constraints
The first step is to identify the constraints,
which are elements of a system that limit the overall system's
performance. In this step, you should focus on
about your computing environment. What are your resources and in
what ways are those resources constrained? For a simple example,
evaluate your machine memory. How much is installed? Open Windows
Task Manager and look at the Performance tab. How much total
physical memory is available? A common mistake people make is to
forget how much memory other processes on a machine
might have 2GB of RAM, but only have 500MB available because they
are running SQL Server and a
of other processes on the
machine. If a package running on the machine must
1GB of data,
even though they have 2GB of memory, the data flow will likely
spool (swap buffers) to disk and the performance will suffer
because only 500MB is available to SSIS. Try to understand your
system's resources and how they are or
The second part of this step is to prioritize
the constraints in order of importance. Prioritizing constraints is
important because, although a given constraint might exist on all
your machines, it might not be significant enough to factor into
your optimization for a given package. For example, recently a
customer asked how to optimize a package's startup time. Their
system was not
with throughput, or even data flow
processing speed, as the loads were very small. Rather, they were
interested in how to speed up the startup time of packages because
they were running them on demand as part of a reporting system and
the load time for the package constituted a large percentage of the
overall report load time. Although memory might be a high-priority
constraint for packages processing very large datasets, for these
particular packages, memory was a low-priority constraint. However,
task types, hard drive speed, configuration type, and package
were high-priority constraints. Understand the priorities of the
The following is a partial list of constraints
of which you should be aware. There are more, no doubt, but this
list is a good start.
Amount of memory
This is perhaps the most important factor for
processing complex data.
Disk IO speed
Using slow IDE or optimized RAID drives? This is important for big
data throughput that involves hard
Changes based on other processes should be
. For example, how much memory SQL Server is
This can increase processing for certain transforms that do
repetitive, same-memory access, such as the Lookup transform.
Queries should be optimized for speed.
Larger packages take longer to load.
raw processing power is important for processing throughput.
Amount of data to
speaking, the more data, the longer the
Complexity of data
More complex data flow can take longer and more memory,
if it involves multiple blocking transforms.
More CPUs widen the processing pipe.
Server availability impacts connection times
and query performance.
Although quite efficient, logging can impact performance when
running on the same machine
memory, reads and
to disk, and occupies the bus.
Order of data
Sorted data might process faster for data flows with certain
transforms, such as Merge and Merge Join.
Event handlers are useful, but can be expensive if
access is improved; joins and searching speed
The Business Intelligence Development Studio is
development platform, but does not function well as
a package execution platform for production or performance testing
speed impacts memory and disk access speeds.
Impacts cross network loads,
, and queries.
Having listed all these constraints, it's
important to stress that a handful will be the most common
high-priority constraints, in no particular order:
Number of CPUs
Hard drive speed
How to Identify
and Prioritize Constraints
To identify constraints, you need to witness the
effect of the constraints. The priorities naturally fall out of the
identification process. The most drastic decreases in performance
are caused by the highest-priority constraints. To
effect of constraints, you need a baseline against which you can
measure decreases in performance.
To understand where a given package spends its
processing time, it's necessary to decompose the package into its
parts. To do that, you must decompose
the package and get a baseline measurement of its core processing
time, then measure each incremental processing time increase as
and other package
are added to the
package. It is very important to isolate what you are measuring so
that when you measure deltas, you know what causes them. Start by
the package and eliminating unmeasured constraints.
To ensure that you have the most control over
the system, try to get total control over everything happening on
the machine. Eliminate as much outside influence as you can.
Shut down other
applications that might consume memory or impact other
Applications can do unexpected things, such as
read from disk at unexpected times or launch a reminder message
Make sure you have
enough memory for package execution
If you swap to disk, it
ruins your measurement.
networks, if possible
If you're accessing flat files across
the network, copy them locally to eliminate the network element.
Later, you can add it back to measure it.
Shut down SQL
Server, if possible
If the package connects to SQL Server,
replace the destination or source with a rowcount, trash, or Data
One of the biggest unmeasured constraints that
is too often ignored is the Business Intelligence Development
Studio. Don't use the designer for measuring performance. If you
do, you're measuring the designer performance, not the package
performance. Execute packages in
. It is ideal
for measuring package performance because it is very lightweight
and generally resembles how packages will be executed in
The following are some typical baseline
measurements. You should pick one with which you are comfortable.
If you want to measure source performance, you should choose the
first one. If you're more interested in the destination
performance, choose the second one, and so forth.
For measuring an existing package, try to
deconstruct it into simple parts that you can measure using this
Because the Multicast does almost nothing, if
there are no outputs, using it as a replacement for destinations is
a great way to measure the performance impact of destinations or
transforms. However, be sure that RunInOptimized mode is turned
, the Execution Engine trims the execution trees.
Note the differences in performance between the
different baselines. After you've done that, you'll have a good
idea where some of your bottlenecks are. For example, when you
replace the Multicast with a real destination, the execution time
might increase, even drastically. But, when you add in transforms,
you might see very little performance degradation. This is what is
by witnessing or measuring the impact of changes.
What's the difference between a constraint and a
bottleneck? Constraints always exist, but they might not be a
factor in the performance of the system because they are hidden by
bottlenecks or they are low-priority constraints. Bottlenecks are
constraints that are actually limiting the overall system's
performance in some way. Remove a bottleneck, and you'll expose
another oneturning a previously hidden constraint into a
Add Elements Back
to the Package and Measure Performance Impact
After you've established these baseline
measurements, you can start incrementally adding previously removed
elements to measure the impact each has on the overall system. Make
sure you only add them back and measure them one by one, otherwise
you won't know which element is
the impact you measure.
Continue this "Set, Test, and Measure" process until you have
reconstructed the package and the environment. Also, don't do this
in a straight line. Try adding components in a different order and
in different combinations.
Performance counters are useful for 'measuring
internal or otherwise hidden processes. Integration Services
provides the following useful performance counters:
Buffers in Use
Flat Buffers in Use
Private Buffers in Use
The most useful of these for performance
Values greater than zero
that the Execution Engine has
swapped buffers to disk. You should always try to avoid swapping
out buffers to disk, which is usually an indication that you don't
have enough memory or that you need to rearrange your data
help gauge the
overall progress of the data flow and can be useful as
checks. For example, as you increase the number of rows a data flow
processes, you would expect a commensurate increase in processing
time. These measurements help' you quickly determine if the
increased processing times are warranted.
Deciding How to
This step is about making sure the bottleneck is
never overutilized. Applying this to data flow, always try to match
the data load at a transform's input with its throughput capacity.
The Data Flow Task already attempts to do this using back pressure,
but that only temporarily masks the problem. It doesn't improve
throughput, it just slows down the data flow. This is a tough one
to get right because throughput and processing loads at an input
aren't constant, so you have to rely on the Data Flow Task for the
Subordinating and Synchronizing
Else to the Preceding Decisions
After you've identified the bottleneck, and
you've fully exploited it as described in the preceding step, you
should apply all the other nonconstrained resources to breaking the
bottleneck. If you've got resources that are not fully utilized in
your environment and that are waiting around because of a
bottleneck in your package, you're wasting resources. Try to get
resources to take on more of the work that the
bottleneck is performing. You can decrease the data loads
into the bottleneck or ease the bottleneck processing
the following techniques:
Filtering the data before it flows into the
bottleneck transform, for example, prebuilding a Lookup transform
dimension reference table from the new facts so that only those
dimensions that are required end up in the reference table.
Ignoring errors and using a conditional split to
filter them, for example, when doing upserts, testing shows that it
is generally faster to ignore errors from the Lookup transform and
conditionally split the not-found rows into an insert than to try
and handle the not-found rows with an error output.
Preprocessing the data the constrained transform
will process, for example, placing a Lookup transform before the
Slowly Changing Dimension transform to filter out nonexistent
Postprocessing the data in lieu of a
transformation, for example, the OLEDB Command transform
row-by-row expensive operations, which you can optimize by
eliminating the transform and flowing the data to a table. Later,
use SQL Server to perform the set-based operations instead.
Cloning and parallelizing the data flow around
the bottleneck. For example, if the bottleneck is a transform,
segment the rows with a Conditional Split transform into
the bottleneck, then union all the rows back into the
In other words, off-load work from your
overutilized bottleneck resources and load it onto the laggard,
underutilized resources. Optimizing nonbottleneck resources doesn't
help much if they're all constrained by the bottleneck.
As you continue to increase the performance of
the bottleneck by off-loading work from it and placing more of the
load on other less-utilized resources, the system becomes more
balanced and the bottleneck is eventually broken.
If in any of the
Previous Steps the Constraint Has Shifted,
Go Back to Step 1
When you take these steps, what you'll find is
that you're in an iterative improvement loop. After you eliminate
one bottleneck, you'll find others. So, you start the process again
until you reach your performance goal.