An Approach to
Data Flow Optimization
The
preceding
section presents a pretty long
list of
methods
to improve the performance of your data flows and
addresses a large percentage of the data flow performance problems
you'll run into. However, if after applying all those suggestions,
you still need to squeeze some performance from your data flow, you
might need to step back and take a more holistic approach to
analyzing your packages. This section
presents
a disciplined and
somewhat theoretical approach you can use to evaluate and
incrementally optimize your data flows.
The Theory of
Constraints
One of the more impressive books I read in
college was
The Goal
by Eliyahu M.
Goldratt. For me, it was an epiphany. In
The Goal
, Dr. Goldratt deftly articulates the
kinds of problems businesses face when continually improving their
processes and trying to get the most from their resources. Although
the book was about how to improve the throughput of a manufacturing
plant, the underlying concept is about optimizing processes,
identifying resources and constraints, and consistently and
significantly increasing the overall performance of systems. Sounds
an
awful
lot like a process we could use for optimizing data flows,
doesn't it?
Note
For more information regarding the Theory of
Constraints and its Thinking Processes, or The Goldratt Institute's
products and services, visit www.goldratt.com.
The Five Steps of
the Theory of Constraints
Dr. Goldratt has codified his process of
improvement into five steps, as
follows
:
-
Identify
the
constraint.
-
Decide how to
exploit
the constraint.
-
Subordinate and
synchronize
everything else to the preceding decisions.
-
Elevate
the
performance of the constraint.
-
If in any of the previous steps, the constraint
has shifted,
go back to Step
1
.
Let's apply the Theory of Constraints to
optimizing data flow.
Identifying and
Prioritizing Resource Constraints
The first step is to identify the constraints,
which are elements of a system that limit the overall system's
performance. In this step, you should focus on
educating
yourself
about your computing environment. What are your resources and in
what ways are those resources constrained? For a simple example,
evaluate your machine memory. How much is installed? Open Windows
Task Manager and look at the Performance tab. How much total
physical memory is available? A common mistake people make is to
forget how much memory other processes on a machine
consume
. They
might have 2GB of RAM, but only have 500MB available because they
are running SQL Server and a
plethora
of other processes on the
machine. If a package running on the machine must
sort
1GB of data,
even though they have 2GB of memory, the data flow will likely
spool (swap buffers) to disk and the performance will suffer
because only 500MB is available to SSIS. Try to understand your
system's resources and how they are or
potentially
can be
constrained.
The second part of this step is to prioritize
the constraints in order of importance. Prioritizing constraints is
important because, although a given constraint might exist on all
your machines, it might not be significant enough to factor into
your optimization for a given package. For example, recently a
customer asked how to optimize a package's startup time. Their
system was not
concerned
with throughput, or even data flow
processing speed, as the loads were very small. Rather, they were
interested in how to speed up the startup time of packages because
they were running them on demand as part of a reporting system and
the load time for the package constituted a large percentage of the
overall report load time. Although memory might be a high-priority
constraint for packages processing very large datasets, for these
particular packages, memory was a low-priority constraint. However,
task types, hard drive speed, configuration type, and package
size
were high-priority constraints. Understand the priorities of the
constraints.
Some Common
Constraints
The following is a partial list of constraints
of which you should be aware. There are more, no doubt, but this
list is a good start.
-
Amount of memory
installed
This is perhaps the most important factor for
processing complex data.
-
Disk IO speed
Using slow IDE or optimized RAID drives? This is important for big
data throughput that involves hard
drives
.
-
Available
memory
Changes based on other processes should be
monitored
and
understood
. For example, how much memory SQL Server is
using.
-
CPU cache
This can increase processing for certain transforms that do
repetitive, same-memory access, such as the Lookup transform.
-
Query
performance
Queries should be optimized for speed.
-
Package size
Larger packages take longer to load.
-
CPU speed
The
raw processing power is important for processing throughput.
-
Amount of data to
process
Generally
speaking, the more data, the longer the
processing time.
-
Complexity of data
flow
More complex data flow can take longer and more memory,
especially
if it involves multiple blocking transforms.
-
Number of
CPUs
More CPUs widen the processing pipe.
-
Server
availability
Server availability impacts connection times
and query performance.
-
Logging
Although quite efficient, logging can impact performance when
verbose.
-
Other software
running on the same machine
Other software
reduces
available
memory, reads and
writes
to disk, and occupies the bus.
-
Order of data
Sorted data might process faster for data flows with certain
transforms, such as Merge and Merge Join.
-
Event
handlers
Event handlers are useful, but can be expensive if
overused
.
-
Indexes
Table
access is improved; joins and searching speed
improves
.
-
Designer
environment
The Business Intelligence Development Studio is
a
full-featured
development platform, but does not function well as
a package execution platform for production or performance testing
purposes.
-
Bus speed
Bus
speed impacts memory and disk access speeds.
-
Network Speed
Impacts cross network loads,
extracts
, and queries.
Most Common
High-Priority Constraints
Having listed all these constraints, it's
important to stress that a handful will be the most common
high-priority constraints, in no particular order:
-
Memory
-
CPU speed
-
Number of CPUs
-
Hard drive speed
-
Network speed
How to Identify
and Prioritize Constraints
To identify constraints, you need to witness the
effect of the constraints. The priorities naturally fall out of the
identification process. The most drastic decreases in performance
are caused by the highest-priority constraints. To
witness
the
effect of constraints, you need a baseline against which you can
measure decreases in performance.
Get a
Baseline
To understand where a given package spends its
processing time, it's necessary to decompose the package into its
processing time
constituent
parts. To do that, you must decompose
the package and get a baseline measurement of its core processing
time, then measure each incremental processing time increase as
more
components
and other package
parts
are added to the
core
package. It is very important to isolate what you are measuring so
that when you measure deltas, you know what causes them. Start by
decomposing
the package and eliminating unmeasured constraints.
Eliminate
Unmeasured Constraints
To ensure that you have the most control over
the system, try to get total control over everything happening on
the machine. Eliminate as much outside influence as you can.
-
Shut down other
applications that might consume memory or impact other
resources
Applications can do unexpected things, such as
read from disk at unexpected times or launch a reminder message
box.
-
Make sure you have
enough memory for package execution
If you swap to disk, it
ruins your measurement.
-
Disconnect from
networks, if possible
If you're accessing flat files across
the network, copy them locally to eliminate the network element.
Later, you can add it back to measure it.
-
Shut down SQL
Server, if possible
If the package connects to SQL Server,
replace the destination or source with a rowcount, trash, or Data
Generation transform.
Tip
One of the biggest unmeasured constraints that
is too often ignored is the Business Intelligence Development
Studio. Don't use the designer for measuring performance. If you
do, you're measuring the designer performance, not the package
performance. Execute packages in
DTExec.exe
. It is ideal
for measuring package performance because it is very lightweight
and generally resembles how packages will be executed in
production.
Build a
Baseline
The following are some typical baseline
measurements. You should pick one with which you are comfortable.
If you want to measure source performance, you should choose the
first one. If you're more interested in the destination
performance, choose the second one, and so forth.
Note
For measuring an existing package, try to
deconstruct it into simple parts that you can measure using this
approach.
Tip
Because the Multicast does almost nothing, if
there are no outputs, using it as a replacement for destinations is
a great way to measure the performance impact of destinations or
transforms. However, be sure that RunInOptimized mode is turned
off.
Otherwise
, the Execution Engine trims the execution trees.
Note the differences in performance between the
different baselines. After you've done that, you'll have a good
idea where some of your bottlenecks are. For example, when you
replace the Multicast with a real destination, the execution time
might increase, even drastically. But, when you add in transforms,
you might see very little performance degradation. This is what is
meant
by witnessing or measuring the impact of changes.
Note
What's the difference between a constraint and a
bottleneck? Constraints always exist, but they might not be a
factor in the performance of the system because they are hidden by
bottlenecks or they are low-priority constraints. Bottlenecks are
constraints that are actually limiting the overall system's
performance in some way. Remove a bottleneck, and you'll expose
another oneturning a previously hidden constraint into a
bottleneck.
Add Elements Back
to the Package and Measure Performance Impact
After you've established these baseline
measurements, you can start incrementally adding previously removed
elements to measure the impact each has on the overall system. Make
sure you only add them back and measure them one by one, otherwise
you won't know which element is
causing
the impact you measure.
Continue this "Set, Test, and Measure" process until you have
reconstructed the package and the environment. Also, don't do this
in a straight line. Try adding components in a different order and
in different combinations.
Measure with
Performance Counters
Performance counters are useful for 'measuring
internal or otherwise hidden processes. Integration Services
provides the following useful performance counters:
-
Buffers in Use
-
Flat Buffers in Use
-
Private Buffers in Use
-
Rows Read
-
Rows Written
-
Buffers Spooled
The most useful of these for performance
measurements is
Buffers Spooled
.
Values greater than zero
indicate
that the Execution Engine has
swapped buffers to disk. You should always try to avoid swapping
out buffers to disk, which is usually an indication that you don't
have enough memory or that you need to rearrange your data
flow.
Rows Read
and
Rows Written
help gauge the
overall progress of the data flow and can be useful as
sanity
checks. For example, as you increase the number of rows a data flow
processes, you would expect a commensurate increase in processing
time. These measurements help' you quickly determine if the
increased processing times are warranted.
Deciding How to
Exploit
the Constraint
This step is about making sure the bottleneck is
never overutilized. Applying this to data flow, always try to match
the data load at a transform's input with its throughput capacity.
The Data Flow Task already attempts to do this using back pressure,
but that only temporarily masks the problem. It doesn't improve
throughput, it just slows down the data flow. This is a tough one
to get right because throughput and processing loads at an input
aren't constant, so you have to rely on the Data Flow Task for the
most part.
Subordinating and Synchronizing
Everything
Else to the Preceding Decisions
After you've identified the bottleneck, and
you've fully exploited it as described in the preceding step, you
should apply all the other nonconstrained resources to breaking the
bottleneck. If you've got resources that are not fully utilized in
your environment and that are waiting around because of a
bottleneck in your package, you're wasting resources. Try to get
those
underutilized
resources to take on more of the work that the
bottleneck is performing. You can decrease the data loads
flowing
into the bottleneck or ease the bottleneck processing
burden
using
the following techniques:
-
Filtering the data before it flows into the
bottleneck transform, for example, prebuilding a Lookup transform
dimension reference table from the new facts so that only those
dimensions that are required end up in the reference table.
-
Ignoring errors and using a conditional split to
filter them, for example, when doing upserts, testing shows that it
is generally faster to ignore errors from the Lookup transform and
conditionally split the not-found rows into an insert than to try
and handle the not-found rows with an error output.
-
Preprocessing the data the constrained transform
will process, for example, placing a Lookup transform before the
Slowly Changing Dimension transform to filter out nonexistent
rows.
-
Postprocessing the data in lieu of a
transformation, for example, the OLEDB Command transform
performs
row-by-row expensive operations, which you can optimize by
eliminating the transform and flowing the data to a table. Later,
use SQL Server to perform the set-based operations instead.
-
Cloning and parallelizing the data flow around
the bottleneck. For example, if the bottleneck is a transform,
segment the rows with a Conditional Split transform into
clones
of
the bottleneck, then union all the rows back into the
next
transform.
In other words, off-load work from your
overutilized bottleneck resources and load it onto the laggard,
underutilized resources. Optimizing nonbottleneck resources doesn't
help much if they're all constrained by the bottleneck.
Elevating
the Bottlenecks
As you continue to increase the performance of
the bottleneck by off-loading work from it and placing more of the
load on other less-utilized resources, the system becomes more
balanced and the bottleneck is eventually broken.
If in any of the
Previous Steps the Constraint Has Shifted,
Go Back to Step 1
When you take these steps, what you'll find is
that you're in an iterative improvement loop. After you eliminate
one bottleneck, you'll find others. So, you start the process again
until you reach your performance goal.
|