SSIS Scalability Foundations | Expert SQL Server 2005 Integration Services (Programmer to Programmer)

When it comes to identifying bottlenecks in packages, over time, you will develop that innate ability to know where to look. Similarly, you will also start designing your packages with scalability in mind, which will come out of knowing what works best for different challenges.

Some of the basics when you want to take your packages to the next level of scalability include knowing where your performance hits are, and choosing the right task to handle your data processing. When it comes to evaluating performance, the primary areas are identifying the control flow task durations and checking memory utilization. However, before diving into these areas, the first thing to look at is your SSIS service.

SSIS Service Status

The first thing you should check when encountering a performance problem is whether the SSIS service is started and set up to auto-start. The service provides some core functionality, such as enumerating and caching the component. By doing this, packages load faster and start up faster. The service name is SQL Server Integration Services, which uses the executable MsDTSSrvr.exe. To check the status of the SSIS service, open the SQL Server Configuration Manager and browse the SQL Server 2005 Services container, as Figure 12-1 shows.

image from book
Figure 12-1: The SQL Server 2005 Services container

The service State should be Running, and the Start Mode should be set to Automatic to ensure that when the server reboots, the service is restarted.

Identifying Task Durations

When you have a package that is taking a long time to run, the natural first step is to identify which tasks are taking the longest. If you are running a package in BIDS, you can pretty much see how long each task is taking. However, if your package is running via a command line script, then identifying the long-running tasks will be more difficult than it sounds.

Even if you’ve turned on package logging, the events that are sent do not provide the task-duration information in a single line. Instead, the log events are sent out as the events happen. For example, the OnPreExecute event is sent out with a timestamp of its occurrence. Later, when the task finishes, the OnPostExecute is sent out. Therefore, to display the duration of tasks, you will need to associate the OnPreExecute and OnPostExecute, and compare the times.

The following TSQL statement can be run against the dbo.sysdtslog90 table. It will display each execution of a package and the tasks contained within that execution, as well as their execution times.

    WITH Task_Details_CTE (Source,SourceID,ExecutionID,StartTime,EndTime)     AS                                  --provides the task exec. detail ( SELECT sys1.Source, sys1.SourceID, sys1.ExecutionID      , sys1.StartTime, sys2.EndTime   FROM dbo.sysdtslog90 sys1  INNER JOIN dbo.sysdtslog90 sys2     ON sys1.Event = 'OnPreExecute'      --gets only the start time    AND sys2.Event = 'OnPostExecute'     --gets only the completed time    AND sys1.ExecutionID = sys2.ExecutionID --groups package execution    AND sys2.SourceID = sys1.SourceID    --associates identical tasks ) SELECT                                  --identifies the "package" tasks   CASE WHEN sys.SourceID = CTE.SourceID THEN CTE.Source        ELSE '   TASK: ' + CTE.Source    END AS [Package|Task]                --indents tasks to package execs      , CTE.StartTime      , CTE.EndTime      , RIGHT('00' + CAST(DATEDIFF(SECOND,CTE.StartTime, CTE.EndTime)/60                     AS VARCHAR(20)),3) + ' min.' +        RIGHT('0'  + CAST(DATEDIFF(SECOND,CTE.StartTime, CTE.EndTime)%60                     AS VARCHAR(20)),2) + ' sec.' AS TaskDuration   FROM dbo.sysdtslog90 sys  INNER JOIN Task_Details_CTE CTE     ON sys.Event = 'PackageStart'       --lists single row pkg exec.    AND sys.ExecutionID = CTE.ExecutionID  ORDER BY CTE.ExecutionID, CTE.StartTime--orders by package, task

This query uses a common table expression (CTE) to perform the task associations. It joins the CTE with the task details to the package executions themselves, and performs some simple datediffs to create a duration in minutes and seconds. This, of course, can be written in several different ways, or extended to include error details. The following table shows the query output.

Open table as spreadsheet

Package	Start Time	End Time	Duration
Processes_Daily_Orders	12/4/2006 14:20:52	12/4/2006 14:45:02	022 min.10 sec.
TASK: Delete stage	12/4/2006 14:20:52	12/4/2006 14:21:01	000 min.09 sec.
TASK: Filter detail	12/4/2006 14:21:01	12/4/2006 14:29:35	008 min.34 sec.
TASK: Main data load	12/4/2006 14:29:35	12/4/2006 14:40:40	011 min.05 sec.
TASK: Cleanup tasks	12/4/2006 14:40:40	12/4/2006 14:45:02	004 min.22 sec.

Although simplistic in its output, it is able to show the durations of each control flow task, the first step in identifying areas that need attention for performance tuning.

In addition, Microsoft has released a report pack that includes a SQL Server 2005 Reporting Service (SSRS) report that runs against the dbo.sysdtslog90 table. It can be downloaded from www.microsoft.com/downloads/details.aspx?FamilyID=&DisplayLang=en. Figure 12-2 shows a sample SSIS log report from the report pack download.

image from book
Figure 12-2: Sample SSIS log report

Memory Utilization

In many cases, as you are looking to identify areas for improvement, you will see that the Data Flow Tasks are taking the most time. And this is expected, since the data flow itself is the core processing mechanism in a package. Furthermore, when it comes to the data flow, the most important thing to be aware of is your system memory. Therefore, one big area of focus for SSIS scalability is memory management-identifying when you exceed thresholds and when you have more memory to leverage for performance.

Important

When it comes to identifying the bottleneck of a system, the standard Performance Monitor counters can be used, such as memory usage and paging, disk throughput and queue lengths, and processor utilization. Although out of the scope of this discussion, they are the foundation to identifying which server resource is constraining the processes.

With SSIS, often you will see bottlenecks in the area of memory and disk input/output (I/O). If you are seeing both, and the only application running on your server is SSIS (and your destinations are on a different server), chances are that the disk I/O is being caused by the limited memory. Low memory will cause the operating system to start using virtual memory and swapping files between the physical disk and the physical memory. Low memory will also cause SSIS to start using temporary space on a drive to manage the size of the memory request. This is called buffer spooling, and it can quickly escalate to very poor SSIS package performance.

To demonstrate, Figure 12-3 shows a package that is importing three identical flat files (each contain approximately four million records). They are first brought together through a Union All transformation and then sent into a Sort transformation, which applies an order across the five columns in the file. They are being landed into a Row Count transformation for demonstration purposes.

image from book
Figure 12-3: Package importing three identical flat files

On a single processor machine with 750 MB of memory (with about 400 MB of memory free), the package ran into performance problems when the memory reached the 750 MB limit of physical RAM. This package completed in 16 minutes and 25 seconds. Visually, through the debugger, when the package reached about 10 million rows read, the package significantly slowed. Figure 12-4 shows Performance Monitor counters (of this execution), including Memory: Available MBytes, Physical Disk: Average Disk Queue Length, SSIS Pipeline: Rows read, and SSIS Pipeline: Buffers spooled.

image from book
Figure 12-4: Performance Monitor with counters

As you can see, as soon as the Memory: Available MBytes reached 0, the server started experiencing disk I/O issues. The Physical Disk: Disk Queue Length began spiking over 6 (which is 3–4 times over what this single disk subsystem can efficiently handle). Second, the Sort transformation started to use temporary disk space to complete its operation, and you can see this because the SSIS Pipeline: Buffers spooled shows that SSIS needed to temporarily store more than 500 buffers to disk during this execution. With approximately 10,000 records per buffer, that is about five million rows temporarily stored to disk. Clearly, this package was unable to run efficiently on this system.

When the same machine is upgraded to 1.5 GB of RAM, and the package run again, this time it completes in 9 minutes 45 seconds. Figure 12-5 shows the Performance Monitor counters during execution.

image from book
Figure 12-5: Performance Monitor counters during execution

As you can see, during this execution, the Memory: Available MBytes never reached 0, showing the amount of physical RAM was sufficient; the Physical Disk: Disk Queue Length rarely increased over 0, showing the disk was sufficiently handling the I/O needs; and the SSIS Pipeline: Buffers spooled counter showed that SSIS never needed to spool buffers to disk.

The second execution ran almost twice as fast as the first test. By looking at the Available MBytes captured during the second execution, you can see that this simple package needed approximately 700 MB of memory for efficient execution. Since the first execution started out at about 400 MB of available memory, this means that the machine had a memory deficiency of approximately 300 MB, which translated to a heavy performance hit. This overall comparison demonstrates that a sufficient amount of system memory will keep your data flows leveraging the memory and eliminating the need to use temporary storage.

Furthermore, when the data flow started spooling buffers to disk, it caused a disk bottleneck on the system, and the disk queue length jumped up well over 2, indicating that the disk I/O could not keep up with the I/O requests. Although not shown, when this test was done with 500 MB of physical RAM and 150 MB available memory when the package started, the package ran for approximately one hour before being prematurely stopped without completing.

Memory management is key when it comes to SSIS performance, and several of the data flow optimization techniques will show ways to manage the amount of memory that is needed in order to optimize the buffers and package execution.

Balancing SQL Operations with the Data Flow

This section provides the framework for knowing when to apply SQL functionality (leveraging the underlying database) versus the SSIS data flow. This, indeed, is a matter of scalability because the appropriate use of SQL scripting and the data flow will enable your solutions to scale. In essence, this is about choosing the right tool for the job.

Data Flow Advantages as Compared with SQL

To be sure, the point of this discussion is not to downplay the value of SQL scripting, or of the data flow. Rather, it’s to set the stage for making a determination of when to use each technology in your ETL process. As you will see in the ensuing pages, the data flow also has some limitations when compared with using SQL statements.

From a core data transformation architecture standpoint, the data flow can perform true in-line transformation processing, meaning that while data is being extracted into the data flow pipeline, business transformation logic can be performed on the data before it is loaded to the destination. On the other hand, SQL-based processing has a limited ELT-type architecture, meaning Extraction, Loading, and then Transforming. When using SQL code to process data, the data must first be moved from one location to another, and then transformed. In-line transformations are not possible with Transact-SQL. For example, with a bulk insert or when BCP imports data, SQL transformation logic cannot be applied to the data. This has implications related to sequential operations, I/O impacts, and error handling, which are addressed later in this chapter.
The data flow has the capability to perform transformations asynchronously or in parallel on the same data stream being brought into the pipeline. This provides greater scalability by reducing the steps needed to transform your data. Contrastingly, SQL scripts are inherently synchronous processes, meaning that Step 1 must be fully complete before Step 2, and Step 2 before Step 3. If you have a source flat file that must be brought into your database, first the BCP step must finish loading data to a staging table, then the update or transformation process must complete before the data is ready to be loaded, and then, finally, the data can be inserted into the destination. The SSIS data flow can be extracting the same data, transforming the data in-memory, and loading the data, potentially while it is still being extracted from the file. To be sure, SQL can be executed in parallel through the SSIS control flow, but this does not alleviate the synchronous requirement that SQL has to finish each ETL step before the data can be loaded to the destination.
The SSIS pipeline engine leverages system memory to scale its operations, which can perform exponentially better than disk operations. Memory I/O operations are much faster than disk I/O operations. Therefore, when your server has enough memory to handle the data flow process, your packages will perform tremendously well. SQL inherently causes disk I/O, which, for large operations, is very intensive and often the bottleneck for an ETL process. The disk I/O overhead comes from the underlying relational database dependency that any RDBMS has by the obligation of consistency and durability (which is what the second and fourth letters in the distinctive ACID acronym of relational databases stand for). This database dependence can also affect end-user querying or transactional system performance (depending on where the code is run), because the SQL operations use the database engine resources to operate.
Included with SSIS is the ability to handle data flow errors without affecting the entire data set. For example, when a row is being processed through the data flow and a conversion error happens on a particular value, that row can be redirected and handled separately and the row can either be brought back into the same pipeline or sent out to an error row table. On the other hand, SQL code is an all-or-nothing proposition when it comes to processing data. For sure, SQL has error handling, but the error handling is at the batch level, not the row level. If one row being converted fails, the entire batch statement fails.
Within a single SSIS data flow, the source-to-destination transformation can involve many-to-many mappings. This means that you are able to bring data together from many sources, perform transformations, do data correlation, multicast or conditionally split the data, and then land the data in multiple destinations simultaneously. A SQL INSERT statement, on the other hand, is fixed with one-to-one, source-to-destination mapping.
The SSIS data flow contains data cleansing, text mining, and data mining capabilities, which provide advanced data transformation logic beyond the standard capabilities that SQL scripting can perform. With a host of out-of-the-box basic and advanced transformations (including the Fuzzy Grouping, Fuzzy Lookup, Term Extraction, Term Lookup, and Data Mining Query transformations), the SSIS data flow can handle many unique and complicated scenarios not easily handled by SQL functionality.
When dealing with large data insert operations, the data flow can handle large data sets by batching inserts through the destination adapter’s advanced properties. A review of these features appears later in this chapter. SQL can also perform batching when doing bulk insert operations. For normal insert statements, the entire data set being loaded must be handled in one transaction, which often comes with long commit times.

In all, the data flow provides some nice benefits and features over and above what a SQL-based process can perform for many operations that are very common for ETL.

SQL Advantages when Compared with the Data Flow

When it comes to the advantages that SQL has when compared with data flow, there are several. SQL contains some significant value over and above the data flow for some ETL operations.

The most notable advantage that SQL brings to the table is its ability to perform set-based updates. This concept has been discussed in several places in this book, but is important to highlight here because, when compared with how the data flow performs updates, SQL is able to significantly scale beyond the data flow. To perform an update natively in the data flow requires the use of the OLE DB Command transformation, which runs the update statement one row at a time. SQL can handle the update in a single operation by joining a temporary table with the table to be updated.
Relational database engines can perform sorting operations efficiently with appropriate indexing. With large tables where scans are necessary, even though sorting with the ORDER BY statement may require inefficient query plans, the sort can often complete faster than leveraging the SSIS data flow Sort transformation. This will be true in cases where the SSIS server does not have enough memory to fit the sort set into the private buffer cache. When this happens, SSIS is forced to write temporary information to disk, severely impacting the performance of the transformation and other packages and data flows being executed.
One advantage of SQL coding is its native relationship to the database engine. Because of this integration, code written in SQL has less conversion operations and, therefore, less chance for conversion failures with implicit conversions. The data flow requires data conversions into the pipeline engine and conversions out of the pipeline into the destination. The data flow also relies on data access providers to connect and import or export, whereas data coming from a SQL table into another SQL table is much more integrated with the database engine.
Finally, SQL (specifically TSQL) excels in performing advanced queries such as recursive self-joins, correlated sub queries, multi-table joins, and more. Although the data flow can perform various data correlation operations in the data flow, some queries are too difficult to model with the available out-of-the-box transformations.

As you can see, SQL-based operations provide some valuable functionality that, when leveraged, can assist in providing scalability to your packages. Relative to the areas reviewed previously, SQL should be applied to the following key areas:

Where updates (as opposed to inserts) are required, and the number of records that need to be updated is more than several thousand, then leveraging SQL code for a set-based update will scale your operation. If your current set-based operations are merely transformation logic not affecting existing destination rows, then look to integrate that transformation logic into the data flow.
Storing auditing and logging information in a database will enable easy querying and reporting. Therefore, SQL can provide the mechanism to integrate between a relational database and SSIS, through the Execute SQL Task. This use of SQL and the database applies to auditing tables for capturing performance and status information; management tables for tracking metadata administration information (such as persisting incremental extraction information); and procedures and functions that drive the auditing and lineage.
Even within the data flow itself, custom SQL can enable faster performance and better scalability such as for source query filtering, ordering, and shaping.