This Is Now: Integration Services | Wireless Java : Developing with Java 2, Micro Edition

So here you are — a new tool and the opportunity to reconsider your data-processing architecture. In an ideal world, the tool would not be part of your conceptual design. This may never ultimately be achievable, but with Integration Services, solutions can be designed with much more flexibility and extensibility given the nature of the product. Designing a data-processing solution requires more than just sending the source data into a black-box transformation engine with outputs that push the data into the destination. And of course, system requirements will dictate the final design of the process, including but not limited to the following:

Source and destination system impact
Processing time windows and performance
Destination system state consistency
Hard and soft exception handling and restartability needs
Environment architecture model, distributed hardware, or scaled-up servers
Solution architecture requirements such as flexibility of change or OEM targeted solutions
Modular and configurable solution needs
Manageability and administration requirements

In reviewing this list, you can quickly map several of these to what you have learned about Integration Services already. In most cases, a good architecture will leverage the built-in functionality of the tool, which in the end reduces administration and support requirements. The tool selection process, if it is not completed before a solution is developed, should include a consideration of the system requirements and functionality of the available products.

The focus of this chapter is not to provide an exhaustive Integration Services architecture but rather to provide a framework for design that models true ETL (in that order). Integration Services brings with it a platform that is able to meet a broad variety of business data-processing needs and handle the required volume and integration complexity within the confines of the engine itself.

Integration Services Design Practices

In the previous analysis of DTS, you looked at a few challenges to the architecture. In doing so, however, you've uncovered a few good principles to follow when designing an Integration Services solution:

Limit synchronicity
Reduce staging and disk IO
Reducing the reliance on an RDBMS

You've already looked at a few reasons why these are important principles to consider, and the result of following these will be more apparent as you compare your DTS solution to an Integration Services approach.

Keep in mind that solution requirements often drive design decisions, and there are situations where staging or the RDBMS are useful in data processing. Some of these are discussed in this section. Your goal, though, is to rethink your design paradigms with Integration Services.

Leveraging the Data Flow

For sure, the biggest value that Integration Services bring is the power of the Data Flow. Not to minimize the out-of-the-box functionality of restartability, configurations, logging, Event Handlers, or other Control Flow tasks, the primary goal of the engine is to "integrate," and the Data Flow is the key to realizing that goal. Accomplishing data-processing logic through Data Flow transformations brings performance and flexibility.

Most data architects come from DBA backgrounds, which means that the first thing that comes to their minds when trying to solve a data integration, processing, or cleansing scenario is to use an RDBMS, such as SQL Server. People gravitate to areas they are comfortable with, so this is a natural response. When your comfort in SQL is combined with an easy-to-use and low-cost product like DTS, which in many ways relies on relational databases, the result is a widely adopted tool.

Moving to Integration Services in some ways requires thinking in different terms — Data Flow terms. In previous chapters, you looked at the different Data Flow transformations, so the focus in this section will be on applying some of those components into design decisions and translating the SQL-based designs into Data Flow processes.

The three architecture best practices relate directly to the value that the Data Flow provides:

Limit synchronicity. By bringing in more of the processing logic into the Data Flow, the natural result is fewer process-oriented steps that require completion before moving on. In the previous chapter, you looked at the general streaming nature of the Data Flow (streaming in terms of moving data, not audio or video streaming). This translates to reduced overall processing times.
Reduce staging and expensive IO operations. The Data Flow performs most operations in memory (with occasional use of temp folders and some interaction with external systems). Whenever processing happens on data that resides in RAM, processing is more efficient. Disk IO operations rely on the performance of the drives, the throughput of the IO channels, and the overhead of the operating system to write and read information to the disk. With high volumes or bursting scenarios typical with data processing and ETL, disk IO is often a bottleneck.
Reduce reliance on RDBMS. Relational engines are powerful tools to use, and the point here is not to detract from their appropriate uses to store and manage data. By using the Data Flow to cleanse and join data rather than the RDBMS, the result is reduced impact on the relational system, which frees it up for other functions that may be higher priority. Reading data from a database is generally less expensive than performing complex joins or complicated queries. In addition, related to the first bullet, all RDBMS operations are synchronous. Set-based operations, while they are very useful and optimized in a relational database system, still require that the operation be complete before the data is available for other purposes. The Data Flow, on the other hand, can process joins and lookups and other cleansing steps in parallel while the data is flowing through the pipeline.

Data Integration and Correlation

The Data Flow provides the means to combine data from different source objects completely independent of the connection source where the data originates. The most obvious benefit of this is the ability to perform in-memory correlation operations against heterogeneous data without having to stage the data. Said in another way, with Integration Services, you can extract data from a flat file and join it to data from a database table inside the Data Flow without first having to stage the flat file to a table and then perform a SQL join operation. This can be valuable even when the data is coming from the same source, such as a relational database engine; source data extractions are more efficient without complex or expensive joins, and data can usually begin to flow into the Data Flow immediately. In addition, single table SELECT statements provide less impact to the source systems than do pulls where join logic is applied. Certainly there are situations where joining data in the source system may be useful and efficient; in many cases, however, focusing on data integration within the Data Flow will yield better performance. When different source systems are involved, the requirement to stage the data is reduced.

Several of the built-in transformations can perform data correlation similar to how a database would handle joins and other more complex data relationship logic. The following transformations provide data association for more than one data source:

Lookup
Merge Join
Merge
Union All
Fuzzy Lookup
Term Lookup
Term Extract

Note

In the next section, you will also consider some of the features of the Data Flow that provide data cleansing and other in-line data operations.

Beyond the built-in capabilities of Integration Services, custom adapters and transformations allow more complex or unique scenarios to be handled. This will be discussed in Chapter 14.

The two most commonly used and powerful data correlation transformations are the Lookup and Merge Join. The functionality and properties of these transformations are described in Chapter 4, but for the purpose of applying Integration Services, a few descriptions and examples will be helpful to see how data integration can be applied.

Lookup Transformation

The Lookup transformation, particularly when it can be set up in fully cached mode, provides an efficient way to bring data objects together. Certainly there are many uses for the Lookup, but a few more common uses may be to look up a foreign key on a destination table or even check to see if a record exists in a related table to determine how to process downstream transformations. Figure 11-5 shows a Lookup transformation joined across a business key with the foreign key returned.

image from book
Figure 11-5

Using the Lookup to return foreign keys or other related columns in this way will relieve the need for joins across databases, or even distributed joins across servers, allowing scale-out models for processing — reducing resource needs and impact on the relational systems. Large set-based update statements can also be reduced by using a Lookup transformation to join data. When only a few columns are needed by the Lookup, such as a business key and one or two narrow return columns, a fully cached Lookup will scale very well.

The Lookup can also be used to check the existence of a related row in a table and provide information back to the Data Flow to inform downstream process on how to handle a row in the Data Flow, such as whether to do an insert or update. You can take a couple of different approaches to accomplish this, both of which require changing the error output. When you edit the Lookup transformation, the Configure Error Output property button is available on all the editor tabs. Figure 11-6 shows the Configure Error Output widow, highlighting the drop-down options that are settable for the error handling.

image from book
Figure 11-6

The three error-handling options are Fail Component (default), Redirect Row, and Ignore Failure. When set to Fail Component, the obvious result is a Data Flow failure when the Lookup table does not have a match with the input records. This option is not useful for the scenario you are considering. By setting the error handling to Redirect Row, the input row that does not match a record in the Lookup reference table is redirected out the error output (the red output path). This method can be useful when rows that do not match the Lookup reference table are needed immediately for operations. The red path may be confusing, however, since the purpose of Lookup would partially be to identify records that do not match the reference table, but the path name and annotations can help clarify the purpose.

The third error-handling option in the Lookup transformation, Ignore Failure, provides the most flexibility when you are trying to identify missing rows. When you set the error handling to Ignore Failure, the Lookup transformation will insert NULLs in all the return columns specified in the reference table when the reference table does not have a match. The flexibility comes when the Lookup output needs to be kept together (matches and non-matches) in the Data Flow either for the entirety of the downstream Data Flow path or when other transformations are required before checking the Lookup results. Figure 11-7 highlights a Data Flow containing a Lookup configured to ignore failures. Downstream of the Data Flow in this example are a few transformations before a Conditional Split, which evaluates the rows as they flow through this transformation based on the Lookup output.

image from book
Figure 11-7

The Conditional Split evaluates whether the prior Lookup had a match in the reference by using an ISNULL function applied against a column specified to be returned from the Lookup reference table. The Conditional Split from Figure 11-7, named "Handle Missing Products," uses the ISNULL function, ISNULL([ProductID]), to identify previous Lookup records that did not have a match. Therefore, any row in the pipeline where the ProductID is NULL will be redirected out the named output, and the rest will be sent down the default output of the Conditional Split.

Of course, be aware that if the return column used in the Lookup reference table allows NULLs, then the filter logic in the Conditional Split would also evaluate this result as a match in the criteria. In the example, the return column does not allow NULLs; therefore the criterion is applied correctly. A common scenario, as this example highlights, is the use of Lookup to determine whether an input record should be inserted into a destination table or whether an update should be applied.

Take special note of how you used the Conditional Split and Multicast transformations. While these transformations do not bring data together, they are useful to split the data up once it is in the Data Flow. Some source data may be used in multiple ways during your data processing. Whether you need to use the entire data set for multiple purposes (Multicast) or you need to break out each row for difference purposes (Conditional Split), these transformations will alleviate the need to pull from the same source more than one time.

Merge Join Transformation

The second most useful data association transformation example is the Merge Join transformation, which can take two different inputs and merge them together by comparing a set of key columns. The Merge Join functionality is also described in detail in Chapter 4, so the discussion will focus on applying the Merge Join in scenarios.

Because the Merge Join transformation allows three different types of joins (Left Outer Join, Inner Join, and Full Join), the output row count may not be the total of the number of rows from either side of the join added together. In most cases this will be true. In Figure 11-8, the right and left sides of the merge are joined by a single column, SalesOrderID, and a select number of columns are returned by each side of the join. Furthermore, the join is identified as a "Left outer join," meaning that all the output rows from the left side of the join will flow through the transformation even if a matching record is not available on the right side. Additionally, rows on the right side of the join that do not match a record on the left side of the merge will be left out of the output.

image from book
Figure 11-8

As an example of the output of this transformation, the following tables show some sample records within a range of OrderIDs.

Assume that the left input looks like the following table.

SalesOrderID	OrderDate	CustomerID
13452	8/1/2005	AW9987
18600	9/15/2005	AW540
21977	9/31/2005	AW2333

And the right input looks like this following table.

SalesOrderID	OrderQty	UnitPrice	ProductID	CarrierTracking
13452	1	$435.99	BK1055	A12227499
13452	1	$15.99	SH344	G977642
21977	3	$34.00	AC53	A12223445
23445	1	$799.95	BK1077	A12338890

When you apply the "Left outer join" logic as defined in Figure 11-9, you'll get the following results.

image from book
Figure 11-9

SalesOrderID	OrderDate	CustomerID	OrderQty	UnitPrice	ProductID	CarrierTracking
13452	8/1/2005	AW9987	1	$435.99	BK1055	A12227499
13452	8/1/2005	AW9987	1	$15.99	SH344	G977642
18600	9/15/2005	AW540	<NULL>	<NULL>	<NULL>	<NULL>
21977	9/31/2005	AW2333	3	$34.00	AC53	A12223445

You'll notice a few things about the output result set:

Rows on the right side of the data set that do not have a match on the left are ignored, such as SalesOrderID 23445.
Rows on the left that do not have matches on the right are in the output result set with NULLs in the columns returned from the right side, such as SalesOrderID 18600.
When a row on either the left or right side of the Merge Join has multiple matches, multiple rows are returned in the output results. In other words, a row from either side may be used more than once in the output result set if the join results in more than one match. SalesOrderID 13452 has two matches on the right input and therefore produces two matches in the output.

As you can see, the Merge Join can be applied much like the Lookup transformation, because it integrates two different sources. There are, however, some very unique differences between the Merge Join and the Lookup transformation and some specific scenarios where the Merge Join transformation is very useful to accomplish data association based on the requirements. The Lookup, for example, will not duplicate rows when more than one match is made on the reference table; instead, a warning will be returned. Furthermore, the Lookup transformation Reference Table works only for database connections and not for flat file or other non-RDBMS sources.

The Merge Join transformation also requires that the sources be sorted by the join columns in the same order, whereas the Lookup does not require this. This may sound expensive from a resource perspective since a Sort transformation requires the entire data set to be cached; however, the source data may also be presorted coming in from the source adapter. In some situations, the join columns used in the Merge Join may have an index applied on the source database, or the source file may be ordered in the right order. You'll study this in more detail in the "Optimizing Package Processing" section later in this chapter. The purpose of mentioning it now is to illustrate that the Merge Join may not need the entire data set cached for performance. When you're determining whether to use the Merge Join or the Lookup transformation, if the sources can be efficiently presorted without adding overhead, and the volumes are moderate to high on both sides of the join (or the number of columns returned is high), then a Merge Join may be the right solution.

As you can see, the Merge Join transformation functions almost identically to a database join between tables (inner or outer join) where the join columns are equated (as opposed to nonequivalent comparisons: >, >=, <, <=, =!).

Data Cleansing and Transformation

The second major area of consideration where you can apply the Data Flow is data cleansing. Cleansing data involves managing missing values; correcting out-of-date, incomplete, or mis-keyed data; converting values to standard data types; changing data grain or filtering data subsets; and de-duplicating redundant data. Consistency is the goal of data cleansing whether the data source is a single system or multiple disparate sources.

Many of the Data Flow components provide data-cleansing capabilities or can participate in a data-cleansing process. Some of the more explicit transformations usable for this process include the following:

Aggregate
Character Map
Conditional Split
Data Conversion
Derived Column
Fuzzy Grouping
Fuzzy Lookup
Pivot
Script Component
Sort (with de-duplicating capabilities)
Unpivot

Each of these transformations, or a combination of them, can handle many data-cleansing scenarios. A few of the transformations provide compelling data-cleansing features that even go beyond the capabilities of many relational engines. This makes use of the Data Flow. For example, the Fuzzy Lookup and Fuzzy Grouping (de-duplication) provide cleansing of dirty data by comparing data similarity within certain defined ranges. Pivot and Unpivot have the ability to transform data coming in by pivoting rows to columns or vice versa. Also, the Script transformation offers very powerful data-cleansing capabilities with the full features of VB.NET embedded; it is highlighted in detail in Chapter 7. Since the goal of this chapter is to highlight and discuss the application of Integration Services, the example will focus on a couple common examples of data cleansing using the Derived Column transformation and the Aggregate transformation. These two transformations have particular relevance in how data cleansing can be accomplished in the Data Flow in comparison with common query logic.

As Chapter 4 demonstrates, the Derived Column transformation's capabilities allow the ability to replace column values coming through the Data Flow. One of the more common data-cleansing scenarios that the Derived Column transformation can accomplish is to replace [blank] and NULL values extracted from various sources. Using the expression language, described in detail in Chapter 7, a check of values could be performed for both cases described. Figure 11-9 shows the Derived Column using the "Replace" column option on the AddressLine1 column coming through this transformation.

The following expression code is used to cleanse the [AddressLine1] input column in the example.

 ISNULL([AddressLine1]) || TRIM([AddressLine1]) == "" ? @Unknown_Value : TRIM([AddressLine1])

The expression checks for both conditions and also trims the value of the column to remove beginning and trailing spaces. If the input column value is NULL or blank, the expression uses a generic variable containing an "unknown" value as a replacement for the column output. Note that the expression uses a few of the different functions available in the language, particularly the conditional case expression (<Boolean> ? <True> : <false>). Rather than recursively embedding the conditional expression to evaluate both the NULL and blank ("") conditions, a Logical OR is used (||, double pipes) to check both conditions at once.

The second valuable transformation to highlight is the Aggregate. The Aggregate transformation brings GROUP BY logic to the Data Flow, but you can have multiple groups of aggregates with different outputs and different aggregations defined for each group. Many data transformation scenarios require changing the grain or the level of detail from the source to the destination, and in most situations the requirement is to roll up the data to higher levels of detail than what the source system makes available.

In Figure 11-10, the Aggregate transformation takes the input and groups by three source columns and applies aggregations across three other input columns.

image from book
Figure 11-10

Although not pictured, the Aggregate transformation can also provide multiple grouping with different outputs per grouping. This means if one grouping should be by CustomerID and another by CustomerID and OrderDate, they both can be accomplished in the same Aggregate component.

Putting It All Together, Seeing the Results

Now that you have considered a reengineered approach for data processing, compare it to the original DTS-based example. The DTS approach involved several steps that staged, cleansed, compared, updated, and loaded data into a destination table.

In reconstructing this approach, you were able to take the logic required and fit the entire process into a single Data Flow. Figure 11-11 shows the new design.

image from book
Figure 11-11

The Integration Services design takes advantage of the Data Flow by leveraging some of the built-in transformations just discussed, such as the Lookup, Derived Column, and Merge Join transformations. All of the logic that was accomplished in five DTS steps as described in Figure 11-1 has been retooled into a single Data Flow in Integration Services. Noticeably absent in the design is a staging environment. The following feature highlights of this redesign have been used to accomplish the identical data-processing task:

A Merge Join transformation associates the Sales Detail flat file records with the Online Customer flat file records. Since the Sales Detail records were already presorted, a Sort transformation was not needed for the 480,000 records coming from this source adapter. A Sort transformation has been used for the 18,000 Online Customer records, which given its low record count performs very well in this scenario.
Because the Sales Header table uses an IDENTITY column, SalesOrderID, which is required for the Sales Detail table, the Script transform creates a SalesOrderID value for each unique combination of sales grouping starting from the last SalesOrderID value in the Header table. The SQL Destination for the Sales Header table is set for Identity Insert to allow the created values to be inserted and the same keys to be available for the Sales Detail table without having to perform multiple passes on the data as the DTS solution did.
A Multicast transformation allows the data to be separated for the Sales Detail and Sales Header tables.
Using an Aggregate transformation, as shown in Figure 11-11, the input data is grouped by the Header criteria and the sales values are aggregated.
A couple of Derived Column transformations are used to cleanse the data flowing through the Data Flow to replace NULL values with defaults.
Lookup transformations allow the foreign keys for several tables to be acquired with limited single-table SELECT statements on the Lookup reference tables, which has little impact on the destination system.

Comparing the results, the redesigned DTS package in Integration Services was able to complete the same processing logic approximately two and one-half times faster than the original DTS package. Figure 11-12 shows the Gantt chart with both package execution times.

image from book
Figure 11-12

Although this is a simple example, it serves to point out the value of Integration Services. Not only does the performance speak for itself, but the simplicity in the design adds to the compelling nature of the tool and the architecture.

Staging Environments

A word must be mentioned about the appropriate use of staging environments. To this point, you have emphasized thinking in Data Flow terms by moving core data process logic into the Data Flow. And in most cases, this will yield high-performance results, especially when the timeliness of moving the data from point A to point B is the highest priority, such as near real-time or tight-processing-window scenarios. Doing this also mitigates some management overhead, limiting interim database usage.

A few situations merit staging environments and are worth mentioning for consideration.

Restartability. The built-in checkpoint logic of Integration Services revolves around the Control Flow. What this means is that a failure in the Data Flow will not persist the data state. Rather, when the package is restarted, the Data Flow will restart from the beginning. The implications affect design if the source system is in flux and an error in the Data Flow causes a processing window to be missed. By landing the raw data first, the chance for data errors is minimized, and in the event of a failure during the load process, the package can be restarted from the staged data.
Processing Windows and Precedence. Certain requirements may dictate that the various source extraction windows do not line up with each other or with the data load window for the destination. In these scenarios, it would be necessary to stage the data for a period of time until the full data set is available or the destination database load window has been reached.
Source Back Pressure. At times, the Data Flow transformations may apply back pressure on the source extractions. This would happen when the flow of data coming in is faster than the performance of the transformations to handle the data processing in the pipeline. The back pressure created would slow down the extraction on the source system, and if the requirement is to extract the data in the fastest time with the least impact, then staging the raw data extract may help eliminate the back pressure.
Data Flow Optimization. Staging certain elements, such as business keys, can actually provide valuable data to optimize the primary Data Flow. For example, if the Lookup source query can be filtered based on a set of keys that was pre-staged, this may allow overall gains in processing times by reducing the time it takes to load the Lookup plus the amount of memory needed for the operation. A second example is the use of staging to perform set-based table updates. Updates in a large system are often the source of system bottlenecks, and since Integration Services cannot perform set-based updates in the Data Flow, one consideration is to stage tables that can be used in a later Execute SQL task for a set-based update, which may provide a more efficient process.

Staged data can also prove useful in data validation and error handling. Given some of the uses of staging, is there a way to accomplish data staging but still retain the performance gain by leveraging the Data Flow? Yes. One emphasis that has been suggested is the reduction of synchronous processing in the Control Flow. In regard to data staging, the most natural thought when you have to introduce a staging environment is to first pick up the data from the source and land it to a staging environment and then pick the data back up from the staging environment and apply the transformation logic to it. What about landing the raw data to a staging environment at the same time that the transformations are applied? Figure 11-13 shows a Data Flow designed with a staging table that does not require the data to reside in the table before the transformation logic is applied.

image from book
Figure 11-13

The Multicast transformation in this example is taking the raw source data and allowing it to stream down to the core Data Flow, while at the same time the raw source data is being staged to a table. The data within the table is now available to query for data validation and checking purposes; in addition, it provides a snapshot of the source system that can then be used for reprocessing when needed. Although the data is landed to staging, two differences distinguish this example from a model that first stages data and then uses the staged data as a source. First, as has been mentioned, the process is no longer synchronous; data can move from point A to point B in many cases in the time it takes simply to extract the data from A. Second, the staging process requires only a single pass on the staging table (for the writes) rather than the IO overhead of a second pass that reads the data from the staging. Overall, this approach may provide the best of both worlds — leveraging the Data Flow but providing the value of a stage environment.

Optimizing Package Processing

There are a few techniques you can apply when you're streamlining packages for performance. In the last chapter, you went under the covers of the Integration Services engine to understand buffer usage and the different types of Data Flow components. This section builds on that knowledge by applying certain optimization techniques to achieve better throughput.

Optimizing Buffers, Execution Trees, and Engine Threads

If you recall in the last chapter, for each Execution Tree in a Data Flow, a different buffer profile is used. This means that downstream execution trees may require different columns based on what is added or subtracted in the Data Flow. You also saw that the performance of a buffer within a Data Flow is directly related to the row width of the buffer. Narrow buffers can hold more rows, and therefore the throughput will be higher.

Some columns that are used in an Execution Tree may not be needed downstream. For example, if an input column to a Lookup transformation is used as the key match to the reference table, this column may not be needed after the Lookup and therefore should be removed before the next Execution Tree. Integration Services does a good job of providing warnings when columns exist in an Execution Tree but are not used in any downstream transformation or destination adapter. Figure 11-14 highlights the Progress tab within a package where column usage has not been optimized in the Data Flow. Each warning, highlighted with a yellow exclamation point, indicates the existence of a column not used later in downstream components and which therefore should be removed from the pipeline after initial use.

image from book
Figure 11-14

The warning text describes the optimization technique well:

 [DTS.Pipeline] Warning: The output column "OrderQty" (1315) on output "Aggregate Output 1" (1261) and component "Header Grouping" (1259) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance.

Any asynchronous component whose input closes out an Execution Tree will have the option of removing columns in the output. You would normally do this through the edit dialog box of the transformation, but you can also do it in the Advanced Editor if the component provides an advanced properties window. For example, in the Union All transformation, you can highlight a row in the editor and delete it with the delete keyboard key. This will ensure that the column is not used in the next Execution Tree.

A second optimization technique in this area revolves around optimizing the processor utilization by adding the available use of more Execution Threads for the Data Flow. As was highlighted in the last chapter, increasing the EngineThreads Data Flow property to a value greater than the number of Execution Trees plus the number of Source components will ensure that Integration Services has enough threads to use. However, there may also be scenarios where an Execution Tree has so many synchronous components that the single Execution Thread allocated cannot handle the workload, and the result is a processing bottleneck, when in fact the server may still have unused processors that can perform the work. Even though breaking up the Execution Tree into multiple trees will require more buffer copies, the end result of allowing a second or third Execution Thread to process the transformations may improve overall performance. To accomplish this, one option is to use a Union All transformation with a single input and output at a natural balancing point of the Execution Tree. In Figure 11-15, notice how the entire Data Flow consists of synchronous transformations, which means that the entire process has only one Execution Tree.

image from book
Figure 11-15

If you modify the Data Flow with the addition of a Union All transformation, the result would look like Figure 11-16.

image from book
Figure 11-16

The Union All was placed before the Conditional Split, so that the first three transformations can be worked on in a different thread than the Conditional Split and the last three lookups in the Data Flow. The end result in both of these scenarios is more efficiency in handling the volumes and complexity within the Data Flow.

Careful Use of Row-Based Transformations

Row-based transforms, as described in Chapter 10, are non-blocking transformations, but they exhibit the functionality of interacting with an outside system (for example, a database or file system) on a row-by-row basis. Compared with other non-blocking transformations, these transformations are slower because of this nature. The other type of non-blocking transformation, streaming, can use internal cache or provide calculations using other columns or variables readily available to the Data Flow, making them perform very fast. Given the nature of row-based transformations, their usage should be cautious and calculated.

Of course, some row-based transformations have critical functionality, so this caution needs to be balanced with data-processing requirements. For example, the Export and Import Column transformation can read and write from files to columns, which is a very valuable tool, but has the obvious overhead of the IO activity with the file system.

Another useful row-based transformation is the OLE DB Command transformation, which can use input column values and execute parameterized queries against a database, row by row. The interaction with the database, although it can be optimized, still requires overhead to process. Figure 11-17 shows a SQL Server Trace run against a database that is receiving updates from an OLE DB Command transformation.

image from book
Figure 11-17

This is only a snapshot, but you can see that taking the duration, reads, and writes, the aggregated impact of thousands of rows will cause Data Flow latency at the transformation.

For this scenario, one alternative is to leverage set-based processes within databases. In order to do this, the data will need to be staged during the Data Flow, and you will need to add a secondary Execute SQL task to the Control Flow that runs the set-based update statement. The result may actually reduce the overall processing time when compared with the original OLE DB Command approach. This alternative approach is not meant to diminish the usefulness of the OLE DB Command but rather to provide an example of optimizing the Data Flow for higher-volume scenarios that may require optimization.

Understand Blocking Transformation Impacts

A blocking transformation requires the complete set of records cached from the input before it can release records downstream. In Chapter 10, a list of about a dozen transformations that meet this criterion was provided. The most common examples are the Sort and Aggregate transformations.

Blocking transformations are intensive because they require caching all the upstream input data, and they also may require more intensive processor usage based on their functionality. When not enough RAM is available in the system, the blocking transformations may also require temporary disk storage. You need to be aware of these limitations when you're working to optimize a Data Flow. The point of mentioning the nature of blocking transformations is not to minimize their usefulness but rather to advise that in some situations they are very useful and perform much better than alternative approaches. Rather, the intention here is to use these transformations in the right places and know the resource impact.

Since sorting data is a common requirement, one optimization technique is valuable to mention. Source data that can be sorted in the adapter through an ORDER BY statement or presorted in a flat file does not require the use of a Sort transformation. As long as the data is physically sorted in the right order when coming into the Data Flow, the source adapter can be configured to indicate that the data is sorted and which columns are sorted in what order. Figure 11-18 shows the Advanced Editor of a source adapter with the Source Output folder highlighted. The first step is to set the IsSorted property to True, as seen on the right-hand properties screen.

image from book
Figure 11-18

The second requirement is to indicate which columns are sorted. To do this, open the Source Output folder and then the Output Columns subfolder. This will open the list of columns that the adapter will send out into the pipeline. To set the sort column order and direction, highlight the first column that is sorted. The example in Figure 11-19 uses the presorted CustomerID column, which is highlighted in the figure.

image from book
Figure 11-19

The SortKeyPosition should be set for the columns used in sorting. For the first column that is sorted, set the SortKeyPosition to a 1 or -1. A -1 indicates that the column is sorted in descending order. Continue to the next sorted column, if applicable, and set the value to a 2 or -2, and subsequently continue for all sorted columns.