Scaling Out | Wireless Java : Developing with Java 2, Micro Edition

You will no doubt already be familiar with the term scaling out, and of course the concept can be applied to SSIS systems. Although there are no magic switches here, there are several interesting features of SSIS, and you will see how they can be applied. Following this combined theme, you will learn how these strategies benefit reliability as well.

Scale Out Memory Pressures

By design, the pipeline processing takes place almost exclusively in memory. This makes for faster data movement and transformations, and a design goal should always be to make a single pass over your data. In this way, you eliminate the time-consuming staging and the costs of reading and writing the same data several times. The potential disadvantage of this is that for large amounts of data and complicated sets of transformations, you need a large amount of memory, and it needs to be the right type of memory for optimum performance.

The virtual memory space for 32-bit Windows operating systems is limited to 2 GB by default. Although you can increase this amount through the use of the /3GB switch applied in the boot.ini file, this often falls short of the total memory available today. This limit is applied per process, which for your purposes means a single package during execution, so by partitioning a process across multiple packages, you can ensure that each of the smaller packages is its own process and therefore takes advantage of the full 2–3 GB virtual space independently. The most common method of chaining packages together to form a consolidated process is through the Execute Package task, in which case it is imperative that you set the child package to execute out of process. You must set the ExecuteOutOfProcess property to true to allow this to happen.

It is worth noting that unlike the SQL Server database engine, SSIS does not support Advanced Windowing Extensions (AWE), so scaling out to multiple packages across processes is the only way to take advantage of larger amounts of memory. If you have a very large memory requirement, then you should consider a 64-bit system for hosting these processes.

For more a more detailed explanation of how SSIS uses memory, and the in-memory buffer structure used to move data through the pipeline, see Chapters 10 and 11.

Scale Out by Staging Data

Staging of data is very much on the decline; after all, why incur the cost of writing to and reading from a staging area, when you can perform all the processing in memory with a single pass of data? With the inclusion of the Dimension and Partition Processing Destinations, you no longer need a physical data source to populate your SQL Server Analysis Services (SSAS) cubes — yet another reason for the decline of staging or even the traditional data warehouse. Although this is still a contentious subject for many, the issue here is this: Should you use staging during the SSIS processing flow? Although it may not be technically required to achieve the overall goal, there are still some very good reasons why you may want to, coming from both the scalability and reliability perspectives.

For this discussion, staging could also be described as partitioning. The process could be implemented within a single data flow, but for one or more of the reasons described below, it may be subdivided into multiple data flows. These smaller units could be within a single package, or they may be distributed through several as discussed below. The staged data will be used only by another Data Flow and does not need to be accessed directly through regular interfaces. For this reason, the ideal choices for the source and destinations are the raw file adapters. This could be described as vertical partitioning, but you could also overlay a level of horizontal partitioning, as by executing multiple instances of a package in parallel.

Raw file adapters allow you to persist the native buffer structures to disk. The in-memory buffer structure is simply dumped to and from the file, without any translation or processing as found in all other adapters, making these the fastest adapters for staging data. You can take advantage of this to artificially force a memory checkpoint to be written to disk, allowing you to span multiple Data Flows tasks and packages. Staging environments and raw files are also discussed later on in Chapter 11, but some specific examples will be illustrated here.

The key use for raw files is that by splitting a Data Flow into at least two individual tasks, the primary task can end with a raw file destination and the secondary task can begin with a raw file source. The buffer structure is exactly the same between the two tasks, so the split can be considered irrelevant from an overall flow perspective, but it provides perfect preservation between the two.

Data Flow Restart

As covered previously, the checkpoint feature provides the ability to restart a package from the point of failure, but it does not extended inside a Data Flow. However, if you divide a Data Flow into one or more individual tasks, each linked together by raw files, you immediately gain the ability to restart the combined flow. Through the correct use of native checkpoints at the (Data Flow) task level, this process becomes very simple to manage.

The choice of where to divide a flow is subjective, but two common choices would be immediately after extraction and immediately after transformation, prior to load.

The post-extraction point offers several key benefits. Many source systems are remote, so extraction may take place over suboptimal network links and can be the slowest part of the process. By staging immediately after the extraction, you do not have to repeat this slow step in the event of a failure and restart. There may also be an impact on the source system during the extraction, and very often this must take place during a fixed time window when utilization is low. In this case, it may be unacceptable to repeat the extract in the event of a failure, until the next time window, usually the following night.

Staging post-transformation simply ensures that the transformation is not wasted if the destination system is unavailable.

You may wish to include additional staging points mid-transformation. These would usually be located after particularly expensive operations and before those that you suspect are at risk to fail. Although you can plan for problems, and the use of error outputs described above should allow you to handle many situations, you can still expect the unexpected and plan a staging point with this in mind. The goal remains the ability to restart as close to the failure point as possible and to reduce the cost of any reprocessing required.

Figure 9-34 shows an example data load process that you may wish to partition into multiple tasks to take advantage of Data Flow restart.

image from book
Figure 9-34

For this scenario, the OLE DB Source connects to a remote SQL Server over a slow network link. Due to the time taken for this data extraction and the impact on the source system, it is not acceptable to repeat the extract if the subsequent processing fails for any reason. For this reason, you choose to stage data through a raw file immediately after the source component. The resulting Data Flow layout is shown in Figure 9-35. This is a Data Flow task.

image from book
Figure 9-35

The flat file source data is accessed across the LAN, and it needs to be captured before it is overwritten. The sort operation is also particularly expensive due to the volume of data. For this reason, you choose to stage the data after the sort is complete. The resulting Data Flow is shown in Figure 9-36.

image from book
Figure 9-36

Finally, you use a third Data Flow task to consume the two staging raw files and complete the process. This is shown in Figure 9-37.

image from book
Figure 9-37

Following this example, a single Data Flow has been divided into three separate tasks. For the purposes of restarting a failed process, you would use a single package and implement checkpoints on each of the three Data Flow tasks.

Scale across Machines

In a similar manner to the Data Flow Restart just discussed, you can also use raw file adapters to partition the Data Flow. By separating tasks into different packages, you can run packages across machines. This may be advantageous if a specific machine has properties not shared with others. Perhaps the machine capable of performing the extract is situated in a different network segment from the machine best suited for processing the data, and direct access is unavailable between the main processing machine and the source. The extract could be performed, and the main processing machine would then retrieve the raw data to continue the process. These situations will be organizational restrictions rather than decisions driven by the design architecture.

The more compelling story for scaling across machines is to use horizontal partitioning. A simple scenario would utilize two packages. The first package would extract data from the source system, and through the Conditional Split you produce two or more exclusive subsets of the data and write this to individual raw files. Each raw file would contain some of the rows from the extract, as determined by the expression used in the Conditional Split. The most common horizontal partition scheme is time-based, but any method could be used here. The goal is to subdivide the total extract into manageable chunks, so for example if a sequential row number is already available in the source, this would be ideal, or one could be applied. See the T-SQL ROW_NUMBER function. Similarly a Row Number transformation could be used to apply the numbering, which could then be used by the split, or the numbering and splitting could be delivered through a Script Component.

Note

A Row Number transformation is freely available from www.sqlis.com and www.konesans.com.

With a sorted data set, each raw file may be written in sequence, completing in order, before moving on to the next one. While this may seem uneven and inefficient, it is assumed that the time delay between completion of the first and final destinations is inconsequential compared to the savings achieved by the subsequent parallel processing.

Once the partitioned raw files are complete, they are consumed by the second package, which performs the transformation and load aspects of the processing. Each file is processed by an instance of the package running on a separate machine. This way, you can scale across machines and perform expensive transformations in parallel. For a smaller-scale implementation, where the previously described 32-bit virtual memory constraints apply, you could parallel process on a single machine, such that each package instance would be a separate thread, allowed its own allocation of virtual memory space.

For destinations that are partitioned themselves, such as a SQL Server data warehouse with table partitions or a partitioned view model, or Analysis Services partitions, it may also make sense to match the partition schema to that of the destination, such that each package addresses a single table or partition.

Figure 9-38 shows a sample package that for the purposes of this example you will partition horizontally.

image from book
Figure 9-38

In this scenario, the Fuzzy Lookup is processing names against a very large reference set, and this is taking too long. To introduce some parallel processing, you decide to partition on the first letter of a name field. It is deemed stable enough for matches to be within the same letter, although in a real-world scenario this may not always be true. You use a Conditional Split transformation to produce the two raw files partitioned from A to M and from N to Z. This primer package is illustrated in Figure 9-39.

image from book
Figure 9-39

Ideally you would then have two instances of the second package, Figure 9-39, running in parallel on two separate machines. However, you need to ensure that the lookup data is filtered on name to match the raw file. Not all pipeline component properties are exposed as expressions, allowing you to dynamically control them, so you would need two versions of the package, identical except for a different Reference table name property in the Fuzzy Lookup, as shown in Figure 9-40. In preparation, you would create two views, one for names A to M and the other for names N to Z to match the two raw files. The two package versions would each use the view to match the raw file they will process.

image from book
Figure 9-40

For any design that uses raw files, the additional I/O cost must be evaluated against the processing performance gains, but for large-scale implementations it offers a convenient way of ensuring consistency within the overall flow and incurs no translation penalty associated with other storage formats.