Throughout the development of the Data Flow Task, there were certain principles that guided its design and implementation. Following are a few of those principles.
Low-Cost Additive Transformations
Before being officially named the "Data Flow Task," it was called the Pipeline Task based on the way data conceptually flows through transformations that modify the data. As the data flows through the Data Flow Task, it visits each transformation, one row at a time. The concept of modifying data row by row has some significant efficiencies over set operations for certain kinds of transformations. There is a one-to-one relationship between all rows and all transformations, or, in other words, all transformations touch all rows exactly once. By combining primitive transformations together, you can build complex additive transformations in limitless ways without undoing incremental per-transformation overhead. Set-based operations lose out when pipelined because there is often no way to eliminate the per-transform cost associated with multiple row visits.
High PerformanceNo Copying, No Kidding
Copying and moving memory is extremely expensive and over time adds up substantially. The Data Flow Task is built upon the philosophy that unless it is absolutely necessary, after being populated, the memory that holds data should never be moved until it is time to output the data to the destination. Transforms should happen in place, data should only be physically copied when absolutely necessary, and columns that are never transformed or referenced should never be processed. This philosophy or guiding design principle is one of the reasons that the Data Flow Task performs so well. So, although the analogy of a pipeline is useful for instructional purposes, it is only partially correct. Instead of the data flowing through the pipeline, it is actually more accurate to say that the pipeline is moving over the data.
One of the Data Flow Task's strengths is the diversity or data formats it supports. To support both traditional ETL and emerging integration scenarios, it is important that the Data Flow Task be able to access data in multiple formats and to bring that diverse data together in one place where it can be transformed in a common way at the same time. Data sandboxing is a term for describing what happens to data that the Data Flow Task consumes from diversely formatted sources. All the data is retrieved from the source and converted to a columnar data format, making it consistent with the data consumed from other formats. For example, data from an Extensible Markup Language (XML) file can be processed uniformly with data consumed simultaneously from a flat file, a database table, and even a binary file. By supporting multiple diversely typed data sources simultaneously, the Data Flow Task achieves true integration "in the pipe."
Every integration problem is unique and there are times when the components in Integration Services aren't sufficient to build your solution. Therefore, the Data Flow Task is highly extensible in two basic forms: custom components and scripting. If you need a component for accessing some data in a format that isn't currently supported, you can write your own source adapter. Likewise, if you have some business logic for transforming data embedded in custom code, you can write a custom transformation that references that code to do the transformation in the Data Flow Task. Scripting is another way to extend the Data Flow Task by allowing you to write quick and simple data flow logic without having to concern yourself with a fully compiled and installed component. Extensibility makes the Data Flow Task flexible and future-proof because it can be modified to work with data in unforeseen ways.