The following is a list of a few key Data Flow Taskspecific terms.
Pipeline component Pipeline components are all the objects that you drop into the Data Flow Task. These are the boxes. For example, you might say that a Data Flow Task has four pipeline components: a source adapter, two transforms, and a destination adapter.
Data flow buffer A data flow buffer is a bit of memory that holds data as it "moves through" the transformation pipeline. There is also some metadata associated with the buffer that describes it.
Source adapter A source adapter is a pipeline component that is responsible for retrieving data from a given source and placing it in the data flow to be processed. Source adapters understand the specific data source. For example, there is a Flat File Source Adapter that knows how to read flat files. There is also an OLEDB Source Adapter that knows how to read data from tables in databases and so forth. Source adapters also know how to place data in the data flow buffers. The source adapters conform data from the format in which it resides to the columnar format of the buffer.
Transform Transforms are pipeline components that manipulate the data in the buffers in some way. They can change the data, change the order, direct it to following transforms, or simply use it as input for generating new data.
Path Paths connect pipeline components. Pipeline components expose what are called inputs and outputs. Essentially, these are ports where transforms are able to consume (input) data or generate (output) data. Paths are the lines that connect the outputs to the inputs.
Input Pipeline components provide inputs as ports where a path can be connected and through which they can consume data from other pipeline components.
Output Pipeline components provide outputs that can be connected to an input on another pipeline component. Outputs contain metadata that describes the data columns that the pipeline component generates. Outputs can be synchronous or asynchronous. The simple definition for a synchronous output is an output that produces a row for every row on an input on the same transform. Although this isn't completely accurate, it's close enough for now. Later, Chapter 23, "Data Flow Task Internals and Tuning," revisits outputs and provides the technically correct definition. But for now, just think of a synchronous output as being "in sync" with an input. For example, the Data Conversion transform has one input and one output. Every row that enters into the input generates a row on the output. The data is modified in the row as it "passes through" the transform.
Asynchronous output Asynchronous outputs can generate the same, less, or more rows than those that enter the transform through one or more inputs. The Aggregate transform, for example, generates only one row on its asynchronous output for some operations, no matter how many rows enter on its input. For example, if the Aggregate transform is calculating the average of a column on its input, it only generates one row when it has received all data on its input and the row contains the average.
Error output You can configure a transform to divert rows with errors to a different output, the "error output." This is useful when you want to better understand what is wrong with the data and even possibly fix the problem in the Data Flow Task and reroute it back into main processing. A typical case is when you have a lookup that fails. You can route the error row to a fuzzy lookup that has algorithms for matching different but similar values. If the fuzzy lookup succeeds, you can reroute the output with the updated key back into the main flow with the other rows.
IDs Every object in the Data Flow Task has a unique integer ID that aids the user in identifying objects and making the connection between them. The data flow uses IDs whenever it accesses objects, including transforms, adapters, paths, and columns.
LineageIDs LineageIDs are special IDs on a column that identify the ID of the data's originating source column. Using LineageIDs, it is possible to trace the path of the data in a particular column to its originating source column. Because asynchronous outputs essentially create new data that can only bear a passing resemblance to data on an input, asynchronous outputs sever LineageID paths. The Data Flow Task always creates a new LineageID for every column on an asynchronous output because there is no logical or physical connection between the asynchronous output and the input.
Advanced Editor The Advanced Editor is the editor you use for pipeline components with no dedicated designer but can be used to modify or scrutinize most pipeline components. The Advanced Editor comes in really handy when you're trying to understand the Data Flow Task diagnostic output because it provides access to the detailed properties on component inputs and outputs, information not available anywhere else in the designer.
Truncation The Data Flow Task classifies all errors as one of two types: general errors or truncations. General errors can result from a host of operations. Truncations happen as a result of a specific class of operations having to do with types and indicate that data might be lost. For example, attempting to convert a Double to an Integer or placing a seven-character string value into a five-character column can result in lost data. The Data Flow Task treats these types of errors as special cases because, in some cases, the results might be considered desirable or harmless.
Side effects Side effects in the Data Flow Task result from operations that affect changes to external data. When HasSideEffects is set to trUE, it is an indication to the Data Flow Engine that the component performs some work that the Data Flow Task cannot control or discover. HasSideEffects prevents the Data Flow Task from trimming the component on which it is set.