Data Manipulation Transforms

These are the transforms you typically think of when you think of transformations. These modify the data, changing it in fundamental ways.

Aggregate

The Aggregate transform provides several different aggregation operations and a Group By feature. It supports multiple simultaneous aggregations on multiple columns as well as multiple aggregations on the same column. You can configure to only have one output or multiple outputs. This transform is useful for Business Intelligence because it supports the aggregation operations needed for things like populating fact tables with fine control over the grain of the aggregations. Table 20.22 provides a profile for this component.

Table 20.22. The Aggregate Transform Profile
Property	Value	Description
Component Type	Transform
Has Custom Editor	Yes
Internal File I/O	No
Output Types	Asynchronous
Threading	Single
Managed	No
Number Outputs	1-Many
Number Inputs	1
Requires Connection Manager	No
Supports Error Routing	No
Constraints	Asynchronous	Blocks. Also creates new execution trees, which help parallelism.

Setting Up the Aggregate Transform

To create a simple aggregation, select the column on which you want to operate in the Available Input Columns window. Figure 20.32 shows the aggregate from the Aggregate.dtsx sample package. After you've selected the column, select the operation in the Operation column of the lower grid. Simple aggregations such as sum and average will only output one row. More complex aggregations that use the Group By feature might output multiple rows, based on the data.

Figure 20.32. The Aggregate Transformation Editor Advanced View with two outputs having two aggregation operations

Table 20.23 shows the operations that are available in the Aggregate transform.

Table 20.23. The Aggregation Operations
Operation	Description
Group by	Divides datasets into groups. Columns of any data type can be used for grouping. For more information, see GROUP BY (Transact-SQL).
Sum	Sums the values in a column. Only columns with numeric data types can be summed. For more information, see SUM (Transact-SQL).
Average	Returns the average of the column values in a column. Only columns with numeric data types can be averaged. For more information, see AVG (Transact-SQL).
Count	Returns the number of items in a group. For more information, see COUNT (Transact-SQL).
Count distinct	Returns the number of unique, nonnull values in a group. For more information, see Eliminating Duplicates with DISTINCT.
Minimum	Returns the minimum value in a group. For more information, see MIN (Transact-SQL). In contrast to the Transact-SQL MIN function, this operation can be used only with numeric, date, and time data types.
Maximum	Returns the maximum value in a group. For more information, see MAX (Transact-SQL). In contrast to the Transact-SQL MAX function, this operation can be used only with numeric, date, and time data types.

To create multiple outputs, click the Advanced button and select one of the empty rows in the top grid. Type in a name for the output and then select an input column from the Available Input Columns window. You can create any number of outputs this way. The sample package has an aggregate with three outputs, as shown in Figure 20.32. One output tells how many of each product have been sold by doing a sum of the order quantity in the sales order data grouped by the product ID. The second output tells the sum total for all items sold per sales order ID, and the third output tells the sum total dollar amount sold per day based on the change date using Group By.

Character Map

The Character Map transform performs common character conversions typically found in multilanguage environments. Japanese has essentially two alphabets that augment the kanji character set. One, katakana, is used for things like spelling words and names from other languages that cannot be formed with kanji. The other, hiragana, is used for verb conjugations and replacing or simplifying kanji reading. Conversion between hiragana and katakana is direct because there is a hiragana character for every katakana character.

Simplified Chinese is a reduced set of Chinese characters and is a subset of the traditional Chinese character set. Uppercase and lowercase are useful for English as well as other languages and the width operations are useful for character encodings found in Japan and other such countries with more complex character-based writing systems than a simple alphabet. Table 20.24 provides a profile for this component.

Table 20.24. The Character Map Transform Profile
Property	Value	Description
Component Type	Transform
Has Custom Editor	Yes
Internal File I/O	No
Output Types	Synchronous
Threading	Single
Managed	No
Number Outputs	1
Number Inputs	1
Requires Connection Manager	No
Supports Error Routing	Yes

The operations available in the Character Map transform are noted in Table 20.25.

Table 20.25. The Character Map Operations
Operation	Description
Byte reversal	Reverses byte order.
Full width	Maps half-width characters to full-width characters.
Half width	Maps full-width characters to half-width characters.
Hiragana	Maps katakana characters to hiragana characters.
Katakana	Maps hiragana characters to katakana characters.
Linguistic casing	Applies linguistic casing instead of the system rules. Linguistic casing refers to functionality provided by the Win32 API for Unicode simple case mapping of Turkic and other locales.
Lowercase	Converts characters to lowercase.
Simplified Chinese	Maps traditional Chinese characters to simplified Chinese characters.
Traditional Chinese	Maps simplified Chinese characters to traditional Chinese characters.
Uppercase	Converts characters to uppercase.

Setting Up the Character Map Transform

To set up the Character Map transform, select the column you want to use and then the operation to perform on it. There is also an option to either perform the conversion in place, or to create a new column with the modified value. Figure 20.33 shows the Character Map Transformation Editor from the CharacterMapCopyColumn.dtsx sample package.

Figure 20.33. The Character Map Transformation Editor

Copy Column

The Copy Column transform is one of the simplest around. It simply creates a clone of a column, copying the value of the source column into the cloned column. This is useful when you need to perform destructive operations on a column, but you want to retain the original column value. For example, you can copy the column and then split it into a different path with an Aggregate transform. This is more efficient than, for example, a multicast because only the columns of interest get copied. Table 20.26 provides a profile for this component.

Table 20.26. The Copy Column Transform Profile
Property	Value	Description
Component Type	Transform
Has Custom Editor	Yes
Internal File I/O	No
Output Types	Synchronous
Threading	Single
Managed	No
Number Outputs	1
Number Inputs	1
Requires Connection Manager	No
Supports Error Routing	No

Setting Up the Copy Column Transform

To set up the Copy Column transform, simply select the column you want to copy and, if you want, change the name of the output column. Figure 20.34 shows the editor for the Copy Column transform in the CharacterMapCopyColumn.dtsx sample package.

Figure 20.34. The Copy Column Transformation Editor

Data Conversion

The Data Conversion transform performs type casting similar to the CAST function in T-SQL. Because the Data Flow Task is very type specific, it is often necessary to convert a column type from the type provided at the source to the type required at the destination. A good example of this is data retrieved from Excel files. If you use the Import/Export Wizard to export data from an Excel file, the Data Flow Task will likely contain a Data Conversion transform because Excel supports so few data types. Table 20.27 provides the profile for this component.

Table 20.27. The Data Conversion Transform Profile
Property	Value	Description
Component Type	Transform
Has Custom Editor	Yes
Internal File I/O	No
Output Types	Synchronous
Threading	Single
Managed	No
Number Outputs	1
Number Inputs	1
Requires Connection Manager	No
Supports Error Routing	Yes
Constraints	Type conversions expensive	You should avoid using this component if possible. If possible, change the column type at the source.

Setting Up the Data Conversion Transform

To set up the Data Conversion transform, select the columns with types you want to convert and in the bottom grid, select the type to which you want to convert the column. You can also change the name of the output column in the Output Alias column of the grid. If you change the output column name to be the same as the corresponding input column name, it overwrites the column. Otherwise, a new column is created with the new name. Figure 20.35 shows the Data Conversion Transformation Editor.

Figure 20.35. The Data Conversion Transformation Editor

OLE DB Command

The OLE DB Command transform is useful when you want to drive behavior with tables. For example, you could execute a stored procedure for every row in the input set. Table 20.28 provides the profile for this component.

Table 20.28. The OLE DB Command Transform Profile
Property	Value	Description
Component Type	Transform
Has Custom Editor	No
Internal File I/O	No
Output Types	Synchronous
Threading	Single
Managed	No
Number Outputs	1
Number Inputs	1
Requires Connection Manager	Yes
Supports Error Routing	Yes
Constraints	Expensive row-based operation	The OLE DB Command transform performs a potentially expensive query for every row. Although useful in some circumstances, it is very expensive and should be avoided where performance is an issue.

Setting Up the Component

The setup for this component can be a little tricky. It doesn't have its own custom editor and it requires an input with columns that you'll likely map as parameters to the OLE DB Command (SQL Query). You must first create a connection to the database that the transform will work against. Select that connection manager in the Connection Manager column.

Next, type in the SqlCommand that you want to execute for every row in the SqlCommand property on the Component Properties tab, as shown in Figure 20.36.

Figure 20.36. Setting the SQL query to be executed every row

Notice that the query has two question mark parameter markers. Go to the Column Mappings tab and click the Refresh button. The Input and Destination columns should be updated to have a destination column for each parameter marker. Drag the input columns to the parameter destination columns, as shown in Figure 20.37.

Figure 20.37. Creating the parameter mappings

If there is still an error message in the editor after creating the mappings, click the Refresh button again and it should go away.

Sort

The Sort transform provides a way to order rows on one or more columns. It also provides a way to remove duplicate rows. The sort is an important transform because it is used to prepare flows for other transforms, such as the Merge and Merge Join transform. Table 20.29 provides the profile for this component.

Table 20.29. The Sort Transform Profile
Property	Value	Description
Component Type	Transform
Has Custom Editor	Yes
Internal File I/O	Possibly	When spooling to disk.
Output Types	Asynchronous
Threading	Multiple
Managed	No
Number Outputs	1
Number Inputs	1
Requires Connection Manager	No
Supports Error Routing	No
Constraints	Memory	Constrained by memory, the sort will spool to disk if it hits memory limitations. Sorting at the source server is better, if possible.

Setting Up the Sort Transform

The Sort Transformation Editor provides a list of columns that you can sort on as well as those you want to pass through the transform. Figure 20.38 shows the Sort Transformation Editor in the Merge.dtsx sample package.

Figure 20.38. The Sort Transformation Editor

In Figure 20.38, the rows are sorted by Genus, Name, and Species columns, all in ascending order. If you want to remove duplicates, select the Remove Rows with Duplicate Sort Values check box at the bottom of the editor. The Sort transform determines duplicates by comparing only the sorted rows. If differences exist in the rows on nonsorted columns, they will still be removed if their sorted column values match.