Transformations | Wireless Java : Developing with Java 2, Micro Edition

Transformations (the term transform will be used throughout this book) are key components to the data flow that transform the data to a desired format as you move from step to step. For example, you may wish a sampling of your data to be sorted and aggregated. Three transforms can accomplish this task for you. The nicest thing about transforms in SSIS is that it is all done in-memory and it no longer requires elaborate scripting as in SQL Server 2000 DTS. As you add a transform, the data is altered and passed down the path in the data flow. Also, since this is done in-memory, you no longer have to create staging tables to perform most functions. When dealing with very large data sets, though, you may still choose to create staging tables.

You set up the transform by dragging it onto the data flow tab design area. Then, click the source or transform you'd like to connect it to, and drag the green arrow to the target transform or destination. If you drag the red arrow, then rows that fail to transform will be directed to that target. After you have the transform connected, you can double-click it to configure the transform.

In this chapter you will look at each of the transforms at a basic level. In the next chapter, you'll take some of the common tasks and transformation on a test drive and use them in a few real-world scenarios. There are a few transformations that are more advanced, which will be described briefly here but will be covered in much more detail in Chapter 6. Those transformations include the following:

Slowly Changing Dimension
Pivot and Unpivot
Row Count
Import and Export Column
Term Extraction and Lookup
Fuzzy Grouping and Matching
Data Mining
OLE DB Command

A few of these transformations are not advanced in functionality or complexity; they just don't have a custom UI implemented in SSIS. If a custom UI hasn't been implemented for a transform, it uses the Advanced Editor.

Aggregate

The aggregate transform allows you to aggregate data from the data flow to apply certain T-SQL functions like GROUP BY, Average, Minimum, Maximum, and Count. For example, in Figure 4-13, you can see that the data is grouped together on the ProductID column and then the other two columns are summed. This produces three new columns that can be consumed down the path, or future actions can be performed on them.

The Aggregate transform is configured in the Aggregate Transformation Editor (see Figure 4-16). To configure it, first check the column that you wish to perform the action on. After you check the column, the input column will be filled below in the grid. Optionally, type an alias in the Output Alias column that you wish to give the column when it's outputted to the next transform or destination. For example, if the column now holds the total money per customer, you may change the name of the column that's outputted from InvoiceAmt to TotalCustomerSaleAmt. This will make it easier for you to recognize what the column is along the path of the data. The most important option is the Operation drop-down box. For this option, you can select the following:

Group By: Breaks the data set into groups by the column you specify
Average: Averages the selected column's numeric data
Count: Counts the records in a group
Count Distinct: Counts the distinct non-NULL values in a group
Minimum: Returns the minimum numeric value in the group
Maximum: Returns the maximum numeric value in the group
Sum: Returns sum of the selected column's numeric data in the group

image from book
Figure 4-16

You can click the Advanced tab to see the options that allow you to configure multiple outputs from the transform. After you click Advanced, you can type a new Aggregation Name to create a new output. You will then be able to check the columns you'd like to aggregate again as if it were a new transform.

In the Advanced tab, the Key Scale option sets an approximate number of keys. The option is set to Unspecified by default and optimizes the transform's cache to the appropriate level. For example, setting it to Low will optimize the transform to write 500,000 keys. Setting it to Medium will optimize it for 5,000,000 keys, and High will optimize the transform for 25,000,000 keys. You can also set the exact number of keys by using the Number of Keys option.

The Count Distinct Scale option will optionally set the amount of distinct values that can be written by the transform. The default value is unspecified, but if you set it to Low, the transform will be optimized to write 500,000 distinct values. Setting the option to Medium will set it to 5,000,000 values, and High will optimize the transform to 25,000,000.

The Auto Extend Factor specifies to what factor your memory can be extended by the transform. The default option is 25%, and you can specify other settings to keep your RAM from getting away from you. The last option is the Warn On Division by Zero checkbox. This option helps you handle division-by-zero errors such as averaging a value of zero. If this option is not checked, the transform will fail instead of giving a warning.

Audit

The Audit transform allows you to add auditing data to your data flow. In the age of HIPPA and Sarbanes-Oxley (SOX) audits, you often must be able to track who inserted the data into a table and when. This transform helps you with that function. For example, if you'd like to track what task inserted data into the table, you can add those columns to the data flow path with this transform.

The task is easy to configure. All other columns are passed through to the path as an output, and any auditing item you add will also be added to the path. Simply select the type of data you'd like to audit in the Audit Type column (shown in Figure 4-17), and then name the column that will be outputted to the flow. The following are some of the options you'll have available to you:

Execution Instance GUID: The GUID that identifies the execution instance of the package
PackageID: The unique ID for the package
PackageName: The name of the package
VersionID: The version GUID of the package
ExecutionStartTime: The time the package began
MachineName: The machine that the package ran on
UserName: The user that started the package
TaskName: The Data Flow task name that holds the audit task
TaskID: The unique identifier for the data flow task that holds the audit task

image from book
Figure 4-17

Character Map

The Character Map transform (shown in Figure 4-18) performs common character translations in the flow. This simple transform can be configured in a single tab. To do so, check the columns you wish to transform. Then, select whether you want this modified column to be added as a new column or whether you want to update the original column. You can give the column a new name under the Output Alias column. Lastly, select the operation you wish to perform on the inputted column. The available operation types are as follows:

Byte Reversal: Reverses the order of the bytes. For example, for the data 0x1234 0x9876, the result is 0x4321 0x6789. This uses the same behavior as LCMapString with the LCMAP_BYTEREV option.
Full Width: Converts the half-width character type to full width.
Half Width: Converts the full-width character type to half width.
Hiragana: Converts the Katakana style of Japanese characters to Hiragana.
Katakana: Converts the Hiragana style of Japanese characters to Katakana.
Linguistic Casing: Applies the regional linguistic rules for casing.
Lowercase: Changes all letters in the input to lowercase.
Traditional Chinese: Converts the simplified Chinese characters to traditional Chinese.
Simplified Chinese: Converts the traditional Chinese characters to simplified Chinese.
Uppercase: Changes all letters in the input to uppercase.

image from book
Figure 4-18

In the screenshot in Figure 4-15, you can see that two columns are being transformed. Both columns will be transformed to uppercase. For the TaskName input, a new column is added and the original is kept. The UserName column is replaced in-line.

Conditional Split

The Conditional Split transform is a fantastic way to add complex logic into your data flow. The transform allows you to send the path to various outputs or paths based on conditions. For example, you could configure the transform to send all products with sales that have a quantity greater than 500 to one path and products that have fewer sales down another path. This exact situation is shown in Figure 4-19. You can drag and drop the column or code snippets from the tree above. After you complete the condition, you will need to name it something logical rather than the default name of Case 1. You'll use this case name later. You also can configure the Default Output Column Name, which will output any data that does not fit any case.

image from book
Figure 4-19

You can also conditionally read string data by using expressions like the following example, which reads the first letter of the City column:

 SUBSTRING(City,1,1) == "F"

Once you connect the transform to the next transform in the path or destination, you'll see a pop-up dialog box that lets you select which case you wish to flow down this path, as shown in Figure 4-20. In this figure, you can see two cases. The GoodSales can go down one path, and the Default down another. After you complete the configuration of the first case, you can create a path for each case in the conditional split.

image from book
Figure 4-20

If you have two cases, and a default case to catch any other conditions, your path may look like Figure 4-21.

image from book
Figure 4-21

A much more detailed example will be given in Chapter 5.

Copy Column

The Copy Column transform is a very simple transformation that copies the output of a column to a clone of itself. This is useful if you wish to create a copy of a column before you perform some elaborate transformations. You could then keep the original value as your control subject and the copy as the modified column. To configure this transform, go to the Copy Column Transformation Editor and check the column you'd like to clone. Then assign a name to the new column.

Note

Many transforms will allow you to transform the data from a column to a new column inherently.

Data Conversion

The Data Conversion transform performs a similar function to the CONVERT or CAST functions in T-SQL. The transform is configured in the Data Conversion Transformation Editor (Figure 4-22), where you would check each column that you wished to convert and then assign what you wish to convert it to under the Data Type column. The Output Alias is the column name you want to assign to the column after it is transformed. If you don't assign it a new name, it will show as Data Conversion: ColumnName later in the data flow.

image from book
Figure 4-22

Data Mining Query

The Data Mining Query transformation typically is used to fill in gaps in your data or predict a new column for your data flow. This transformation runs a data-mining query and adds the output to the data flow. It also can optionally add columns, such as the probability of a certain condition being true. A few great scenarios for this transformation would be the following:

You could take columns such as number of children, household income, and marital income to predict a new column that states whether the person owns a house or not.
You could predict what customers would want to buy based on their shopping cart items.
You could fill in the blank holes in your data where customers didn't enter all the fields in a questionnaire.

The possibilities are endless with this. This topic will be covered extensively throughout the book, and you'll learn how to configure it in Chapter 6.

Derived Column

The Derived Column transform creates a new column that is derived from the output of another column. You may wish to use this transformation, for example, to multiply the quantity of orders by the cost of the order to derive the total cost of the order, as shown in Figure 4-23. You can also use it to find out the current date or to fill in the blanks in the data by using the ISNULL function. This is one of the top five transforms that you'll find yourself using to alleviate the need for T-SQL scripting in the package.

image from book
Figure 4-23

To configure this transform, drag the column or variable into the Expression column as shown in Figure 4-23. Then add any functions to it. A list of functions can be found in the top-right corner of the Derive Column Transformation Editor. You must then specify, in the Derived Column drop-down box, if you want the output to replace an existing column in the data flow or to create a new column. If you create a new column, specify the name in the Derived Column Name column.

Export Column

The Export Column transformation is a transformation that exports an image or a file from the data flow. Unlike the other transformations, the Export Column transform doesn't need a destination to create the file. To configure it, go to the Export Column Transformation Editor, which is shown in Figure 4-24. Select the column that contains the file from the Extract Column drop-down box. Select the column that contains the path and file name to send the files to in the File Path Column drop-down box.

image from book
Figure 4-24

The other options specify where the file will be overwritten or dropped. The Allow Append checkbox specifies whether the output will be appended to the existing file, if one exists. If you check Force Truncate, the existing file will be overwritten if it exists. The Write BOM option specifies whether a byte-order mark is written to the file if it is a DT_NTEXT data type.

If you do not check the Append or Truncate options and the file exists, the package will fail if the error is not handled. The following error is a subset of the complete error you'd receive:

 Error: 0xC02090A6 at Data Flow Task, Export Column [61]: Opening the file "wheel_small.gif" for writing failed. The file exists and cannot be overwritten. If the AllowAppend property is FALSE and the ForceTruncate property is set to FALSE, the existence of the file will cause this failure.

Fuzzy Grouping and Lookup

The Fuzzy Grouping transformation is a very handy transform in your Toolbox that helps you find patterns in your data that could represent duplicated data. For example, it can match a row like "Main St." with "Main Street" to point out a duplicate and condense it down to one record. The Fuzzy Lookup transformation looks at your data input and attempts to clean dirty data. It's very similar to a T-SQL join, but it operates in a fuzzy fashion. This type of transformation typically follows a Lookup transform. The Lookup transformation would attempt to find an exact match, and then you would attempt to look up the records that can't be found with the Fuzzy Lookup transform. Both of these transforms are rather large and will be discussed in much greater detail in Chapter 6.

Import Column

The Import Column transformation is the opposite of the Export Column transform. It can take image or text files stored in a directory and import them as columns into the data flow. You'll learn more about this transformation in Chapter 6.

Lookup

The Lookup transformation performs the same function as the Lookup tab in the Data Pump Task in SQL Server 2000. For example, if you had a column called ZipCode in the data flow and wanted to derive the two columns called State and City, you could use this transformation to perform this type of lookup if you had a mapping table somewhere to reference. As in SQL Server 2000, this should be used sparingly and is no substitute to a good join in a SQL query, as it will cause some latency as each row is looked up.

To configure the Lookup transform, open the editor (Figure 4-25) and first select the Connection Manager that contains the lookup table. You can then select the table or type the query that will represent the lookup information. Instead of typing the query, you can also click Build Query to build a query in an easy interface.

image from book
Figure 4-25

Merge

The Merge transformation can merge data from two paths into a single output. The transform is useful when you wish to break out your data flow into a path that handles certain errors and then merge it back into the main data flow downstream after the errors have been handled. It's also useful if you wish to merge data from two data sources.

The transform is similar to the Union All transformation, which you'll learn about in a moment, but the Merge transform has some restrictions that may cause you to lean toward using Union All:

The data must be sorted before the Merge transform. You can do this by using the Sort transform prior to the merge or by specifying an ORDER BY clause in the source connection.
The metadata must be the same between both paths. For example, the CustomerID column can't be a numeric column in one path and a character column in another path.
If you have more than two paths, you should choose the Union All transformation.

To configure the transform, ensure that the data is sorted exactly the same on both paths and drag the path onto the transform. You'll be asked if the path you'd like to merge is Merge Input 1 or 2. If this is the first path you're connecting to the transform, select Merge Input 1. Next, connect the second path into the transform. The transformation will automatically configure itself, as you can see in Figure 4-26. Essentially, it will map each of the columns to the column from the other path, and you have the choice to ignore a certain column's data.

image from book
Figure 4-26

Merge Join

One of the overriding themes of SSIS is that you shouldn't have to write any code to create your transformation. One case to prove this is the Merge Join transformation. This transformation will merge the output of two inputs and perform an INNER or OUTER join on the data. An example of where this would be useful is if you have an HR system in one data stream that has an EmployeeID in it and you have a payroll system in another data stream with the payments data. You could merge the two data inputs together and output the employee's name from the HR system and paycheck data from the payroll system into a single path. You can see a variation of this in Figure 4-27, where the employee's name is merged from one system and hire date from another system.

image from book
Figure 4-27

Note

If both inputs are in the same database, it would be faster to perform a join at the OLE DB Source level instead of using a transformation. This transformation is useful when you have two different data sources you wish to merge or when you don't want to write your own join code.

To configure the Merge Join transformation, connect your two inputs into the Merge Join transform and then select what represents the right and left join as you connect each input. Open the Merge Join Transformation Editor and verify the linkage between the two tables. You can see an example of this in Figure 4-28, where the ContactID is the linkage. You can right-click on the arrow to delete a linkage or drag a column from the left input onto the right input to create a new linkage if one is missing. Lastly, check each of the columns you want to be passed as output to the path and select the type of join you wish to make (LEFT, INNER, or FULL).

image from book
Figure 4-28

Multicast

The Multicast transformation, as the name implies, can send a single data input to multiple output paths easily. You may want to use this transformation to send a path to multiple destinations sliced in different ways, as shown in Figure 4-28. To configure the transform, simply connect the transform to your input, and then drag the output path from the Multicast transform onto your next destination or transform. After you connect the Multicast transform to your first destination or transform, you can keep connecting it to other transforms or destinations. There is nothing to configure in the Multicast Transformation Editor other than the names of the outputs.

Note

The Multicast transformation is similar to the Split transformation in that both transformations send data to multiple outputs. The Multicast will send all the rows from the path, whereas the Split will conditionally send part of the data to the path.

OLE DB Command

The OLE DB Command transformation executes an OLE DB command against each row in your data flow. It's quite useful for cleaning up child tables before you perform your insert. This transform uses the Advanced Editor and is covered in Chapter 6.

Percentage and Row Sampling

The Percentage and Row Sampling transformations give you the ability to take the data from the source and randomly select a subset of data. The transformation produces two outputs that you can select. One output is the data that was randomly selected, and the other is the data that was not selected. You can use this to send a subset of data to a development or test server. The most useful application of this transform is to train a data-mining model. You can use one output path to train your data-mining model and the sampling to validate your data-mining model.

To configure the transformation, select the percentage or number of rows you wish to be sampled (as shown in Figure 4-29). As you can imagine, the Percentage Sampling transformation allows you to select the percentage of rows, and the Row Sampling transformation allows you to specify how many rows you wish to be outputted randomly. Next, you can optionally name each of the outputs from the transformation. The last option is to specify the seed that will randomize the data. If you select a seed and run the transformation multiple times, the same data will be outputted to the destination. If you uncheck this option, which is the default, the seed will be automatically incremented by one each at runtime and you will see random data each time.

image from book
Figure 4-29

Pivot and Unpivot

The Pivot transformation denormalizes a normalized data set to make it more viewable for reporting. This output is similar to how OLAP displays data or how the matrix report looks in Reporting Services. You will now look at a quick example of how a pivot table would use data. Say, for example, you have a list of employees and the number of items they sold by date. This view, as shown in the following table, may not be as useful as a pivoted view:

Employee	Date	Items Sold
Mary	9/1/2005	180
Mary	9/2/2005	2
Mary	9/3/2005	140
John	9/1/2005	88
John	9/3/2005	60
John	9/4/2005	12

Once you pivot the data by the date as shown in the following table, you can quickly see the number of orders that each employee had per date. The dates where the employee had no orders just show a NULL.

	9/1/2005	9/2/2005	9/3/2005	9/4/2005
Mary	180	2	140
John	88		60	12

The Pivot transformation performs this function for you. The Unpivot transformation performs exactly the opposite. You'll learn much more about how to configure these two transformations in Chapter 6.

Row Count

The Row Count transformation simply counts the rows that flow through the transform and outputs the number to a variable. The variable is in the scope of the Data Flow task. Typically, you would see this used to feed an e-mail that is sent to a user about the number of records that were loaded. The configuration of this transform will be discussed in Chapter 6.

Script Component

The Script Component transform allows you to write custom scripts as transforms, sources, or destinations. Some of the things you can do with this transform include the following:

Create a custom transform that would use a .NET assembly to validate credit card numbers or mailing addresses.
Validate data and skip records that don't seem reasonable. For example, you can use it in a human resource recruitment system to pull out candidates that don't match the salary requirement at a job code level.
Write a custom component to integrate with a third-party vendor.

Scripts used as sources can support multiple outputs, and you have the option of precompiling the scripts for runtime efficiency. You'll learn much more about the Scripting Component transform in Chapter 7.

Slowly Changing Dimension

The Slowly Changing Dimension transform coordinates the updating and inserting of a dimension in a data warehouse. This transformation is coordinated through the Slowly Changing Dimension Wizard. The wizard will produce all the transforms necessary to update or add to your dimension. This was once a very arduous process for a DTS developer and now can be done in minutes. This transform will be explained in much more detail in Chapter 6.

Sort

The Sort transformation allows you to sort data based on any column in the path. This will probably be one of the top five transformations you use on a regular basis because some other transforms require sorted data. To configure the transform, open the Sort Transformation Editor once it's connected to the path and check the column that you wish to sort by. Then, uncheck any column you don't want passed through to the path from the Pass Through column. By default, every column will be passed through the pipeline. You can see this in Figure 4-30, where the user is sorting by ProductID and passing all other columns in the path as output.

image from book
Figure 4-30

In the bottom grid, you can specify the alias that you wish to output and whether you're going to sort in ascending or descending order. The Sort Order column shows which column will be sorted on first, second, third, and so on. You can optionally check the Remove Rows with Duplicate Sort Values option to remove any rows that have duplicate values.

Term Extraction and Lookup

The Term Extraction transformation pulls out keywords from a data set. For example, you can use the Term Extraction transform to look at a series of articles and pull out keywords from each article. Another great use for it would be to analyze e-mail messages from a company support mailbox to find common issues and terms. The transform works only with English words and linguistics.

Common pronouns and articles are not extracted from the input. For example, "bicycle" would be extracted but not the words "the bicycle." Two new output columns are extracted from the input: term and score. The term output column is the term that was extracted from the input, and the score is how often it shows in the input. All other input columns are dropped. If you want to keep your original data, you may have to use the Multicast transformation and send one path to the Term Extraction and the other wherever you wish to output the data. You can also specify data that you wish to exclude from the transformation's inspection.

The Term Lookup transformation pulls out rows that meet keywords that you predefine. For example, if you have a support e-mail system that logs messages into a database, you could have this transformation comb through the e-mail messages and automatically pull out any message that talks about problems with your product. It uses a Connection Manager to point to a table that contains the terms that you wish to use as your reference.

You'll learn much more about this in Chapter 6.

Union All

The Union All transform works much the opposite way as the Merge transform. It takes the outputs from multiple sources or transforms and combines them into a single result set. For example, in Figure 4-31, the user combines the data from two XML sources on the Internet into a single output using the Union All transform and then sends the single result set into the Term Extraction transform.

image from book
Figure 4-31

To configure the transform, connect the first source or transformation to the Union All transform and then continue to connect the other sources or transforms to it until you are complete. You can optionally open the Union All Editor to make sure the columns map correctly, but SSIS will take care of that for you automatically. The transform fixes minor metadata issues. For example, if you have one input that is a 20-character string and another that is 50 characters, the output of this from the Union All transform will be the longer 50-character column.