Chapter 10: Handling Heterogeneous and Unusual Data


Because of its price-point (out of the box with SQL Server 2005), SSIS is not only being used to push data into SQL Server, but also to heterogeneous sources. On the other side of the coin, you will often receive data that’s not anywhere close to perfect that you must transform with SSIS. This chapter walks you through some of the scenarios witnessed in the field, that kept us up late at night. These examples should cover most scenarios that you’ll see in your company, and will hopefully keep you from having to have the same sleepless nights that we had while creating these solutions!

Unusual Data Flow Scenarios

It never fails. Just when you think you’ve seen the worst possible extract, someone tops it. We’ve been consulting for years and we have competitions to see what the worst extract is. The examples you’re about to see talk about how you can clean these horrible extracts. Oftentimes, you may have purchased a third-party product where you’re not allowed to have direct access to the data. Instead, you must generate reports or access the data through web services and make this your data source. Painful as it may seem, there is a solution for solving almost any strange data feed in SSIS. In most cases, you won’t even have to break out the Script Component. This chapter has many examples that explain how to handle this bad data, but the examples won’t walk you through step-by-step on how to do the basics such as how to create a proper connection manager.

Creating Rows from Columns

As you know, mainframe screens rarely conform to any normalized form. For example, a screen may show Bill to Name, Ship To Customer, and Dedicated To Name fields. Typically, the data source would store these three fields as three columns in a file (such as VSAM). So, when you receive an extract from the mainframe, you may have three columns, as shown in Figure 10-1.

image from book
Figure 10-1: Extract from the mainframe

Your goal is to load this file into a Customer table in SQL Server. You want a row for each customer in each column, for a total of 13 rows in the Customer table, as shown in the CustomerName and OrderID columns in Figure 10-2.

image from book
Figure 10-2: CustomerName and OrderID columns

The Unpivot transform and the Script Component Source are both mechanisms to accomplish this task. This example demonstrates how to use the Unpivot transform to create rows in the data flow from columns. Although it is possible to use a Script source to accomplish the task, it is much more complex, and should only be used if there are much more complex business rules than the stock Unpivot transform can accomplish.

The first step is to create a new package, PivotExample.dtsx. Drag a new Data Flow Task onto the control flow and go into the Data Flow tab to configure the task. For this example, create a Flat File connection manager that points to PivotExample.csv. The file can be downloaded from the Wrox web site (www.wrox.com); the Connection Manager properties should resemble Figure 10-3. Name the connection manager Pivot Source. The first row should be a header row. The file is comma-delimited, so you will want to specify the delimiter on the Columns page.

image from book
Figure 10-3: The Flat File Connection Manager Editor

After creating the connection manager, add a new Flat File Source and rename it to Mainframe Data. Point the connection to the Pivot Source connection manager. Ensure that all the columns are selected in the Columns page on the source and then click OK to return to the data flow.

The next step is the most important step. Now, you must unpivot the data and turn each column into a row in the data flow. You can do this by dragging an Unpivot transform onto the data flow and connecting it to the source. In this example, you want to unpivot the BillToName, ShiptToName, and DedicatedToName columns, and the OrderID column will just be passed through for each row. To do this, select each column you want to unpivot, as shown in Figure 10-4, and select Pass Through for the OrderID column.

As you select each column that you want to unpivot, the column will be added to the grid below, as shown in Figure 10-4. You’ll then need to type CustomerName for the Destination Column property for each row in the grid. This will write the data from each of the three columns into a single column called CustomerName. Optionally, you can also type Original Column for the Pivot Key Column Name property. By doing this, each row that’s written by the transform will have an additional column called Original Column. This new column will state where the data came from.

image from book
Figure 10-4: Unpivot Transformation Editor

The Pivot transform will take care of columns that have NULL values. For example, if your DedicatedtoName column for OrderID 1 has a NULL value as shown in Figure 10-1, that column will not be written as a row. However, you may want to handle empty string values, which will create blank rows in the data flow. To throw these records out, you can use a Conditional Split transform, as shown in Figure 10-5. In this transform, you can use the following code to create one condition for your good data that you want to keep, which only brings in rows with actual data:

 ISNULL(CustomerName) ==  FALSE  && TRIM(CustomerName) != ""

image from book
Figure 10-5: The Conditional Split Transformation Editor

The else condition handles empty strings and NULL customers, and, in this example, the relevant Conditional Split output is called NULL Customer. After this, you’re ready to send the data to the destination of your choice. The simplest example is to send the data to a new SQL Server table in the AdventureWorks database.

The final package should look something like Figure 10-6. The Valid Customer output goes to the customer table and the NULL data condition just gets thrown out. You’re now finished and you can execute the package.

image from book
Figure 10-6: The final package

Multiple Record Types in a Single File

Oftentimes, mainframe developers may send you a single file that contains information about multiple tables inside the single file, as shown in Figure 10-7. In this figure, you can see two record types: record type 1 is the order header and record type 2 represents the individual line items for the order. The first column in this csv file (CH10_OrderRecordExtract.csv, which you can download from the Wrox web site) contains the record type.

image from book
Figure 10-7: File containing information about multiple tables

This section discusses this common example often seen with mainframe extracts, and explains how to parse the file into multiple tables. This example has been simplified dramatically to fit into a chapter, so many of the performance considerations are being ignored in an effort to just show the functionality of how to master this problem. As you can see in Figure 10-7 (the file being used in this example), there are two orders which are indicated by the first column being set to 1. Under those two orders, there are many line items.

There are several ways to solve this parsing issue. In this case, you will make a pass at the file to load it into multiple raw files. The second pass of the raw files will load those files into SQL Server tables. Prior to starting this example, run Chapter10_OrderTables.sql, which will create the necessary tables in SQL Server. You could indeed load the data directly to the target tables in one pass, but there are certain advantages to a two-step process, depending on the business and technical requirements.

One advantage of not making this a two-step process is that you would only have to load the data into the server’s memory buffers a single time. On the contrary, if you were to make two passes at the data, you will ultimately slow down your processing. The advantage of making the two passes at the data is for both the restartability and availability of the data. If the loading of the data into the production tables were to fail, you would still have the data massaged and in easy-to-access intermediate files. If this first process took three hours and the second pass failed, then all you must do then is execute the second pass, or place checkpoint files on the process to control that execution logic.

Another advantage is availability. In some cases, you will not be able to load the data into the production tables during operational hours. If you had a two-step process, you could stage the data into raw files, and perform all the transformations during operational hours. Then, during your load window, you could take the final step.

In some cases, you will have to stage the data in a table or a raw or text file. Those scenarios usually consist of very complex fixed-width or ragged-right data where the number and length of the columns are variable. The main problem is the length of each column. If that is variable and there are multiple record types in the file, then you must stage each record type into its own file, and then use a Derived Column transform and the SUBSTRING() function to break up the data.

This example assumes that you must break the data into two stages; however, as you can see, this depends on your business requirements. Start by creating a new package called RawFile.dtsx and add a new Data Flow Task onto the control flow. Name the Data Flow Task Break up File.

Next, create a Flat File connection manager that points to CH10_OrderRecordExtract.csv. This is the comma-delimited file that contains multiple record types and has no header row. As you can see in the Columns page (shown in Figure 10-8), Columns 5 and 6 have no data for record type 1. The first column (Column 0) contains the record type.

image from book
Figure 10-8: Columns page of Connection Manager Editor

Although most of the data is not shared between the various record types, there is a key that pairs the records together. In this case, it’s the Order Number column, which is in Column 1. Before exiting the Connection Manager screen, let’s update the common columns to the proper column names and data types.

Change the name of the first column to RecordType and the second column to OrderNumber in the Advanced page for the Name property. Each column will be a signed integer (DT_I4), as shown in Figure 10-9. The last two columns (Columns 5 and 6) should be named UnitPrice and LineItemPrice, respectively, and should both be set to Currency (DT_CY) as the data type. Leave the other column names and data types alone.

image from book
Figure 10-9: Column 4 properties in Connection Manager Editor

Inside the Data Flow Task you created earlier, you’re ready to begin using the Flat File Connection Manager by creating a Flat File source in the data flow. Point the source to the Flat File Connection Manager you just created. Name the source Initial Extract.

Create a new Conditional Split transform and name it Split By Record Type. You will create three conditions, including the default output. The first condition is called Orders and will contain all the records that have a RecordType of 1. The expression RecordType == 1 will suffice to grab only the orders. To get the Order Details case, you can use the expression RecordType == 2. The default condition should be called Unhandled and will catch any potential NULL record types or bad data that doesn’t meet your requirements. Figure 10-10 shows the final configuration of the transform.

image from book
Figure 10-10: Final configuration of the transform

Next, create a Data Conversion transform and name it Data Conversion – Order. Drag the Order output from the Conditional Split onto the newly created transform. Inside the Data Conversion transform, select Columns 2, 3, and 4 from the Available Input Columns. Change the Output Alias to InternetOrderFG for Column 2, SalesOrderNumber for Column 3, and PurchaseOrderNumber for Column 4. The data type for InternetOrderFG should be set to Boolean, and the other two columns should be set to a 15-character string, as shown in Figure 10-11.

image from book
Figure 10-11: Data Conversion – Order in the Data Conversion Transformation Editor

Drag over a second Data Conversion transform and name it Data Conversion – Order Details. Connect it to the Order Details output from the Conditional Split transform. Inside the Data Conversion transform, select Columns 2, 3, and 4, as shown in Figure 10-12. Set the Output Alias to CarrierTrackingID for Column 2, OrderQuantity for Column 3, and ProductID for Column 4. The data type should be a 20-character string (DT_STR) for CarrierTrackingID and Integer (DT_I4) for the other two columns. It’s important to have a Data Conversion Transformation here, not after the source, since the data types will be different for each record type.

image from book
Figure 10-12: Data Conversion – Order Details in the Data Conversion Transformation Editor

With the columns now converted, you’re ready to write the data to your raw files. Drag over the first Raw File destination and connect it to the Data Conversion – Order Data Conversion transform. Name the Raw File destination Order Extract and double-click it to configure it.

In the Component Properties tab of the Raw File destination, ensure that the Access Mode property is set to File Name. Type C:\ExpertSSIS\OrderData.raw for the FileName property and change the WriteOption property to Create Always. These options will create the .raw file each time the package runs, and you will create a process later to remove the file upon successful completion of the package. In the Input Columns tab, check the OrderNumber, InternetOrderFG, SalesOrderNumber, and PurchaseOrderNumber columns, as shown in Figure 10-13.

image from book
Figure 10-13: Configuring the Order Extract destination

Drag over another Raw File destination and name it Order Details Extract. Connect it to the Data Conversion – Order Details transform. In the destination, configure it just like you did the last Raw File destination, except have the destination create the C:\ExpertSSIS\OrderDetailsData.raw file. This time, in the Input Columns tab, select the OrderNumber, UnitPrice, LineItemPrice, CarrierTrackingID, OrderQuantity, and ProductID columns, as shown in Figure 10-14.

image from book
Figure 10-14: Configuring the Order Details Extract destination

Using the Raw File

Now that you’ve transformed the data and written two raw files with the first pass, you’re ready to consume the raw files. The first process could have been accomplished during a non-maintenance window, and the second process could now be un-tethered from this package and executed independently during maintenance.

First, create another Data Flow Task in the same package and drag a Success (green) precedence constraint from the first Break Up File Data Flow Task to it. Name the newly created Data Flow Task Orders and double-click it to configure it. Inside the Data Flow tab, create a new Raw File source and name it Order Extract. Inside the source, set the AccessMode option to File Name and type C:\ExpertSSIS\ OrderData.raw for the FileName property.

The last step in this data flow is to drag over an OLE DB Destination and name it Order Table. In this destination’s Connection Manager page, select the Order table in the AdventureWorks database from the table drop-down box. You will have to create this table by running Chapter10_OrderTables.sql from the Wrox web site. Of course, you will also have to create a connection manager to the AdventureWorks database. In the Mappings page, the columns will not line up exactly, so you may have to connect them manually. Your final configuration should look like Figure 10-15.

image from book
Figure 10-15: Final configuration in the DB Destination Editor

In the control flow, create a last Data Flow Task and name the task OrderDetails. Connect the Orders task to the OrderDetails task with a Success precedence constraint. In the OrderDetails Data Flow Task, add a Raw File source. This time, the Raw File source should have the FileName property set to C:\ExpertSSIS\OrderDetailsData.raw in the Component Properties page.

Connect the Raw File source to a new OLE DB destination named OrderDetail Table. This time, point the destination to the OrderDetail table in the AdventureWorks connection manager. The Mappings page should align perfectly, and only the OrderLineNumber column should be ignored, as shown in Figure 10-16, since it’s an auto-incrementing identity column.

image from book
Figure 10-16: Alignment in the Mappings page

You’re now ready to execute the package. The Break Up File Data Flow Task should look like Figure 10-17. As you can see, there are three order records and 29 line items for those orders. In the Control Flow tab, you should have three total tasks, but you may choose to add a File System Task to remove the raw files after they’ve been consumed. Doing this will use the space on the drive only during the execution of the package. This is not mandatory, though, if you have ample space, since you specified that you wanted the raw file to always be created. One additional thing to consider is that if later on you choose to select data out of the Order table, you will need to encapsulate this table in quotation marks or square brackets, since Order is a reserved word in T-SQL.

image from book
Figure 10-17: The Break Up File Data Flow Task



Expert SQL Server 2005 Integration Services
Expert SQL Server 2005 Integration Services (Programmer to Programmer)
ISBN: 0470134119
EAN: 2147483647
Year: 2004
Pages: 111

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net