Data Profiling


Data profiling is becoming an increasingly important facet to a DBA or warehouse architect’s job title. Data profiling can shave weeks off a few-month project by speeding up the requirements process and letting the business unit know what type of issues they’re going to experience prior to starting the project. It also is extremely important if you want to do data validation or cleansing as data moves through the data flow.

Data profiling is the DBA’s way of checking inconsistent user-entered data that falls through the cracks of being validated from the source that saved the data to the database. DBAs can use SSIS to assist in data profiling so that the format of data (such as credit card numbers, telephone numbers, e-mail addresses, residential addresses, and so on) follows the formatted guidelines for which the data is intended to be stored. You especially see this type of data cleansing with companies that perform mailings. The cost of mailing is much cheaper if you’re able to presort and standardize the addresses.

Data profiling with SSIS provides the flexibility to direct bad data to text logs or other database tables, and to provide custom information or metadata, so that there is enough information for corrective action to be taken on such data.

The following demo profiles the EmailAddress and Phone columns from the AdventureWorks Person.Contact table. The data flow will separate good and bad records by placing them into two different flat files.

To get started, create a new package and add a Data Flow Task. Then, add a Data Reader Source. An Ado.Net connection manager is also needed to connect to the AdventureWorks database that must be wired up to the Data Reader Source, just as you did in previous examples. In the Component Properties tab of the Data Reader Source, set the SqlCommand property to the following query:

  SELECT * FROM PERSON.CONTACT 

A Script Component can now be added that will perform the profiling on the data (once again in a synchronous manner). Double-click on the component and check the ContactID, EmailAddress, and Phone columns as Input Columns, as shown in Figure 2-18. Ensure that ReadOnly is set for the UsageType property of all the columns.

image from book
Figure 2-18: Input Columns

The extraneous columns that will be set as output can now be manually added to the Output 0 node; making sure that their data types and lengths match the corresponding input columns. Figure 2-19 shows ValidPhone as a Boolean, and ValidEmail as a Boolean. The values for ValidPhone and ValidEmail will be set by script within the component that uses Regular Expressions (RegEx) to validate the format of the data.

image from book
Figure 2-19: Output fields

If the data is valid for each record, then the transform will set the ValidEmail or ValidPhone columns to true. If the data is not valid, then they will be set to false accordingly. You could then apply additional business logic in a Conditional Split or Derived Column transform to separate the good and bad rows. Next, click the Design Script button in the Script page to begin writing the code. Listing 2-4 shows the logic. Note that the RegularExpressions are part of the System.Text.RegularExpressions namespace, so be sure that the assembly is imported first.

Listing 2-4: Getting Input Values

  Imports System.Data Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper Imports Microsoft.SqlServer.Dts.Runtime.Wrapper Imports System.Text.RegularExpressions Public Class ScriptMain     Inherits UserComponent     Dim rgxEmail, rgxPhone As Regex     Public Overrides Sub PreExecute()         MyBase.PreExecute()         rgxEmail = New Regex("\S+@\S+\.\S+", RegexOptions.Compiled)         rgxPhone = New Regex("\(?\s*\d{3}\s*[\)\.\-]?\s*\d{3}\s*[\-\.]?\s*\d{4}",         RegexOptions.Compiled)     End Sub     Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)         Row.ValidEmail = rgxEmail.IsMatch(Row.EmailAddress)         Row.ValidPhone = rgxPhone.IsMatch(Row.Phone)     End Sub End Class 

Listing 2-4 performs the cleaning operations discussed. Phone numbers and e-mail addresses are checked against regular expressions that validate the data’s format. There are different regular expressions, and you will find that there is not just one regular expression that’s a fix for all, even for phone and e-mail addresses. E-mail addresses and phone numbers that are valid for one application may be invalid for others. These expressions have been selected just for this demonstration. Note how the regular expressions have been instantiated in the PreExecute() phase of the component; this is so that we only initialize them once. An optional parameter is also passed in to the regular expression constructor to tell it to compile the expression, which provides some extra performance when using these objects.

A Conditional Split component is then used to separate the valid/invalid data coming out of the Script Component. Using logical operators of || represents an OR condition and && represents an AND condition.

Figure 2-20 shows that there are two conditions created by dragging a column from the Columns tree and adding them as a condition. The functions in the top-right window can be used when creating the condition’s expression logic. The first condition, called Invalid Data, checks the output columns, ValidEmail or ValidPhone, which are passed from the Script Component. The Invalid Data condition checks if either flag is false. The Valid Data condition checks that both flags are true. In actuality, you could have just entered the first Invalid Data condition and then put the Valid Data condition as the Default output name. In essence, this would have created and IfThenElse condition, but for this example, both conditions were left in to demonstrate how you can check for multiple conditions if your logic were more complex.

image from book
Figure 2-20: Creation of two conditions

As shown in Figure 2-21, by setting up this logic, the component can now be wired to handle both conditions. The conditions are wired up to Flat File Destination components. The columns used as outputs are then added to each of the flat files.

image from book
Figure 2-21: Component wired to handle both conditions

In Figure 2-22, you can see the BadData.txt file that was created. The component has been configured so that the column headers are added to the top. The first two fields are the flags, and the record shows that the phone number is not the correct format for the regular expression that was used to check it. The record has a phone number of 55-2555-0100, which is obviously invalid.

image from book
Figure 2-22: BadData.txt file

This type of example is very important when you’re dealing with postal data. Companies can quantify the cost of each letter going out to a customer. If you can give your marketing department better quality data or more reliable data, you can save your company a lot of money. Also, it’s useful for detecting fraud or data that you may want to have manually inspected.



Expert SQL Server 2005 Integration Services
Expert SQL Server 2005 Integration Services (Programmer to Programmer)
ISBN: 0470134119
EAN: 2147483647
Year: 2004
Pages: 111

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net