Case Study Process Package


The Load package put the data into the database. The Process package is going to perform the magic. All this payment data from different sources with varying degrees of quality needs to be matched by invoice or customer attributes against your dimension tables of Invoice and Customer. Having it combined in one place allows this package to apply the logic of payment matching to all payments at once. Hopefully, every time the package runs, it is money in the bank for Company ABC.

The strategy for this package is to mimic the logic provided from the business specifications in Figure 19-4. You will queue all the payment transactions that are unmatched for a moment in time. Then you will run that stream of payments through a gauntlet of matching options until you break through your confidence level for matching. This design will make it easy to add further matching scenarios in the future but will allow you to use the advanced fuzzy matching logic available today in the Integration Services.

You'll be breaking the construction of the package into these sections: Package Setup, High-Confidence Data Flow, and Medium-Confidence Data Flow.

Package Setup

This portion of the Case Study will create the Control Flow steps that you need to systematically review pending and unmatched payment transactions. You will set up the variables that you need to store unmatched payment counts at each stage of the matching process. You will create placeholder Data Flow tasks that will perform the matching, and then you'll send out an e-mail to report on the statistics for the matching operations.

  1. Create a new SSIS project. Name the project CaseStudy_Process. When the project is created, go to the Solution Explorer and click on the Package.dtsx file. In the Property window, find the Name property and change the name from package.dtsx to casestudy_process.dtsx. Answer Yes to change the name of the package object as well.

  2. Add an OLE DB Connection to the Connection Managers that connects to the CaseStudy database. Name the connection CaseStudy.OLEDB.

  3. Add an SMTP Connection to the Connection Managers that connects to a viable SMTP mail server. Name the connection Mail Server.

  4. Use the menu SSIS Variables to access the Variables editor and add what's shown in the following table.

    Variable Name

    Scope

    Data Type

    Value

    HIGHCONFMATCHCNTSTART

    CaseStudy_Process

    Int32

    0

    HIGHCONFMATCHCNTEND

    CaseStudy_Process

    Int32

    0

    MEDCONFMATCHCNTEND

    CaseStudy_Process

    Int32

    0

  5. Add an Execute SQL task to the Control Flow. This task needs to query the database for the pending payments and record the total number and dollar amount prior to starting the High Confidence Data Flow task. Name the task Execute SQL Get High Conf Stats. Connect to the OLEDB connection. Set up two result columns to retrieve first an amount value into the variable that represents a count of the pending transactions at this point. Use the following SQL in the SQLStatement Property:

     SELECT        count(*) as ToCnt FROM bankbatchdetail d INNER JOIN BANKBATCH h ON h.bankbatchid = d.bankbatchid WHERE matcheddate is null AND RawInvoiceNbr is not null AND RawInvoiceNbr <> '' 

  6. Add a Data Flow task to the Control Flow. Name the task High Confidence Data Flow Process Task Start. Connect the tasks. You'll cover this task in the "High-Confidence Data Flow" section.

  7. Add another Execute SQL task by copying the first SQL task. Name the task High Confidence Data Flow Process Task End. Connect the tasks. Change the variable mappings in the result column to HIGHCONFMATCHCNTEND.

  8. Add another Data Flow task to the Control Flow. Name the task Medium Confidence Data Flow Process Task. Connect the task to the Execute SQL task. You'll cover this task in the "Medium-Confidence Data Flow" section.

  9. Add another Execute SQL task by copying the first SQL task. Name the task Medium Confidence Data Flow Process Task End. Connect the tasks. Change the variable mappings in the result column to MEDCONFMATCHCNTEND.

  10. Add an Execute SQL task from the Toolbox. Name the task Execute SQL to Balance by Batch. Set the OLE DB connection. Set the SQLStatement property to EXEC usp_BankBatch_Balance. This procedure will update and balance batch level totals based on the payments that are processed.

  11. Add a Send Mail task. Set it up to connect to the Mail Server SMTP connection. Fill in the To, From, and Subject properties. (If you don't have access to an SMTP connection, disable this task for testing.) Create an expression for the MessageSource property similar to the following code:

     " COMPANY ABC, Automated Payment Matching Results: Starting with " + (DT_WSTR, 25) @[User::HIGHCONFMATCHCNTSTART]  + " payments,  We received and successfully processed " +  (DT_WSTR, 25) (@[User::HIGHCONFMATCHCNTSTART]- @[User::HIGHCONFMATCHCNTEND]) + " payments automatically with a High-Level of confidence.  We processed " + (DT_WSTR, 25) (@[User::HIGHCONFMATCHCNTEND]- @[User::MEDCONFMATCHCNTEND]) + " payments with a Medium-Level of confidence.  This is an automated message." 

The completed Control Flow should look similar to Figure 19-33.

image from book
Figure 19-33

High-Confidence Data Flow

Your first level of matching should be on the data attributes that are most likely to produce the highest-quality lookup against the target invoice table. The attribute that would provide the highest-quality lookup and confidence level when matching would be the Invoice Number. An invoice number is a manufactured identification string generated by Company ABC for each created bill. If you get a match by invoice number, you can be highly confident that payment should be applied against this invoice number.

  1. First you need to create a stream of items to process in your Data Flow. You'll do this by querying all pending payments that at least have some sort of data in the RawInvoiceNbr Field. If there is no data in this field, the items can't be matched through an automated process until a subject-matter expert can look up the item or identify it in another way. Add an OLE DB Source to the Data Flow. Set up these properties. Notice that the [RawInvoiceNbr] field has been converted to uppercase before it is delivered into your data stream to better enable the lookup matches.

    Property

    Value

    Connection

    CaseStudy.OLEDB Connection

    DataAccessMode

    SQLCommand

    SQLCommandText

     SELECT h.ImportType, BankBatchDtlID, UPPER (RawInvoiceNbr) as RawInvoiceNbr, PaymentAmount FROM bankbatchdetaild INNER JOIN BANKBATCH h ON h.bankbatchid = d.bankbatchid WHERE matcheddate is null AND RawInvoiceNbr is not null AND RawInvoiceNbr <> '' 

  2. Add a Sort transform to the output of the OLE DB Source and sort the data by the [BankBatchDtlID] field in ascending order. Even though you could order the incoming data by BankBatchDtlID by adding an ORDER BY clause to the SQLCommandText property in the OLE DB Source, you still need this Sort transform to sort the stream for a later Merge Join operation.

  3. Add the first Lookup transform, which is going to be a lookup by Invoice. You are going to add many of these, so you'll go over this first one in detail. For each item in the stream, you want to set up an attempt to match the information in the [RawInvoiceNbr] field from the different payment data sources to your real invoice number in the invoice dimension table. In other lookups, you may attempt name or e-mail lookups. The invoice number is considered your highest-confidence match because it is a unique number generated by the billing system. If you find a match to the invoice number, you have identified the payment. Set up the following properties on the component.

    Property

    Value

    Connection

    CaseStudy.OLEDB Connection

    SQL Query

     SELECT InvoiceID, UPPER(ltrim(rtrim(InvoiceNbr))) As InvoiceNbr, CustomerID FROM INVOICE ORDER BY InvoiceNbr 

  4. In the Columns tab, connect the Input Column [RawInvoiceNbr] to the Lookup Column [InvoiceNbr]. If there is a match on the lookup, pull back the InvoiceID and CustomerID. This information will be in the Lookup data. Do this by adding these columns as Lookup columns to the Lookup Column Grid.

    The default behavior of the Lookup transform is to fail if there is a no-match condition. You don't want this to happen, because you expect that you aren't going to get 100 percent matches on each transform. What you'd like to be able to do is separate the matches from the non-matches so that you only continue to look up items in the stream that are unmatched. To do that, you will use this built-in capability to "know" if a match has been made, and instead of failing the component or package, you will divert the stream to another lookup. In the Lookup transform, use the Configure Error Output button to set up the action of a failed lookup to be Redirect Row as in Figure 19-34.

  5. Because the Invoice number can be keyed incorrectly at the bank or truncated, it may be off by only a few digits, or by an "O" instead of a "0." Using only inner-join matching, you may miss the match, but there may still be a good chance of a match if you can use the Fuzzy Lookup. This package is also going to use a lot of Fuzzy Lookup tasks. They all need to be set up the same way, so you'll do this one in detail and then just refer to it later.

    1. Add a Fuzzy Lookup task to the Data Flow to the right of the Lookup task. Connect the Error Output of the previous Invoice Lookup transform to the Fuzzy Lookup. Set up the OLE DB connection to CaseStudy.OLEDB.

    2. Select the option to Generate a New Index with the reference table set to [Invoice]. (Later it will be more efficient to change these settings to store and then use a cached reference table.)

    3. In the Columns tab, match the [RawInvoiceNbr] Fields to the [InvoiceNbr] field.

    4. Deselect the extra Error columns from being passed through from the input columns. These columns were added to the stream because it was diverted using the error handler. You aren't interested in these columns because a no-match is not considered an error for this transform.

    5. Right-click the line between the two columns. Click Edit Relationship on the pop-up menu. Check all the comparison flags starting with Ignore.

    6. Select the InvoiceID and CustomerID fields to return as the lookup values if a match can be made with the fuzzy logic.

    7. In the Advanced tab, set the Similarity Threshold up to .78 for the Invoice fuzzy match.

  6. Since the output of the Fuzzy Lookup contains a number indicating the similarity threshold, you can use this number to separate the stream into high-and low-similarity matches. Low-similarity matches will continue through further matching attempts. High-similarity matches will be remerged with other high-similarity matches. Add a Conditional Split task to separate the output into two streams based on the field [_Similarity] which represents a mathematical measurement of "sameness" between the [RawInvoiceNbr] provided by ABC Company's customers and the InvoiceNbr that you have on file. The splits should always be set up like Figure 19-35.

  7. You want to merge any high-similarity matching from the Fuzzy Lookup and the previous Inner-Join Lookup transform, but to do that, the Fuzzy Lookup output must be sorted. This step will also be repeated many times. Add a Sort transform and select to sort the column [BankBatchDtlID] field in ascending order. The Sort transforms do two things: They sort data, and they also allow you to remove the redundant fuzzy-data-added columns by deselecting them for pass-through. Remove references to these fields (_Similarity, Confidence, ErrorCode, and ErrorColumn) when passing data through sorts.

  8. Add a Merge Component to the Data Flow. Connect the output of the Invoice Lookup to the High Similarity output of the Fuzzy Lookup (via the Sort transform). In the Merge Editor you can see all the fields from both sides of the merge. Sometimes a field will come over with the value to <IGNORE> the field. Make sure you match these fields or some of the data is going to be dropped from your stream. A merge transaction will look like Figure 19-36.

  9. At this point, the only items in the Merge are matched by Invoice, and you should have foreign keys for both the customer and the invoice. These keys can now be updated by executing the stored procedure usp_BankBatchDetail_Match for each of the matching items in your merged stream. Add an OLE DB command to the Data Flow and set up the OLEDB connection. Set the SQLCommand Property up as usp_BankBatchDetail_Match ?, ?, ?. Press Refresh to retrieve the parameters to match. Match the InvoiceID, CustomerID, and BankBatchDtlID fields from the input and output. The stored procedure will run for each row in your stream and automatically update the foreign keys. If a row is found with both invoice and customer keys, the stored procedure will also mark that transaction as complete.

image from book
Figure 19-34

image from book
Figure 19-35

image from book
Figure 19-36

This completes the High-Confidence Data Flow. At this point, your Data Flow should look like Figure 19-37. When this Data Flow returns to the Control Flow, the Execute SQL task will recalculate the number of remaining pending transactions by count and by amount. The next step is the Medium-Confidence Data Flow.

image from book
Figure 19-37

Medium-Confidence Data Flow

The Medium-Confidence Data Flow is made up of matches using customer information. Since names and e-mail addresses are more likely to be similar, this level of matching is not as high on the confidence-level continuum as an invoice number. And even after you have the customer identified, you will still need to identify the invoice. To find the invoice, you'll attempt to match on the closest non-paid invoice by amount for the customer. All of these tasks, until you get to the end, are similar to the High-Confidence Data Flow. The only difference is that the lookups use the customer table instead of the invoice table. For this reason, you'll just list the basic steps. Refer to Figure 19-40 to see a picture of the final result.

  1. Add an OLE DB Source set up exactly the same way as for the High-Confidence Data Flow.

  2. Add a Lookup to the Data Flow connecting to the OLEDB source. Name it Email Lookup. Look for exact matches between RawInvoiceNbr and the field [Email] in the customer table. Set the error handling to Redirect when encountering a Lookup error. Use this SQL Query:

     Select CustomerID, UPPER(rtrim(Email)) as Email FROM Customer WHERE Email is not null AND Email <> '' 

  3. Add another Lookup by Customer Name beside the Email Lookup. Feed it the error output of the Email Lookup. Look for exact matches between RawInvoiceNbr and the field [Name] in the customer table. Set the error handling to Redirect when encountering a Lookup error. Use this SQL Query:

     SELECT CustomerID, UPPER(rtrim([Name])) as [Name] FROM CUSTOMER WHERE [Name] is not null and [Name] <> '' 

  4. Add Sort Components to the outputs of both lookups. Place them directly under each lookup. Sort by BankBatchDtlID ascending. In the sort by name matches, don't forget to deselect the error columns.

  5. Add a Merge Component to merge the two outputs of the Sorts for matches by Email and Name.

  6. Add a Lookup using the CustomerLookup table next to the Name Lookup. Feed it the error output of the Customer Name Lookup. Look for exact matches between the fields [RawInvoiceNbr] and the lookup field [RawDataToMatch]. This lookup requires an additional match on the fields [ImportType] for both the input and output data. Set the error handling to Redirect. Use the table name [CustomerLookup] as the source. Look up and return the CustomerID.

  7. Add a Sort to the CustomerLookup task. Deselect the extra columns.

  8. Add a Fuzzy Lookup task to the Data Flow. Connect it to the error output of the CustomerLookup Lookup. Connect to the customer table, and match by RawInvoiceNbr to Email Address. Select the CustomerID for the lookup. Set the Similarity to .70. Remove the columns for pass-through that start with lookup.

  9. Add the Conditional Split component to the output of the Fuzzy Lookup to separate the matches by similarity values above and below .70.

  10. Moving to the left, add a new Merge task to merge the results of the e-mail and name merge with the customer lookup matched sort results. Combine the matched results of the two sorted outputs.

  11. Add a Sort to the High Similarity Results of the Fuzzy Lookup by Email. Deselect the columns that were added by the Fuzzy Lookup starting with "_". Sort by BatchDetailID.

  12. Add a new Merge task to combine the Email Fuzzy Lookup Sort to the Email, Name, and CustomerLookup merged results.

  13. Add a Fuzzy Lookup task to the Data Flow beside the conditional split from the last Email Fuzzy Lookup. Name it Fuzzy Name Lookup. Move it to the same level to the right of the conditional lookup. Connect the Low Similarity Output from the Email Fuzzy Lookup to the new Fuzzy Name Lookup. Use the [Customer] table to look for matches matching [RawInvoiceNbr] to [Name]. Uncheck the pass-through checkbox for the input column [CustomerID] that is being fed by the Low Similarity stream. Retrieve a new lookup of CustomerID. In the Advanced tab, move the Similarity setting to .65.

  14. Add another conditional split below the Fuzzy Name Lookup and split the output into High and Low Similarity, again using the .70 number.

  15. Add a sort to sort the HIGHSIMILARITY output from the Conditional Split you just created. Remove the extra columns.

  16. Add the last Merge task to merge the Sort from the high-similarity fuzzy name match with all the other matches that have been merged so far. At this point, you have captured in the output of this Merge task all the transactions that you were not able to identify by invoice number but that you have been able to identify by customer attributes of e-mail or name. These are all of your medium-confidence matches. Knowing the customer might be good, but finding the payment invoice would be even better.

  17. Add another Lookup task to the Data Flow below the last Merge task. Name it Lookup Invoice By Customer. Connect the output of the Merge task to it. Open the editor. Put the following basic SQL query in as the reference table:

     "SELECT INVOICEID, CUSTOMERID, TotalBilledAmt FROM INVOICE" 

    In the Columns tab, link the CustomerID that you have discovered to the CustomerID in the invoice lookup table. Connect the paymentamount field to the TotalBilledAmount field. Go to the Advanced tab and check the box Enable Memory Restriction. Then check the box to Modify the SQL Statement. Update the contents of the Caching SQL statement to the following:

     select * from (SELECT INVOICEID, CUSTOMERID, TotalBilledAmt FROM INVOICE) as refTable  where [refTable].[CUSTOMERID] = ? and (ABS([refTable].[TotalBilledAmt] - ?)<([RefTable].[TotalBilledAmt]*.05)) 

  18. Click on the Parameters button. A box for parameters will appear as in Figure 19-38. Select the field PaymentAmount to substitute for Parameter1. This query looks for matches using the CustomerID field and an amount that is within 5 percent of the billed premium.

    To have the result return an Invoice Number, click back on the Columns tab and select the InvoiceID field in the grid. At this stage, you don't care if you don't get a match in terms of error handling. Set the error-handling behavior to Ignore Error, and just send the data through regardless of whether it matches or not. If you have the customer ID and that's it, fine. If you have both, that's even better, but you'll send your results through regardless.

  19. Add an OLE DB Command transform to the Data Flow at the bottom. Attach a connection to the results of the last invoice lookup by amount. Set the connection to CaseStudy.OLEDB. Set the SQLCommand Property to usp_BankBatchDetail_Match ?, ?, ?. Press Refresh to retrieve the parameters to match. Match the InvoiceID, CustomerID, and BankBatchDtlID fields from the input and output. The stored procedure will run for each row in your stream and automatically update the foreign keys. If a row is found with both invoice and customer keys, the stored procedure will also mark that transaction as complete.

image from book
Figure 19-38

This completes the task of building the Medium-Confidence Data Flow and the CaseStudy_Process package. The Data Flow should look similar to Figure 19-39.

image from book
Figure 19-39

Interpreting the Results

Before you started this exercise of creating the CaseStudy_Process SSIS package, you had loaded a set of 16 payment transactions for matching into the BankBatchDetail table. By running a series of SQL statements comparing the RawInvoiceNbr with invoices and customers, you could only retrieve a maximum of 7 matches. This translates into a 44 percent match of payments to send to the payment processors without any further human interaction. The development of this package with heavy usage of Fuzzy Lookup transforms increases your identification hit-rate to 13 out of 16 matches, or an 81 percent matching percentage. The results can be broken out as shown in the following table.

Stage in Process

# of New Matches

Match Percent

High-Confidence Invoice Match

2

12%

Med-Confidence Invoice Match

9

56%

Med-Confidence Customer Match

2

12%

As you may recall, the business expectations were to make an improvement to match all but 20 to 40 percent of every payment that comes into Company ABC. You are right at, or just under, the best percentage with your test data — and this is just a beginning. Remember that the unidentified items will be worked by SMEs who will store the results of their matching customer information in the CustomerLookup table. Incidentally, you used this data even though the table is empty within the Lookup CustLookup transform in the Medium-Confidence Data Flow. As SME-provided information is stored, the Data Flow will become "smarter" in matching incoming payments by referring to this matching source as well.

Now look at the three items that weren't matched by your first run.

Item Matching Information

Payment Amount

Intl Sports Association

$ 34,953.00

JBROWN@CAPITALCYCLES.COM

$ 21,484.00

181INTERNA

$ 18,444.00

The first item looks like a customer name, and if you search in the customer table, you'd find a similar customer named International Sport Assoc. Since it is highly likely that future payments will be remitted in the same manner, you'll store the match between the customer's actual name and the string Intl Sports Association in the CustLookup table.

The second item looks like a customer e-mail address. If you can find the customer to whom this e-mail address belongs, you can update that information directly into the customer table to facilitate a future match. There is one customer named Capital Road Cycles that has several invoices at or around $20,000. You'll update the customer data with this e-mail address.

If you query the invoice table using an invoice number like 181INTERNA, you find several, but they are all for an amount of $34,953.00. This payment is for $18,444.00. Because the payment is significantly different from your billed amount, someone is going to have to look at this payment. This transaction will be manually processed based on your current business rules regardless of anything you could have done. Because the matching is against an invoice number, you also don't have anything of use for your customer lookup table.

If you were to now delete all the rows from the BankBatch and BankBatchDetail tables and rerun both the CaseStudy_Load and CaseStudy_Process packages, the payment matching process now improves to 15 out of 16 matches — a whopping 94 percent match. Company ABC will be highly pleased with the capabilities that you have introduced them to with this SSIS solution.



Professional SQL Server 2005 Integration Services
Wireless Java : Developing with Java 2, Micro Edition
ISBN: 189311550X
EAN: 2147483647
Year: 2006
Pages: 182

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net