7.7 Detecting Suspicious Government Financial Transactions: A Case Study

The following data mining investigation describes work done under contract with the Federal Defense Financial Accounting Service by prime contractor EDS Corporation and subcontractors from the Federal Data Corporation, Elder Research, and Abbott Consulting.

As with most data mining projects, there were several steps undertaken by the analysts, starting with an understanding of the crime, followed by the establishment of goals. Next came the assembling of the data, an assessment of the challenges, the selection of a modeling strategy and algorithms, and the creation and testing of the models with the validation data set, culminating with some final observations and recommendations by the team of data miners.

7.7.1 Step One: Identify Investigative Objective

As with most data mining projects, the team begins by first understanding and clearly defining the scope of the criminal behavior and data mining problem. The client is well aware that there is a historical problem with intentional fraud in the vendor payment systems. Compounding the problem, however, as with most matters of criminal behavior, is that there is a limited number of known cases. Furthermore, the transaction data is incomplete for those known fraud cases. Then, there is the problem that fraud is often hidden in large sets of legitimate transactions.

7.7.2 Step Two: Establish Investigation Goals

The primary goal was to identify suspicious payments while maintaining a low false-alarm rate, a cost concern often associated with fraud detection models. This cost concern is due to a limited number of examiners to investigate suspicious payments. A secondary goal is to build a data mining process that can be generalized and reproduced for other applications and business questions within the agency. Lastly, the goal of knowledge transfer is desired, enabling the government to do the data mining process internally with its existing staff.

7.7.3 Step Three: Conduct Knowledge Discovery Process

This is the process in which the data miners spend time understanding the business methods involved. It is also the stage at which an understanding of the data takes place and its preparation is performed. As part of this preparation, initial exploratory analyses are performed in order to refine the data preparation process for the creation of new models.

7.7.4 Step Four: Assess Investigation Challenges

In order to develop an effective data mining strategy, an assessment of the modeling challenges is performed at this juncture. First, the data set is a very large payment database with incomplete information in the vendor payment data file; this is a common problem when mining production data. In addition, payments are unlabeled and cannot be verified. Furthermore, there is a very small number of known fraud payments with instances of multiple payments from the same case. These conditions can lead to possible over-fitting or over-searching for models. To remedy these challenges, a three-stage process will be used for training, testing, and validation of the data sets. A cross-validation methodology will be used by creating several models with a variety of algorithms by different modelers who will use different data mining strategies.

7.7.5 Step Five: Set Strategy for Investigation

The main strategy was to create multiple structured and random samples for training and testing of fraud detection models. This called for 11 structured samples ("splits") for known fraud data, each with training, testing, and validation data subsets of 33 overlapping samples of fraud cases. In addition, 11 corresponding random splits were used with training, testing, and validation data subsets for non-fraud data of 33 non-overlapping samples of non-fraud (see Figure 7.4). The strategy employed used a small enough set of non-fraud data to make it unlikely that unlabeled "non-fraud" transaction were really being classified as "fraud"; this is a data balancing process. Next, multiple algorithms were used to construct the models using decision trees, rule induction tools, neural networks, and a priori rules. The strategy also called for use of multiple modelers, each assigned to work on different splits of the data. The objective was to generate hundreds of models and to keep the 11 best to create a model ensemble.

click to expand
Figure 7.4: Eleven sets of training, testing, validation data (33 sets in all).

In addition to being split, the data was also rotated to ensure the validity of the models (see Figure 7.5).

click to expand
Figure 7.5: The data was rotated in the training, testing, and validation phases.

7.7.6 Step Six: Evaluate Investigation Algorithms

Algorithms tend to have different error rates, based on the data sets in which they are used (see Figure 7.6). Empirical studies, such as StatLog, have demonstrated that the structure of the data influences the classification accuracy of algorithms, such as neural networks, regression, and decision trees.

click to expand
Figure 7.6: Five algorithms on six data sets yielded different results.

7.7.7 Step Seven: Investigation Ensembles Are Selected

Because single-model synthesis can be difficult, algorithms search for the best model, but not exhaustively, either by decision trees, polynomial and neural networks, or logistic regression and because iterative algorithms converge to local minima, such as neural networks, the team agreed to use ensembles of models. Ensembles smooth out jagged decision boundaries and provide a means of eliminating errors from individual classifiers (see Figure 7.7).

click to expand
Figure 7.7: Model ensembles make decisions by committee of algorithms.

7.7.8 Step Eight: Data Is Prepared

During the data preparation, some input and output data specifications are controlled via automated scripting. At the same time, some fields are removed prior to modeling, and some new features (ratios) are created (see Figure 7.8).

click to expand
Figure 7.8: Data is prepared for mining.

7.7.9 Step Nine: Models Are Created and Tested

Models are created on the basis of multiple criteria, such as fraud sensitivity and false alarm; they are assessed on the basis of their overall performance by weight, rank, and algorithm diversity (see Figure 7.9).

click to expand
Figure 7.9: Model creation stream in Clementine.

The models are combined based on their scores in order to increase the robustness of their predictions (see Figure 7.10).

click to expand
Figure 7.10: Results of final models.

7.7.10 Step 10: Models Are Tested on Validation Data

The team found that the results were heavily dependent on which transactions were used for training, testing, and validation. Overall, however, the ensemble was the best classifier; the final result of the model was that 97% of known fraud cases were accurately detected in the validation data set sample (see Figure 7.11). The end result was that the models selected 1,217 suspicious payments for further investigation by the agency. Those inquiries are in progress.

click to expand
Figure 7.11: Overall model score on validation data.

7.7.11 Conclusions of Investigation

In terms of performance, the team found that model ensembles mitigate risk compared to single-model solutions and that the ensemble, although not necessarily the best model, was always among the best, and rarely among the worst. They found that model ensembles were also able to identify key variables, attributes that are important for several models.

7.7.12 Future Directions

It was recommended that the data mining process be extended to include other operating locations through the reusability of these models. It was also suggested that prediction be prioritized on the basis of the dollar amounts of suspicious payments and that the results of examiners be improved in order to develop a methodology to integrate the entire process from data collection through examinations. Lastly, the team recommended that suspicious payment rules be deployed to prevent future fraudulent acts. The agency plans to convert the Clementine stream to C code.

As this case study demonstrated, it is very important to have the capability of developing and testing models using different algorithms in order to arrive at an optimum classification solution.

This case study was also presented at The Twelfth International Conference on Tools with Artificial Intelligence, Vancouver, British Columbia, November 13–15, 2000, by H. Vafaie, D.W. Abbott, M. Hutchins, and I.P. Matkovsky, under the title "Combining Multiple Models Across Algorithms and Samples for Improved Results." The author would like to thank and acknowledge Dean Abbott of Abbott-Consulting.com and Bill Haffrey of SPSS for sharing the material.