8.19 Machine-Learning and Fraud


8.19 Machine-Learning and Fraud

Data mining tools based on machine-learning algorithms allow for the extraction of rules directly from the data and the creation of graphical decision trees. A decision tree allows for the visual segmentation of various data components for predicting situations where fraud has a high probability of occurring. The core technology of these decision tree tools as we have learned are commonly machine-learning algorithms, which automate the process of segmenting important features and ranges hidden in a database. For example, these types of tools (ANGOSS and SPSS) can be used to discover situations in which fraudulent transactions are likely to increase, enabling e-retailers to take preventive steps to reduce their losses. ANGOSS uses CHAID and CART concurrently, while SPSS uses the machine-learning algorithm C5.0.

8.19.1 Decision Trees

The decision tree shown in Figure 8.8 demonstrates instances where the probability of fraud increases. For example, the upper node represents the total number of on-line sales for a single day (2,009) with 13.2% representing fraudulent transactions. However, when the product price was between $70.00 and $188.00 (the fourth node from the left), the rate of fraud more than doubled to 28.1%; when the product price was over $254.00 (the node on the far right), it increased further to 32.2%.

click to expand
Figure 8.8: Decision trees can uncover hidden ranges where fraud is higher than average.

Figure 8.9 examines the rate of fraud based on the type of products sold by an electronic consumer product site, which, on average, is 13.2%. As you can see, it increases on higher-price items, such as camcorders and PDAs to 22.4% and gets even higher with digital cameras and PCs to 34.9%.

click to expand
Figure 8.9: As fraud statistics show, computer equipment is high on criminals' lists.

Figure 8.10 shows a relationship between fraud and a specific demographic attribute, in this case the median rent of shoppers. This information is matched against their physical addresses, captured at the shopping cart on the site.

click to expand
Figure 8.10: Fraud is highest in households where the median rent is $425-$548.

Interestingly, at this juncture, several clues on the profile of these on-line fraud perpetrators can begin to be assembled. They tend to be young, as the neural network model was indicating (25–44), and, as the decision tree tool discovered, they tend to live in apartments with rents in the range of $425 to $548. Through this type of data mining analysis, the user can interactively view the demographic features and dynamics associated with fraudulent transactions, enabling an e-business to begin to assemble a fraud profile.

8.19.2 Conditions for Fraud

Machine-learning-based data mining tools can also generate conditional and association rules directly from their segmentation analysis of a data set. Using a database with a sample of both legal and fraudulent transactions, rules can be extracted in order to anticipate and discover the conditions that are associated with fraudulent transactions.

These rules can be deployed to perform real-time alerts in situations where transactions and the associated demographics of on-line shoppers tend to indicate a high probability of fraud. From these analyses rules can be generated in many programming language formats, such as C or Java, enabling a Web site to incorporate them in its e-commerce application server software; for example:

      IF     PERCENT AGE_25-44 is    62.00      AND    MEDIAN RENT is          527.00   74.00      THEN   TRANSACTION is          Fraud             Rule's probability:     0.727             The rule exists in:     320 records             Significance level:     .001 error probability 

These rules represent potential conditions where fraud is likely to occur, which can be integrated into an e-commerce system or outsourced to a Web service provider. The following is a sample of a rule written in SQL; other code formats include Visual Basic, C, and Java among others.

      -- SQL Predictive Model      -- A separate nested case statement has been      -- generated for each predicted outcome.      --Block # 1: Calculates the probabality      --           that 'Transaction' equals 'Fraud'      --Block # 1: Calculates the probabality      --           that 'Transaction' equals 'Fraud' 

      (CASE      WHEN (pprice >= 16 and pprice < 24) THEN           0.102564102564      WHEN (pprice >= 24 and pprice < 44) THEN           0      WHEN (pprice >= 44 and pprice < 70) THEN           0.129032258065      WHEN (pprice >= 70 and pprice < 188) THEN           (CASE           WHEN (MED_RENT >= 0 and MED_RENT < 362) THEN                0.274509803922           WHEN (MED_RENT >= 362 and MED_RENT < 921) THEN                0.139784946237           WHEN (MED_RENT >= 921 and MED_RENT < 1001) THEN                0.777777777778           Else                0.280701754386           END)      WHEN (pprice >= 188 and pprice < 254) THEN           0.0500758725341      WHEN (pprice >= 254 and pprice <= 2321) THEN            (CASE           WHEN (P_AGE5_17 >= 0 and P_AGE5_17 < 6) THEN                0.0857142857143           WHEN (P_AGE5_17 >= 6 and P_AGE5_17 < 12) THEN                0.575342465753           WHEN (P_AGE5_17 >= 12 and P_AGE5_17 < 14) THEN                0.181818181818           WHEN (P_AGE5_17 >= 14 and P_AGE5_17 <= 33) THEN                0.314189189189           Else                0.322440087146           END)      Else           0.131906421105      END) 




Investigative Data Mining for Security and Criminal Detection
Investigative Data Mining for Security and Criminal Detection
ISBN: 0750676132
EAN: 2147483647
Year: 2005
Pages: 232
Authors: Jesus Mena

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net