Defining Data Mining | Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications

Data mining capability is not something you can buy off the shelf. Data mining requires building a BI decision-support application, specifically a data mining application, using a data mining tool. The data mining application can then use a sophisticated blend of classical and advanced components like artificial intelligence, pattern recognition, databases, traditional statistics, and graphics to present hidden relationships and patterns found in the organization's data pool.

Data mining is the analysis of data with the intent to discover gems of hidden information in the vast quantity of data that has been captured in the normal course of running the business. Data mining is different from conventional statistical analysis, as indicated in Table 13.1. They both have strengths and weaknesses.

Table 13.1. Statistical Analysis versus Data Mining

Statistical Analysis	Data Mining
Statisticians usually start with a hypothesis.	Data mining does not require a hypothesis.
Statisticians have to develop their own equations to match their hypothesis.	Data mining algorithms can automatically develop the equations.
Statistical analysis uses only numerical data.	Data mining can use different types of data (e.g., text, voice), not just numerical data.
Statisticians can find and filter dirty data during their analysis.	Data mining depends on clean, well-documented data.
Statisticians interpret their own results and convey these results to the business managers and business executives.	Data mining results are not easy to interpret. A statistician must still be involved in analyzing the data mining results and conveying the findings to the business managers and business executives.

Tables 13.2 and 13.3 use specific examples (insurance fraud and market segmentation, respectively) to illustrate the differences between traditional analysis techniques and discovery-driven data mining.

Table 13.2. Example of Insurance Fraud Analysis

Traditional Analysis Technique

Discovery-Driven Data Mining

An analyst notices a pattern of behavior that might indicate insurance fraud. Based on this hypothesis, the analyst creates a set of queries to determine whether this observed behavior actually constitutes fraud. If the results are not conclusive, the analyst starts over with a modified or new hypothesis and more queries. Not only is this process time-consuming , but it also depends on the analyst's subjective interpretation of the results. More importantly, this process will not find any patterns of fraud that the analyst does not already suspect.

An analyst sets up the data mining application, then "trains" it to find all unusual patterns, trends, or deviations from the norm that might constitute insurance fraud. The data mining results unearth various situations that the analyst can investigate further. For the follow-up investigation, the analyst can then use verification-driven queries. Together, these efforts can help the analyst build a model predicting which customers or potential customers might commit fraud.

Table 13.3. Example of Market Segmentation Analysis

Traditional Analysis Technique

Discovery-Driven Data Mining

An analyst wants to study the buying behaviors of known classes of customers (e.g., retired school teachers , young urban professionals) in order to design targeted marketing programs. First, the analyst uses known characteristics about those classes of customers and tries to sort them into groups. Second, he or she studies the buying behaviors common to each group . The analyst repeats this process until he or she is satisfied with the final customer groupings.

The data mining tool "studies" the database using the clustering technique to identify all groups of customers with distinct buying patterns. After the data is mined and the groupings are presented, the analyst can use various query, reporting, and multidimensional analysis tools to analyze the results.

The Importance of Data Mining

Discovery-driven data mining finds answers to questions that decision- makers do not know to ask. Because of this powerful capability, data mining is an important component of business intelligence. One may even say that data mining, also called knowledge discovery, is a breakthrough in providing business intelligence to strategic decision-makers. At first glance, this claim may seem excessive. After all, many current decision-support applications provide business intelligence and insights.

Executive information systems (EISs) enable senior managers to monitor, examine, and change many aspects of their business operations.
Query and reporting tools give business analysts the ability to investigate company performance and customer behavior.
Statistical tools enable statisticians to perform sophisticated studies of the behavior of a business.
New multidimensional online analytical processing (OLAP) tools deliver the ability to perform "what if" analysis and to look at a large number of interdependent factors involved in a business problem.

Many of these tools work with BI applications and can sift through vast amounts of data. Given this abundance of tools, what is so different about discovery-driven data mining? The big difference is that traditional analysis techniques, even sophisticated ones, rely on the analyst to know what to look for in the data. The analyst creates and runs queries based on some hypotheses and guesses about possible relationships, trends, and correlations thought to be present in the data. Similarly, the executive relies on the business views built into the EIS tool, which can examine only the factors the tool is programmed to review. As problems become more complex and involve more variables to analyze, these traditional analysis techniques can fall short. In contrast, discovery-driven data mining supports very subtle and complex investigations.

Data Sources for Data Mining

BI target databases are popular sources for data mining applications. They contain a wealth of internal data that was gathered and consolidated across business boundaries, validated , and cleansed in the extract/transform/load (ETL) process. BI target databases may also contain valuable external data, such as regulations, demographics , or geographic information. Combining external data with internal organizational data offers a splendid foundation for data mining.

The drawback of multidimensional BI target databases is that since the data has been summarized, hidden data patterns, data relationships, and data associations are often no longer discernable from that data pool. For example, the data mining tool may not be able to perform the common data mining task of market basket analysis (also called associations discovery, described in the next section) based on summarized sales data because some detailed data pattern about each sale may have gotten lost in the summarization. Therefore, operational files and databases are also popular sources for data mining applications, especially because they contain transaction-level detailed data with a myriad of hidden data patterns, data relationships, and data associations.

Exercise caution with operational systems extracts because the data could contain many duplicates, inconsistencies, and errors and could skew the data mining results.

Data mining tools could theoretically access the operational databases and BI target databases directly without building data mining databases first, as long as the database structures are supported by the tool (e.g., relational like Oracle, hierarchical like IMS, or even a flat file like VSAM). However, this is not an advisable practice for several reasons.

The data pool needs to be able to change for different data mining runs, such as dropping a sales region or restricting a product type for a specific mining purpose. Changing the data content of operational or BI target databases is not possible.
The performance of operational as well as BI target databases would be impacted by the data mining operations. That is unacceptable for operational databases and not desirable for BI target databases.
A data mining operation may need detailed historical data. Operational databases do not store historical data, and BI target databases often do not have the desired level of detail. Archival tapes may have to be restored and merged to extract the desired data.

Therefore, organizations often extract data for data mining as needed from their BI target databases and from their operational files and databases into special-purpose data mining databases (Figure 13.1).

Figure 13.1. Data Sources for Data Mining Applications

graphics/13fig01.gif