Microsoft Data Mining Algorithms | The MicrosoftВ Data Warehouse Toolkit: With SQL ServerВ 2005 and the MicrosoftВ Business Intelligence Toolset

Data mining algorithms are the logic used to create the mining models. Several standard algorithms in the data mining community have been carefully tested and honed over time. One of the algorithms used to calculate decision trees uses a Bayesian method to determine the score used to split the branches of the tree. The roots of this method (so to speak) trace back to its namesake, Thomas Bayes, who first established a mathematical basis for probability inference in the 1700s.

The Data Mining group at Microsoft has been working diligently to expand the number of algorithms offered in SQL Server 2005 and to improve their accuracy. SQL Server Data Mining includes seven algorithms that cover a large percentage of the common data mining application areas. The seven core algorithms are:

Decision Trees (and Linear Regression)
Nave Bayes
Clustering
Sequence Clustering
Time Series
Association
Neural Network (and Logistic Regression)

The two regression algorithms set parameters on the main algorithm to generate the regression results. Some of these higher-level algorithms include parameters the data miner can use to choose from several underlying algorithms to generate the model. If you plan to do serious data mining, you need to know what these algorithms are and how they work so you can apply them to the appropriate problems and are able to get the best performance. We briefly describe each of these algorithms in the following list. The Books Online topic Data Mining Algorithms is a good starting point for additional information about how each of these algorithms work.

Reference

For more detailed information, see the book Data Mining with SQL Server 2005 (Wiley, 2005) by ZhaoHui Tang and Jamie MacLennan, key members of the Microsoft SQL Server 2005 Data Mining team.

Decision Trees

The Microsoft Decision Trees algorithm supports both classification and estimation. It works well for predictive modeling for both discrete and continuous attributes.

The process of building a decision tree starts with the dependent variable to be predicted and runs through the independent variables to see which one most effectively divides the population. The goal is to identify the variable that splits the cases into groups where the predicted variable (or class) either predominates or is faintly represented. The best starting variable for the tree is the one that creates groups that are the most different from each otherthe most diverse.

For example, if youre creating a decision tree to identify couples who are likely to form a successful marriage , youd need a training set of input attributes for both members of the couple and an assessment of whether or not the partnership is successful. The input attributes might include the age of each individual, religion, political views, gender, relationship role (husband or wife), and so on. The predictable attribute might be MarriageOutcome with the values of Success or Fail. The first split in the decision treethe one that creates the biggest splitmight be based on a variable called PoliticalViews: a discrete variable with the values of Similar and Different. Figure 10.2 shows that this initial split results in one group (PoliticalViews=Similar) that has a much higher percentage of successful marriages than the other (PoliticalViews=Different). The next split might be different for each of the two branches. In Figure 10.2, the top branch splits based on the height difference between the two (calculated as height of husband minus height of wife). The lower branch also splits on height difference, but uses different cutoff points. It seems that couples can better tolerate a height difference if their political views are similar.

Figure 10.2: A simple decision tree to predict relationship success

The bottom branch of this tree indicates that the chances of having a successful marriage are more likely when the couple shares similar political views (.72). Following down that same branch, the chances get better when the husband is at least as tall as, but not more than 10 inches taller than the wife (.81). The group with the lowest probability of success (.08) is characterized by different political views and a height difference where the husband is either less than 2 inches taller or 8 inches or more taller than the wife. Once you build a decision tree like this, you could use it to predict the success of a given relationship by entering the appropriate attributes for both partners . At that point, youd be well on your way to building a matching engine for a dating web site.

Nave Bayes

The Microsoft Nave Bayes algorithm is a good starting point for many data mining projects. It is a simplified version of Decision Trees, and can be used for classification and prediction for discrete attributes. Fortunately, if you select the Nave Bayes algorithm in the data mining wizard, it will automatically discretize your data by converting any continuous variables in the input case set to discrete values.

The Nave Bayes algorithm is fairly simple, based on the relative probabilities of the different values of each attribute, given the value of the predictable attribute. For example, if you had a case set of individuals with their occupations and income ranges, you could build a Nave Bayes model to predict Income Range given an Occupation . The decision tree in Figure 10.2 could have been generated by the Nave Bayes algorithm because all the variables are discrete (although the Tree Viewer is not available for Nave Bayes models. The required probability calculations are almost all done as part of the process of building the mining model cube, so the results are returned quickly.

Clustering

The clustering algorithm is designed to meet the clustering or segmentation business need described earlier. Clustering is generally considered a density estimation problem with the assumption that there are multiple populations in a set, each with its own density distribution. (Its sentences like the previous one that serve to remind us that statisticians speak in a different language.) Its easier to understand clustering visually: A simple spreadsheet graph can be used as an eyeball clustering tool especially when there are only two variables. Its easy to see where the dense clusters are located. For example, a graph showing per-capita income versus per-capita national debt for each country in the world would quickly reveal several obvious clusters of countries . We explore the idea of graphically identifying clusters further in the next section. The challenge is finding these clusters when there are more than two variables, or when the variables are discrete and non-numeric rather than continuous.

Note	The Microsoft Clustering algorithm uses whats known as an Expectation-Maximization (EM) approach to identifying clusters. An alternative, distance-based clustering mechanism called K-means is available by setting parameters on the model.

Sequence Clustering

Sequence clustering adds another level of flexibility to the clustering problem by including an ordering attribute. The algorithm can identify common sequences and use those sequences to predict the next step in a new sequence. Using the web site example, sequence clustering can identify common click-paths and predict the next page (or pages) someone will visit, given the pages they have already visited.

Time Series

The Microsoft Time Series algorithm can be used to predict continuous variables, like Sales, over time. The algorithm includes time-variant factors like seasonality and can predict one or more variables from the case set. It also has the ability to generate predictions using cross-variable correlations . For example, product returns in the current period may be a function of product sales in the prior period. (Its just a guess, but wed bet that high sales in the week leading up to December 25 may lead to high returns in the week following.)

Association

The Microsoft Association algorithm is designed to meet the business tasks described as association, affinity grouping, or market basket analysis. Association works well with the concept of nested case sets, where the higher level is the overall transaction and the lower level is the individual items involved in the transaction. The algorithm looks for items that tend to occur together in the same transaction. The number of times a combination occurs is called its support . The SUPPORT parameter allows the data miner to set a minimum number of occurrences before a given combination is considered significant. The Association algorithm goes beyond item pairs by creating rules that can involve several items. In English, the rule sounds like When Item A and Item B exist in the item set, then the probability that Item C is also in the item set is X. The rule is displayed in this form: A, B C (X). In the same way the data miner can specify a minimum support level, it is also possible to specify a minimum probability in order for a rule to be considered.

Neural Network

Neural network algorithms mimic our understanding of the way neurons work in the brain. The attributes of a case are the inputs to a set of interconnected nodes, each of which generates an output. The output can feed another layer of nodes (known as a hidden layer) and eventually feeds out to a result. The goal of the neural network algorithm is to minimize the error of the result compared with the known value in the training set. Through some fancy footwork known as back propagation, the errors are fed back into the network, modifying the weights of the inputs. Then the algorithm makes additional passes through the training set, feeding back the results, until it converges on a solution. All this back and forth means the neural network algorithm is slowest from a performance standpoint. The algorithm can be used for classification or prediction on both continuous and discrete variables.

Using these seven algorithms, separately or in combination, you can create solutions to most common data mining problems.