Data Mining Algorithms in Analysis Services 2005 | Professional SQL Server Analysis Services 2005 with MDX (Programmer to Programmer)

Analysis Services 2005 provides you with nine data mining algorithms that you can utilize to solve various business problems. These algorithms can be broadly classified into five categories based on the nature of the business problem they can be applied to. They are

Classification
Regression
Segmentation
Sequence analysis
Association

Classification data mining algorithms help solve business problems such as identifying the type of membership (Platinum, Gold, Silver, Bronze) a new customer should receive or whether the requested loan can be approved for a customer based on his or her attributes. Classification algorithms predict one or more discrete variables based on the attributes of the input data. Discrete variables are variables which contain a limited set of values. Some examples of discrete variables are Gender, Number of children in a house, and number of cars owned by a customer.

Regression algorithms are similar to classification algorithms; instead of predicting discrete attributes, however, they predict one or more continuous variables. Continuous variables are variables that can have many values. Examples of continuous variables are yearly income, age of a person, and commute distance to work. The algorithms belonging to the regression category should be provided with at least one input attribute that is of type continuous. For example, assume you want to predict the sale price of your house, a continuous value, and determine the profit you would make by selling the house. The price of the house would depend on several factors, such as square feet area (another continuous value), zip code, and house type (single family, condo, or town home), which are discrete variables. Hence regression algorithms are primarily suited for business problems where you have at least one continuous attribute as input and one or more attributes as predictable attributes.

Segmentation algorithms are probably the most widely used algorithms. Segmentation is the process of creating segments or groups of items based on the input attributes. Customer segmentation is one of the most common business applications, where stores and companies segment their customers based on the various input attributes. One of the most common uses of segmentation is to perform targeted mailing campaigns to those customers who are likely to make purchases. This reduces the mailing cost to all the customers, thereby maximizing the profit for the company.

Sequence analysis algorithms analyze and group input data based on a certain sequence of operations. For example, if you want to analyze the navigation patterns of Internet users (sequence and order of pages visited by a user on the internet) and group them based on their navigations, sequence analysis algorithms would be used. Based on the sequence of pages visited you can identify interests of people and provide appropriate information to the users as a service or show advertisements relevant to the users' preferences to increase sales of specific products. For example, if you navigate through pages of baby products on http://www.amazon.com, subsequent visits to Amazon pages might result in baby product related advertisements. Similarly, sequence analysis is also used in genomic science to group a sequence of genes with similar sequences.

Association data mining algorithms help you to identify association in the data set. Typically these algorithms are used for performing market-basket analysis where association between various products purchased together are analyzed. Based on the analysis, associations between various products are identified and these help in the cross-selling of products together to boost sales. One famous data mining example highlights associations — customers buying diapers also bought beer, and the purchases occurred on Thursday/Friday. One of the reasons is that diapers often need replenishing and women request their husbands or significant others to buy them. Men often buy over the weekend, and hence these purchases were made together. Based on this association, supermarkets can have diapers and beer stocked adjacently, which helps boost the sales of beer.

Enclosed below are brief descriptions of the nine data mining algorithms supported in Analysis Services 2005. The description will give you an overview of the algorithm and scenarios where the algorithm can be utilized. We recommend you to refer to SQL Server Analysis Services documentation for details such as algorithm properties, their values, and various content types supported by the algorithm for input and predictable columns. Following these descriptions you will learn two data mining algorithms in detail by creating mining models using the data mining wizards.

Microsoft Decision Trees

Microsoft Decision Trees is a classification algorithm that is used for predictive modeling and analysis. A classification algorithm is an algorithm that selects the best possible outcome for an input data from a set of possible outcomes. An input data set called the training data that contains several attributes is provided as input to the algorithm. Usage of the attributes as either input or predictable are also provided to the algorithm. The classification algorithm analyzes the attributes of the input data and arrives at a distribution, which includes a combination of input attributes and their values that result in the value of the predictable column. Microsoft Decision Trees is helpful in predicting both discrete and continuous attributes. If the data type of the predictable attribute is continuous, the algorithm is called Microsoft Regression Trees and there are additional properties to control the behavior of the regression analysis.

Nave Bayes

Nave Bayes is another classification algorithm available in Analysis Services 2005 that is used for predictive analysis. The Nave Bayes algorithm calculates the value of the predictable attribute based on the probabilities of the input attribute in the training data set. Nave Bayes helps you to predict the outcome of the predictable attribute quickly because it assumes the input attribute is independent. Compared to the data mining algorithms in Analysis Services 2005, Nave Bayes is computationally less intense for model creation.

Microsoft Clustering

The Microsoft Clustering algorithm is a segmentation algorithm that helps in grouping the sample data set into segments based on the characteristics. The clustering algorithm helps in identifying relationships existing within a specific data set. A typical example would be grouping store customers based on their characteristics of sales patterns. Based on this information you can classify the importance of certain customers to your bottom-line. The Microsoft Clustering algorithm is unique because it is a scalable algorithm that is not constrained by the size of the data set. Unlike the Decision Trees or Nave Bayes algorithm, the Microsoft Clustering algorithm does not require you to specify a predictable attribute for building the model.

Sequence Clustering

As the name indicates, the Sequence Clustering algorithm helps in grouping sequences in the sample data. Similar to the clustering algorithm, the sequence clustering algorithm groups the data sets but based on the sequences instead of the attributes of the customers. An example of where Sequence Clustering would be used is to group the customers based on the navigation paths of the Web site they have visited. Based on the sequence, the customer can be prompted to go to a Web page that would be of interest.

Association Rules

The Microsoft Association algorithm is an algorithm that typically helps identify associations or relationships between products that are purchased. If you have shopped at http://www.Amazon.com you have likely noticed information "people who have purchased item one have also purchased item two." Identifying the association between products purchased is called market-basket analysis. The algorithm helps in analyzing products in a customer's shopping basket, and predicts other products the customer is likely to buy. That prediction is based on purchase co-occurrence of similar products by other customers. This algorithm is often used for cross-selling through product placement in the store.

Neural Networks (SSAS)

The Microsoft Neural Networks algorithm is a classification algorithm similar to Microsoft Decision Trees and calculates probabilities for each value of the predictable attribute, but it does so by creating internal classification and regression models that are iteratively improved based on the actual value. The algorithm has three layers (the input layer, an optional hidden layer, and an output layer) that are used to improve the prediction results. The actual value of a training case is compared to the predictable value and the error difference is fed back within the algorithm to improve the prediction results. Similar to the decision trees algorithm, the Neural Network algorithm is used for predicting discrete and continuous attributes. One of the main advantages of neural networks over the decision trees algorithm is that neural networks can handle complex as well as large amount of training data much more efficiently.

Time Series

The Microsoft Time Series algorithm is used in predictive analysis but is different from other predictive algorithms in Analysis Services 2005 because during prediction it does not take input columns to predict the predictable column value. Rather, it identifies trends in the input data and helps predict future values. A typical application of a time series algorithm is to predict the sales of a specific product based on the sales trend of the product in the past, along with the sales trend of a related product. Another example would be to predict stock prices of a company based on the stock price of another company. The Time Series algorithm is used for predicting continuous attributes.

Microsoft Linear Regression

Microsoft Linear Regression is a special case of Microsoft Decision Tree algorithm where you set an algorithm property so that the algorithm will never create a split there by ending up with a linear regression. The algorithm property MINIMUM_LEAF_CASES for the Microsoft Decision Tree will be set to a value greater than the number of input cases used to train the model. The linear regression algorithm will typically be used when you want to find the relationship between two continuous columns. The algorithm finds the equation of a line that best fits data representing the relationship between the input columns. Microsoft Linear Regression algorithm only supports input columns that have certain content type. The main content type typically used will be continuous. The algorithm does not support content types discrete or discretized. For more details on the content type supported by the algorithm please refer to product documentation.

Microsoft Logistic Regression

The Microsoft Logistic Regression is a variation of the Microsoft Neural Networks algorithm where the hidden layer is not present. The simplest form of logistic regression is to predict a column that has two states. The input columns can contain many states and can be of many content types (discrete, continuous, discretrized, etc.). You can certainly model such a predictable column using linear regression but the linear regression might not restrict the values to the minimum and maximum values of the column. However, logistic regression is able to restrict the output values for the predictable column to the minimum and maximum values with the help of a S-shaped curve instead of the linear line which would have been created by a linear regression. In addition, logistic regression is able to predict columns of content type discrete or discretized and able to take input columns that are content type discrete or discretized.