Data Mining | Practical Business Intelligence with SQL Server 2005

As you have seen, Analysis Services enables you to build powerful BI solutions that enable users to really understand the business. However, many business problems rely on the ability to spot patterns and trends across data sets that are far too large or complex for human analysts. Data mining can be used to explore your data and find these patterns, allowing you to begin to ask why things happen and to predict what may happen in the future.

Data Mining Architecture

Data mining in SQL Server 2005 is integrated into the Analysis Services engine, as shown in Figure 2-16. The information derived from data mining can be made available as part of Analysis Services cubes and Reporting Services reports so that users can apply the groupings and predictions from data mining to the existing data.

Figure 2-16. Data mining architecture

Preparing the Data

Data mining and data warehouses go well together because using clean, validated information is vital to the successful use of data mining. Just like most Analysis Services features, you can use data mining against almost any source of information, including transaction systems, but as usual we recommend starting from a data warehouse as the best source of data.

One of the great features of data mining in SQL Server 2005 is that you can either use relational tables as the source data for the models or you can build models on top of multidimensional cubes. The advantage of using cubes as the source is that the models can use any complex calculations that you may have added and can also sometimes benefit from the query performance increase for aggregated information.

Building Data Mining Models

You can use the BI Development Studio to design and validate data mining models by adding them to an Analysis Services project, as shown in Figure 2-17. After deciding whether to use a data source or the current cube as the source, you can select an algorithm and then supply the relevant information to set up the model. A data mining algorithm is the mechanism that actually creates the mining models by looking for specific patterns in a data set. The model created can be validated and tested within the same development environment and then deployed to an Analysis Services server.

Figure 2-17. Building data mining models

Using the Information

After your model has been deployed on a server, you can use it to make predictions using a query language called DMX (Data Mining Extensions), which is somewhat similar to SQL. The process for querying a mining model is similar to querying other Analysis Services objects and uses the OLE DB Provider for Analysis Services. For example, when customers on your e-commerce site adds a DVD to their online shopping basket, you could submit a DMX query to use a mining model to predict which other titles a specific customer might like to purchase.

If you want to display the data mining models in your application, two different viewer libraries are available depending on whether you are building Web-based or Windows Forms applications. The Web-based controls are included with SQL Server 2005 as a sample project that you will need to compile using Visual Studio. These controls can then be added to Web pages to show the results of some of the algorithms.

For Windows client applications, the data mining viewer controls shown in Figure 2-18 are shipped as part of the Feature Pack for SQL Server 2005, which is a collection of add-ons and components that were only completed after the release of SQL Server 2005.

Figure 2-18. Data mining viewer controls

Data Mining Features

SQL Server 2005 ships with many Microsoft data mining algorithms in the box and also supports "plug-in" algorithms from third-party vendors. The algorithm that you decide to use will depend on the task you are trying to accomplish. For example, some algorithms such as Microsoft Decision Trees are good for predicting the expected value of a numeric attribute such as sales revenue, or predicting the category that a particular record may fall into. Other algorithms are used to group records together into similar categories, such as using the Microsoft Clustering Algorithm to segment customers by demographic information.

Using Data Mining in ETL

One of the most innovative aspects of data mining in SQL Server 2005 is the incorporation of data mining features into Integration Services. Data quality is one of the main concerns in BI projects, and Integration Services includes a Data Mining Query transformation that you can use to redirect or modify a row depending on the results of a data mining prediction. For example, data that is outside the normal range could be flagged for review or could be categorized using the mining model, as shown in Figure 2-19.

Figure 2-19. Data mining integration in Integration Services

Integration Services also has some transformations that take advantage of data mining, such as Fuzzy Lookup, which can be used to return "close" rather than just exact matches from a reference table, and Fuzzy Grouping to help identify possible duplicate rows.

Managing and Securing Data Mining Models

Because SQL Server's data mining features are provided by Analysis Services, the security and management for data mining models work the same way as for cubes and other Analysis Services objects. You can add users to roles, which control their ability to access data mining models, and you can use SQL Server Management Studio to reprocess the models to load new data.