1 Understanding Data Mining | Data Mining with Microsoft[r] SQL Server[tm] 2000 Technical Reference

Microsoft SQL Server 2000 Analysis Services is a suite of decision support engines and tools.

Analysis Services is accessed through a graphical user interface tool which is implemented as a snap-in within Microsoft Management Console (MMC). This tool lets an administrator easily manage various decision-support tasks . As of this writing, Analysis Services provides access to two forms of decision support mechanisms: data mining and OLAP. Although both data mining and OLAP are crucial elements to any decision-support effort, they perform very different duties , as we'll soon see.

Note

Analysis Services is the name for the decision support tools included in SQL Server 2000 which includes both OLAP and data-mining functionality. In SQL Server 7.0, Analysis Services was called OLAP Services and as the name implies, it included only the OLAP functionality.

Analysis Services' architecture can be divided into the client portion, used for providing interfaces to front-end applications, and the server portion, which houses the engines that provide the functionality and power to these services. The client and server portions have separate components that are accessed in different ways. For example, client services provide the functionality needed to create local components that are subsets of the server components. Client services are used to create local cubes from server-side online analytical processing (OLAP) cubes. The server components provide the core data and services needed to support Analysis server tasks. 1

Note

OLAP and data mining can easily use data contained in formats other than SQL Server 2000, such as text, Microsoft Excel spreadsheets, or any other OLE DB source.

Introduction to OLAP

Online analytical processing (OLAP) is a technology that allows users to analyze a large database to gain insight on the information it contains. The database for an OLAP system is structured for efficient storage of static data. The storage structure is designed to efficiently retrieve data, particularly aggregated data such as summed totals and counts. Because the OLAP storage unit is multidimensional, it's called a cube as opposed to a table. What makes OLAP unique is its ability to store the aggregated data hierarchically, allowing the user to drill down and roll up aggregates by dimensional traits. Dimensions give contextual information to the numerical figures, or measures, that you're examining. For example, if you're looking at sales figures generally you would be interested in regional, quarterly, and product sales; OLAP calls each of these a dimension. If you wanted to look at one sales region that had a particularly good quarter, OLAP's structure allows you to expand the quarterly view to see each month or day of the quarter. At the same time, you can also drill down into the data on the region with the highest sales to find the cities responsible for the increase in sales. As Figure 2-1 shows, OLAP is a great tool for understanding how measures relate to dimensions. Because the measures are precalculated, OLAP makes navigation through the data almost instantaneous.

Figure 2-1. Contents of an OLAP cube.

OLAP vs. Data Mining

This chapter introduces the concepts and workings of OLAP. This knowledge is important for various reasons, not the least of which is that OLAP is one of the data sources available to you in a data mining scenario. OLAP is also a storage option for data mining models. The data mining process is often confused with OLAP cube navigation, so it's crucial that you understand the capabilities of each process. This book doesn't provide you with enough knowledge to do any significant work with OLAP, but there are some good books dedicated to this subject that are well worth reading such as: 2

Peterson, Tim, et al. Microsoft OLAP Unleashed. Indianapolis: SAMS, 1999.
Thomsen, Erik, George Spofford, and Dick Chase. Microsoft OLAP Solutions. New York: Wiley & Sons, 1998.
Kimball, Ralph, et al. The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing, and Deploying Data Warehouses. New York: Wiley & Sons, 1998.

There are two main points I'd like to make about relational data and OLAP. The first is that OLAP tends to remove any levels of granularity in the data. If you were using OLAP to aggregate, individual line items for sales would be disconnected from the OLAP aggregates. The individual line items get lost in aggregates because all of the relevant records get merged into one, which causes the numerical measures to become the summed values while the unique identifiers get discarded. Analysis Services does allow you to drill through to the underlying source data, but only if when you build the cube you choose the option that allows the user to perform drill throughs. The second point to note about OLAP is that it tends to require a more rigid definition of the data structures than a relational database does. Typically, OLAP's information comes from a data warehouse that's been structured specifically for query and analysis. This structure requires tighter design work up front, but the extra work pays off in the end with well-defined data relationships, hierarchies, and definitions.

From a reporting perspective, OLAP exposes its data in the same way regardless of the internal storage options chosen when building the cube. From a structural perspective, OLAP provides these various storage options to optimize data retrieval. These structural differences are completely hidden from the reporting functions. The three storage options available in OLAP are multidimensional online analytical processing (MOLAP), relational online analytical processing (ROLAP), and hybrid online analytical processing (HOLAP).

MOLAP

Multidimensional online analytical processing (MOLAP) is used to build multidimensional cubes from data stored in a data warehouse. Analysis Services uses the MOLAP cube to respond to a predefined range of inquiries, as shown in Figure 2-2. The MOLAP method is often chosen if the initial data sets are so large that the processing of the cube from the original warehouse data requires a batch process. The data is aggregated and processed using a set of predefined calculations. The resulting cube can be queried in much the same way as any other database is queried. 3

The main reason for using this method is that the MOLAP storage mechanisms are especially effective in retrieving data quickly. Unlike a relational database, which has to build a result set on the fly to answer queries, MOLAP simply identifies the location, or cell , where the precalculated elements reside and returns data to the respondent.

Figure 2-2. MOLAP structure.

The downside to this approach is that the cube is disconnected from the warehouse, which makes the process of updating the cube complex. Batch routines that compare cube data to warehouse data and update the cube data as necessary have to be put in place.

Corrections to mistakes found in the original data, once caught and corrected, also have to be made to the aggregated data in the cube. Cube corrections can be made in a number of ways; sometimes the best method is to create reverse records in the warehouse. For example, if you need to correct the dollar amount on a set of invoices you can cancel the incorrect invoices by creating a set of matching invoices and enter negative dollar amounts. Other times, the best, but more time-consuming , solution is to rebuild the entire cube from the corrected warehouse data. MOLAP can store only basic data types, such as text and integers. This creates an important limitation if you need to keep other types of information in the underlying data.

ROLAP

Most of the disadvantages associated with MOLAP are addressed with relational online analytical processing (ROLAP). ROLAP also has the advantage of being able to query preaggregated data. ROLAP's storage mechanism uses the original RDBMS, such as SQL Server 2000, to store the aggregates in tabular form, which can then be used by the OLAP engine, as shown in Figure 2-3. Because of this storage mechanism, the RDBMS engine can be used to manage the generated aggregates. The engine can make corrections and inserts on the individual line items and on the aggregates themselves . Reports can, if they need to, query aggregate tables directly without having to go through the OLAP querying mechanisms, such as OLE DB for OLAP, PivotTable Service, or multidimensional expressions (MDX). 4

Figure 2-3. ROLAP structure.

This storage mechanism does have a downside. The tabular structure of ROLAP, as opposed to the multidimensional structure of a MOLAP cube, is not as efficient for the OLAP engine to query. This inefficiency leads to a performance strain on the system, especially if the data sets are larger than average for a MOLAP storage mechanism. To correct this performance deficiency, you'll need to delve into the myriad complexities of aggregate ROLAP tables and attempt to index them properly. This leads to further limitations because the nature of this type of index tuning task invariably favors some types of queries to the detriment of others, thus creating a need for more specialization of the ROLAP structures than would otherwise be necessary. The facility that the MOLAP cubes provide for ad hoc querying almost seems like a luxury in ROLAP.

HOLAP

Sometimes it happens that 80 percent of the queries concern high-level aggregates, and the remaining 20 percent need access to the low-level line items in the source data. Hybrid online analytical processing (HOLAP) was designed to address this need. Essentially , HOLAP combines the advantages of MOLAP with those of ROLAP by storing the high-level aggregates in a MOLAP cube and keeping the low-level aggregates and line items in the relational database tables, as shown in Figure 2-4. The OLAP interface remains unchanged, but the engine is able to selectively choose data from different storage mechanisms based on the given query. Because information stored in the relational portion contains fewer, if any, levels of aggregation, some of the disadvantages of ROLAP can be avoided with HOLAP. Because HOLAP makes the tables far less complex to manage on the relational database side, the data is more easily optimized through indexing. The MOLAP portion of the cube provides fast retrieval for the aggregated portion of the queries without having to expose a developer or administrator to the details of summarized structures.

Figure 2-4. HOLAP structure. 5

Server Architecture

Analysis Services' architecture on the server side provides core facilities to create and manipulate OLAP cubes and data mining models. Connections to the original data sources as well as local security access is also managed by the server. The server manages the mechanics of data storage as the cubes and models are being processed. It also maintains the metadata repositories stored and used by Microsoft SQL Server 2000 Meta Data Services.

The user interface provided by Analysis Manager contains Data Transformation Services (DTS), which is a SQL Server 2000 service that cleans and transfers data as it moves between data sources. Analysis Manager is the main user interface for Analysis Services, but there are other programming interfaces that enable custom front-end applications to interact with the COM object model that controls the interface to the server. The Microsoft SQL Server 2000 Analysis Services server architecture is shown in Figure 2-5.

Data Mining Services Within Analysis Services

Although Analysis Server has many functions that are used by OLAP and data mining, as shown in Figure 2-5, some of the server features are specifically designed for data mining. For example, data mining uses a storage option that places data mining models in multidimensional structures, in relational database tables, or in Predictive Model Markup Language (PMML), which is a standardized Extensible Markup Language (XML) format specifically designed for data mining models.

Figure 2-5. Analysis Services server architecture. 6

Note

Security access roles are used by Analysis Services in order to limit access to various objects including cubes and data mining models. These roles function almost identically as SQL Server roles.

Client Architecture

The client side is primarily designed to provide a bridge, or interface, between the server functions and the custom applications. (See Figure 2-6.) PivotTable Service manages the crux of this interaction and also provides interfaces to OLE DB for Data Mining, which in turn either provides connectivity directly to Microsoft Visual C++ applications or to ActiveX Data Objects (ADO).

Figure 2-6. Analysis Services client architecture.

PivotTable Service

PivotTable Service is a built-in Analysis Server client that acts as an OLE DB provider for multidimensional data and data mining operations. It's positioned between the server engine and other client applications. As such, it provides OLE DB access to applications that need access to multidimensional data and data mining services from a client application. The data returned is generally in the form of tabular data, which can then be manipulated by traditional means using ADO record sets, Excel spreadsheets, or FoxPro tables. In both Data Mining Services and OLAP, PivotTable Service provides an extended SQL syntax that uses Data Definition Language (DDL), a language that allows clients to create and update local cubes or data mining models on the fly. These local structures can then be taken offline and used for queries and temporary updates in the pursuit of answers to "What if?" scenarios. PivotTable Service uses this specialized syntax to provide sophisticated query features to the client applications. The data can then be displayed, analyzed with data mining algorithms, or even changed. 7

Just about any client capable of communicating through OLE DB interface can access PivotTable Service. Clients using C++ can use special PivotTable Service interfaces with COM and OLE DB to create highly specialized custom front-end applications.

The role of the PivotTable Service is to provide access to the Analysis Services' functions through DDL syntax. Since this syntax is primarily designed for queries, the administrative functionality in the language is somewhat limited. Data mining models and cubes can indeed be created, but maintaining these structures is extremely code- intensive and rather impractical .

OLE DB

In order to provide a standard way to access disparate forms of data, whether they come from RDBMSs or Microsoft Outlook folders, and convert them to tabular result sets, Microsoft created OLE DB. OLE DB is designed to provide a common interface to these different data sources in the form of providers. For example, there are OLE DB providers for SQL Server, OLE DB providers for Oracle, and even OLE DB providers for comma-delimited text. Providers make it possible for an application to access these different data sources using the same calling methods, while letting the OLE DB provider handle the logistics of converting that data source into a usable table format. Analysis Services uses OLE DB for Data Mining and OLE DB for OLAP.

OLE DB for OLAP

The OLE DB for OLAP provider, shown in Figure 2-7, contains features that allow access to special OLAP structures. OLAP is unique, in part, because of its ability to store data in multidimensional structures. As was mentioned earlier, this multidimensional structure doesn't easily translate to simple tabular formats. To address this problem, Microsoft developed ActiveX Data Objects (Multidimensional) (ADO MD), which is a subset of the familiar ADO COM interface. ADO MD provides the same connection and querying facilities as ADO, but it also includes a record set with cells instead of rows and columns . A set of coordinates is needed to retrieve the cells.

Figure 2-7. OLE DB for OLAP access.

OLE DB for Data Mining

OLE DB for Data Mining exposes the data mining model object in the form of a table. It also provides a series of statements that allow an application to manipulate this object. This OLE DB provider supplies virtual table objects that can be created with a form of the SQL CREATE syntax; they can be populated with INSERT and queried with SELECT.

The data in the data mining model is structured as it would be in a database that contains a group of related tables. An auto dealer database might contain data about car sales made by auto salespersons along with information about the year, make, and model of the cars . This information might have a Salesperson table as a parent table for the Sales table, and the Sales table might have foreign key relationships to the actual car data. In a relational structure, a join between the Salesperson table and the Sales table will most likely cause the salesperson information to be repeated as many times as there are sales for each one of the salespersons. This is because for every record in the Salesperson table there might be more than one sales item record. Since the data mining model doesn't actually store data in relation tables, it simulates that functionality by creating nested tables, as shown in Figure 2-8. In the example shown, if the salesperson is the focus of the data mining prediction, each salesperson's record represents a case. Once the records from the sale are nested with the salesperson's record, it becomes a case set.

Figure 2-8. OLE DB nested table.

Inserting cases into the data mining model causes the OLE DB provider to launch a process that first applies the algorithm to the data and then stores patterns in the model. The stored patterns can be used with PREDICTION JOIN statements to compare a table with variable fields to the patterns in the model. Since the data mining model actually contains every combination of cases learned from the original data source, and statistical information about the number of cases used to create the distinct combination, this model and the original variables are used to make predictions . 9

Decision Support Objects (DSO)

Analysis Services exposes much of the functionality of the engine; it allows you to create cubes, databases, and data mining models and manage server functions. These functions are exposed through a series of well-defined COM interfaces. These interfaces also provide access to the properties of almost all the objects available in the server, such as roles, dimensions, and partitions. The conjunction of this set of COM interfaces is called Decision Support Objects (DSO). Using DSO bypasses the PivotTable Service entirely because it provides a direct interface to the engine. This direct interface limits the functions of DSO from performing queries in the same way that PivotTable Service does.

DSO is implemented in the form of a hierarchy of objects and collections that map to the functions and components in Analysis Services. This allows applications to easily integrate with the underlying engine without having to understand the complexities inside.

Generally DSO is used to programmatically perform administrative tasks that would otherwise have to be performed interactively with Analysis Manager. Entire front-end applications similar to Analysis Manager can be written by anyone who knows DSO and can instantiate COM objects using Microsoft Visual Basic or Visual C++. DSO is also used to allow batch routines to perform routine, scheduled administrative tasks such as updating cubes or adding new cases to a data mining model.

Multidimensional Expressions (MDX)

Because of the unique structure of OLAP cubes, a specialized subset of the SQL Server language called Multidimensional expressions (MDX) exists specifically to query it. Although the syntax of the query resembles a standard SQL Server query and uses syntax such as SELECT, WHERE, and FROM, the differences become quite apparent when a query of any degree of complexity is attempted. Observe the following query provided in the SQL Server Books Online:

 WITH MEMBER All_Countries.USA_and_Japan AS 'SUM({USA, Japan})' SELECT CROSSJOIN({Smith, Jones},     {USA_North.CHILDREN, USA_South, Japan,         USA_and_Japan}) ON COLUMNS,     {Qtr1.CHILDREN, Qtr2, Qtr3, Qtr4.CHILDREN} ON ROWS FROM SalesCube WHERE (Sales, [1997], Products.All)   10

This kind of syntax is needed because OLAP is less concerned with retrieving data from sets of records than it is with pinpointing the coordinate position where the aggregated data is located, a task comparable to searching for a particular apartment in a building. To find the apartment, you need not only the building address, but also the building number, floor number, and apartment number. With this information you can walk straight up to the door and avoid searching every building on the block to find one apartment. Using MDX, you can formulate requests using specific coordinates that take you directly to your data.

Prediction Joins

Much like MDX, Data Mining Services uses its own subset of the SQL syntax. This functionality is provided to allow front-end applications to make predictions based on data in the data mining models. This syntax will be examined in detail in ‚ Chapter 12, "Data-Mining Queries." ‚

Summary

Analysis Services, although included with SQL Server 2000, is actually an independent set of services that combines OLAP functionality with data mining facilities. Both have a server architecture that shares features used for decision support services. OLAP and data mining also have unique features, some of which have been discussed in this chapter. The server components perform the fundamental, core engine-level functions of those services while exposing their interfaces through PivotTable Service.

PivotTable Service serves as a client to the server components of Analysis Services and acts as a bridge to applications that need to access objects through an OLE DB provider. It provides language extensions that create server side and local, offline data mining and OLAP structures and the syntax to create queries. MDX is the specialized language extension for OLAP and PREDICTION JOIN is the SQL subset used to make predictions from data mining models. 11

As an alternative to PivotTable Services, most Analysis Services administrative tasks can be performed more directly through the use of DSO, a set of COM interfaces that expose their functionality to ActiveX client applications.

OLE DB is the mechanism provided by Microsoft to allow clients or ActiveX wrappers such as ADO to access many different types of data through essentially the same programming interface. OLE DB interfaces with the PivotTable Service to communicate with server-side Analysis Services structures.