2 Microsoft SQL Server Analysis Services Architecture | Data Mining with Microsoft[r] SQL Server[tm] 2000 Technical Reference

The physical efficiency and usefulness of the data-mining process depends in large part on the underlying data model, the physical structure of a subset of data compiled from a data-mining algorithm, along with the description of the original data set, you use. Depending on the type and quantity of data you have and the results you are looking for, you can create a mining model directly from live OLTP data, which ensures your data is current, or from batches of data, which are extracted into specially enhanced reporting structures and kept for future use. As we'll discover in this chapter, the data model design has a big impact on the responsiveness of the server and the quality of the information mined.

Why Data Mining Needs a Data Warehouse

A data warehouse provides a repository for an organization's data. This repository is located in a central storage facility from which users can retrieve data whenever they need it. Using a single and centralized data warehouse as a data source increases the quality and consistency of an organization's data because the final data validation procedures are almost always handled by one team. Most major institutions and corporations (as well as many small ones) with high volume data processing tasks use a data warehouse model.

Organizations that manage data in warehouses often find that their data is too general and unmanageable to base specific and specialized decisions on. Although data warehouses do a good job of structuring data, you need data mining to extract useful information from the underlying data. Before you can begin the process of data mining, you have to store and structure the data in some type of warehouse. 1

One great virtue of data warehousing is that it can separate the reporting data (the data on which data mining is performed) from the OLTP data (the data structured for efficient transaction processing). It's very important to keep these processes separate, particularly when the data is being mined. Both processes begin with the same data, but in most cases they use different data structures to get different results. The structure affects how easily the data can be extracted and how the queries are formulated. In the right structure, simple queries can answer complex questions. Besides ease and simplicity, another consideration when deciding how data is going to be structured to meet reporting needs is server performance and resources.

Although data warehousing and data mining are closely related and interrelated processes, data mining requires different structures for its data, uses different algorithms, and, as we've discussed, is designed to accomplish different objectives.

Maintaining Data Integrity

The differences in these processes are reflected by their almost opposite objectives. OLTP is used to enter raw data and transactions from sources such as

Mail order sales
Cash registers
Measuring instruments such as those used in nuclear power plants 2
Mainframe file imports
Web server logs

Regardless of the source, what's important is that the process guarantees that the integrity of the data is maintained and that nonsensical data never gets into any of the tables. One way to guarantee this integrity is to use relational database management system (RDBMS) features to enforce rules that block the entry of bad data. Another way to guarantee the integrity of the data is to create data structures that don't require the same data to be entered multiple times. A classic example of a data structure built to support customer invoicing has a Customer table, an Invoice-header table, and a Line-items table. These tables are structured so that a customer name and address is entered only into the Customer table. The Invoice-header table contains all the information needed to identify an invoice such as the invoice number, date, shipping address, and so on, and a reference to the customer that placed the order. The Line-items table contains quantities , discounts , and references to the Invoice-header table. The items purchased are represented with product codes. Each code refers to products in the Product table. The intelligent use of references reduces redundant data entry and makes the model efficient. Figure 3-1 shows how the efficiency of this model also increases its complexity, making it somewhat more difficult for users to understand.

The main goals of an RDBMS engine such as Microsoft SQL Server are efficient management of information and the maintenance of data integrity. Normalization is the process that minimizes redundant information in a database, establishes efficient relationships among the database's tables, and ensures data integrity. One side effect of a normalized database is complex queries. Let's say that we want to use a SQL query to retrieve all the names of customers who purchased tofu in April, 1998, along with the quantity and date of each purchase. The query would look something like this:

 SELECT     cst.CompanyName,     ord.ShippedDate,     prd.ProductName,     COUNT(1) AS qty FROM customers cst      JOIN [orders] ord         ON cst.CustomerId = ord.CustomerId     JOIN [order details] det         ON det.orderid = ord.orderid     JOIN [products] prd          ON det.productid = det.productid WHERE ord.ShippedDate BETWEEN '4/1/1998' AND '5/1/1998' AND prd.productname = 'TOFU' GROUP BY     cst.CompanyName,     ord.ShippedDate,     prd.ProductName

Figure 3-1. Relational database diagram. 3

Note

If you want, you can run this very same query in the sample Northwind database included with SQL Server 2000.

Although this is hardly one of the most complicated queries you can come up with, it does seem like a lot of work for little information. To make this query, I had to join four tables together in order to get three fields. If I wanted to include regional and territorial information in my query, I would have had to join six or seven tables together! Fundamentally, there's nothing wrong with this way of querying data, as a matter of fact, SQL is a highly evolved language that is able to handle far more complex queries than this. To understand the underlying data and simplify the extractions, let's consider other methods of structuring the data.

For argument's sake, let's assume that all we want is to report and extract data from our structure. We don't care about data integrity or efficient use of structures for OLTP. We could then change our structures to look something like the one shown in Figure 3-2.

Figure 3-2. Denormalized data structures.

The process of melding multiple tables into one is known as denormalizing. Although I would not advise applying this process to a OLTP system, it does make the job of simplifying queries easier.

When the database is denormalized, the query shown in the earlier example looks like this. 4

 SELECT     CompanyName,     ShippedDate,     ProductName,     count(1) AS qty FROM ProductPurchases      WHERE ShippedDate BETWEEN '4/1/1998' AND '5/1/1998'     AND productname = 'TOFU' GROUP BY     CompanyName,     ShippedDate,     ProductName

By denormalizing the tables, we eliminated the complexities associated with joins. The downside to this process is that it increases data redundancy, and more data requires more disk space. Filtering a denormalized database is inherently slower than filtering a normalized database. Finally, we also made data entry a more cumbersome process because data elements such as the customer name and invoice number, which are referenced in a normalized structure, have to be entered for every item the customer purchases.

Reporting Against OLTP Data Can Be Hazardous to Your Performance

Because many reports are more informative when they are created from the latest data, it might seem reasonable to sacrifice simplicity for current data. In the interests of getting the most current information, we could make a case for investing in more powerful machines to process our more complex queries. There are, however, other reasons besides the complexity of queries to avoid reporting against live OLTP data. For one thing, in OLTP environments RDBMS engines are already using valuable machine power to maintain security, indexes, data pages, memory, and other critical resources without having to drain more power to answer large queries.

Having to share machine power is a common problem associated with the mixing of reporting with OLTP, but another phenomena called locking contention actually presents a greater threat to performance. A mechanism in the RDBMS engines is specifically designed to lock records before updating them in order to avoid data anomalies created when two or more users attempt to update data records at the same time or when one user reads a record as it is being updated. While OLTP databases insert, delete, and update records, reporting processes read or select data. To ensure that a report does not output incongruous results, the RDBMS engine will refuse to change a record that's being read by a report. To generate an accurate quarterly accounts closing report, the engine has to use static data. If there was no locking mechanism in place, it's possible that some of the totaled items on the first page could change without that change being reflected in totals on the subsequent pages. In this case, a manual tally of the figures would reveal the discrepancy. The locking mechanism makes sure everyone takes their turn to access the data for reading and writing.

Caution

Locking doesn't prevent proper functioning of a database, but it does force transactions to "wait" until the resource is released. A locked database most often acts like a system slowdown and causes some front-end applications to time out. One of the most misunderstood and overlooked locking problems is referred to as the deadlock. A deadlock occurs when two connections attempt to access and lock the same two resources at the same time. The first connection might succeed in locking the first resource but not the second resource, which has been locked by the second connection. The second connection is in the same situation as its rival, except that is has successfully placed a lock on the second resource and is waiting on the first resource. Each connection will hold to their positions like dueling partners , each waiting for the other to "blink. "SQL Server recognizes a deadlock and kills the connection with the least amount of CPU cycles invested, thus turning the user into a deadlock victim. In most cases, the deadlock victim's applications will crash.

Many reporting-tool vendors tout the flexibility of their ad hoc querying interfaces for end users as one of the most powerful features of their products. While these reporting tools that only read data seem harmless, they too can be hazardous. If the data to be reported on is on the OLTP server, great care has to be taken to ensure that runaway ad hoc queries can't be made because you will have little control over what users might try to do on the production servers. Sometimes the RDBMS engine needs to create temporary work spaces for internal sorting and hashing needs. If a user creates and runs a query that turns out to be missing join predicates, the RDBMS engine could attempt to build a result set large enough to prevent core business transactions from accessing needed disk, memory, or processor resources. 5

The process of building new data models is even more intrusive than building reporting functions because unlike most reports, a data model's exhaustive search for patterns requires it to read as much data as possible to create the information that gets stored in the model. The nature of the data-mining process is such that very large quantities of data need to be analyzed at once in order to derive accurate patterns and new trends. Because of the extensive scope of this data analysis, Data Mining Services would most certainly find itself competing directly for resources with line-of-business transactions if Data Mining Services was getting its data from the OLTP server.

Tip

The best way to avoid poor performance and to optimize OLTP and data mining is to create a data warehouse on a separate database from the OLTP database or on a separate server, and use that data exclusively for extraction and querying. In this way, the core line-of-business applications can be isolated from the model-building processes of data mining while still permitting flexibility for users who need to construct reports for their own needs. If the data-mining process slows the server temporarily, it becomes more of an annoyance to any other reporting process instead of causing problems for critical customer service processes.

Data Warehousing Architecture for Data Mining

Data Mining Services does not have access to higher-level data structures and models. It has no semantic structure to check facts. The data-mining model assumes that the data is factual and uses it to provide the resulting predictions . It's crucial that you take the necessary precautions to ensure that the data you start with is accurate. Assuring accurate data may require significant analysis of the attribute values inserted into the data-mining tools. To ensure that the conclusions drawn from past events are indeed valid, the team responsible for the data intelligence tools needs to implement a good process that cleans data as it goes into a data warehouse. The first step in the process of mining data is to find out whether you actually have the right data in your database. Deriving new patterns and rules not only requires specific data to be available, but also requires this data to be in granular form. If granular data is not available, you run the risk of missing certain meaningful information. For example, if the car dealership discussed in the previous chapter wanted to discover information about related sales, including the sale of car accessories, loans, or insurance, it would only get access to part of this information if related sales are not tied to vehicle sales. To apply data mining, you must first understand your data. Ask yourself the following questions:

What data is available?
How granular can we make the information?
How useful is the data? 6
What is the quality of the validation process?

Once the quality of this data has been validated , your next concern is how to move the OLTP data into a warehouse.

Creating the Warehouse from OLTP Data

The process of extracting data from an OLTP system for a data warehouse is complex, time-consuming , error-prone , and probably about as glamorous a task as peeling potatoes. Unfortunately , there is no way around this step in the process. The quality and usability of the resulting predictions makes this step a crucial part of the data-mining process.

Note

Data warehousing uncovers most of the problems that exist within the production systems. Any errors that don't get resolved at this point will surely show up in the form of bad results in the subsequent data-mining process that uses the warehoused data simply because those processes are the most visible components of the system.

Ensuring Data Quality

Most large databases are full of errors, sloppy entries, and missing data. Errors usually go unnoticed because people assume that if large amounts of data have been collected without causing the shutdown of the system, the data must be OK. This assumption is proven false the minute the extracted OLTP data is applied to a well-designed data warehouse. 7

Many organizations, particularly some of the large companies that house enough data to consider data mining, often use many different data-capturing systems that define and describe transactions, entities, and relationships between tables in different ways. Some systems use multiple naming conventions, incompatible or hidden encoding structures, or historical summary information, which is kept for only a limited time. Here's a non-exhaustive list of problems encountered when creating a data warehouse.

‚ Hard-coded names ‚ Programs that rely on the file or table names to access data. These present a problem because the logic applied to retrieve the data is too closely coupled with the logic present in the code.
‚ Inconsistent field lengths ‚ Field lengths may differ for the very same attribute in different tables. For instance, a company name might be allowed 50 characters in one table and only 35 in another. This is a problem only if entries with field lengths of 35 or more are present in one system and end up truncated when merged with the tables that have entries with smaller fields.
‚ Missing values ‚ The tables might contain missing values in certain fields, or numerical fields might have 0 instead of NULL values for unknown amounts. This can cause serious problems when performing summary aggregates and averaging. This is even more detrimental to the data-mining process simply because of the potential a 0 value has for predicting future behavior. 8
‚ Inconsistent values ‚ These occur when two departments enter the same conceptual entity but give it two different names. For example, if different operators enter the values of"Green, " "GRN, " "GR, "and"Forest Green"to describe green automobiles, the data-mining model will create four classifications for the same color car. Inconsistent values are also created when values such as the area code of a phone number or a name without a title is entered.
‚ Inconsistent field types ‚ This problem occurs when values for the same type of item are kept in different data types. For example, it's possible to have zip codes kept in an integer-type field in one table and in a character-type field in another table. This inconsistency makes it very difficult to automatically merge these values into one table. Program code or conversion functions are required to integrate them.
‚ Inconsistent entity handling ‚ When different departments within the same company take charge of their own data, each department might have different levels of tolerance for sloppy data entry. For example, if a table contains only one field for both first and last names of individuals, the names can be entered in multiple ways. If data can't be parsed into first and last names, the records can't be automatically merged with records in other tables with the same names. Inconsistent handling of entities can even make it difficult to separate the names into two or more fields.

Storing redundant data on data islands, or unique data repositories, located throughout the company also leads to problems. Islands of data, illustrated in Figure 3-3, are especially difficult to handle because the context in which the data was entered each time has to be known in order to accurately house the data. 9

Figure 3-3. Data islands managed by individual departments.

Data Warehousing Nightmare

One cable television and Internet service provider I did OLAP consulting work for ran into problems when it tried to build its first data warehouse by combining subscriber information taken from various departments and each department used a different system to record subscriber information. Each department gave subscribers unique customer entities with specific attributes that resulted in duplicate and triplicate entries. Because deriving conclusions about individual customers was crucial to the data-mining process, a huge amount of work had to be done to "clean "this data and combine the records so that each subscriber had only one record that could then be housed in a table within the data warehouse.

When to Discard Data from the Warehouse

Once the data is cleaned and stored, some warehouse records may still be useless. It's difficult to know which records to discard and when. Often, the best solution is to remove some records entirely from the source data before using it for data mining. Remove records that don't represent the reality of the transactions or the current state of the business. These records only cause false predictions about future results and include the following:

‚ Test data 10 ‚ Developers and systems administrators are often surprised by the quantity of test-related data that resides in production environments. Test data is made up of bogus clients and fictitious transactions and therefore is not to be used as source data.
‚ Data that represents exceptional circumstances ‚ For example, if a grocery store had increased sales of batteries and canned foods after a hurricane , these sales should be discarded. The numbers would do nothing to help predict trends, unless, of course, hurricanes were a frequent occurrence in the area.
‚ Artificial data such as dummy refunds made to rectify erroneous sales ‚ For instance, if an accountant discovers that the previous month's sales figures included some double entries from a faulty application, he might enter equal negative amounts or create fake refund entries to rectify the data. The entry of artificial data rectifies reporting totals and keeps accounting applications running smoothly. Unfortunately, any phantom sales and refunds must be removed from the data source, or the predictions will also be illusory.

Optimizing Data for Mining

Optimally (and optimistically speaking), the data sources used to create data-mining models are contained in one simple table of columns and rows. Of course most data sources are quite different, even those that come from a warehouse. Often data is stored in multiple tables (perhaps even in multiple formats), and the tables are related in all sorts of explicit and not-so-explicit ways. For the best predictions and performance, keep your source data organized in as simple a format as possible. 11

Data Structures

When a data-mining model is made, it is the data-mining algorithms that extract the data from a flattened table, which is a single table or view that contains all the necessary rows and columns from the data source. Once this has been managed, the important decision of how to manage the columns and rows has to be made.

Choosing Columns

Columns, or fields, are the units that contain a particular kind of data in a given data type. When used for data mining, the column type is of little concern. It doesn't matter whether the column contains a number, a character, or a data field; what is important, however, is the distribution of the contents of that field.

When identifying the columns you want to use to create your data-mining models, keep in mind what role those fields will play in determining how the data-mining model will be constructed . The following are roles that a given column can have in this process:

Input column
Target column 12
Key column
Value column

Input columns

Input columns are used to create the original patterns. They include the attributes of a given case that describe the characteristics that influence an outcome. The following is a list of attributes included in the case of a car. Each one would be an input column.

Year
Make
Model 13
Body Type
Color
Engine Type
Mileage

It's a given that some cars will sell for a higher price than others based on the combinations of their attributes. Data mining investigates patterns based on attributes as it creates a model.

Target columns

These are the outcomes that you want predicted . A car dealer ?starget columns could include 14

Average age of customers
Average annual income of customers
Predicted sales of those cars priced at 10 percent over blue book value

Results are expressed in these target columns. If you examine the column inputs, you can determine the value of the targets. In this example, after examining the patterns, the car dealer could discover that 25-year-old males with average incomes of $75,000 buy red Corvette sports cars if they have less than 20,000 miles on them.

Key columns

Key columns are not used for deriving data-mining patterns but are used to correlate data in a model to prediction cases in a table. The use of key columns will become clear in ‚ Chapter 12,"Data-Mining Queries," 15 ‚ in which queries are discussed in depth.

Value columns

Value columns contain specific numerical values, such as the price of a particular item associated with each row. In this case, values help predict the future price of the items.

Columns to Avoid

If a table has columns that contain the same values, these values can't be used for data mining. Sometimes values are repeated if a subset of data from the data warehouses is used to create a mining model, and if that subset, by definition, refers to a certain type of case. For example, if a car dealer wants predictions only about Ford vehicles, the Make column in the table should contain only "FORD "as the value.

The value of "0 " would show up in every row if, for example, the dealer had a RUSTPROOFING column to input the cost of rust proofing a new automobile, but as it happened , no one ever had his or her car rust proofed. Obviously this column would be completely useless for making predictions.

It's also a good idea to check for columns with oddball cases, which if included might skew results. For instance, if only four out of 112,000 customers bought leather headlight covers, the LEATHER_HEADLIGHT_COVERS column should be ignored since it's unlikely that such a miniscule deviation from the norm would be of any use for modeling purposes. This is not to say that the causes for this lack of headlight-cover sales shouldn't be examined!Xbut it seems evident that those leather headlight covers should be either marked down or removed from the shelves . Generally speaking, if 97 percent (or more) of a column's data is identical, then chances are slim that you can use that field to discover any meaningful patterns. 16

When columns are unique (instead of too similar) to one another, their value also diminishes. Every database has dissimilar input columns such as:

Social security numbers
Vehicle identification numbers
Primary key values
System-generated passwords
Names
Telephone numbers 17
IP addresses

A pattern cannot be derived from this list of attributes because the values only occur once. There are also no meaningful similarities to be found in the values in a given column. A pattern can exist only when a value repeats over and over under similar circumstances.

There is no question that some of these unique columns with unique values contain valuable information. For example, social security numbers can be parsed to show where and when they were issued; and vehicle identification numbers are encoded with the year, make, and model information. However, to make use of this information, those columns would have to be split into two or more columns. Each new column would have information grouped together and these groups would be used to derive predictions. In this example, the original fields would be ignored or discarded.

Some columns add no predictive value because their value is constant in relation to other values in other columns. For example, if all BMWs come with air conditioning, the relationship between AIR_CONDITIONING = YES and MAKE = BMW has no predictive value. It has no predictive value not only because one condition always exists with the other, but also because the condition exists in the first place because the dealership decided to make AC standard fare in BMWs.

Selecting Rows

Rows represent individual cases that are used to create the model (see Figure 3-4). Check that the data in each row is accurate; valid predictions depend on it. The cases in the rows themselves can be represented by a view made by joining two or more tables together. Regardless of how rows are represented, care must be taken to understand the granularity and accuracy of their content.

Figure 3-4. Data rows and columns in a table.

Calculated and Derived Data

In addition to rows of raw data, you can also add columns and use them to enhance the value of the data in the rows. If the car dealer wants more information, he performs calculations on the row values. Some vehicle-related samples of calculations include:

Mileage / Number of owners
Mileage / Price
Blue book value, Sale price
Number of prospects / Number of sales 19
Sale price, Trade-in value
Sale price, (Transportation cost + Inventory cost)
Sticker price, Sale price

If a data-model builder is sufficiently knowledgeable about the business or subject being studied, there is no limit to the relationships she might look for. The first list of calculations are straightforward; in the interest of discovering new patterns, let's derive more imaginative values such as these:

‚ Sale price / Age of car ‚ To find price patterns.
‚ Age of salesperson, Age of customer 20 ‚ A long shot, but generation gaps shouldn't be discounted.
‚ Miles per gallon / Annual income of customer ‚ Are customers with high annual incomes as concerned with fuel efficiency as those with average annual incomes?

Certain fields contain hidden information that can flag important values. If a retail store combines date values (and these days are related to sales values) with a holiday database, the retailer could discover which items sell best on holidays. Other values related to dates include:

Weekday versus weekend sales
Month of the year
Season of the year 21
End of month versus beginning of month
Tax-refund time

As mentioned earlier, other data items can be parsed to extract more valuable data. These include:

‚ Vehicle identification numbers ‚ Used to derive various fields about the year, make, model, and manufacturer.
‚ Zip codes ‚ Each code is specific to a geographic location. 22
‚ Telephone numbers ‚ Area codes are also specific to a geographic region.
‚ IP Addresses ‚ Networks are identified by portions of the IP address. This address tells the owner of a Web site what provider a visitor to his site uses.

One thing that makes data mining such a labor of skill and imagination is its capacity to make optimal use of derived variables. Valuable information can be extrapolated from seemingly mundane data by using derived data, variables , and simple calculations.

Determining Data Granularity

Ideally, the data used for mining should be of the same granular level as the case level of the predictions you're seeking. Let me explain. If your vehicle sales predictions are based on individual cars, it makes sense to base the data-mining model on individual car sales. On the other hand, if you want to predict which makes and models will sell best in 500 nationwide dealerships, it makes more sense to summarize the vehicle data by make and model characteristics, thereby reducing the level of granularity. If you wanted to predict total yearly sales of cars per state, the granularity of the source data would be more effective if set at the state-level of granularity. 23

Besides reducing the granularity of your data, summarizing allows you to create variables derived vertically instead of horizontally. For example, a column containing the total costs grouped by car make can be materialized through aggregation in a way that isn't otherwise possible.

When creating a flattened view of a data source, it?ssometimes advantageous to combine both detail-level data with summarized data, as shown in Figure 3-5.

Figure 3-5. Join of detail level data and summarized data.

A way to see the advantage of combining detail and summarized data, is to create a data source with customer information and an assortment of invoices for each customer. If you are interested in the type of items the customer purchased per invoice, group the individual line items of the invoice by item type. Figure 3-5 shows a series of invoices from customers of a local computer store. The predictions that result from this data source are based on types of components, not on the individual components purchased by each customer, so the line items are grouped by fixed disks, memory, processors, and so on.

Physical Data Mining Structure

Now that I've described the implementation of the data structures, it's worth looking at a few examples of the most recognized architectural models. Although these architectures are not concerned with the data sources, or even with the models themselves, they do describe how the client applications interact with the data-mining models. An architecture is chosen by the size of the data source that will be mined and by the frequency with which the prediction queries will be issued against the data mining models. 24

Client (Single- Tier ) Architecture

In the scenario illustrated in Figure 3-6, a client machine contains the engine that performs the actual data mining. The data stored in a warehouse is downloaded, stored on the client machine, and all the data preparation and mining is carried out on the client machine.

Figure 3-6. Single-tier architecture.

As client machines become more powerful, so do servers, which means that networks used to download the data source to the client machine could soon be drowning in monster-size data files. Networks should be used for this purpose only when the amount of data to be mined is small and the process is infrequent enough to allow the client machine enough time to process tasks without exhausting its resources and those of the network to which it's connected.

Two-Tier Architecture

Two-tier architecture, as illustrated in Figure 3-7, can mine millions of records on reasonably high-performance servers without a very complex physical architecture. The server houses the data-mining engine as well as the data warehouse and runs all the processes locally. The client machines, through an OLE DB connection, simply call the engine to perform all the necessary data-mining processes and when applicable , receive prediction result sets. 25

Figure 3-7. Two-tier architecture.

Three-Tier Architecture

This architecture generally requires a dedicated high-performance server to run the data-mining engine on the middle tier and an equally high-end server to run the database engine on the back end.

In the example shown in Figure 3-8, the data warehouse resides on the back end while the middle tier is responsible for mining that data. The middle tier loads the data from the back end and mines it, and the results are passed on to the data-mining client machine. To the client machine, the process is identical to the two-tier model.

Figure 3-8. Three-tier architecture. 26

Relational Data Warehouse

The relational data warehouse is, as the name implies, a storage structure that uses the same relational engine as an OLTP database. The warehouse uses the same engine, but stores the data in a separate location from the OLTP data such as another physical server or database. Even though maintaining data integrity and optimal normalization are not top priorities for the relational data warehouse, it does need to use the unique querying powers of the RDBMS to retrieve data according to the specifications of any given SELECT statement.

Advantages of Relational Data Storage

Using relational data storage as the source for data mining is one of the best options available for designing the architecture of a corporate intelligence system. There are some tradeoffs, but consider the advantages described in the following sections.

Flexibility

When both the OLTP data and the data warehouse share the same engine, there's a very good chance that the inter-server or inter-database data transfers are especially favored by the seamless communications structures that usually exist between the RDBMS implementations . When these engines use the same version and are made by the same vendor, as is the case if both database engines are SQL Server 2000, then the data transfers are perfectly seamless.

Note

If you decide to use OLAP, a special communication bridge has to be built between the OLAP engine and the RDBMS to facilitate the movement of data from the relational structures into the cubes. Although there are specific advantages to proceeding in this way, for example, the interfaces to the cube are standardized, special care has to be taken to understand the inner workings of both relational and OLAP structures to ensure that the cube really does reflect the state of the data in the OLTP system. In other words, you should understand how to get information from the cube and from the relational database so that you know which rows in the database are processed and in the cube and which ones still await processing. 27

Relational Databases Use Standards

Database designers and system administrators find relational databases easy to understand because they share standards with OLTP systems. There are plenty of OLAP vendors with unique storage implementations and querying rules, but the science and structural philosophy of relational data is common to most RDBMS engines. The warehouses are better built because the standards are familiar to more people; for example, terms such as primary keys, foreign keys, rules, and special data types are second nature to most engineers working in the field. Because the two structures can be directly compared, the OLTP data is easy to test and validate. Because the metadata is easily accessible, the database permits the use of indexes, which are beneficial and easily applied to the data-mining process.

Easy Access to Lowest Level of Granularity

One of the most important steps in the data-mining process is the comparison of test cases to real cases.

Note

Test cases are cases that were not used to build the original data-mining model, but which are kept in order to measure the effectiveness of the model. These cases are used as inputs as if they were new data that had unknown values that need to be predicted. Since you already know what the real values are, the outputs from the model can be tested for accuracy against them.

If the input cases are of the same granular level as the transactions used to create the data warehouse, it's also crucial that the model contains that same level of granularity. For example, if you wish to create a data-mining function that lets you input an automobile's characteristics and then guess its price and customer profile, you would need a data-mining model that contains individual cases at a granular level including sales and customer information.

OLAP optimizes the use of aggregated information by sacrificing convenient and intuitive access to the detail level of data that is used to populate the cube. Naturally there are ways to work around this problem such as using the drill-through functions in Analysis Services, but these functions often require the use of other special functions that in turn refer back to the underlying data source in a somewhat disjointed manner. Even with this disadvantage , OLAP is still used for many data-mining scenarios. 28

Building Supporting Tables for Data Mining

The main source component of the relational data-mining process is the case table, which contains the raw data the computer must "learn" from. Out of this table, you have to create the following columns:

‚ Case key column ‚ This is the column that uniquely identifies the case you want to model such as the make of car or the name of a bank customer.
‚ Input columns ‚ These are the columns that contain the known attributes that describe the case. For instance, the car, color, and make are known attributes that go into a case record and later serve to facilitate predictions using similar attributes in test sets.
‚ Predictable columns 29 ‚ Predictable columns contain the expected values for predictions, which are based on the attributes in the input columns.

Case tables contain all the columns needed to establish the cases set for a data-mining model. There are times when the cases can be derived directly from a single table in a relational database. This is a simple matter to set up because all the relevant columns are kept in one place. If the source is in multiple tables, joining the tables to make them appear as one will make this task a bit more complex. When you select a data-mining method, whether it's clustering or decision trees, you must choose an algorithm to use with the relational data-mining model, which then automatically attaches you to a specific data-mining algorithm provider. The data-mining algorithm provider presents the choices of data-mining algorithms and defines the physical structure for the data-mining model.

OLAP cubes

OLAP cubes are an important part of data mining not only as a source of data for creating data-mining models, but also as a structure for storing data-mining models themselves.

How Data Mining Uses OLAP Structures

Instead of using a traditional flattened table, OLAP provides a cube's dimensions as the input source. With this cube's dimensions, the data-mining service creates the cases needed for the data-mining model.

The columns in an OLAP data-mining model are constructed from visible levels in the OLAP cube. To create an OLAP data-mining model, the Mining Model wizard requires a cube whose dimensions contain at least one visible level. 30

Once the model is created, it can be stored in a virtual cube, or a dimension of the original source cube. OLAP data-mining models cannot be created directly from relational data sources or virtual cubes that contain a data-mining dimension.

The outputs or prediction results can come in various forms depending on how the results are going to be used. These forms include:

One of the measures of the source cube
Member property of the chosen dimension and level member

In the beginning of this chapter, we looked at some basic differences between data warehousing structures and OLTP data. We denormalized, or flattened, table structures to reduce table joins. If we really want to take this process one step further and eliminate the need to filter data altogether, we can create structures that are not only denormalized, but horizontally and vertically partitioned as shown in Figure 3-9.

These tables are almost completely useless for an OLTP database. Instead of rows containing attributes that can be filtered through a WHERE clause, multiple tables of the same exact structures are created. These contain a reduced set of fields used to answer more specific questions. They also contain only rows that belong to a category, such as a Products category or a Date-of-Purchase category. On the other hand, a query, like the one shown below, is easy to construct not only because the tables contain less rows, but also because the WHERE clause is unnecessary.

 SELECT     CompanyName,     ShippedDate,     ProductName,     count(1) AS qty FROM Tofu_Purchases_April_1998  GROUP BY     CompanyName     ShippedDate   31

Figure 3-9. Cubed structures.

Advantages of OLAP Storage

An OLAP cube is designed to rigidly structure data to make it easy for us to examine aggregates. This aggregation, because it's precalculated, offers a unique opportunity to create derived fields that can exist because there is only aggregated data and no detailed-level data. These fields include:

Sums
Averages
Standard deviations 32
Ranks
Counts
Mathematical functions
Financial functions

Speed

Data in OLAP cubes is preaggregated and stored in cells that require a querying engine to define the cell coordinates. OLAP data is radically different from relational data. OLAP cube data takes longer to prepare and is used only once. However, one of the big advantages of OLAP technology, regardless of how it's ultimately used, is the speed by which an operator can navigate sums and averages across dimensions. If the cases that feed the model are contained in a very large data warehouse and if the data-mining process is able to dispense with the need for detailed-level data, then the models can be built in a fraction of the time required to build a similar model using relational data.

Easy Navigation 33

To fit into a cube, data must be precisely structured to meet the cube's requirements. The cube's structure is exposed to outside reporting tools and OLE DB wrappers such as ADO MD, so you can determine the cube's dimensions. After you know the cube's dimensions you can surf the cube's data to quickly examine aggregates through various dimensional levels. This is possible because dynamic aggregates conform to the framework provided by the relatively static dimensions.

More Info

Because Data Mining doesn't need to navigate data like a human operator does, easy navigation is a relatively unimportant feature. On the other hand, the results of data-mining queries are automatically placed in dimensions that are stored in OLAP structures and those dimensions, when applied to aggregates, let you quickly review some what-if scenarios.

OLAP Storage Requires Highly-Structured Dimensions

In order to make the aggregates consistent across dimensions, the dimensions and measures themselves must be clearly defined. Many dimensions contain information that, when processed, exists in a hierarchical structure. For example, if a Car information dimension is created with Year as the top dimension and Car Make as a low dimension, the OLAP engine will usually show a warning message or not create the dimension if there are more distinct Year values than there are Car Make values because this can be a violation of the intended structure of the cube. OLAP's fact checking ability prevents the storing of faulty or irrelevant aggregates in the cube.

Semantic rigor on the part of OLAP makes for more accurate data-mining models because the algorithms assume that the attributes of those individual cases actually mean something. Because nothing at the engine level or conceptual level of the relational model really prevents inaccurate hierarchies from existing, the model is more prone to error. A false hierarchy, for example, might create models that place too little, or too much, emphasis on certain attributes because they seem to belong to a certain level in a given dimension. This inaccurate level assignment could result in false predictions.

Flat Tables vs. Multidimensional OLAP Tables

While flat table structures are fine for simple data-mining projects, they present several problems for more complex projects. Flat tables are difficult to analyze, they contain lower grained information, and when they are used for complex projects so many rules are needed to maintain the integrity of the data that the model becomes overly complex and unwieldy. On the other hand, an OLAP data-mining system analyzes data across multiple dimensions and takes full advantage of the patterns along these dimensions. The patterns are merged, conclusions are drawn, and predictions are made based on summarized groups of data. In the end, our fictitious data-mining car dealer discovers trends such as"Red trucks are profitable overall, but much of the profit comes from dealerships in the southern United States"and"Blue trucks are profitable too, but mostly in the northeast United States." 34

When you think about it, data analysis, in almost all cases, involves a certain degree of analysis of cumulative figures. Most companies with large data warehouses rarely analyze an individual case. It's the cumulative data such as sales per month or sales by store that most interests the business analyst. Data mining analyzes and makes predictions based on this type of cumulative data.

When OLAP Is Not Appropriate for Data Mining

As wonderful as OLAP is, there are some important limitations to consider when using it as a source for data mining. The most important limitation of OLAP is that it does not provide data granularity. As mentioned earlier, OLAP is great when it comes to aggregation; unfortunately, it completely fails when the data mining model is required to represent individual line items such as Sales. If a marketing department wanted to use data mining to predict the sales potential of a certain make of automobile, regardless of the individual features and condition of the vehicle, it would make sense to use sales summaries of the car make and not the individual sales for each car. An OLAP structure would work perfectly for this job. On the other hand, if your predictions need to be based on individual cases (such as on the customer case, in order to make predictions about individual customer behavior), the OLAP structures would be incapable of providing such input.

Summary

As discussed, the data-mining models are ultimately fed with data from OLTP systems, which are most often kept in relational database structures. Although there's little doubt this OLTP data should be moved to a data warehouse more conducive to the process of data mining, one of the first things to determine is whether this new structure is going to be another relational system or an OLAP system.

Relational systems offer a bit more flexibility and are easier for people to build. It also is the only storage mechanism that seamlessly preserves the original granular levels present in the original OLTP systems.

OLAP offers an enhanced storage mechanism that optimizes data retrieval, enforces rigid semantic structures, and facilitates intuitive navigational features. As a source for building data-mining models, it offers a wider array of storage options for the predictions and the models themselves, including other OLAP structures. Although OLAP doesn't store low-level granular data, it's the ideal source of summarized data about groups. 35

When deciding on a physical architecture on which to implement your data-mining operation, the size of the data models and the frequency of the prediction queries must be taken into account. The three-tiered architectural model is ideal when there is an exceptional amount of data to be mined. Three-tier models require two high-end servers that divide up the task of serving the data from the model. The most common architecture is the two-tier model; one tier is for the clients who query the model and the second tier is for the server that processes the requests . The single-tier model is designed for light-weight data-mining processes that can be executed on client machines without having an adverse affect on network bandwidth or exhausting the relatively weaker resources in the client machines.