10 Understanding Data-Mining Structures | Data Mining with Microsoft[r] SQL Server[tm] 2000 Technical Reference

As you have noticed from reading this book, Microsoft Analysis Services provides data-mining algorithms that are primarily suited for classifying cases into groups and predicting missing values from discrete sets. With Microsoft Data Mining, it's easy to find whether a given mushroom is edible based on its characteristics or what film genre a given customer is likely to rent based on their past behavior, but it's virtually impossible to predict specific continuous (numerical) data values, such as the price of a car based on its make, model, and condition.

Other data-mining algorithms, such as regression analysis, are well suited to this type of data mining. Unfortunately, Microsoft Data Mining doesn't support regression analysis at this time. However, this shouldn't stop you from using Microsoft SQL Server 2000 to apply such algorithms to make predictions . The rich array of tools and functions available in SQL Server 2000 makes it possible to use other algorithms. In this appendix, I'll show you how the regression analysis algorithm works and one way to apply it directly to OLAP and relational data without using data-mining components from Analysis Services. This serves two purposes - one, it provides you with a solid understanding of the underlying processes of a data-mining algorithm and two, it suggests an alternative way to mine data using your programming knowledge and the SQL Server tools. As soon as Analysis Services allows third-party algorithms, you will be able to use variations of the formulas described in this appendix to add the algorithm to the list of algorithms available in the data-mining tools.

What Is Regression Analysis?

Regression analysis is the application of statistical formulas to two sets of numbers to determine the degree to which one set of numbers affects the second set of numbers . The purpose of regression analysis is to improve our ability to predict the next occurrence of our dependent variable based on past experience. Regression analysis is also the mathematical relationship between two variables expressed in an equation. One variable can be predicted using what we know about the other variable. The variable whose value is to be predicted is called the dependent variable. The variable about which knowledge is available is called the independent variable. The relationships between variables may be linear or curvilinear. By linear, we mean that the functional relationship can be described graphically (on a common X-Y coordinate system) with a straight line and mathematically by the common form: y = mx + b. By curvilinear, we mean that the relationship is defined by a curved line, which incidentally happens to be more complex to calculate. 1

There are many situations in which the value of one numerical variable has an effect on the numerical value of another variable. Our example will use regression analysis to find out whether the size of a house has a direct effect on the cost of electricity. If the results show a relationship, regression analysis can also help predict how much the cost increases per square foot .

Predicting Continuous Attributes: An Example

The best way to grasp how regression analysis works is to actually work through an example. Table A-1 lists the square footage and average annual electric costs of 12 homes :

Table A-1. House Size vs. Observed Electricity Costs

House Size (sq. ft.)	Annual Electricity Costs
1900 2	$600
2000	$660
2525	$900 3
1300	$480
3500	$1,680
4100	$1,710
2870	$1,200
1700 5	$540
2300	$1,500
2450	$1,080 6
2890	$1,200
1890	$540

Imagine that you are going to buy a home that you intend to live in for at least five years . Before deciding which home to buy, you need to know how much it will cost per month. Besides the mortgage, the costs include maintenance, taxes, insurance, and utilities. Because the homes are located in a warm climate, you will have the added expense of air conditioning. 7

Correlation

The idea that there is a correlation between the size of a home and the cost to live in it sounds reasonable. To test the hypothesis, the first thing you need to do is establish that there is a link between home size and electricity costs. This link, or correlation, is expressed as a number between -1 and 1. The closer to 1 the number is, the higher the positive correlation. Therefore, the larger the home, the higher the electricity costs. The closer to -1 this number is, the higher the negative correlation is. If there was a negative correlation in this relationship, a larger home would use less electricity. If the correlation number is 0, it is unlikely that there is any link between home size and energy consumption. The number that describes the correlation is known as the correlation coefficient. I'll show you the actual equation for calculating the correlation coefficient in a moment, but first I'll build out the table of sums and means that we'll need to plug into the equation. Table A-2 contains all the numbers needed to calculate both the correlation coefficients and the regression line that I'll discuss later in the section. The column of numbers containing the independent variables is commonly referred to as the X column or the column of x values. The dependent variables are commonly referred to as the y values.

Table A-2. x and y Value Computations for Correlation and Regression Equations

‚	X	Y	X 8 ²	Y ²	XY
1	1900	600 9	3,610,000	360,000	1,140,000
2	2000	660 10	4,000,000	435,600	1,320,000
3	2525	900 11	6,375,625	810,000	2,272,500
4	1300	480 12	1,690,000	230,400	624,000
5	3500	1680 13	12,250,000	2,822,400	5,880,000
6	4100	1710 14	16,810,000	2,924,100	7,011,000
7	2870	1200 15	8,236,900	1,440,000	3,444,000
8	1700	540 16	2,890,000	291,600	918,000
9	2300	1500 17	5,290,000	2,250,000	3,450,000
10	2450	1080 18	6,002,500	1,166,400	2,646,000
11	2890	1200 19	8,352,100	1,440,000	3,468,000
12	1890	540 20	3,572,100	291,600	1,020,600
Sum	‚ & pound ;Ux=29,425	‚ Uy=12,090 21	‚ Ux ² =79,079,225	‚ Uy ² =14,462,100	‚ Uxy=33,194,100 22
Mean

Now that we've calculated the sums and the means of each column, we can move on to the equations. Figure A-1 shows the equation used to calculate the correlation coefficient. To simplify the explanation, I've divided the computation into the three main components. These are the covariance of the x and y values divided by the product of the standard deviations of x and the standard deviations of y. 23

Figure A-1. Correlation coefficient equation.

The covariance is calculated with the formula in Figure A-2.

Figure A-2. The covariance of x and y equation.

By now some of you are probably wishing you had never fallen asleep in high school precalculus. No need to worry; I'll explain this process step by step just to show you how simple it is.

First we get the covariance by substituting the numbers: 24

Cov _xy = (33,194,100 / 12) ‚ (2452 ‚ ƒ ˜1007.5)

Cov _xy = 295,701

Then we calculate the standard deviations of the X and Y columns , starting with the X column. (See Figure A-3.) 25

Figure A-3. The standard deviation of x equation.

Substituting the numbers for x gives us the following result. (See Figure A-4.)

Std _x = square_root((79,079,225 / 12) ‚ 6,012,712.67)

Std 26 _x = 759.75

Figure A-4. The standard deviation of y equation.

Substituting the numbers for y gives us the following result:

Std _{y 27} = square_root((14,462,100 / 12) ‚ 1,015,056.25)

Std _y = 436.02

Substituting all the numbers gives us the following result:

r = 295,701 / (759.75 ‚ ƒ ˜ 436.02)

r = .89 or 89 % 28

This number, because it is close to 1, or 100 percent, shows a very strong positive correlation. In other words, there is a direct link to home size and electricity consumption. The next question is how much will an x- sized house consume in electricity per year? To get this information, we first need to calculate the regression line.

How to Interpret the Correlation Coefficient Value

Generally, anything above 50 percent correlation is considered significant. However, that depends heavily on the size of the sample. An 80 percent correlation when there are only five rows of data in the sample is largely insignificant compared to a 54 percent correlation when there are 150 rows of data in the sample. The larger the sample, the lower the correlation can be and the smaller the sample, the higher we need the correlation to be before we can conclude that the data correlation is high enough.

A common misinterpretation of correlation figures is that strong correlations somehow prove that the independent variables (the x values) cause the values of the dependent variables (the y values). To prove that this is not always true, some statisticians have shown strong correlations between the number of Certified Public Accountant (CPA) certifications and the number of prison incarcerations for a given year. There are coincidental factors that contribute to those figures but no causal link between the two. There are various statistical tests that can be done to investigate the significance of the correlation, but these are beyond the scope of this appendix.

To find out whether there is a causal effect between two values, the correlation coefficient is squared to come up with a coefficient of determination. The squaring of the correlation coefficient causes the number to be smaller, so if your correlation coefficient for a given analysis is .80, then the coefficient of determination will be .64.

The Regression Line

In this case, the independent variable is the square footage of the house. We assume that this number (x value) is not influenced by any other variables. The electricity cost is the dependent variable (y value) because it is the direct result of the independent variable. Before we discuss the statistical formulas, look at the numbers from Table A-2 when transferred to a graph. (See Figure A-5.) 29

Figure A-5. Scatter chart representation of the home size vs. electricity costs.

As you can see from Figure A-5, the algorithm looks for a straight line that best describes the highest possible correlation by making the points fall as close to the line as possible. Regression looks for the slope and intercept of the line closest to all the points on the graph. The slope is a number that describes the steepness of the line as it moves up or down the Y axis of the graph. The intercept is the number that denotes the point on the Y axis when x = 0. Because of this goal, this line is known as the best fit line or regression line.

Note

This section uses a straight line to find this best fit. However, it's often more accurate to establish relationships between numbers using curved lines, especially when the numbers show trends indicating that they go through predictable or seasonal changes. A curved line could follow this trend far more accurately than a straight line, but for the sake of simplicity, we'll use the straight line. Once you understand linear regression, I urge you to explore the use of nonlinear regression for data sets that do not necessarily follow a linear trend.

If a = the intercept and b = the slope, the basic equation that best describes linear regression is as follows :

y = bx + a

If we can calculate the value for a and b, thereby providing a value for a given house price (x), we should be able to solve the equation and get the expected electric bill (y). 30

Finding the Slope

The equation for the calculation of the slope is shown in Figure A-6.

Figure A-6. Regression line slope equation.

By substituting the numbers, we get the following result:

b = [(12 ‚ ƒ ˜ 33,194,100) ‚ (29,425 ‚ ƒ ˜ 12,090)] / (12 ‚ ƒ ˜ 79,079,225) ‚ 29,425 ^{2 31}

b = 42,580,950 / 83,120,075

b = 0.512

We now have one variable of the equation set. The next step is to find the intercept, which we'll do with the help of the slope.

Finding the Intercept

The formula for finding the intercept is shown in Figure A-7.

Figure A-7. The intercept of the regression line equation. 32

Table A-2 provides the numbers we plug in to the equation:

a = 1007.50 ‚ (0.512 ‚ ƒ ˜ 2452.08)

a = -248

The Regression Coefficient

Now that we have computed the numbers, we know that the regression formula is y = (0.512 ‚ ƒ ˜ ‚ q) - 248. By replacing x with a proposed home size, we can predict the expected annual electric bill. This coefficient is a two-way street - if we wanted to predict the size of the house, we could simply replace y with a proposed electric bill!

Using Regression Analysis to Make Predictions

Imagine you have seen several homes, and you want to find the annual electric bill. By substituting the home size in square feet for x, you'll get the y value, or the expected cost of electricity per year. (See Table A-3.) 33

Table A-3. Using x to predict y

Square Footage of Home (X)	Expected Cost of Electricity (Y)
1150	$340.80
1875 34	$712.00
2247	$902.46
3324	$1,453.89 35
3754	$1,674.05
4120	$1,861.44
4621	$2,117.95

In addition to the dangers associated with making predictions based on data that might have high coincidental correlations, there is also the danger of making predictions that go beyond the range of values present in the sample. In our house size vs. electricity bills example, our house size was between 1300 and 4100 square feet. It might be tempting to predict values for x and y where x is outside the range, as I did in Table A-3. Notice that I have one home with 1150 square feet in the first row and another with 4621 square feet on the last row. These two values fall outside the range of house sizes in the sample. These values make it risky to act on the calculated results or predictions. The ranges fall outside our experience and the assumption is that the results will remain linear. Making predictions based on numbers that do not fall within the sample is called extrapolation. This is in contrast to interpolation, a process that generates predictions based on the range of values contained in the model.

Cause and Effect

Regression and correlation analysis cannot determine cause and effect. It's up to the analyst to do a logic check, determine an appropriate hypothesis, and test the hypothesis against the data. For example, a correlation value of .95 relates the number of new CPAs in a city to the number of people arrested for drunk driving in the same city in one year. Clearly there is no cause and effect involved here. A deeper variable, population, is the true independent variable that drives both the number of CPAs and the number of arrests made for drunk driving. As analysts, we must choose data sets that are related and check that they influence one another.

Using extrapolation to predict energy costs with small homes is less risky than doing so with a sample of small and large homes combined in the same sample. For example, large houses will show a major increase in electricity use simply because they have proportionately higher ceilings than smaller homes, which causes the air-conditioning bills to be higher than in a house with lower ceilings. The predicted values will be false because they were based on the proportions measured with the smaller homes.

Analyzing the Accuracy of the Regression Line 37

The slope and the intercept of the regression line tells us very little about how well the line actually fits the points on the graph. Although the correlation coefficient mentioned previously in the chapter tells us about the strength of the relationships between the numbers, we still need a way to quantify the accuracy of our predictions. Ideally , for every prediction of Y, we would get an error range that tells us how far off the mark we are. For example, it would be nice to know that a prediction for a house of x size consumes y dollars in electricity give or take v dollars. To calculate our margin of error, we use the formula in Figure A-8.

Figure A-8. The formula for testing the error range of y for a value of x.

Use Table A- 4 for the following formulas.

Table A- 4.x and y Values for Correlation and regression Equation with Additional for Variance Analysis

X	Y 38	X ²	Y ²	XY		39	(X-AVG(X))
1900	600	3,610,000	360,000	1,140,000	$724.80 40	$15,575.04	304796.01
2000	660	4,000,000	435,600	1,320,000 41	$776.00	$13,456.01	204379.34
2525	900	6,375,625	810,000 42	2,272,500	$1,044.80	$20,967.04	5316.84
1300	480	1,690,000 43	230,400	624,000	$417.60	$3,893.76	1327296.01
3500	1680 44	12,250,000	2,822,400	5,880,000	$1,544.00	$18,496.00	1098129.34
4100 45	1710	16,810,000	2,924,100	7,011,000	$1,851.20	$19,937.44	2715629.34 46
2870	1200	8,236,900	1,440,000	3,444,000	$1,221.44	$459.67 47	174654.34
1700	540	2,890,000	291,600	918,000	$622.40 48	$6,789.76	565629.34
2300	1500	5,290,000	2,250,000	3,450,000 49	$929.60	$325,356.16	23129.34
2450	1080	6,002,500	1,166,400 50	2,646,000	$1,006.40	$5,416.96	4.34
2890	1200	8,352,100 51	1,440,000	3,468,000	$1,231.68	$1,003.62	191771.01
1890	540 52	3,572,100	291,600	1,020,600	$719.68	$32,284.90	315937.67
‚ U 53 _x =29,425	‚ U _y =12,090	‚ U _{x 54} ² =79,079,225	‚ U _y ² =14,462,100	‚ U 55 _xy =33,194,100	‚	MSE=$46,636.63	SUM=6,926,672.92
	56				‚	‚	‚

Notice that the first thing we must do is get the value for the prediction based on the x values that we used to build the model. The Y'column contains the predicted values. 57

Apply the regression coefficient using the already existing x values and ignoring the real y values.
Square the difference and input the sums in each cell of the (Y-Y') ² column.
Sum (Y-Y') ² and then divide by n-2. This gives us the mean squared error (MSE) value. The MSE should equal $466366.36/(12-2) = $46,636.63. 58
Apply the last part of the formula to get a variance of Y. 46636.63 ‚ ƒ ˜ (1 + 1/n + ((3500 _{proposed value} - 2452.08) ² / 6,926,672.92)) = 57577.59. The x _{proposed value} is the proposed x value that is used to predict a y value. Calculate this value for every prediction. For this example, use 3500 for this value. 59
Take the square root of the variance and multiply it by the t-distribution coefficient that corresponds to the 95 percent confidence level for 12 data points. (See Table A-5.) Looking at the degrees of freedom for 10 (12-2) and the 95 percent confidence column, you get the number 2.228.
Multiply the t-distribution coefficient with the square root of the variance and you'll get 534.61. This represents the range of error you can be 95 percent sure your prediction will be within.

Table A-5. T-Distribution Table

62 68 74 80 86

T-Distribution Significance Level
‚ Degrees of Freedom (n-2) ‚	‚ 90% ‚	‚ 95% 60 ‚	‚ 98% ‚	‚ 99% ‚
1	6.314	12.706	31.821	63.657 61
2	2.92	4.303	6.965	9.925
3	2.353	3.182	4.541	5.841
4 63	2.132	2.776	3.747	4.604
5	2.015 64	2.571	3.365	4.032
6	1.943	2.447 65	3.143	3.707
7	1.895	2.365	2.998 66	3.499
8	1.86	2.306	2.896	3.355 67
9	1.833	2.262	2.821	3.25
10	1.812	2.228	2.764	3.169
11 69	1.796	2.201	2.718	3.106
12	1.782 70	2.179	2.681	3.055
13	1.771	2.16 71	2.65	3.012
14	1.761	2.145	2.624 72	2.977
15	1.753	2.131	2.602	2.947 73
16	1.746	2.12	2.583	2.921
17	1.74	2.11	2.567	2.898
18 75	1.734	2.101	2.552	2.878
19	1.729 76	2.093	2.539	2.861
20	1.725	2.086 77	2.528	2.845
21	1.721	2.08	2.518 78	2.831
22	1.717	2.074	2.508	2.819 79
23	1.714	2.069	2.5	2.807
24	1.711	2.064	2.492	2.797
25 81	1.708	2.06	2.485	2.787
26	1.706 82	2.056	2.479	2.779
27	1.703	2.052 83	2.473	2.771
28	1.701	2.048	2.467 84	2.763
29	1.699	2.045	2.462	2.756 85
30	1.697	2.042	2.457	2.75
40	1.684	2.021	2.423	2.704
60 87	1.671	2	2.39	2.66
120	1.658 88	1.98	2.358	2.617

Using this method, my predictions can now look like those in Table A-6.

Table A-6. Results of Using the x Values to Predict the y Values with Error Values

Square Footage of Home (X)	Expected Cost of Electricity (Y) 89	Error Interval
1150	$340.80	$0 - $893
1875	$712 90	$201 - $1,222
2247	$902.46	$401 - $1,402
3324 91	$1,453.89	$928 - $1,977
3754	$1,674.05	$1,121 - $2,226
4120	$1,861.44	$1,276 - $2,445
4621	$2,117.95	$1,480 - $2,753 93

Note

Most statistics books cover other methods for analyzing variance, and these are worth exploring if you're interested in being able to assert accurate levels of confidence in your predictions.

Using OLAP to Create Regression Models

When Microsoft OLAP first entered the market, what many data miners looked for first were the statistical functions. Because statistics depends heavily on the compilation of aggregated data, OLAP is a powerful tool to perform regression analysis. OLAP provides the following MDX functions to perform the same calculations as those we just did.

To calculate the slope and the intercept, we can use the LinRegSlope and LinRegIntercept functions in the following way:

LinRegSlope (set, y measure, x measure)
LinRegIntercept (set, y measure, x measure)
The structures are very straightforward; just remember that the y value must be the first parameter and the x value must be the second. 94

The correlation to measure the strength of the relationship between the X and Y arrays is just as easy to compute:

correlation (set, y measure, x measure)

Predictions can be made using the LinRegPoint function. The following function predicts a given y value for the proposed x value:

LinRegPoint (Proposed x value, set, y measure, x measure)

The following listing shows these functions in action. I created a very simple cube that contains a time dimension and the same home size vs. electricity cost measures as in the previous examples. Consequently, I ran this MDX query in the MDX Sample Application provided with Analysis Services with the following result:

 with member [measures].[correlation]  as 'correlation({descendants([time],[time].[day])},     [home size],     [electric]) ' member [measures].[slope]  as 'linregslope({descendants([time],[time].[day])},     [electric],     [home size]) ' member [measures].[intercept]  as 'linregintercept({descendants([time],[time].[day])},     [electric],     [home size]) ' member [measures].[predict 3500]  as 'linregpoint(3500,     {descendants([time],[time].[day])},     [electric],     [home size]) ' select {[time]} on columns, {     [correlation],     [slope],     [intercept],     [predict 3500] } on rows     from [homes]

The MDX regression analysis yields the results shown in Figure A-9. 95

Figure A-9. MDX regression analysis results.

An added advantage of OLAP is the ease and speed with which queries are issued. Also, OLAP allows you to navigate across various dimensions, allowing you to drill down levels and have the statistical functions dynamically calculate according to the current aggregation you happen to be in.

Applying Regression to a Relational Database

To make regression analysis even more valuable to data miners, we must be able to apply the algorithm to a table. The most important point to keep in mind when choosing a data source or when building a table is that the logically related pairs of data values must remain grouped together in a row. Using our house example, a logical pair would be the square footage of a house and its annual electricity costs. Regression, as you've probably noticed, makes use of the product of the values in the X and Y columns to generate the regression line.

The house size vs. electricity consumption example is straightforward, but many other real-world regression analysis projects rely on data that has the x values in one table and the y values in another. A SELECT statement is used to join them. For example, if you're going to use the daily interest rates as the x values to predict the price of airline tickets in the Y column, you need to make sure that each row contains an interest rate and a ticket price for the same date, week, or other common measurement level the values share so that the paired values are related.

Getting the values you need from tables using the SQL syntax is more complex because SQL lacks many of the statistical functions available in OLAP. Also, relational tables are designed to return result sets after only one pass through the tables. To get around the set-oriented functions of RDBMS engines, it's possible to take advantage of the richness of the T-SQL syntax to store all the needed numerical elements in variables and perform the calculations with them as we do in the code listing below. The result can then be returned in the form of a table. 96

Note

If this were a real application, tables could be designed to structure the environment for regression in a more sensible manner. The correlation, the slope, and the intercept would obviously be computed once and stored in a table so that the predictions could simply perform calculations based on precomputed data instead of recalculating the same data every time a prediction is made. The stored procedure is structured in this way to demonstrate the use of T-SQL for regression analysis.

 set nocount on create table tdist95     (df int, dist float) insert into tdist95 values (1, 12.706) insert into tdist95 values (2, 4.303) insert into tdist95 values (3, 3.182) insert into tdist95 values (4, 2.776) insert into tdist95 values (5, 2.571) . . . -- ************************************************** -- To fill this table, see the t-distribution table -- in this document. It?sassumed that a permanent -- t-distribution table will be present to be used -- by all future regression analysis efforts. -- ************************************************** go create procedure usp_MakeElectricBillPrediction     @homesize float as declare @x float declare @y float declare @avgx float declare @avgy float declare @xy float declare @x2 float declare @y2 float declare @support int declare @dist float declare @correlation float declare @determination float declare @slope float declare @intercept float declare @sx float declare @sy float declare @sxy float declare @prediction float select     @x = sum([size]) ,     @y = sum([electricity]),     @avgx = avg([size]) ,     @avgy = avg([electricity]) ,     @xy = sum([size]*[electricity]) ,     @x2 = sum(square([size])) ,     @y2 = sum(square([electricity])),     @support = count(1) as support          from homeelectric set @slope = ((@support * @xy) - (@x * @y))     / ((@support * @x2) - square(@x)) set @intercept   = (@y/@support) -     (@slope * (@x/@support)) set @sxy = (@xy/@support) - (@avgx * @avgy) set @sx  = sqrt(abs(@x2/@support)-square(@avgx)) set @sy  = sqrt(abs(@y2/@support)-square(@avgy)) set @correlation = @sxy/(@sx*@xy) set @determination = square(@correlation) -- The t-distribution confidence level is calculated based -- on the degrees of freedom, which is the support - 2 select @dist = dist from tdist95 where df = (@cnt - 2). if @dist is null begin     -- This is in case the support is higher than the highest      -- degree of freedom value, in which case we use the maximum.      select @dist = max(dist) from tdist95. end -- The expected electric bill is calculated here. set @prediction = @intercept + (@homesize * @slope) select     @errx = sum(X-avgx) as errx,     @erry = sum(Y-avgy)) as erry,     @residual = sum(square(y - (intercept + (slope* X)))     from homeelectric -- Calculate the error variation. set @error = (@tdist * @residual) * (square(@prediction -     @avgx)/@errx) + 1 + (1/@support) -- Return the results in the form of a table. select     @correlation as correlation,     @support as support,     @slope as slope,     @intercept as intercept,     @prediction as AnnualElectricBill,     @error as PlusOrMinus

Regression with SQL Server vs. Regression Using OLAP

OLAP is able to perform regression functions only with measures in the same fact table. The house size vs. electricity consumption example is ideally suited for this, but if you wanted to perform a regression analysis to see whether the price of two-bedroom houses had any relationship to the price of five-bedroom houses, OLAP couldn't use the regression functions to perform those calculations. This is because the regression functions use the measures that belong to a given set, but to do the house analysis, two sets have to be compared. A big advantage of using SQL Server to perform regression analysis directly with user defined mathematical expressions is that you have the flexibility to perform analysis on data sets in a number of different ways.

Because OLAP tends to include records in a set that may contain 0 values, which you might choose to exclude, the regression analysis can be seriously compromised. For example, if some of the sample houses were uninhabited for some time they would show $0 in electricity consumption. This data would make the correlation as well as the slope of the line inaccurate. As complicated as it would be to eliminate these values from the set when using OLAP, it's easy to exclude records when using SQL Server with a simple WHERE clause.

Using Visual Basic to Perform Regression Analysis

To analyze, regression analysis needs both aggregates and individual rows. Although this analysis can be accomplished with SQL Server user-defined functions, temporary tables, and cursors , you might consider creating a COM+ middle- tier component that does the calculations and returns the slope, intercept, and correlation. The intense mathematics are better left to this component so that CPU cycles are not degraded and taken from the database engine that needs the power to return queries. These calculated values can then be stored in a table for use in subsequent predictions. As you can see from this code listing, using a programming language offers the ultimate in flexibility and power because an application development language is equipped with functions and syntax designed to support complex calculations and data structures. Most languages, such as Microsoft Visual Basic, Microsoft Visual C++, or Perl are designed to easily connect to data sources such as SQL Server, which in our example contains the data needed to perform regression analysis. 97

Note

This code listing was written as a batch script; it wasn't structured using functions or any persistent data storage. This form is not recommended for a real job, but in this example, it makes the code easier to follow and the regression analysis easier to understand.

 homesize = wcsript.arguments(0)     dim support     dim x     dim y     dim x2     dim y2     dim xy     dim avgx     dim avgy     dim errx     dim erry     dim intercept     dim slope     dim errorval     dim correlation     dim significance     dim prediction     dim residual     dim variance     dim tDist95[34]     tdist[1] = 12.706     tdist[2] = 4.303     tdist[3] = 3.182     tdist[4] = 2.776     tdist[5] = 2.571     tdist[6] = 2.447     tdist[7] = 2.365     tdist[8] = 2.306     tdist[9] = 2.262     tdist[10] = 2.228     tdist[11] = 2.120     tdist[12] = 2.110     tdist[13] = 2.101     tdist[14] = 2.093     tdist[15] = 2.086     tdist[16] = 2.080     tdist[17] = 2.074     tdist[18] = 2.069     tdist[19] = 2.064     tdist[20] = 2.060     tdist[21] = 2.056     tdist[22] = 2.052     tdist[23] = 2.048     tdist[24] = 2.045     tdist[25] = 2.042     tdist[26] = 2.030     tdist[27] = 2.021     tdist[28] = 2.014     tdist[29] = 2.009     tdist[30] = 2.004     tdist[31] = 2.000     tdist[32] = 1.990     tdist[33] = 1.984     tdist[34] = 1.980     Dim oConn as adodb.connection     Dim oRs as adodb.recordset     Dim strSql     Set oConn.ConnectionString ="Provider=sqloledb;         username=dtsuser;password= ;Data Source=dataserver;         Initial Catalog=dataminer"     oConn.Open     strSql = "Select sum([size]) as X," & _         "sum([electricity]) as Y," & _         "avg([size]) as AvgX," & _         "avg([electricity]) as AvgX," & _         "sum([size]*[electricity]) as XY," & _         "sum(square([size])) as X2," & _         "sum(square([electricity])) as Y2," & _         "count(1) as support" & _         " from homeelectric"     Set oRs = oConn.Execute(strSql)     x =     oRs.x     y =     oRs.y     x2 =    oRs.x2     y2 =    oRs.y2     avgx =  oRs.avgx     avgy =  oRs.avgy     support = oRs.support     slope  = ((support * xy) - (x * y)) / ((support * x2)         - square(x))     intercept  = (y/support) - (slope * (x/support))     sxy = (xy/support) - (avg * avgy)     sx  = sqrt(abs(x2/support)-square(avgx))     sy  = sqrt(abs(y2/support)-square(avgy))     correlation = sxy/(sx*xy)     oRs.Close     strSql = "select sum(X-" & cstr(avgx) & ") as errx," & _         "sum(Y-" & cstr(avgy) & "), as erry" & _         "sum(square(y-(" & cstr(intercept) & " + (" & _         cstr(slope) &  * X)))) as resid" & _         " from homeelectric"     Set oRs = oConn.Execute(strSql)     errx = oRs.errx     erry = oRs.erry     residual = oRs.residual     oRs.Close     oConn.Close     prediction = intercept + (slope * homesize)     if support > 35 ' This is so we don't overwrite the array.         tdist = 1.980      else          tdist = tDist95[support-2]     end if     MSE = tdist * sqrt(abs(square(residual) / (support-2)))        errorval = MSE * ((square(prediction - avgx)/ errx)) + _         1 + (1/support)     wscript.echo  "Slope : " & cstr(slope)     wscript.echo  "Intercept : " & cstr(intercept)     wscript.echo  "Correlation : " & cstr(correlation)     wscript.Echo  "Prediction"     wscript.Echo  "------------------------------------------------"     wscript.Echo  "If house is " & cstr(estimate) & " square feet"     wscript.Echo  "You can expect to pay $" & cstr(prediction) & _                   " in for electricity annually"     wscript.Echo  " K give or take $" & cstr(errorval)

Creating the Models

We have just seen how to perform the calculations needed to make numerical predictions using linear regression analysis. However, as you've probably noticed, the Microsoft Analysis Services algorithms create a data-mining model that is used to make those predictions. The linear regression algorithm should also generate a data-mining model that can be used for predictions without having to execute the algorithm every time.

Because linear regression is not one of the available data-mining algorithms within Analysis Services, we have only two useful choices to store our models:

A table created using SQL Server, Visual Foxpro, or Microsoft Access
A Predictive Modeling Markup Language (PMML) XML file 98

Unlike the other algorithms we've seen, linear regression needs to store very little data about the cases. It only needs

‚ Correlation ‚ To test the degree to which the data is related.
‚ Slope ‚ A necessary number needed to perform the regression equation.
‚ Intercept ‚ One of the numbers needed to perform the regression equation. 99
‚ Residual value ‚ This is the difference between the predicted values for y and the actual values for y. The difference exists because we make compromises in the expected values of y to draw a straight line. This helps calculate the variance in the prediction.
‚ Variance of x ‚ This value, combined with the residual value, helps calculate the variance of a prediction.
‚ Support ‚ The number of cases used to create the regression line.

Using a Table 100

Storing these in a table is easy. A table such as the one shown here is adequate:

 CREATE TABLE RegressionModel (ID int,     correlation float,     slope float,     intercept float,     residual float,     -- the error of X (errx) is another name for the variance of X     errx float,      support int)

After the regression line calculation is performed, the numbers are inserted into the model, and all subsequent predictions would simply query the values from the table to build the formula that generates the prediction. Here is a sample query:

 SELECT intercept + (slope * 1500) as prediction     FROM RegressionModel  WHERE ID = 10

Using PMML

Storing the linear regression mining model in an industry-wide standard structure offers some advantages over storing them in a table. Microsoft has embraced PMML and stores local mining models created with the clustering or decision trees algorithms in a PMML-compliant XML file. By storing the regression model in the same format, there's a good chance you will be able to use today's models within Analysis Services as soon as regression is made available. PMML also provides a way to use other data-mining products that support this algorithm. The downside to this approach, of course, is that until Microsoft supports this model, your applications will have to be designed to read directly from the XML file to make predictions.

For more information about PMML support for the regression model, go to 101 ‚ http://www.dmg.org/ ‚

Summary

Data mining is a powerful tool, and as of this writing, a new one as well. The decision trees and clustering algorithms available with SQL Server 2000 are ideal for making predictions of discrete attributes. These same algorithms are incapable of supplying specific numeric predictions. Regression analysis is one of the many algorithms used to make predictions on continuous (numeric) data, but it's not currently available as part of the data-mining suite of tools.

Soon you will be able to add your own algorithms or purchase algorithms from third-party vendors to the data-mining tools. The regression analysis algorithm is even hinted at in the OLE DB for data-mining specification. I wanted to demonstrate that the tools that come with SQL Server may be sufficient to apply new algorithms designed to your specifications.

SQL Server does come with powerful tools to perform this regression analysis and numerical predictions, either by using OLAP functions or the SQL Server T-SQL language itself.These tools give you the power to use proven algorithms today and even use the variations that are most appropriate for your data-analysis needs.