As you have noticed from reading this book, Microsoft Analysis Services provides data-mining algorithms that are primarily suited for classifying cases into groups and predicting missing values from discrete sets. With Microsoft Data Mining, it's easy to find whether a given mushroom is edible based on its characteristics or what film genre a given customer is likely to rent based on their past behavior, but it's virtually impossible to predict specific continuous (numerical) data values, such as the price of a car based on its make, model, and condition.
Other data-mining algorithms, such as regression analysis, are well suited to this type of data mining. Unfortunately, Microsoft Data Mining doesn't support regression analysis at this time. However, this shouldn't stop you from using Microsoft SQL Server 2000 to apply such algorithms to make predictions . The rich array of tools and functions available in SQL Server 2000 makes it possible to use other algorithms. In this appendix, I'll show you how the regression analysis algorithm works and one way to apply it directly to OLAP and relational data without using data-mining components from Analysis Services. This serves two purposes - one, it provides you with a solid understanding of the underlying processes of a data-mining algorithm and two, it suggests an alternative way to mine data using your programming knowledge and the SQL Server tools. As soon as Analysis Services allows third-party algorithms, you will be able to use variations of the formulas described in this appendix to add the algorithm to the list of algorithms available in the data-mining tools.
What Is Regression Analysis?
Regression analysis is the application of statistical formulas to two sets of numbers to determine the degree to which one set of numbers affects the second set of numbers . The purpose of regression analysis is to improve our ability to predict the next occurrence of our dependent variable based on past experience. Regression analysis is also the mathematical relationship between two variables expressed in an equation. One variable can be predicted using what we know about the other variable. The variable whose value is to be predicted is called the dependent variable. The variable about which knowledge is available is called the independent variable. The relationships between variables may be linear or curvilinear. By linear, we mean that the functional relationship can be described graphically (on a common X-Y coordinate system) with a straight line and mathematically by the common form: y = mx + b. By curvilinear, we mean that the relationship is defined by a curved line, which incidentally happens to be more complex to calculate. 1
There are many situations in which the value of one numerical variable has an effect on the numerical value of another variable. Our example will use regression analysis to find out whether the size of a house has a direct effect on the cost of electricity. If the results show a relationship, regression analysis can also help predict how much the cost increases per square foot .
Predicting Continuous Attributes: An Example
The best way to grasp how regression analysis works is to actually work through an example. Table A-1 lists the square footage and average annual electric costs of 12 homes :
Table A-1. House Size vs. Observed Electricity Costs |
House Size (sq. ft.) | Annual Electricity Costs |
---|---|
1900 2 | $600 |
2000 | $660 |
2525 | $900 3 |
1300 | $480 |
3500 | $1,680 |
4100 | $1,710 |
2870 | $1,200 |
1700 5 | $540 |
2300 | $1,500 |
2450 | $1,080 6 |
2890 | $1,200 |
1890 | $540 |
Imagine that you are going to buy a home that you intend to live in for at least five years . Before deciding which home to buy, you need to know how much it will cost per month. Besides the mortgage, the costs include maintenance, taxes, insurance, and utilities. Because the homes are located in a warm climate, you will have the added expense of air conditioning. 7
Correlation
The idea that there is a correlation between the size of a home and the cost to live in it sounds reasonable. To test the hypothesis, the first thing you need to do is establish that there is a link between home size and electricity costs. This link, or correlation, is expressed as a number between -1 and 1. The closer to 1 the number is, the higher the positive correlation. Therefore, the larger the home, the higher the electricity costs. The closer to -1 this number is, the higher the negative correlation is. If there was a negative correlation in this relationship, a larger home would use less electricity. If the correlation number is 0, it is unlikely that there is any link between home size and energy consumption. The number that describes the correlation is known as the correlation coefficient. I'll show you the actual equation for calculating the correlation coefficient in a moment, but first I'll build out the table of sums and means that we'll need to plug into the equation. Table A-2 contains all the numbers needed to calculate both the correlation coefficients and the regression line that I'll discuss later in the section. The column of numbers containing the independent variables is commonly referred to as the X column or the column of x values. The dependent variables are commonly referred to as the y values.
Table A-2. x and y Value Computations for Correlation and Regression Equations |
‚ | X | Y | X 8 2 | Y 2 | XY |
---|---|---|---|---|---|
1 | 1900 | 600 9 | 3,610,000 | 360,000 | 1,140,000 |
2 | 2000 | 660 10 | 4,000,000 | 435,600 | 1,320,000 |
3 | 2525 | 900 11 | 6,375,625 | 810,000 | 2,272,500 |
4 | 1300 | 480 12 | 1,690,000 | 230,400 | 624,000 |
5 | 3500 | 1680 13 | 12,250,000 | 2,822,400 | 5,880,000 |
6 | 4100 | 1710 14 | 16,810,000 | 2,924,100 | 7,011,000 |
7 | 2870 | 1200 15 | 8,236,900 | 1,440,000 | 3,444,000 |
8 | 1700 | 540 16 | 2,890,000 | 291,600 | 918,000 |
9 | 2300 | 1500 17 | 5,290,000 | 2,250,000 | 3,450,000 |
10 | 2450 | 1080 18 | 6,002,500 | 1,166,400 | 2,646,000 |
11 | 2890 | 1200 19 | 8,352,100 | 1,440,000 | 3,468,000 |
12 | 1890 | 540 20 | 3,572,100 | 291,600 | 1,020,600 |
Sum | ‚ & pound ;Ux=29,425 | ‚ Uy=12,090 21 | ‚ Ux 2 =79,079,225 | ‚ Uy 2 =14,462,100 | ‚ Uxy=33,194,100 22 |
Mean |
Now that we've calculated the sums and the means of each column, we can move on to the equations. Figure A-1 shows the equation used to calculate the correlation coefficient. To simplify the explanation, I've divided the computation into the three main components. These are the covariance of the x and y values divided by the product of the standard deviations of x and the standard deviations of y. 23
Figure A-1. Correlation coefficient equation. |
The covariance is calculated with the formula in Figure A-2.
Figure A-2. The covariance of x and y equation. |
By now some of you are probably wishing you had never fallen asleep in high school precalculus. No need to worry; I'll explain this process step by step just to show you how simple it is.
First we get the covariance by substituting the numbers: 24
Cov xy = (33,194,100 / 12) ‚ (2452 ‚ ƒ ˜1007.5)
Cov xy = 295,701
Then we calculate the standard deviations of the X and Y columns , starting with the X column. (See Figure A-3.) 25
Figure A-3. The standard deviation of x equation. |
Substituting the numbers for x gives us the following result. (See Figure A-4.)
Std x = square_root((79,079,225 / 12) ‚ 6,012,712.67)
Std 26 x = 759.75
Figure A-4. The standard deviation of y equation. |
Substituting the numbers for y gives us the following result:
Std y 27 = square_root((14,462,100 / 12) ‚ 1,015,056.25)
Std y = 436.02
Substituting all the numbers gives us the following result:
r = 295,701 / (759.75 ‚ ƒ ˜ 436.02)
r = .89 or 89 % 28
This number, because it is close to 1, or 100 percent, shows a very strong positive correlation. In other words, there is a direct link to home size and electricity consumption. The next question is how much will an x- sized house consume in electricity per year? To get this information, we first need to calculate the regression line.
How to Interpret the Correlation Coefficient Value
Generally, anything above 50 percent correlation is considered significant. However, that depends heavily on the size of the sample. An 80 percent correlation when there are only five rows of data in the sample is largely insignificant compared to a 54 percent correlation when there are 150 rows of data in the sample. The larger the sample, the lower the correlation can be and the smaller the sample, the higher we need the correlation to be before we can conclude that the data correlation is high enough.
A common misinterpretation of correlation figures is that strong correlations somehow prove that the independent variables (the x values) cause the values of the dependent variables (the y values). To prove that this is not always true, some statisticians have shown strong correlations between the number of Certified Public Accountant (CPA) certifications and the number of prison incarcerations for a given year. There are coincidental factors that contribute to those figures but no causal link between the two. There are various statistical tests that can be done to investigate the significance of the correlation, but these are beyond the scope of this appendix.
To find out whether there is a causal effect between two values, the correlation coefficient is squared to come up with a coefficient of determination. The squaring of the correlation coefficient causes the number to be smaller, so if your correlation coefficient for a given analysis is .80, then the coefficient of determination will be .64.
The Regression Line
In this case, the independent variable is the square footage of the house. We assume that this number (x value) is not influenced by any other variables. The electricity cost is the dependent variable (y value) because it is the direct result of the independent variable. Before we discuss the statistical formulas, look at the numbers from Table A-2 when transferred to a graph. (See Figure A-5.) 29
Figure A-5. Scatter chart representation of the home size vs. electricity costs. |
As you can see from Figure A-5, the algorithm looks for a straight line that best describes the highest possible correlation by making the points fall as close to the line as possible. Regression looks for the slope and intercept of the line closest to all the points on the graph. The slope is a number that describes the steepness of the line as it moves up or down the Y axis of the graph. The intercept is the number that denotes the point on the Y axis when x = 0. Because of this goal, this line is known as the best fit line or regression line.
This section uses a straight line to find this best fit. However, it's often more accurate to establish relationships between numbers using curved lines, especially when the numbers show trends indicating that they go through predictable or seasonal changes. A curved line could follow this trend far more accurately than a straight line, but for the sake of simplicity, we'll use the straight line. Once you understand linear regression, I urge you to explore the use of nonlinear regression for data sets that do not necessarily follow a linear trend.
If a = the intercept and b = the slope, the basic equation that best describes linear regression is as follows :
y = bx + a
If we can calculate the value for a and b, thereby providing a value for a given house price (x), we should be able to solve the equation and get the expected electric bill (y). 30
Finding the Slope
The equation for the calculation of the slope is shown in Figure A-6.
Figure A-6. Regression line slope equation. |
By substituting the numbers, we get the following result:
b = [(12 ‚ ƒ ˜ 33,194,100) ‚ (29,425 ‚ ƒ ˜ 12,090)] / (12 ‚ ƒ ˜ 79,079,225) ‚ 29,425 2 31
b = 42,580,950 / 83,120,075
b = 0.512
We now have one variable of the equation set. The next step is to find the intercept, which we'll do with the help of the slope.
Finding the Intercept
The formula for finding the intercept is shown in Figure A-7.
Figure A-7. The intercept of the regression line equation. 32 |
Table A-2 provides the numbers we plug in to the equation:
a = 1007.50 ‚ (0.512 ‚ ƒ ˜ 2452.08)
a = -248
The Regression Coefficient
Now that we have computed the numbers, we know that the regression formula is y = (0.512 ‚ ƒ ˜ ‚ q) - 248. By replacing x with a proposed home size, we can predict the expected annual electric bill. This coefficient is a two-way street - if we wanted to predict the size of the house, we could simply replace y with a proposed electric bill!
Using Regression Analysis to Make Predictions
Imagine you have seen several homes, and you want to find the annual electric bill. By substituting the home size in square feet for x, you'll get the y value, or the expected cost of electricity per year. (See Table A-3.) 33
Table A-3. Using x to predict y |
Square Footage of Home (X) | Expected Cost of Electricity (Y) |
---|---|
1150 | $340.80 |
1875 34 | $712.00 |
2247 | $902.46 |
3324 | $1,453.89 35 |
3754 | $1,674.05 |
4120 | $1,861.44 |
4621 | $2,117.95 |
In addition to the dangers associated with making predictions based on data that might have high coincidental correlations, there is also the danger of making predictions that go beyond the range of values present in the sample. In our house size vs. electricity bills example, our house size was between 1300 and 4100 square feet. It might be tempting to predict values for x and y where x is outside the range, as I did in Table A-3. Notice that I have one home with 1150 square feet in the first row and another with 4621 square feet on the last row. These two values fall outside the range of house sizes in the sample. These values make it risky to act on the calculated results or predictions. The ranges fall outside our experience and the assumption is that the results will remain linear. Making predictions based on numbers that do not fall within the sample is called extrapolation. This is in contrast to interpolation, a process that generates predictions based on the range of values contained in the model.
Cause and Effect
Regression and correlation analysis cannot determine cause and effect. It's up to the analyst to do a logic check, determine an appropriate hypothesis, and test the hypothesis against the data. For example, a correlation value of .95 relates the number of new CPAs in a city to the number of people arrested for drunk driving in the same city in one year. Clearly there is no cause and effect involved here. A deeper variable, population, is the true independent variable that drives both the number of CPAs and the number of arrests made for drunk driving. As analysts, we must choose data sets that are related and check that they influence one another.
Using extrapolation to predict energy costs with small homes is less risky than doing so with a sample of small and large homes combined in the same sample. For example, large houses will show a major increase in electricity use simply because they have proportionately higher ceilings than smaller homes, which causes the air-conditioning bills to be higher than in a house with lower ceilings. The predicted values will be false because they were based on the proportions measured with the smaller homes.
Analyzing the Accuracy of the Regression Line 37
The slope and the intercept of the regression line tells us very little about how well the line actually fits the points on the graph. Although the correlation coefficient mentioned previously in the chapter tells us about the strength of the relationships between the numbers, we still need a way to quantify the accuracy of our predictions. Ideally , for every prediction of Y, we would get an error range that tells us how far off the mark we are. For example, it would be nice to know that a prediction for a house of x size consumes y dollars in electricity give or take v dollars. To calculate our margin of error, we use the formula in Figure A-8.
Figure A-8. The formula for testing the error range of y for a value of x. |
Use Table A- 4 for the following formulas.
Table A- 4.x and y Values for Correlation and regression Equation with Additional for Variance Analysis |
X | Y 38 | X 2 | Y 2 | XY | 39 | (X-AVG(X)) | |
---|---|---|---|---|---|---|---|
1900 | 600 | 3,610,000 | 360,000 | 1,140,000 | $724.80 40 | $15,575.04 | 304796.01 |
2000 | 660 | 4,000,000 | 435,600 | 1,320,000 41 | $776.00 | $13,456.01 | 204379.34 |
2525 | 900 | 6,375,625 | 810,000 42 | 2,272,500 | $1,044.80 | $20,967.04 | 5316.84 |
1300 | 480 | 1,690,000 43 | 230,400 | 624,000 | $417.60 | $3,893.76 | 1327296.01 |
3500 | 1680 44 | 12,250,000 | 2,822,400 | 5,880,000 | $1,544.00 | $18,496.00 | 1098129.34 |
4100 45 | 1710 | 16,810,000 | 2,924,100 | 7,011,000 | $1,851.20 | $19,937.44 | 2715629.34 46 |
2870 | 1200 | 8,236,900 | 1,440,000 | 3,444,000 | $1,221.44 | $459.67 47 | 174654.34 |
1700 | 540 | 2,890,000 | 291,600 | 918,000 | $622.40 48 | $6,789.76 | 565629.34 |
2300 | 1500 | 5,290,000 | 2,250,000 | 3,450,000 49 | $929.60 | $325,356.16 | 23129.34 |
2450 | 1080 | 6,002,500 | 1,166,400 50 | 2,646,000 | $1,006.40 | $5,416.96 | 4.34 |
2890 | 1200 | 8,352,100 51 | 1,440,000 | 3,468,000 | $1,231.68 | $1,003.62 | 191771.01 |
1890 | 540 52 | 3,572,100 | 291,600 | 1,020,600 | $719.68 | $32,284.90 | 315937.67 |
‚ U 53 x =29,425 | ‚ U y =12,090 | ‚ U x 54 2 =79,079,225 | ‚ U y 2 =14,462,100 | ‚ U 55 xy =33,194,100 | ‚ | MSE=$46,636.63 | SUM=6,926,672.92 |
56 | ‚ | ‚ | ‚ |
Notice that the first thing we must do is get the value for the prediction based on the x values that we used to build the model. The Y'column contains the predicted values. 57
Table A-5. T-Distribution Table |
T-Distribution Significance Level | ||||
---|---|---|---|---|
‚ Degrees of Freedom (n-2) ‚ | ‚ 90% ‚ | ‚ 95% 60 ‚ | ‚ 98% ‚ | ‚ 99% ‚ |
1 | 6.314 | 12.706 | 31.821 | 63.657 61 |
2 | 2.92 | 4.303 | 6.965 | 9.925 |
3 | 2.353 | 3.182 | 4.541 | 5.841 |
4 63 | 2.132 | 2.776 | 3.747 | 4.604 |
5 | 2.015 64 | 2.571 | 3.365 | 4.032 |
6 | 1.943 | 2.447 65 | 3.143 | 3.707 |
7 | 1.895 | 2.365 | 2.998 66 | 3.499 |
8 | 1.86 | 2.306 | 2.896 | 3.355 67 |
9 | 1.833 | 2.262 | 2.821 | 3.25 |
10 | 1.812 | 2.228 | 2.764 | 3.169 |
11 69 | 1.796 | 2.201 | 2.718 | 3.106 |
12 | 1.782 70 | 2.179 | 2.681 | 3.055 |
13 | 1.771 | 2.16 71 | 2.65 | 3.012 |
14 | 1.761 | 2.145 | 2.624 72 | 2.977 |
15 | 1.753 | 2.131 | 2.602 | 2.947 73 |
16 | 1.746 | 2.12 | 2.583 | 2.921 |
17 | 1.74 | 2.11 | 2.567 | 2.898 |
18 75 | 1.734 | 2.101 | 2.552 | 2.878 |
19 | 1.729 76 | 2.093 | 2.539 | 2.861 |
20 | 1.725 | 2.086 77 | 2.528 | 2.845 |
21 | 1.721 | 2.08 | 2.518 78 | 2.831 |
22 | 1.717 | 2.074 | 2.508 | 2.819 79 |
23 | 1.714 | 2.069 | 2.5 | 2.807 |
24 | 1.711 | 2.064 | 2.492 | 2.797 |
25 81 | 1.708 | 2.06 | 2.485 | 2.787 |
26 | 1.706 82 | 2.056 | 2.479 | 2.779 |
27 | 1.703 | 2.052 83 | 2.473 | 2.771 |
28 | 1.701 | 2.048 | 2.467 84 | 2.763 |
29 | 1.699 | 2.045 | 2.462 | 2.756 85 |
30 | 1.697 | 2.042 | 2.457 | 2.75 |
40 | 1.684 | 2.021 | 2.423 | 2.704 |
60 87 | 1.671 | 2 | 2.39 | 2.66 |
120 | 1.658 88 | 1.98 | 2.358 | 2.617 |
Using this method, my predictions can now look like those in Table A-6.
Table A-6. Results of Using the x Values to Predict the y Values with Error Values |
Square Footage of Home (X) | Expected Cost of Electricity (Y) 89 | Error Interval |
---|---|---|
1150 | $340.80 | $0 - $893 |
1875 | $712 90 | $201 - $1,222 |
2247 | $902.46 | $401 - $1,402 |
3324 91 | $1,453.89 | $928 - $1,977 |
3754 | $1,674.05 | $1,121 - $2,226 |
4120 | $1,861.44 | $1,276 - $2,445 |
4621 | $2,117.95 | $1,480 - $2,753 93 |
Most statistics books cover other methods for analyzing variance, and these are worth exploring if you're interested in being able to assert accurate levels of confidence in your predictions.
Using OLAP to Create Regression Models
When Microsoft OLAP first entered the market, what many data miners looked for first were the statistical functions. Because statistics depends heavily on the compilation of aggregated data, OLAP is a powerful tool to perform regression analysis. OLAP provides the following MDX functions to perform the same calculations as those we just did.
To calculate the slope and the intercept, we can use the LinRegSlope and LinRegIntercept functions in the following way:
The correlation to measure the strength of the relationship between the X and Y arrays is just as easy to compute:
Predictions can be made using the LinRegPoint function. The following function predicts a given y value for the proposed x value:
The following listing shows these functions in action. I created a very simple cube that contains a time dimension and the same home size vs. electricity cost measures as in the previous examples. Consequently, I ran this MDX query in the MDX Sample Application provided with Analysis Services with the following result:
with member [measures].[correlation] as 'correlation({descendants([time],[time].[day])}, [home size], [electric]) ' member [measures].[slope] as 'linregslope({descendants([time],[time].[day])}, [electric], [home size]) ' member [measures].[intercept] as 'linregintercept({descendants([time],[time].[day])}, [electric], [home size]) ' member [measures].[predict 3500] as 'linregpoint(3500, {descendants([time],[time].[day])}, [electric], [home size]) ' select {[time]} on columns, { [correlation], [slope], [intercept], [predict 3500] } on rows from [homes]
The MDX regression analysis yields the results shown in Figure A-9. 95
Figure A-9. MDX regression analysis results. |
An added advantage of OLAP is the ease and speed with which queries are issued. Also, OLAP allows you to navigate across various dimensions, allowing you to drill down levels and have the statistical functions dynamically calculate according to the current aggregation you happen to be in.
Applying Regression to a Relational Database
To make regression analysis even more valuable to data miners, we must be able to apply the algorithm to a table. The most important point to keep in mind when choosing a data source or when building a table is that the logically related pairs of data values must remain grouped together in a row. Using our house example, a logical pair would be the square footage of a house and its annual electricity costs. Regression, as you've probably noticed, makes use of the product of the values in the X and Y columns to generate the regression line.
The house size vs. electricity consumption example is straightforward, but many other real-world regression analysis projects rely on data that has the x values in one table and the y values in another. A SELECT statement is used to join them. For example, if you're going to use the daily interest rates as the x values to predict the price of airline tickets in the Y column, you need to make sure that each row contains an interest rate and a ticket price for the same date, week, or other common measurement level the values share so that the paired values are related.
Getting the values you need from tables using the SQL syntax is more complex because SQL lacks many of the statistical functions available in OLAP. Also, relational tables are designed to return result sets after only one pass through the tables. To get around the set-oriented functions of RDBMS engines, it's possible to take advantage of the richness of the T-SQL syntax to store all the needed numerical elements in variables and perform the calculations with them as we do in the code listing below. The result can then be returned in the form of a table. 96
If this were a real application, tables could be designed to structure the environment for regression in a more sensible manner. The correlation, the slope, and the intercept would obviously be computed once and stored in a table so that the predictions could simply perform calculations based on precomputed data instead of recalculating the same data every time a prediction is made. The stored procedure is structured in this way to demonstrate the use of T-SQL for regression analysis.
set nocount on create table tdist95 (df int, dist float) insert into tdist95 values (1, 12.706) insert into tdist95 values (2, 4.303) insert into tdist95 values (3, 3.182) insert into tdist95 values (4, 2.776) insert into tdist95 values (5, 2.571) . . . -- ************************************************** -- To fill this table, see the t-distribution table -- in this document. It?sassumed that a permanent -- t-distribution table will be present to be used -- by all future regression analysis efforts. -- ************************************************** go create procedure usp_MakeElectricBillPrediction @homesize float as declare @x float declare @y float declare @avgx float declare @avgy float declare @xy float declare @x2 float declare @y2 float declare @support int declare @dist float declare @correlation float declare @determination float declare @slope float declare @intercept float declare @sx float declare @sy float declare @sxy float declare @prediction float select @x = sum([size]) , @y = sum([electricity]), @avgx = avg([size]) , @avgy = avg([electricity]) , @xy = sum([size]*[electricity]) , @x2 = sum(square([size])) , @y2 = sum(square([electricity])), @support = count(1) as support from homeelectric set @slope = ((@support * @xy) - (@x * @y)) / ((@support * @x2) - square(@x)) set @intercept = (@y/@support) - (@slope * (@x/@support)) set @sxy = (@xy/@support) - (@avgx * @avgy) set @sx = sqrt(abs(@x2/@support)-square(@avgx)) set @sy = sqrt(abs(@y2/@support)-square(@avgy)) set @correlation = @sxy/(@sx*@xy) set @determination = square(@correlation) -- The t-distribution confidence level is calculated based -- on the degrees of freedom, which is the support - 2 select @dist = dist from tdist95 where df = (@cnt - 2). if @dist is null begin -- This is in case the support is higher than the highest -- degree of freedom value, in which case we use the maximum. select @dist = max(dist) from tdist95. end -- The expected electric bill is calculated here. set @prediction = @intercept + (@homesize * @slope) select @errx = sum(X-avgx) as errx, @erry = sum(Y-avgy)) as erry, @residual = sum(square(y - (intercept + (slope* X))) from homeelectric -- Calculate the error variation. set @error = (@tdist * @residual) * (square(@prediction - @avgx)/@errx) + 1 + (1/@support) -- Return the results in the form of a table. select @correlation as correlation, @support as support, @slope as slope, @intercept as intercept, @prediction as AnnualElectricBill, @error as PlusOrMinus
Regression with SQL Server vs. Regression Using OLAP
OLAP is able to perform regression functions only with measures in the same fact table. The house size vs. electricity consumption example is ideally suited for this, but if you wanted to perform a regression analysis to see whether the price of two-bedroom houses had any relationship to the price of five-bedroom houses, OLAP couldn't use the regression functions to perform those calculations. This is because the regression functions use the measures that belong to a given set, but to do the house analysis, two sets have to be compared. A big advantage of using SQL Server to perform regression analysis directly with user defined mathematical expressions is that you have the flexibility to perform analysis on data sets in a number of different ways.
Because OLAP tends to include records in a set that may contain 0 values, which you might choose to exclude, the regression analysis can be seriously compromised. For example, if some of the sample houses were uninhabited for some time they would show $0 in electricity consumption. This data would make the correlation as well as the slope of the line inaccurate. As complicated as it would be to eliminate these values from the set when using OLAP, it's easy to exclude records when using SQL Server with a simple WHERE clause.
Using Visual Basic to Perform Regression Analysis
To analyze, regression analysis needs both aggregates and individual rows. Although this analysis can be accomplished with SQL Server user-defined functions, temporary tables, and cursors , you might consider creating a COM+ middle- tier component that does the calculations and returns the slope, intercept, and correlation. The intense mathematics are better left to this component so that CPU cycles are not degraded and taken from the database engine that needs the power to return queries. These calculated values can then be stored in a table for use in subsequent predictions. As you can see from this code listing, using a programming language offers the ultimate in flexibility and power because an application development language is equipped with functions and syntax designed to support complex calculations and data structures. Most languages, such as Microsoft Visual Basic, Microsoft Visual C++, or Perl are designed to easily connect to data sources such as SQL Server, which in our example contains the data needed to perform regression analysis. 97
This code listing was written as a batch script; it wasn't structured using functions or any persistent data storage. This form is not recommended for a real job, but in this example, it makes the code easier to follow and the regression analysis easier to understand.
homesize = wcsript.arguments(0) dim support dim x dim y dim x2 dim y2 dim xy dim avgx dim avgy dim errx dim erry dim intercept dim slope dim errorval dim correlation dim significance dim prediction dim residual dim variance dim tDist95[34] tdist[1] = 12.706 tdist[2] = 4.303 tdist[3] = 3.182 tdist[4] = 2.776 tdist[5] = 2.571 tdist[6] = 2.447 tdist[7] = 2.365 tdist[8] = 2.306 tdist[9] = 2.262 tdist[10] = 2.228 tdist[11] = 2.120 tdist[12] = 2.110 tdist[13] = 2.101 tdist[14] = 2.093 tdist[15] = 2.086 tdist[16] = 2.080 tdist[17] = 2.074 tdist[18] = 2.069 tdist[19] = 2.064 tdist[20] = 2.060 tdist[21] = 2.056 tdist[22] = 2.052 tdist[23] = 2.048 tdist[24] = 2.045 tdist[25] = 2.042 tdist[26] = 2.030 tdist[27] = 2.021 tdist[28] = 2.014 tdist[29] = 2.009 tdist[30] = 2.004 tdist[31] = 2.000 tdist[32] = 1.990 tdist[33] = 1.984 tdist[34] = 1.980 Dim oConn as adodb.connection Dim oRs as adodb.recordset Dim strSql Set oConn.ConnectionString ="Provider=sqloledb; username=dtsuser;password= ;Data Source=dataserver; Initial Catalog=dataminer" oConn.Open strSql = "Select sum([size]) as X," & _ "sum([electricity]) as Y," & _ "avg([size]) as AvgX," & _ "avg([electricity]) as AvgX," & _ "sum([size]*[electricity]) as XY," & _ "sum(square([size])) as X2," & _ "sum(square([electricity])) as Y2," & _ "count(1) as support" & _ " from homeelectric" Set oRs = oConn.Execute(strSql) x = oRs.x y = oRs.y x2 = oRs.x2 y2 = oRs.y2 avgx = oRs.avgx avgy = oRs.avgy support = oRs.support slope = ((support * xy) - (x * y)) / ((support * x2) - square(x)) intercept = (y/support) - (slope * (x/support)) sxy = (xy/support) - (avg * avgy) sx = sqrt(abs(x2/support)-square(avgx)) sy = sqrt(abs(y2/support)-square(avgy)) correlation = sxy/(sx*xy) oRs.Close strSql = "select sum(X-" & cstr(avgx) & ") as errx," & _ "sum(Y-" & cstr(avgy) & "), as erry" & _ "sum(square(y-(" & cstr(intercept) & " + (" & _ cstr(slope) & * X)))) as resid" & _ " from homeelectric" Set oRs = oConn.Execute(strSql) errx = oRs.errx erry = oRs.erry residual = oRs.residual oRs.Close oConn.Close prediction = intercept + (slope * homesize) if support > 35 ' This is so we don't overwrite the array. tdist = 1.980 else tdist = tDist95[support-2] end if MSE = tdist * sqrt(abs(square(residual) / (support-2))) errorval = MSE * ((square(prediction - avgx)/ errx)) + _ 1 + (1/support) wscript.echo "Slope : " & cstr(slope) wscript.echo "Intercept : " & cstr(intercept) wscript.echo "Correlation : " & cstr(correlation) wscript.Echo "Prediction" wscript.Echo "------------------------------------------------" wscript.Echo "If house is " & cstr(estimate) & " square feet" wscript.Echo "You can expect to pay $" & cstr(prediction) & _ " in for electricity annually" wscript.Echo " K give or take $" & cstr(errorval)
Creating the Models
We have just seen how to perform the calculations needed to make numerical predictions using linear regression analysis. However, as you've probably noticed, the Microsoft Analysis Services algorithms create a data-mining model that is used to make those predictions. The linear regression algorithm should also generate a data-mining model that can be used for predictions without having to execute the algorithm every time.
Because linear regression is not one of the available data-mining algorithms within Analysis Services, we have only two useful choices to store our models:
Unlike the other algorithms we've seen, linear regression needs to store very little data about the cases. It only needs
Using a Table 100
Storing these in a table is easy. A table such as the one shown here is adequate:
CREATE TABLE RegressionModel (ID int, correlation float, slope float, intercept float, residual float, -- the error of X (errx) is another name for the variance of X errx float, support int)
After the regression line calculation is performed, the numbers are inserted into the model, and all subsequent predictions would simply query the values from the table to build the formula that generates the prediction. Here is a sample query:
SELECT intercept + (slope * 1500) as prediction FROM RegressionModel WHERE ID = 10
Using PMML
Storing the linear regression mining model in an industry-wide standard structure offers some advantages over storing them in a table. Microsoft has embraced PMML and stores local mining models created with the clustering or decision trees algorithms in a PMML-compliant XML file. By storing the regression model in the same format, there's a good chance you will be able to use today's models within Analysis Services as soon as regression is made available. PMML also provides a way to use other data-mining products that support this algorithm. The downside to this approach, of course, is that until Microsoft supports this model, your applications will have to be designed to read directly from the XML file to make predictions.
For more information about PMML support for the regression model, go to 101 ‚ http://www.dmg.org/ ‚
Summary
Data mining is a powerful tool, and as of this writing, a new one as well. The decision trees and clustering algorithms available with SQL Server 2000 are ideal for making predictions of discrete attributes. These same algorithms are incapable of supplying specific numeric predictions. Regression analysis is one of the many algorithms used to make predictions on continuous (numeric) data, but it's not currently available as part of the data-mining suite of tools.
Soon you will be able to add your own algorithms or purchase algorithms from third-party vendors to the data-mining tools. The regression analysis algorithm is even hinted at in the OLE DB for data-mining specification. I wanted to demonstrate that the tools that come with SQL Server may be sufficient to apply new algorithms designed to your specifications.
SQL Server does come with powerful tools to perform this regression analysis and numerical predictions, either by using OLAP functions or the SQL Server T-SQL language itself.These tools give you the power to use proven algorithms today and even use the variations that are most appropriate for your data-analysis needs.