6.5 Least-Squares Linear Regression

   

 
Java Number Cruncher: The Java Programmer's Guide to Numerical Computing
By Ronald  Mak

Table of Contents
Chapter  6.   Interpolation and Approximation


If we're given a set of data points, we might not want an interpolation function that passes through all the points. But, instead, we're interested in a trend line that passes closely among the points, especially if there are a large number of data points. We can use the function that describes the trend line to generate approximations of the original data points. These approximations "smooth out" the original data, which is desirable if the original data contained some experimental error.

The trend line that we will compute is the regression line. Since we're looking for the line that passes the most closely among the data points, it is unique for a given set of points. What remains is to define what we mean by "close."

Figure 6-3 shows a regression line passing among three data points, ( x 1 , y 1 ), ( x 2 , y 2 ), and ( x 3 , y 3 ). The line function is in the power form, f ( x ) = a + a 1 x. This is also known as the slope-intercept form of the line function, where a is where the line intercepts the y axis, and a 1 is the slope of the line.

Figure 6-3. A linear regression line passing "closely" among three points. Each D represents the error, or vertical difference, between the point and the line.

graphics/06fig03.jpg

The figure also shows D 1 , D 2 , and D 3 , which are the differences between the y values of the data points and the y values computed by the line function. For example,

graphics/06equ22.gif


Therefore, each D represents the vertical difference between the point and the line, not the perpendicular distance. We can also consider each D to be the error between the actual value (the data point) and the predicted value (computed on the regression line).

Thus, to get the regression line function f ( x ) = a + a 1 x, we need to compute the values of its coefficients a and a 1 such that the line is the one that is closest to the data points: The line that we'll consider to be the closest is the least-squares regression line, because it's the one that minimizes the sum of the squares of all the D s. In other words, we want values for a and a 1 that minimize the value of the error function E ( a , a 1 ) defined as

graphics/06equ23.gif


By squaring the D values, the least-squares algorithm places a heavy penalty on large errors. Therefore, the algorithm tries to avoid many large D 's.


   
Top
 


Java Number Cruncher. The Java Programmer's Guide to Numerical Computing
Java Number Cruncher: The Java Programmers Guide to Numerical Computing
ISBN: 0130460419
EAN: 2147483647
Year: 2001
Pages: 141
Authors: Ronald Mak

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net