Section 40. Regression

40. Regression

Overview

Regression is one of the statistical tools used in the Multi-Vari approach and is probably the most powerful. A Regression analysis determines

The statistical significance of a relationship between a Continuous X and a Continuous Y in Y = f(X₁,X₂,..., X_n).
The nature of the relationship itself (i.e., the equation).

There are two basic forms of Regression:

Simple Linear Regression, which relates one Continuous Y with one Continuous X
Multiple Linear Regression, which relates one Continuous Y with more than one Continuous X

Regression analysis is the statistical analysis technique used to investigate and model the relationship between the variables. For both the Simple and Multiple techniques, the model parameters are linear in nature, not quadratic or any other power. Given the sheer size of the subject and the application of the tool in Lean Sigma, here the focus is primarily on Simple Linear Regression. Multiple Linear Regression is covered briefly in "Other Options" in this section.^[70]

^[70] For a deeper theoretical understanding of Regression, see Introduction to Linear Regression Analysis by Douglas Montgomery and Elizabeth Peck.

As with all statistical tests, a sample of reality is required. Generally 30 or more data points are required for the X and the corresponding value of Y at that point. Regression is a passive analysis tool and so the process is not actively manipulated during the data capture. After the requisite number of data points have been collected, they are entered as two columns into a statistical software package and analyzed.

Analyzing the data graphically using a Fitted Line Plot shows a result similar to the example shown in Figure 7.40.1. Here the X is "Age Of Propellant" in a rocket motor and the Y is "Shear Strength" of the propellant at that age. The data points are plotted on a Scatter Plot and then a straight line is fitted through them to give the best statistical fit. This is the Regression Line. There are many ways of doing this mathematically; in Regression the approach is to use Least Squares, which minimizes the total squares of all the distances from the line.

Figure 7.40.1. Example Fitted Line Plot^[71] (output from Minitab v14).

^[71] Source: SBTI's Lean Sigma Methodology training material.

The equation of the straight line (the Regression model) is given above the graph in Figure 7.40.1 and is

Shear Strength (psi) = 2628 37.15 x Age of Propellant (weeks)

Thus, in the future, for any Age of Propellant from 0 to 25 weeks,^[72] it is possible to predict the physical property Shear Strength for that propellant. Also, if the Shear Strength had to be maintained above a certain level to perform correctly, then it is also possible to calculate a would-be shelf life for the propellant based on the model.

^[72] There is no data outside of this timeframe and so no predictions should be made beyond 25 weeks.

In the top right of Figure 7.40.1 are three statistics. These are in fact only three of many which are available from the full analysis results, which are shown in Figure 7.40.2. The analysis shows the same equation (model) representing the relationship between Y and X.

Figure 7.40.2. Analysis results for the Rocket Propellant example.^[74]
Regression Analysis
The regression equation is Shear Strength (psi) = 2628 37.2 Age of Propellant (weeks)
Predictor	Coef	St Dev	T	P
Constant	2627.82	44.18	59.47	0.000
Age of P	37.154	2.889	12.86	0.000
S = 96.11	R Sq = 90.2%		R Sq (adj) = 89.6%
Analysis of Variance
Source	DF	SS	MS	F	P
Regression	1	1527483	1527483	165.38	0.000
Residual Error	18	166255	9236
Total	19	1693738

^[74] Source: SBTI's Lean Sigma Methodology training material.

For the constant term and for each X in the model there is a p-value indicating whether that term is significantly non-zero. Both have a p-value of zero, which indicates that there would be a small (almost zero) chance of getting coefficients this large (far from zero) purely by random chance. Specifically

The p-value for the constant indicates that the Y-intercept is not equal to zero
The p-value for the Age of Propellant indicates that the slope of the Regression line is not equal to zero The statistics on the center row are the same as those listed on the Fitted Line Plot in Figure 7.40.1:

S is the standard deviation of the variation not explained by the model, known as the Residuals. It is the spread of the data around the Regression line.
R-Sq (R²) is the amount of variation in the data that is explained by the model. It is calculated from the ANOVA table at the bottom of Figure 7.40.2 by the equation SS(Regression) / SS(Total). Here 90.2% (calculated as 1527483 ÷ 1693738) of all the variability in the sample data is explained by the model.
R-Sq(adj) is an indicator of whether any redundant (non-contributing) terms have been included in the model.^[73] If the R-Sq(adj) falls well below the R-Sq value then there are redundant terms. Here the two are close and thus the conclusion should be that all the terms used in the model actually contribute something.
^[73] This is more applicable in Multiple Linear Regression versus Simple Regression.

The bottom table is an ANOVA (Analysis Of Variance) table. For more details see "ANOVA" in this chapter. The ANOVA table breaks the variation into two main pieces:

The variation explained by the model (known as the Regression)
The variation not explained by the model (known as the Residual Error)

The calculation of these is shown graphically in Figure 7.40.3:

The Mean of the Y data is calculated and represented by the dashed horizontal straight line in the figure.
The Total Variation (Source Total in the ANOVA Table) is calculated by taking the square of the distance for each data point from the mean and then summing all the squares. SS(Total) = 1693738 is calculated this way.
The Residual Error is calculated by taking the square of the distance of each data point from the Regression Line and then summing all the squares. It is the bit left over after the line has been fitted. SS(Residual Error) = 166255 was calculated this way.
The variation explained by the model, known as the Regression, is calculated by taking the square of the distance from where the line predicts a point should be from the mean for every data point and then summing all the squares.

Figure 7.40.3. Graphical representation of ANOVA calculation.

From the preceding calculations it is possible to calculate a signal-to-noise ratio based on the size of the Regression (the signal) versus the background noise (Residual Error). This is the F-test in the table. Here the value of F is 165.38, which means the size of the signal due to the X is 165.38 times greater than the background noise.

The software then looks up the F value in a statistical table to discover the likelihood of seeing a difference of this magnitude.^[75] The likelihood is the p-value, in this case 0.000.

^[75] For more detail on exactly how this is calculated see Introduction to Linear Regression Analysis by Douglas Montgomery and Elizabeth Peck.

The p-value indicates the likelihood of seeing a relationship this strong in the data sample purely by random chance; this means that there is no relationship at the population level, it happened by coincidence in selecting the sample from the population. As in most statistical tests, if the p-value is associated with a pair of hypotheses, for Regression:

H_o: Y is independent of X
H_a: Y is dependent on X

If the p-value is less than 0.05 (as in this example) then the null hypothesis H_o should be rejected and the conclusion is that the Y is dependent on the X. Belts sometimes are misled at this point into assuming that there is a direct causal relationship between the X and the Y. There might be, but a change in X does not necessarily directly cause Y to move. The statistically correct explanation here is that when X moves 1 unit, Y moves by some consistent associated amount.

The analysis is not complete until the model adequacy is validated, which is done by reviewing the quality of the fit and an investigation into the variation that has not been explained, the Residuals (the bit left over). Residual evaluation gives a warning sign that the generated model might not be adequate or appropriate.

Looking at Figure 7.40.3, you know the residual is the actual value minus the fitted value, and it can be negative or positive depending on whether the data point is above or below the line. There are several measures of model adequacy with respect to the Residuals:

The sum of the Residuals = 0
The Residuals have a constant variance
The Residuals are normally distributed
The Residuals are in control

To validate model adequacy it is useful to examine the residuals graphically. To determine if the Residuals are Normal a few options are available:

A Probability Plot can be applied as shown in Figure 7.40.4a. Residuals on the Normal Plot should form a straight line.

Figure 7.40.4. Graphical Analysis of Residuals (output from Minitab v14).
A Histogram can be applied as shown in Figure 7.40.4b. The Histogram should appear to be forming a normal curve. This can be hit-and-miss and should be used in conjunction with the Normal Probability Plot.
A Normality Test can be applied on the Residuals to gain a p-value. See "Normality Test" in this chapter. This is by far the best approach.

To determine if the Residuals are in Control, an Individuals Chart can be applied to the Residuals as shown in Graph C. Residuals that appear out of control should be studied further. Possible out-of-control issues might include Measurement Systems error, incorrect data entry, or an unusual process event. In the case of the latter, the Team should consult any notes taken during the data collection to evaluate the impact of the process event.

To determine if the Residuals have constant variance and to show that they are random (just background noise), a Residuals versus Fits Plot can be applied, as shown in Graph D. The Residuals should be distributed randomly across the Plot; any obvious patterns could indicate model inadequacy as described in Table 7.40.1.

Table 7.40.1. Interpretation of the Residuals versus Fits Plot
Pattern	Residuals versus Fits	Interpretation
Residuals are contained in a straight band with no obvious pattern in the graph.		The model is adequate.
Residuals show a funnel pattern. The variance of the errors is not constant and increases as Y increases.		The model is inadequate. This might be resolved by transforming the Y.^[76]
Residuals show a parabolic or quadratic pattern.		The model is inadequate. This might be resolved with a higher order model (quadratic, for example).
Residuals show a bow pattern. The variance of the errors is not constant.		The model is inadequate. This might be resolved by transforming the Y.^[77]

^[76] Beyond the scope of this book.

^[77] Beyond the scope of this book.

If there are patterns in the Residuals and the R² value is very high, it probably presents no problem; however, if, for example, R² is less than 80% then there might be opportunity to create a better model based on the paths recommended in the table.

After the model is deemed to be adequate, the Team should collectively draw practical conclusions from it and present them back to the Process Owner and the Champion.

Roadmap

The roadmap to conducting a Regression analysis is as follows:

Step 1.	Plan the study. Identify the Ys and Xs to be considered. For each Y (preferably both the Xs and Ys) verify the Measurement System using a Gage R&R Study (see "MSAContinuous" in this chapter). Agree on the data collection approach and assign responsibilities to the Team members (for more details see "KPOVs and Data" in this chapter).
Step 2.	Pilot data collection. Validate the data collection approach as created in Step 1. Modify and retest if necessary.
Step 3.	Collect the data, carefully following the agreed data collection approach. Take copious notes of process conditions and record any unusual process events. Transfer the data promptly into electronic form and make backup copies.
Step 4.	Analyze the data, as per the details in "Overview" in this section: Create the Fitted Line Plot Evaluate significance of R² and the p-values Check the Residuals to validate model adequacy
Step 5.	Formulate practical conclusions from the analysis, including potential follow-on studies.

Interpreting the Output

Regression in its Simple Linear form is quite straightforward to apply. There are, however, as with all tools, several pitfalls that can cause Belts problems:

The purpose of a model is to create a prediction model for behavior of the response Y based on the predictor X. However, if the X itself cannot be predicted, then the model is useless. An example of this might be a desire to predict the maximum daily load on an electric power generation system from a maximum daily temperature model. The accuracy and usefulness of the Regression model for electric load prediction is conditional to the forecast of the temperature; the accuracy of which is patchy at best.
Regression is an interpolation technique, not an extrapolation technique. Predictions from Regression models are made only with confidence within the confines of the data. If no data has been taken in an operating region, the model is hitand-miss at best. To remedy this, data points should be taken over the breadth of the region in which predictions are made.
Single points can heavily affect Regression models. In Graph A of Figure 7.40.5, the single outlier dramatically reduces the R² value of the model. If the outlier is a bad value, then the model estimates are wrong and the error is inflated. However, if the outlier is a real process value, it should not be removed. It is a useful piece of data for the process. Refer to notes taken during data collection to understand the point and if possible try to recreate it.

Figure 7.40.5. The effect of a single data point on the regression model.
In Graph B in Figure 7.40.5, the single outlier increases the R² and regression coefficient. In this case, evaluate the model with and without the point to determine its effect. If the R² value greatly changes during this analysis, then that value is too influential. Conduct other data runs near that point to lower its leverage and confirm its validity.
Regression models should represent meaningful relationships. Take for example the relationship shown in Figure 7.40.6. Data about a city showed that as population density of storks increased, so did the town's population. As much as I'd like to believe this relationship, it could equally be the reverse, mundane scenario. As the town's population increases then there are more chimneys (nesting grounds for storks); thus the stork population can increase accordingly.

Figure 7.40.6. Incorrect causal relationships.^[79]

^[79] Source SBTI's Lean Sigma Methodology training material.

Other Options

"Overview" and "Roadmap" in this section describe Simple Linear Regression, the investigation of the relationship between one Continuous X and one Continuous Y. Multiple Linear Regression, on the other hand, investigates the relationship between multiple Continuous Xs and one Continuous Y. The principles used are similar; however, the Fitted Line Plot no longer helps in this case. The multiple Xs are added into the Regression analysis and the same pointers are used, namely the R-Sq and p-values. The R-Sq(adj) becomes even more important in Multiple Linear Regression because there are more terms (more Xs) added into the model and certainly not all of them give any contribution.^[78]

^[78] For much more detail on Multiple Linear Regression, see Introduction to Linear Regression Analysis by Douglas Montgomery and Elizabeth Peck.

Linear Regression is just that, linear. Sometimes the behavior of the relationship between the X and the Y is non-linear. Most software packages allow the user to select higher order models, specifically quadratic (including X² terms) and cubic (including X3 terms). Belts tend to get carried away adding in higher-order terms when really the key tenet here is "the simpler the model, the better." Unless there is compelling reason to add a higher-order term, the linear model is usually preferable. Look to the R-Sq(adj) value as the model advances up an order. If R-Sq(adj) decreases for a higher-order model then the higher-order terms do not bring any additional value.

40. Regression

Overview

Figure 7.40.1. Example Fitted Line Plot[71] (output from Minitab v14).

Figure 7.40.2. Analysis results for the Rocket Propellant example.[74]

Figure 7.40.3. Graphical representation of ANOVA calculation.

Figure 7.40.4. Graphical Analysis of Residuals (output from Minitab v14).

Table 7.40.1. Interpretation of the Residuals versus Fits Plot

Roadmap

Interpreting the Output

Figure 7.40.5. The effect of a single data point on the regression model.

Figure 7.40.6. Incorrect causal relationships.[79]

Other Options

Figure 7.40.1. Example Fitted Line Plot^[71] (output from Minitab v14).

Figure 7.40.2. Analysis results for the Rocket Propellant example.^[74]

Figure 7.40.6. Incorrect causal relationships.^[79]