MULTIPLE REGRESSION ANALYSIS | Six Sigma and Beyond: Statistics and Probability, Volume III

Multiple regression analysis is a statistical technique that can be used to analyze the relationship between a single dependent (criterion) variable and several independent ( predictor ) variables. The objective of multiple regression analysis is to use the independent variables whose values are known to predict the single dependent value selected by the experimenter. Each independent variable is weighted by the regression analysis procedure to ensure maximal prediction from the set of independent variables. The weights denote the relative contribution of the independent variables to the overall prediction and facilitate interpretation as to the influence of each variable in making the prediction, although correlation among the independent variables complicates the interpretive process. The set of weighted independent variables forms the regression variate, a linear combination of the independent variables that best predicts the dependent variable. The regression variate, also referred to as the regression equation or regression model, is the most widely known example of a variate among the multivariate techniques.

Multiple regression analysis is a dependence technique. Because of this, you as an experimenter must be able to classify the variables as dependent and independent. However, because regression is also a statistical tool, it should be used when the variables are metric. Only under certain circumstances it is possible to include nonmetric data. When we do that, appropriate transformation of data must occur.

REPRESENTING CURVILINEAR EFFECTS WITH POLYNOMIALS

Several types of data transformations are appropriate for linearizing a curvilinear relationship. Direct approaches involve modifying the values through some arithmetic transformation (e.g., taking the square root or logarithm of the variable). However, such transformations have several limitations. First, they are helpful only in a simple curvilinear relationship (a relationship with only one turning or inflection point). Second, they do not provide any statistical means for assessing whether the curvilinear or linear model is more appropriate. Finally, they accommodate only univariate relationships and not the interaction between variables when more than one independent variable is involved. We now discuss a means of creating new variables to explicitly model the curvilinear components of the relationship and address each of the limitations inherent in data transformations.

Polynomials are power transformations of an independent variable that add a nonlinear component for each additional power of the independent variable. The power of 1 (X ¹ ) represents the linear component and is the simplest form representing a line. The power of 2, the variable squared (X ² ), represents the quadratic component. In graphical terms, X ² represents the first inflection point. A cubic component, represented by the variable cubed (X ³ ), adds a second inflection point. With these variables and even higher powers, we can accommodate more complex relationships than are possible with only transformations. For example, in a simple regression model, a curvilinear model with one turning point can be modeled with the equation

where b = intercept, b ₁ X ₁ = linear effect of X ₁ , and b ₂ = curvilinear effect of X ₁ .

Although any number of nonlinear components may be added, the cubic term is usually the highest power used. As each new variable is entered into the regression equation, we can also perform a direct statistical test of the nonlinear components, which we cannot do with data transformations. Three (two nonlinear and one linear) relationships are shown in Figure 11.5. For interpretation purposes, the positive quadratic term indicates a U-shaped curve, whereas a negative coefficient indicates a ‹‚ -shaped curve.

Figure 11.5: Representing nonlinear relationships with polynomials.

Multivariate polynomials are created when the regression equation contains two or more independent variables. We follow the same procedure for creating the polynomial terms as before but must also create an additional term, the interaction term (X ₁ X ₂ ), which is needed for each variable combination to represent fully the multivariate effects. In graphical terms, a two-variable multivariate polynomial is portrayed by a surface with one peak or valley. For higher-order polynomials, the best form of interpretation is obtained by plotting the surface from the predicted values.

How many terms should be added? Common practice is to start with the linear component and then sequentially add higher-order polynomials until nonsignificance is achieved. The use of polynomials, however, also has potential problems. First, each additional term requires a degree of freedom, and this may be particularly restrictive with small sample sizes. This limitation does not occur with data transformation. Also, multicollinearity is introduced by the additional terms and makes statistical significance testing of the polynomial terms inappropriate. Instead, the experimenter must compare the R ² values from the equation model with linear terms to the R ² for the equation with the polynomial terms. Testing for the statistical significance of the incremental R ² is the appropriate manner of assessing the impact of the polynomials.

STANDARDIZING THE REGRESSION COEFFICIENTS: BETA COEFFICIENTS

If each of our independent variables had been standardized before we estimated the regression equation, we would have found different regression coefficients. The coefficients resulting from standardized data are called beta ( ² ) coefficients. Their advantage is that they eliminate the problem of dealing with different units of measurement, thus reflecting the relative impact on the dependent variable of a change in one standard deviation in either variable. Now that we have a common unit of measurement, we can determine which variable has the most impact.

Three cautions must be observed when using beta coefficients. First, they should be used as a guide to the relative importance of individual independent variables only when collinearity is minimal. Second, the beta values can be interpreted only in the context of the other variables in the equation. For example, a beta value for family size reflects its importance only in relation to family income, not in any absolute sense. If another independent variable were added to the equation, the beta coefficient for family size would probably change, because there would likely be some relationship between family size and the new independent variable. The third caution is that the levels (e.g., families of size five, six, and seven persons) affect the beta value. Had we found families of size eight, nine, and ten, the value of beta would likely change. In summary, beta coefficients should be used only as a guide to the relative importance of the independent variables included in the equation, and only over the range of values for which sample data actually exist.

ASSESSING MULTICOLLINEARITY

A key issue in interpreting the regression variate is the correlation among the independent variables. This is a data problem, not a problem of model specification. The ideal situation for an experimenter would be to have a number of independent variables highly correlated with the dependent variable, but with little correlation among themselves . Yet in most situations, particularly situations involving consumer response data, there will be some degree of multicollinearity. In some other occasions, such as using dummy variables to represent nonmetric variables or polynomial terms for nonlinear effects, the researcher is creating situations of high multicollinearity. The researcher's task is to assess the degree of multicollinearity and determine its impact on the results and the necessary remedies if needed. In the following sections we discuss the effects of multicollinearity and then detail some useful diagnostic procedures and possible remedies.

The effects of multicollinearity can be categorized in terms of explanation and estimation. The effects on explanation primarily concern the ability of the regression procedure and the experimenter to represent and understand the effects of each independent variable in the regression variate. As multicollinearity occurs (even at the relatively low levels of .30 or so), the process for separating the effects of individuals becomes more difficult. First, it limits the size of the coefficient of determination and makes it increasingly more difficult to add unique explanatory prediction from additional variables. Second, and just as important, it makes determining the contribution of each independent variable difficult because the effects of the independent variables are "mixed" or confounded. Multicollinearity results in larger portions of shared variance and lower levels of unique variance from which the effects of the individual independent variables can be determined. For example, assume that one independent variable (X ₁ ) has a correlation of .60 with the dependent variable, and a second independent variable (X ₂ ) has a correlation of .50. Then X _l would explain 36% (obtained by squaring the correlation of .60) of the variance of the dependent variable, and X ₂ would explain 25% (correlation of .50 squared). If the two independent variables are not correlated with each other at all, there is no "overlap," or sharing, of their predictive power. The total explanation would be their sum, or 61%. But as collinearity increases , there is some "sharing" of predictive power, and the collective predictive power of the independent variables decreases.

Figure 11.6 portrays the proportions of shared and unique variance for our example of two independent variables in varying instances of collinearity. If the collinearity of these variables is zero, then the individual variables predict 36 and 25% of the variance in the dependent variable, for an overall prediction (R ² ) of 61%. But as multicollinearity increases, the total variance explained decreases. Moreover, the amount of unique variance for the independent variables is reduced to levels that make estimation of their individual effects quite problematic .

Figure 11.6: Proportions of unique and shared variance by levels of multicollinearity.

In addition to the effects on explanation, multicollinearity can have substantive effects on the estimation of the regression coefficients and their statistical significance tests. First, the extreme case of multicollinearity in which two or more variables are perfectly correlated, termed singularity, prevents the estimation of any coefficients. In this instance, the singularity must be removed before the estimation of coefficients can proceed. Even if the multicollinearity is not perfect, high degrees of multicollinearity can result in regression coefficients being incorrectly estimated and even having the wrong signs.

Because of these potential problems, the effects of multicollinearity can be substantial. In any regression analysis, the assessment of multicollinearity should be undertaken in two steps: identification of the extent of collinearity, and assessment of the degree to which the estimated coefficients are affected. If corrective action is needed, assess the correlation matrix for the independent variables first and then follow with both pairwise and multiple-variable collinearity. Two of the most common measures are the tolerance value and its inverse ” the variance inflation factor. These measures tell us the degree to which each independent variable is explained by the other independent variable. (Tolerance is the amount of variability of the selected independent variable not explained by the other independent variables. A common cut-off threshold is a tolerance of .10. However, each study should be evaluated on its own merits and appropriately evaluated.)