Details | SAS.STAT 9.1 Users Guide (Vol. 5)

Regression Methods

All of the predictive methods implemented in PROC PLS work essentially by finding linear combinations of the predictors (factors) to use to predict the responses linearly. The methods differ only in how the factors are derived, as explained in the following sections.

Partial Least Squares

Partial least squares (PLS) works by extracting one factor at a time. Let X = X be the centered and scaled matrix of predictors and Y = Y the centered and scaled matrix of response values. The PLS method starts with a linear combination t = X w of the predictors, where t is called a score vector and w is its associated weight vector. The PLS method predicts both X and Y by regression on t :

The vectors p and c are called the X- and Y- loadings , respectively.

The specific linear combination t = X w is the one that has maximum covariance t ² u with some response linear combination u = Y q . Another characterization is that the X- and Y-weights w and q are proportional to the first left and right singular vectors of the covariance matrix or, equivalently, the first eigenvectors of and , respectively.

This accounts for how the first PLS factor is extracted. The second factor is extracted in the same way by replacing X and Y with the X- and Y-residuals from the first factor

These residuals are also called the deflated X and Y blocks. The process of extracting a score vector and deflating the data matrices is repeated for as many extracted factors as are desired.

SIMPLS

Note that each extracted PLS factor is defined in terms of different X- variables X _i . This leads to difficulties in comparing different scores, weights, and so forth. The SIMPLS method of de Jong (1993) overcomes these difficulties by computing each score t _i = X r _i in terms of the original (centered and scaled) predictors X .The SIMPLS X-weight vectors r _i are similar to the eigenvectors of SS ² = X ² YY ² X , but they satisfy a different orthogonality condition. The r ₁ vector is just the first eigenvector e ₁ (so that the first SIMPLS score is the same as the first PLS score), but whereas the second eigenvector maximizes

the second SIMPLS weight r ₂ maximizes

The SIMPLS scores are identical to the PLS scores for one response but slightly different for more than one response; refer to de Jong (1993) for details. The X- and Y-loadings are defined as in PLS, but since the scores are all defined in terms of X , it is easy to compute the overall model coefficients B :

Principal Components Regression

Like the SIMPLS method, principal components regression (PCR) defines all the scores in terms of the original (centered and scaled) predictors X . However, unlike both the PLS and SIMPLS methods, the PCR method chooses the X-weights/X-scores without regard to the response data. The X-scores are chosen to explain as much variation in X as possible; equivalently, the X-weights for the PCR method are the eigenvectors of the predictor covariance matrix X ² X . Again, the X- and Y-loadings are defined as in PLS; but, as in SIMPLS, it is easy to compute overall model coefficients for the original (centered and scaled) responses Y in terms of the original predictors X .

Reduced Rank Regression

As discussed in the preceding sections, partial least squares depends on selecting factors t = X w of the predictors and u = Y q of the responses that have maximum covariance, whereas principal components regression effectively ignores u and selects t to have maximum variance, subject to orthogonality constraints. In contrast, reduced rank regression selects u to account for as much variation in the predicted responses as possible, effectively ignoring the predictors for the purposes of factor extraction. In reduced rank regression, the Y-weights q _i are the eigenvectors of the covariance matrix of the responses predicted by ordinary least squares regression; the X-scores are the projections of the Y-scores Y q _i onto the X space.

Relationships Between Methods

When you develop a predictive model, it is important to consider not only the explanatory power of the model for current responses, but also how well sampled the predictive functions are, since this impacts how well the model can extrapolate to future observations. All of the techniques implemented in the PLS procedure work by extracting successive factors, or linear combinations of the predictors, that optimally address one or both of these two goals ”explaining response variation and explaining predictor variation. In particular, principal components regression selects factors that explain as much predictor variation as possible, reduced rank regression selects factors that explain as much response variation as possible, and partial least squares balances the two objectives, seeking for factors that explain both response and predictor variation.

To see the relationships between these methods, consider how each one extracts a single factor from the following artificial data set consisting of two predictors and one response:

  data data;   input x1 x2 y;   datalines;   3.37651  2.30716        0.75615   0.74193   0.88845        1.15285   4.18747  2.17373        1.42392   0.96097  0.57301        0.27433     1.11161   0.75225   0.25410     1.38029   1.31343   0.04728   1.28153   0.13751        1.00341     1.39242   2.03615        0.45518   0.63741  0.06183        0.40699     2.52533   1.23726   0.91080   2.44277  3.61077   0.82590   ;   proc pls data=data nfac=1 method=rrr;   title "Reduced Rank Regression";   model y = x1 x2;   proc pls data=data nfac=1 method=pcr;   title "Principal Components Regression";   model y = x1 x2;   proc pls data=data nfac=1 method=pls;   title "Partial Least Squares Regression";   model y = x1 x2;   run;

The amount of model and response variation explained by the first factor for each method is shown in Figure 56.7 through Figure 56.9.

  Reduced Rank Regression   The PLS Procedure   Percent Variation Accounted for by   Reduced Rank Regression Factors   Number of   Extracted        Model Effects        Dependent Variables   Factors     Current       Total     Current       Total   1     15.0661     15.0661    100.0000    100.0000

Figure 56.7: Variation Explained by First Principal Components Regression Factor

  Principal Components Regression   The PLS Procedure   Percent Variation Accounted for by Principal Components   Number of   Extracted        Model Effects        Dependent Variables   Factors     Current       Total     Current       Total   1     92.9996     92.9996      9.3787      9.3787

Figure 56.8: Variation Explained by First Partial Least Squares Regression Factor

  Partial Least Squares Regression   The PLS Procedure   Percent Variation Accounted for   by Partial Least Squares Factors   Number of   Extracted        Model Effects        Dependent Variables   Factors     Current       Total     Current       Total   1     88.5357     88.5357     26.5304     26.5304

Figure 56.9: Variation Explained by First Partial Least Squares Regression Factor

Notice that, while the first reduced rank regression factor explains all of the response variation, it accounts for only about 15% of the predictor variation. In contrast, the first principal components regression factor accounts for most of the predictor variation (93%) but only 9% of the response variation. The first partial least squares factor accounts for only slightly less predictor variation than principal components but about three times as much response variation.

Figure 56.10 illustrates how partial least squares balances the goals of explaining response and predictor variation in this case.

Figure 56.10: Depiction of First Factors for Three Different Regression Methods

The ellipse shows the general shape of the 11 observations in the predictor space, with the contours of increasing y overlaid. Also shown are the directions of the first factor for each of the three methods. Notice that, while the predictors vary most in the x1 = x2 direction, the response changes most in the orthogonal x1 = - x2 direction. This explains why the first principal component accounts for little variation in the response and why the first reduced rank regression factor accounts for little variation in the predictors. The direction of the first partial least squares factor represents a compromise between the other two directions.

Cross Validation

None of the regression methods implemented in the PLS procedure fit the observed data any better than ordinary least squares (OLS) regression; in fact, all of the methods approach OLS as more factors are extracted. The crucial point is that, when there are many predictors, OLS can over-fit the observed data; biased regression methods with fewer extracted factors can provide better predictability of future observations. However, as the preceding observations imply, the quality of the observed data fit cannot be used to choose the number of factors to extract; the number of extracted factors must be chosen on the basis of how well the model fits observations not involved in the modeling procedure itself.

One method of choosing the number of extracted factors is to fit the model to only part of the available data (the training set ) and to measure how well models with different numbers of extracted factors fit the other part of the data (the test set ). This is called test set validation . However, it is rare that you have enough data to make both parts large enough for pure test set validation to be useful. Alternatively, you can make several different divisions of the observed data into training set and test set. This is called cross validation , and there are several different types. In one-at-a-time cross validation, the first observation is held out as a single-element test set, with all other observations as the training set; next , the second observation is held out, then the third, and so on. Another method is to hold out successive blocks of observations as test sets, for example, observations 1 through 7, then observations 8 through 14, and so on; this is known as blocked validation. A similar method is split-sample cross validation, in which successive groups of widely separated observations are held out as the test set, for example, observations {1, 11, 21, ...}, then observations {2, 12, 22, ...}, and so on. Finally, test sets can be selected from the observed data randomly ; this is known as random sample cross validation.

Which validation you should use depends on your data. Test set validation is preferred when you have enough data to make a division into a sizable training set and test set that represent the predictive population well. You can specify that the number of extracted factors be selected by test set validation by using the CV=TESTSET( data set ) option, where data set is the name of the data set containing the test set. If you do not have enough data for test set validation, you can use one of the cross validation techniques. The most common technique is one-at-a-time validation (which you can specify with the CV=ONE option or just the CV option), unless the observed data is serially correlated, in which case either blocked or split-sample validation may be more appropriate (CV=BLOCK or CV=SPLIT); you can specify the number of test sets in blocked or split-sample validation with a number in parentheses after the CV= option. Note that CV=ONE is the most computationally intensive of the cross validation methods, since it requires a recomputation of the PLS model for every input observation. Also, note that using random subset selection with CV=RANDOM may lead two different researchers to produce different PLS models on the same data (unless the same seed is used).

Whichever validation method you use, the number of factors chosen is usually the one that minimizes the predicted residual sum of squares (PRESS); this is the default choice if you specify any of the CV methods with PROC PLS. However, often models with fewer factors have PRESS statistics that are only marginally larger than the absolute minimum. To address this, van der Voet (1994) has proposed a statistical test for comparing the predicted residuals from different models; when you apply van der Voet s test, the number of factors chosen is the fewest with residuals that are insignificantly larger than the residuals of the model with minimum PRESS.

To see how van der Voet s test works, let R _i,jk be the j th predicted residual for response k for the model with i extracted factors; the PRESS statistic is . Also, let i _min be the number of factors for which PRESS is minimized. The critical value for van der Voet s test is based on the differences between squared predicted residuals

One alternative for the critical value is C _i = & pound ; _jk D _i,jk , which is just the difference between the PRESS statistics for i and i _min factors; alternatively, van der Voet suggests Hotelling s T ² statistic where d _i, _· is the sum of the vectors d _i,j = { D _i,j ₁ ,...,D _i,j N _y } ^² and S _i is the sum of squares and crossproducts matrix

Virtually, the significance level for van der Voet s test is obtained by comparing C _i with the distribution of values that result from randomly exchanging and . In practice, a Monte Carlo sample of such values is simulated and the significance level is approximated as the proportion of simulated critical values that are greater than C _i . If you apply van der Voet s test by specifying the CVTEST option, then, by default, the number of extracted factors chosen is the least number with an approximate significance level that is greater than 0.10.

Centering and Scaling

By default, the predictors and the responses are centered and scaled to have mean 0 and standard deviation 1. Centering the predictors and the responses ensures that the criterion for choosing successive factors is based on how much variation they explain, in either the predictors or the responses or both. (See the Regression Methods section on page 3380 for more details on how different methods explain variation.) Without centering, both the mean variable value and the variation around that mean are involved in selecting factors. Scaling serves to place all predictors and responses on an equal footing relative to their variation in the data. For example, if Time and Temp are two of the predictors, then scaling says that a change of std( Time ) in Time is roughly equivalent to a change of std( Temp ) in Temp .

Usually, both the predictors and responses should be centered and scaled. However, if their values already represent variation around a nominal or target value, then you can use the NOCENTER option in the PROC PLS statement to suppress centering. Likewise, if the predictors or responses are already all on comparable scales , then you can use the NOSCALE option to suppress scaling.

Note that, if the predictors involve crossproduct terms, then, by default, the variables are not standardized before standardizing the cross product. That is, if the i th values of two predictors are denoted and , then the default standardized i th value of the cross product is

If you want the cross product to be based instead on standardized variables

where and for k = 1 , 2, then you should use the VARSCALE option in the PROC PLS statement. Standardizing the variables separately is usually a good idea, but unless the model also contains all cross products nested within each term , the resulting model may not be equivalent to a simple linear model in the same terms. To see this, note that a model involving the cross product of two standardized variables

involves both the crossproduct term and the linear terms for the unstandardized variables.

When cross validation is performed for the number of effects, there is some disagreement among practitioners as to whether each cross validation training set should be retransformed. By default, PROC PLS does so, but you can suppress this behavior by specifying the NOCVSTDIZE option in the PROC PLS statement.

Missing Values

By default, PROC PLS handles missing values very simply. Observations with any missing independent variables (including all class variables) are excluded from the analysis, and no predictions are computed for such observations. Observations with no missing independent variables but any missing dependent variables are also excluded from the analysis, but predictions are computed.

However, the experimental MISSING= option on the PROC PLS statement provides more sophisticated ways of modeling in the presence of missing values. If you specify MISSING=AVG or MISSING=EM, then all observations in the input data set contribute to both the analysis and the OUTPUT OUT= data set. With MISSING=AVG, the fit is computed by filling in missing values with the average of the nonmissing values for the corresponding variable. With MISSING=EM, the procedure first computes the model with MISSING=AVG, then fills in missing values by their predicted values based on that model and computes the model again. Alternatively, you can specify MISSING=EM(MAXITER= n ) with a large value of n in order to perform this imputation/fit loop until convergence.

Displayed Output

By default, PROC PLS displays just the amount of predictor and response variation accounted for by each factor.

If you perform a cross validation for the number of factors by specifying the CV option on the PROC PLS statement, then the procedure displays a summary of the cross validation for each number of factors, along with information about the optimal number of factors.

If you specify the DETAILS option on the PROC PLS statement, then details of the fitted model are displayed for each successive factor. These details include for each number of factors

the predictor loadings
the predictor weights
the response weights
the coded regression coefficients (for METHOD = SIMPLS, PCR, or RRR)

If you specify the CENSCALE option on the PROC PLS statement, then centering and scaling information for each response and predictor is displayed.

If you specify the VARSS option on the PROC PLS statement, the procedure displays, in addition to the average response and predictor sum of squares accounted for by each successive factor, the amount of variation accounted for in each response and predictor.

If you specify the SOLUTION option on the MODEL statement, then PROC PLS displays the coefficients of the final predictive model for the responses. The coefficients for predicting the centered and scaled responses based on the centered and scaled predictors are displayed, as well as the coefficients for predicting the raw responses based on the raw predictors.

ODS Table Names

PROC PLS assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.

Table 56.1: ODS Tables Produced in PROC PLS
ODS Table Name	Description	Statement	Option
CVResults	Results of cross validation	PROC	CV
CenScaleParms	Parameter estimates for centered and scaled data	MODEL	SOLUTION
CodedCoef	Coded coefficients	PROC	DETAILS
ParameterEstimates	Parameter estimates for raw data	MODEL	SOLUTION
PercentVariation	Variation accounted for by each factor	PROC	default
ResidualSummary	Residual summary from cross validation	PROC	CV
XEffectCenScale	Centering and scaling information for predictor effects	PROC	CENSCALE
XLoadings	Loadings for independents	PROC	DETAILS
XVariableCenScale	Centering and scaling information for predictor variables	PROC	CENSCALE and VARSCALE
XWeights	Weights for independents	PROC	DETAILS
YVariableCenScale	Centering and scaling information for responses	PROC	CENSCALE
YWeights	Weights for dependents	PROC	DETAILS