Syntax | SAS.STAT 9.1 Users Guide (Vol. 5)

The following statements are available in PROC PLS. Items within the brackets <> are optional.

PROC PLS < options > ;
- BY variables ;
- CLASS variables < / option > ;
- MODEL dependent-variables = effects < / options > ;
- OUTPUT OUT= SAS-data-set < options > ;

To analyze a data set, you must use the PROC PLS and MODEL statements. You can use the other statements as needed.

PROC PLS Statement

PROC PLS < options > ;

You use the PROC PLS statement to invoke the PLS procedure and, optionally , to indicate the analysis data and method. The following options are available.

CENSCALE

lists the centering and scaling information for each response and predictor .

CV=ONE

CV=SPLIT < ( n ) >

CV=BLOCK < ( n ) >

CV=RANDOM < ( cv-random-opts ) >

CV=TESTSET ( SAS-data-set )

specifies the cross validation method to be used. By default, no cross validation is performed. The method CV=ONE requests one-at-a-time cross validation, CV=SPLIT requests that every n th observation be excluded, CV=BLOCK requests that n blocks of consecutive observations be excluded, CV=RANDOM requests that observations be excluded at random, and CV=TESTSET( SAS-data-set ) specifies a test set of observations to be used for validation ( formally , this is called test set validation rather than cross validation ). You can, optionally, specify n for CV=SPLIT and CV=BLOCK; the default is n = 7. You can also specify the following optional cv-random-options in parentheses after the CV=RANDOM option:

NITER= n	specifies the number of random subsets to exclude. The default value is 10.
NTEST= n	specifies the number of observations in each random subset chosen for exclusion. The default value is one-tenth of the total number of observations.
SEED= n	specifies an integer used to start the pseudo-random number generator for selecting the random test set. If you don t specify a seed, or specify a value less than or equal to zero, the seed is by default generated from reading the time of day from the computer s clock.

CVTEST < ( cvtest-options ) >

specifies that van der Voet s (1994) randomization-based model comparison test be performed to test models with different numbers of extracted factors against the model that minimizes the predicted residual sum of squares; see the Cross Validation section on page 3384 for more information. You can also specify the following cv-test-options in parentheses after the CVTEST option:

PVAL= n	specifies the cut-off probability for declaring an insignificant difference. The default value is 0.10.
STAT = test-statistic	specifies the test statistic for the model comparison. You can specify either T2, for Hotelling s T ² statistic, or PRESS, for the predicted residual sum of squares. The default value is T2.
NSAMP= n	specifies the number of randomizations to perform. The default value is 1000.
SEED= n	specifies the seed value for randomization generation (the clock time is used by default).

DATA = SAS-data-set

names the SAS data set to be used by PROC PLS. The default is the most recently created data set.

DETAILS

lists the details of the fitted model for each successive factor. The details listed are different for different extraction methods : see the Displayed Output section on page 3387 for more information.

METHOD=PLS < ( PLS-options ) >

METHOD=SIMPLS

METHOD=PCR

METHOD=RRR

specifies the general factor extraction method to be used. The value PLS requests partial least squares, SIMPLS requests the SIMPLS method of de Jong (1993), PCR requests principal components regression, and RRR requests reduced rank regression. The default is METHOD=PLS. You can also specify the following optional PLS-options in parentheses after METHOD=PLS:

ALGORITHM=NIPALS SVD EIG RLGW	names the specific algorithm used to compute extracted PLS factors. NIPALS requests the usual iterative NIPALS algorithm, SVD bases the extraction on the singular value decomposition of X ² Y , EIG bases the extraction on the eigenvalue decomposition of Y ² X X ² Y , and RLGW is an iterative approach that is efficient when there are many predictors (R nner et al. 1994). ALGORITHM=SVD is the most accurate but least efficient approach; the default is ALGORITHM=NIPALS.
MAXITER = n	specifies the maximum number of iterations for the NIPALS and RLGW algorithms. The default value is 200.
EPSILON = n	specifies the convergence criterion for the NIPALS and RLGW algorithms. The default value is 10 ˆ’ ¹² .

Experimental ” MISSING=NONE

MISSING=AVG

MISSING=EM < ( EM-options ) >

specifies how observations with missing values are to be handled in computing the fit. The default is MISSING=NONE, for which observations with any missing variables (dependent or independent) are excluded from the analysis. MISSING=AVG specifies that the fit be computed by filling in missing values with the average of the nonmissing values for the corresponding variable. If you specify MISSING=EM then the procedure first computes the model with MISSING=AVG, then fills in missing values by their predicted values based on that model and computes the model again. You can also specify the following optional EM-options in parentheses after MISSING=EM:

MAXITER = n	specifies the maximum number of iterations for the imputation/fit loop. The default value is 1. If you specify a large value of MAXITER= then the loop will iterate until it converges (as controlled by the EPSILON= option).
EPSILON = n	specifies the convergence criterion for the imputation/fit loop. The default value for is 10 ˆ’ ⁸ . This option is only effective if you specify a large value for the MAXITER= option.

NFAC= n

specifies the number of factors to extract. The default is min{15 , p, N }, where p is the number of predictors (the number of dependent variables for METHOD=RRR) and N is the number of runs (observations). This is probably more than you need for most applications. Extracting too many factors can lead to an over-fit model, one that matches the training data too well, sacrificing predictive ability. Thus, if you use the default NFAC= specification, you should also either use the CV= option to select the appropriate number of factors for the final model or consider the analysis to be preliminary and examine the results to determine the appropriate number of factors for a subsequent analysis.

NOCENTER

suppresses centering of the responses and predictors before fitting. This is useful if the analysis variables are already centered and scaled. See the Centering and Scaling section on page 3386 for more information.

NOCVSTDIZE

suppresses re-centering and re-scaling of the responses and predictors before each model is fit in the cross validation. See the Centering and Scaling section on page 3386 for more information.

NOPRINT

suppresses the normal display of results. This is useful when you want only the output statistics saved in a data set. Note that this option temporarily disables the Output Delivery System (ODS); see Chapter 14, Using the Output Delivery System, for more information.

NOSCALE

suppresses scaling of the responses and predictors before fitting. This is useful if the analysis variables are already centered and scaled. See the Centering and Scaling section on page 3386 for more information.

VARSCALE

specifies that continuous model variables should be centered and scaled prior to centering and scaling the model effects in which they are involved. The rescaling specified by the VARSCALE option may be more appropriate if the model involves cross products between model variables; however, the VARSCALE option still may not produce the model you expect. See the Centering and Scaling section on page 3386 for more information.

VARSS

lists, in addition to the average response and predictor sum of squares accounted for by each successive factor, the amount of variation accounted for in each response and predictor.

BY Statement

BY variables ;

You can specify a BY statement with PROC PLS to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. The variables are one or more variables in the input data set.

If you specify more than one BY statement, the procedure uses only the latest BY statement and ignores any previous ones.

If your input data set is not sorted in ascending order, use one of the following alternatives:

Sort the data using the SORT procedure with a similar BY statement.
Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the PLS procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order.
Create an index on the BY variables using the DATASETS procedure (in base SAS software).

For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts . For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide .

CLASS Statement

CLASS variables < / option > ;

The CLASS statement names the classification variables to be used in the analysis. If the CLASS statement is used, it must appear before the MODEL statement.

Classification variables can be either character or numeric. By default, class levels are determined from the entire formatted values of the CLASS variables. Note that this represents a slight change from previous releases in the way in which class levels are determined. In releases prior to Version 9, class levels were determined using no more than the first 16 characters of the formatted values. If you wish to revert to this previous behavior you can use the TRUNCATE option on the CLASS statement. In any case, you can use formats to group values into levels. Refer to the discussion of the FORMAT procedure in the SAS Procedures Guide and to the discussions of the FORMAT statement and SAS formats in SAS Language Reference: Dictionary .

Any variable in the model that is not listed in the CLASS statement is assumed to be continuous. Continuous variables must be numeric.

You can specify the following option in the CLASS statement after a slash(/):

TRUNCATE

specifies that class levels should be determined using only up to the first 16 characters of the formatted values of CLASS variables. When formatted values are longer than 16 characters, you can use this option in order to revert to the levels as determined in releases previous to Version 9.

MODEL Statement

MODEL response-variables = predictor-effects < / options > ;

The MODEL statement names the responses and the predictors, which determine the Y and X matrices of the model, respectively. Usually you simply list the names of the predictor variables as the model effects, but you can also use the effects notation of PROC GLM to specify polynomial effects and interactions; see the Specification of Effects section on page 1784 in Chapter 32, The GLM Procedure, for further details. The MODEL statement is required. You can specify only one MODEL statement (in contrast to the REG procedure, for example, which allows several MODEL statements in the same PROC REG run).

You can specify the following options in the MODEL statement after a slash (/).

INTERCEPT

By default, the responses and predictors are centered; thus, no intercept is required in the model. You can specify the INTERCEPT option to override the default.

SOLUTION

lists the coefficients of the final predictive model for the responses. The coefficients for predicting the centered and scaled responses based on the centered and scaled predictors are displayed, as well as the coefficients for predicting the raw responses based on the raw predictors.

OUTPUT Statement

OUTPUT OUT= SAS-data-set keyword = names < ...keyword = names > ;

You use the OUTPUT statement to specify a data set to receive quantities that can be computed for every input observation, such as extracted factors and predicted values. The following keywords are available:

PREDICTED	predicted values for responses
YRESIDUAL	residuals for responses
XRESIDUAL	residuals for predictors
XSCORE	extracted factors (X-scores, latent vectors, latent variables, T )
YSCORE	extracted responses (Y-scores, U )
STDY	standardized (centered and scaled) responses
STDX	standardized (centered and scaled) predictors
H	approximate leverage
PRESS	approximate predicted residuals
TSQUARE	scaled sum of squares of score values
STDXSSE	sum of squares of residuals for standardized predictors
STDYSSE	sum of squares of residuals for standardized responses

Suppose that there are N _x predictors and N _y responses and that the model has N _f selected factors.

The keywords XRESIDUAL and STDX define an output variable for each predictor, so N _x names are required after each one.
The keywords PREDICTED, YRESIDUAL, STDY, and PRESS define an output variable for each response, so N _y names are required after each of these keywords.
The keywords XSCORE and YSCORE specify an output variable for each selected model factor. For these keywords, you provide only one base name, and the variables corresponding to each successive factor are named by appending the factor number to the base name . For example, if N _f = 3 then a specification of XSCORE=T would produce the variables T1, T2, and T3.
Finally, the keywords H, TSQUARE, STDXSSE, and STDYSSE each specify a single output variable, so only one name is required after each of these keywords.