Syntax | SAS.STAT 9.1 Users Guide (Vol. 6)

PROC TPSPLINE < option > ;
- MODEL dependents = < variables > (variables) < / options > ;
- SCORE data=SAS-data-set out=SAS-data-set ;
- OUTPUT < out=SAS-data-set > keyword < keyword > ;
- BY variables ;
- FREQ variable ;
- ID variables ;

The syntax in PROC TPSPLINE is similar to that of other regression procedures in the SAS System. The PROC TPSPLINE and MODEL statements are required. The SCORE statement can appear multiple times; all other statements appear only once.

The syntax for PROC TPSPLINE is described in the following sections in alphabetical order after the description of the PROC TPSPLINE statement.

PROC TPSPLINE Statement

PROC TPSPLINE < option > ;

The PROC TPSPLINE statement invokes the procedure. You can specify the following option.

DATA= SAS-data-set

specifies the SAS data set to be read by PROC TPSPLINE. The default value is the most recently created data set.

BY Statement

BY variables ;

You can specify a BY statement with PROC TPSPLINE to obtain separate analysis on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables.
If your input data set is not sorted in ascending order, use one of the following alternatives:
- Sort the data using the SORT procedure with a similar BY statement.
- Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the TPSPLINE procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order.
- Create an index on the BY variables using the DATASETS procedure.

For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts . For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide .

FREQ Statement

FREQ variable ;

If one variable in your input data set represents the frequency of occurrence for other values in the observation, specify the variable's name in a FREQ statement. PROC TPSPLINE treats the data as if each observation appears n times, where n is the value of the FREQ variable for the observation. If the value of the FREQ variable is less than one, the observation is not used in the analysis. Only the integer portion of the value is used.

ID Statement

ID variables ;

The variables in the ID statement are copied from the input data set to the OUT= data set. If you omit the ID statement, only the variables used in the MODEL statement and requested statistics are included in the output data set.

MODEL Statement

MODEL dependents = < regression variables > (smoothing variables) < /options > ;

The MODEL statement specifies the dependent variables, the independent regression variables, which are listed with no parentheses, and the independent smoothing variables, which are listed inside parentheses.

The regression variables are optional. At least one smoothing variable is required, and it must be listed after the regression variables. No variables can be listed in both the regression variable list and the smoothing variable list.

If you specify more than one dependent variable, PROC TPSPLINE calculates a thin-plate smoothing spline estimate for each dependent variable, using the regression variables and smoothing variables specified on the right-hand side.

If you specify regression variables, PROC TPSPLINE fits a semiparametric model using the regression variables as the linear part of the model.

You can specify the following options in the MODEL statement.

ALPHA= number

specifies the significance level ± of the confidence limits on the final thin-plate smoothing spline estimate when you request confidence limits to be included in the output data set. Specify number as a value between 0 and 1. The default value is 0.05. See the 'OUTPUT Statement' section on page 4510 for more information on the OUTPUT statement.

DF= number

specifies the degrees of freedom of the thin-plate smoothing spline estimate, defined as

where A ( » ) is the hat matrix. Specify number as a value between zero and the number of unique design points.

DISTANCE= number

D= number

defines a range such that if two data points ( x _i , z _i ) and ( x _j , z _j ) satisfy
then these data points are treated as replicates, where x _i are the smoothing variables and z _i are the regression variables.
You can use the DISTANCE= option to reduce the number of unique design points by treating nearby data as replicates. This can be useful when you have a large data set. The default value is 0.

LAMBDA0= number

specifies the smoothing parameter, » , to be used in the thin-plate smoothing spline estimate. By default, PROC TPSPLINE uses the » parameter that minimizes the GCV function for the final fit. The LAMBDA0= value must be positive.

LAMBDA= list-of-values

specifies a set of values for the » parameter. PROC TPSPLINE returns a GCV value for each » point that you specify. You can use the LAMBDA= option to study the GCV function curve for a set of values for » . All values listed in the LAMBDA= option must be positive.

LOGNLAMBDA0= number

LOGNL0= number

specifies the smoothing parameter » on the log 10( n » ) scale. If you specify both the LOGNL0= and LAMBDA0= options, only the value provided by the LOGNL0= option is used. By default, PROC TPSPLINE uses the » parameter that minimizes the GCV function for the estimate.

LOGNLAMBDA= list-of-values

LOGNL= list-of-values

specifies a set of values for the » parameter on the log 10( n » ) scale. PROC TPSPLINE returns a GCV value for each » point that you specify. You can use the LOGNLAMBDA= option to study the GCV function curve for a set of » values. If you specify both the LOGNL= and LAMBDA= options, only the list of values provided by LOGNL= option is used.
In some cases, the LOGNL= option may be prefered over the LAMBDA= option. Because the LAMBDA= value must be positive, a small change in that value can result in a major change in the GCV value. If you instead specify » on the log ₁₀ scale, the allowable range is enlarged to include negative values. Thus, the GCV function is less sensitive to changes in LOGNLAMBDA .

M= number

specifies the order of the derivative in the penalty term . The M= value must be a positive integer. The default value is the max (2 ,INT ( d/ 2) + 1), where d is the number of smoothing variables.

SCORE Statement

SCORE DATA= SAS-data-set OUT= SAS-data-set ;

The SCORE statement calculates predicted values for a new data set. If you have multiple data sets to predict, you can specify multiple SCORE statements. You must use a SCORE statement for each data set.

The following keywords must be specified in the SCORE statement.

DATA= SAS-data-set

specifies the input SAS data set containing the smoothing variables x and regression variables z . The predicted response ( y ) value is computed for each ( x , z ) pair. The data set must include all independent variables specified in the MODEL statement.

OUT= SAS-data-set

specifies the name of the SAS data set to contain the predictions .

OUTPUT Statement

OUTPUT OUT= SAS-data-set < keyword keyword > ;

The OUTPUT statement creates a new SAS data set containing diagnostic measures calculated after fitting the model.

You can request a variety of diagnostic measures that are calculated for each observation in the data set. The new data set contains the variables specified in the MODEL statement in addition to the requested variables. If no keyword is present, the data set contains only the predicted values.

Details on the specifications in the OUTPUT statement are as follows .

OUT= SAS-data-set

specifies the name of the new data set to contain the diagnostic measures. This specification is required.

keyword

specifies the statistics to include in the output data set. The names of the new variables that contain the statistics are formed by using a prefix of one or more characters that identify the statistic, followed by an underscore (_), followed by the dependent variable name.
For example, suppose that you have two dependent variables, say y1 and y2 , and you specify the keywords PRED, ADIAG, and UCLM. The output SAS data set will contain the following variables:

P_y1 and P_y2
ADIAG_y1 and ADIAG_y2
UCLM_y1 and UCLM_y2

The keywords and the statistics they represent are as follows:

RESID R	residual values, calculated as ACTUAL - PREDICTED
PRED	predicted values
STD	standard error of the mean predicted value
UCLM	upper limit of the confidence interval for the expected value of the dependent variables. By default, PROC TPSPLINE computes 95% confidence limits.
LCLM	lower limit of the confidence interval for the expected value of the dependent variables. By default, PROC TPSPLINE computes 95% confidence limits.
ADIAG	diagonal element of the hat matrix associated with the observation
COEF	coefficients arranged in the order of ( , ₁ , , _d , ₁ , _nUnique ) where nUnique is the number of unique data points. This option can only be used when there is only one dependent variable in the model.