Chapter 74: The TPSPLINE Procedure | SAS.STAT 9.1 Users Guide (Vol. 6)

Overview

The TPSPLINE procedure uses the penalized least squares method to fit a nonparametric regression model. It computes thin-plate smoothing splines to approximate smooth multivariate functions observed with noise. The TPSPLINE procedure allows great flexibility in the possible form of the regression surface. In particular, PROC TPSPLINE makes no assumptions of a parametric form for the model. The generalized cross validation (GCV) function may be used to select the amount of smoothing.

The TPSPLINE procedure complements the methods provided by the standard SAS regression procedures such as the GLM, REG and NLIN procedures. These procedures can handle most situations in which you specify the regression model and the model is known up to a fixed number of parameters. However, when you have no prior knowledge about the model, or when you know that the data cannot be represented by a model with a fixed number of parameters, you can use the TPSPLINE procedure to model the data.

The TPSPLINE procedure uses the penalized least squares method to fit the data with a flexible model in which the number of effective parameters can be as large as the number of unique design points. Hence, as the sample size increases, the model space increases as well, enabling the thin-plate smoothing spline to fit more complicated situations.

The main features of the TPSPLINE procedure are as follows :

provides penalized least squares estimates
supports the use of multidimensional data
supports multiple SCORE statements
fits both semiparametric models and nonparametric models
provides options for handling large data sets
supports multiple dependent variables
enables you to choose a particular model by specifying the model degrees of freedom or smoothing parameter

The Penalized Least Squares Estimate

Penalized least squares estimates provide a way to balance fitting the data closely and avoiding excessive roughness or rapid variation. A penalized least squares estimate is a surface that minimizes the penalized least squares over the class of all surfaces satisfying sufficient regularity conditions.

Define x _i as a d -dimensional covariate vector, z _i as a p -dimensional covariate vector, and y _i as the observation associated with ( x _i , z _i ). Assuming that the relation between z _i and y _i is linear but the relation between x _i and y _i is unknown, you can fit the data using a semiparametric model as follows:

where f is an unknown function that is assumed to be reasonably smooth, ˆˆ _i ,i = 1 , , n are independent, zero-mean random errors, and ² is a p -dimensional unknown parametric vector.

This model consists of two parts . The z _i ² is the parametric part of the model, and the z _i are the regression variables. The f ( x _i ) is the nonparametric part of the model, and the x _i are the smoothing variables.

The ordinary least squares method estimates f ( x _i ) and ² by minimizing the quantity:

However, the functional space of f ( x ) is so large that you can always find a function f that interpolates the data points. In order to obtain an estimate that fits the data well and has some degree of smoothness, you can use the penalized least squares method.

The penalized least squares function is defined as

where J ₂ ( f ) is the penalty on the roughness of f and is defined, in most cases, as the integral of the square of the second derivative of f .

The first term measures the goodness of fit and the second term measures the smoothness associated with f . The » term is the smoothing parameter, which governs the tradeoff between smoothness and goodness of fit. When » is large, it more heavily penalizes rougher fits. Conversely, a small value of » puts more emphasis on the goodness of fit.

The estimate f _» is selected from a reproducing kernel Hilbert space, and it can be represented as a linear combination of a sequence of basis functions. Hence, the final estimates of f can be written as

where B _j is the basis function, which depends on where the data x _j is located, and _j and _j are the coefficients that need to be estimated.

For a fixed » , the coefficients ( , , ² ) can be estimated by solving an n — n system.

The smoothing parameter can be chosen by minimizing the generalized cross validation (GCV) function.

If you write

then A ( » ) is referred to as the hat or smoothing matrix, and the GCV function V ( » ) is defined as

PROC TPSPLINE with Large Data Sets

The calculation of the penalized least squares estimate is computationally intensive . The amount of memory and CPU time needed for the analysis depend on the number of unique design points, which corresponds to the number of unknown parameters to be estimated.

You can specify the D= value option in the MODEL statement to reduce the number of unknown parameters. The option groups design points by the specified range (see the D= option on page 4509).

PROC TPSPLINE selects one design point from the group and treats all observations in the group as replicates of that design point. Calculation of the thin-plate smoothing spline estimates are based on the reprocessed data. The way to choose the design point from a group depends on the order of the data. Therefore, different orders of input data may result in different estimates.

This option, by combining several design points into one, reduces the number of unique design points, thereby approximating the original data. The D= value you specify determines the width of the range used to group the data.