Details | SAS/STAT 9.1 Users Guide Volume 2 only

Nonparametric Regression

Nonparametric regression relaxes the usual assumption of linearity and enables you to explore the data more flexibly, uncovering structure in the data that might otherwise be missed.

However, many forms of nonparametric regression do not perform well when the number of independent variables in the model is large. The sparseness of data in this setting causes the variances of the estimates to be unacceptably large unless the sample size is extremely large. The problem of rapidly increasing variance for increasing dimensionality is sometimes referred to as the curse of dimensionality. Interpretability is another problem with nonparametric regression based on kernel and smoothing spline estimates. The information these estimates contain about the relationship between the dependent and independent variables is often difficult to comprehend.

To overcome these difficulties, Stone (1985) proposed additive models. These models estimate an additive approximation to the multivariate regression function. The benefits of an additive approximation are at least twofold. First, since each of the individual additive terms is estimated using a univariate smoother, the curse of dimensionality is avoided, at the cost of not being able to approximate universally . Second, estimates of the individual terms explain how the dependent variable changes with the corresponding independent variables.

To extend the additive model to a wide range of distribution families, Hastie and Tibshirani (1990) proposed generalized additive models. These models enable the mean of the dependent variable to depend on an additive predictor through a nonlinear link function. The models permit the response probability distribution to be any member of the exponential family of distributions. Many widely used statistical models belong to this general class; they include additive models for Gaussian data, nonparametric logistic models for binary data, and nonparametric log-linear models for Poisson data.

Additive Models and Generalized Additive Models

This section describes the methodology and the fitting procedure behind generalized additive models.

Let Y be a response random variable and X ₁ , X ₂ , · · ·, X _p be a set of predictor variables. A regression procedure can be viewed as a method for estimating the expected value of Y given the values of X ₁ , X ₂ , · · ·, X _p . The standard linear regression model assumes a linear form for the conditional expectation

Given a sample, estimates of ² , ² ₁ , · · ·, ² _p are usually obtained by the least squares method.

The additive model generalizes the linear model by modeling the conditional expectation as

where s _i ( X ) , i = 1, 2, · · ·, p are smooth functions.

In order to be estimable , the smooth functions s _i have to satisfy standardized conditions such as Es _j ( X _j ) = 0. These functions are not given a parametric form but instead are estimated in a nonparametric fashion.

While traditional linear models and additive models can be used in most statistical data analysis, there are types of problems for which they are not appropriate. For example, the normal distribution may not be adequate for modeling discrete responses such as counts or bounded responses such as proportions .

Generalized additive models address these difficulties, extending additive models to many other distributions besides just the normal. Thus, generalized additive models can be applied to a much wider range of data analysis problems.

Similar to generalized linear models, generalized additive models consist of a random component, an additive component, and a link function relating the two components . The response Y , the random component, is assumed to have exponential family density

where is called the natural parameter and is the scale parameter. The mean of the response variable µ is related to the set of covariates X ₁ ,X ₂ , · · ·, X _p by a link function g . The quantity

defines the additive component, where s ₁ ( ·), · · ·, s _p ( ·) are smooth functions, and the relationship between µ and · is defined by g ( µ ) = · . The most commonly used link function is the canonical link, for which · = .

Generalized additive models and generalized linear models can be applied in similar situations, but they serve different analytic purposes. Generalized linear models emphasize estimation and inference for the parameters of the model, while generalized additive models focus on exploring data nonparametrically. Generalized additive models are more suitable for exploring the data and visualizing the relationship between the dependent variable and the independent variables.

Backfitting and Local Scoring Algorithms

Much of the development and notation in this section follows Hastie and Tibshirani (1986). Consider the estimation of the smoothing terms s , s ₁ ( ·), · · ·, s _p ( ·) in the additive model

where E [ s _j ( X _j )] = 0 for every j . Since the algorithm for additive models is the basis for fitting generalized additive models, the algorithm for additive models is discussed first.

Many ways are available to approach the formulation and estimation of additive models. The backfitting algorithm is a general algorithm that can fit an additive model using any regression-type fitting mechanisms.

Define the j th set of partial residuals as

then E ( R _j X _j ) = s _j ( X _j ). This observation provides a way for estimating each smoothing function s _j ( ·) given estimates { _i ( ·) , i ‰ j } for all the others. The resulting iterative procedure is known as the backfitting algorithm (Friedman and Stuetzle 1981). The following formulation is taken from Hastie and Tibshirani (1986).

The Backfitting Algorithm

The unweighted form of the backfitting algorithm is as follows:

Initialization:
Iterate:

m = m + 1

for j = 1 to p do:
Until:

fails to decrease, or satisfies the convergence criterion.

In the preceding notation, ( ·) denotes the estimate of s _j ( ·) at the m th iteration. It can be shown that with many smoothers (including linear regression, univariate and bivariate splines, and combinations of these), RSS never increases at any step. This implies that the algorithm always converges (Hastie and Tibshirani, 1986). Note, however, that for distributions other than Gaussian, numerical instabilities with weights may cause convergence problems. Even when the algorithm converges, the individual functions need not be unique, since dependence among the covariates can lead to more than one representation for the same fitted surface.

A weighted backfitting algorithm has the same form as for the unweighted case, except that the smoothers are weighted. In PROC GAM, weights are used with non-Gaussian data in the local scoring procedure described later in this section.

The GAM procedure uses the following condition as the convergence criterion for the backfitting algorithm:

where ˆˆ = 10 ^{ˆ’ 8} by default; you can change this with the EPSILON= option on the MODEL statement.

The algorithm so far described fits just additive models. The algorithm for generalized additive models is a little more complicated. Generalized additive models extend generalized linear models in the same manner that additive models extend linear regression models, that is, by replacing form ± + ˆ‘ _j X _j ² _j with the additive form ± + ˆ‘ _j f _j ( X _j ). Refer to Generalized Linear Models Theory in Chapter 31, The GENMOD Procedure, for more information.

PROC GAM fits generalized additive models using a modified form of adjusted dependent variable regression, as described for generalized linear models in McCullagh and Nelder (1989), with the additive predictor taking the role of the linear predictor. Hastie and Tibshirani (1986) call this the local scoring algorithm . Important components of this algorithm depend on the link function for each distribution, as shown in the following table.

Distribution	Link	Adjusted Dependent(Z)	Weights(w)
Normal	identity	y	1
Bin( n, µ )	logit	· + ( y ˆ’ µ ) /n µ (1 ˆ’ µ )	n µ (1 ˆ’ µ )
Gamma	ˆ’ 1 / µ	· + ( y ˆ’ µ ) / µ ²	µ ²
Poisson	log	· + ( y ˆ’ µ ) / µ	µ
Inverse Gaussian	1 / µ ²	· ˆ’ 2( y ˆ’ µ ) / µ ³	µ ³ /4

Once the distribution and hence these quantities are defined, the local scoring algorithm proceeds as follows:

The General Local Scoring Algorithm

Initialization:
Iterate:

m = m + 1

Form the predictor · _i , mean µ _i , weights w _i , and adjusted dependent variable z _i based on the previous iteration

Fit an additive model to Z using the backfitting algorithm with weights W to obtain estimated functions ( ·).
Until:

The convergence criterion is satisfied or the deviance fails to decrease. The deviance is an extension to generalized models of the RSS; refer to Goodness of Fit in Chapter 31, The GENMOD Procedure, for a definition.

The GAM procedure uses the following condition as the convergence criterion for local scoring:

where ˆˆ ⁸ = 10 ^{ˆ’ 8} by default; you can change this with the EPSSCORE= option on the MODEL statement.

The estimating procedure for generalized additive models consists of two loops . Inside each step of the local scoring algorithm (outer loop), a weighted backfitting algorithm (inner loop) is used until convergence or until the RSS fails to decrease. Then, based on the estimates from this weighted backfitting algorithm, a new set of weights is calculated and the next iteration of the scoring algorithm starts. The scoring algorithm stops when the convergence criterion is satisfied or the deviance of the estimates ceases to decrease.

Smoothers

A smoother is a tool for summarizing the trend of a response measurement Y as a function of one or more predictor measurements X ₁ , · · ·, X _p . It produces an estimate of the trend that is less variable than Y itself. An important property of a smoother is its nonparametric nature. It does not assume a rigid form for the dependence of Y on X ₁ , · · · , X _p . This section gives a brief overview of the smoothers that can be used with the GAM procedure.

Cubic Smoothing Spline

A smoothing spline is the solution to the following optimization problem: among all functions · ( x ) with two continuous derivatives, find one that minimizes the penalized least square

where » is a fixed constant, and a ‰ x ₁ ‰ · · · ‰ x _n ‰ b . The first term measures closeness to the data while the second term penalizes curvature in the function. It can be shown that there exists an explicit, unique minimizer, and that minimizer is a natural cubic spline with knots at the unique values of x _i .

The value » / (1 + » ) is the smoothing parameter . When » is large, the smoothing parameter is close to 1, producing a smoother curve; small values of » , corresponding to smoothing parameters near 0, are apt to produce rougher curves, more nearly interpolating the data.

Local Regression

Local regression was proposed by Cleveland, Devlin, and Grosse (1988). The idea of local regression is that at a predictor x , the regression function · ( x ) can be locally approximated by the value of a function in some specified parametric class. Such a local approximation is obtained by fitting a regression surface to the data points within a chosen neighborhood of the point x . A weighted least squares algorithm is used to fit linear functions of the predictors at the centers of neighborhoods. The radius of each neighborhood is chosen so that the neighborhood contains a specified percentage of the data points. The smoothing parameter for the local regression procedure, which controls the smoothness of the estimated curve, is the fraction of the data in each local neighborhood. Data points in a given local neighborhood are weighted by a smooth decreasing function of their distance from the center of the neighborhood. Refer to Chapter 41, The LOESS Procedure, for more details.

Thin-Plate Smoothing Spline

The thin-plate smoothing spline is a multivariate version of the cubic smoothing spline. The theoretical foundations for the thin-plate smoothing spline are described in Duchon (1976, 1977) and Meinguet (1979). Further results and applications are given in Wahba and Wendelberger (1980). Refer to Chapter 74, The TPSPLINE Procedure, for more details.

Selection of Smoothing Parameters

CV and GCV

The smoothers discussed here have a single smoothing parameter. In choosing the smoothing parameter, cross validation can be used. Cross validation works by leaving points ( x _i , y _i ) out one at a time, estimating the squared residual for smooth function at x _i based on the remaining n ˆ’ 1 data points, and choosing the smoother to minimize the sum of those squared residuals. This mimics the use of training and test samples for prediction. The cross validation function is defined as

where indicates the fit at x _i , computed by leaving out the i th data point. The quantity nCV ( » ) is sometimes called the prediction sum of squares or PRESS (Allen 1974).

All of the smoothers fit by the GAM procedure can be formulated as a linear combination of the sample responses

for some matrix A ( » ), which depends on » . (The matrix A ( » ) depends on x and the sample data, as well, but this dependence is suppressed in the preceding equation.) Let a _ii be the diagonal elements of A ( » ). Then the CV function can be expressed as

In most cases, it is very time consuming to compute the quantity a _ii . To solve this computational problem, Wahba (1990) has proposed the generalized cross validation function ( GCV ) that can be used to solve a wide variety of problems involving selection of a parameter to minimize the prediction risk.

The GCV function is defined as

The GCV formula simply replaces the a _ii with tr( A ( » )) /n . Therefore, it can be viewed as a weighted version of CV . In most of the cases of interest, GCV is closely related to CV but much easier to compute. Specify the METHOD=GCV option on the MODEL statement in order to use the GCV function to choose the smoothing parameters.

Degrees of Freedom

The estimated GAM model can be expressed as

Because the weights are calculated based on previous iteration during the local scoring iteration, the matrices A _i may depend on Y for non-Gaussian data. However, for the final iteration, the A _i matrix for the spline smoothers has the same role as the projection matrix in linear regression and therefore, nonparametric degrees of freedom (DF) for the i th spline smoother can be defined as

For LOESS smoothers A _i is not symmetric and so is not a projection matrix. In this case PROC GAM uses

The GAM procedure gives you the option of specifying the degrees of freedom for each individual smoothing component. If you choose a particular value for the degrees of freedom, then during every local scoring iteration the procedure will search for a corresponding smoothing parameter lambda that yields the specified value or comes as close as possible. The final estimate for the smoother during this local scoring iteration will be based on this lambda. Note that for univariate spline components, an additional degree of freedom is removed by default to account for the linear portion of the model, so the value displayed in the Fit Summary and Analysis of Deviance tables will be one less than the value you specify.

Confidence Intervals for Smoothers

In the GAM procedure, curvewise confidence intervals for smoothing splines and pointwise confidence intervals for loess are provided in the output dataset.

Curvewise Confidence Interval for Smoothing Spline

Viewing the spline model as a Bayesian model, Wahba (1983) proposed Bayesian confidence intervals for smoothing spline estimates as follows:

where a _ii ( » ) is the i th diagonal element of the A ( » ) matrix and z _{± /} ₂ is the ± / 2 point of the normal distribution. The confidence intervals are interpreted as intervals across the function as opposed to point-wise intervals.

Suppose that you fit a spline estimate to experimental data that consists of a true function f and a random error term, ˆˆ _i . In repeated experiments, it is likely that about 100(1 ˆ’ ± )% of the confidence intervals cover the corresponding true values, although some values are covered every time and other values are not covered by the confidence intervals most of the time. This effect is more pronounced when the true response curve or surface has small regions of particularly rapid change.

Pointwise Confidence Interval for Loess Smoothers

As defined in Cleveland, Devlin, and Grosse (1988), a standardized residual for a loess smoother follows a t distribution with degrees of freedom, where is called the lookup degrees of freedom, defined as

where ₁ = Trace ( I ˆ’ A ( » )) ^T ( I ˆ’ A ( » )) and ₂ = Trace (( I ˆ’ A ( » )) ^T ( I ˆ’ A ( » ))) ² . Therefore an approximate pointwise confidence interval at x _i is

where ( x _i ) is the estimate of the standard deviation.

Distribution Family and Canonical Link

In general, there is not just one reasonable link function for a given response variable distribution. For parametric models, the choice of link function can lead to substantively different estimates and tests. However, the inherent flexibility of nonparametric models makes them less likely to be sensitive to the precise choice of link function. Thus, for simplicity and computational efficiency, the GAM procedure uses only the canonical link for each distribution, as discussed below.

The Gaussian Model

With this model, the link function is the identity function, and the generalized additive model is the additive model.

The Binomial Model

A binomial response model assumes that the proportion of successes Y is such that Y has a Bin ( n,p ( x )) distribution. The Bin ( n,p ( x )) refers to the binomial distribution with parameters n and p ( x ). Often the data are binary, in which case n = 1. The canonical link is

The Poisson Model

The link function for the Poisson model is the log function. Assuming that the mean of the Poisson distribution is µ ( x ), the dependence of µ ( x ) and independent variable x ₁ , · · ·, x _k is

The Gamma Model

Let the mean of the Gamma distribution be µ ( x ). The canonical link function for the Gamma distribution is ˆ’ 1/ µ ( x ). Therefore, the relationship between µ ( x ) and the independent variable x ₁ , · · ·, x _k is

The Inverse Gaussian Model

Let the mean of the Inverse Gaussian distribution be µ ( x ). The canonical link function for inverse Gaussian distribution is 1 / µ ² . Therefore, the relationship between µ ( x ) and the independent variable x ₁ , · · ·, x _k is

Dispersion Parameter

Continuous distributions in the exponential family (Gaussian, Gamma, and Inverse Gaussian) have a dispersion parameter that can be estimated by the scaled deviance. For these continuous response distributions, PROC GAM incorporates this dispersion parameter estimate into standard errors of the parameter estimates, prediction standard errors of spline components, and chi-square statistics. The discrete distributions used in GAM (Binomial and Poisson) do not have a dispersion parameter. For more details on the distributions, dispersion parameter, and deviance, refer to Generalized Linear Models Theory in Chapter 31, The GENMOD Procedure.

Forms of Additive Models

Suppose that y is a continuous variable and x1 and x2 are two explanatory variables of interest. To fit an additive model, you can use a MODEL statement similar to that used in many regression procedures in the SAS System:

  model y = spline(x1) spline(x2);

This model statement requires the procedure to fit the following model:

where the s _i () terms denote nonparametric spline functions of the respective explanatory variables.

The GAM procedure can fit semiparametric models. The following MODEL statement assumes a linear relation with x1 and an unknown functional relation with x2 :

  model y = param(x1) spline(x2);

If you want to fit a model containing a functional two-way interaction between x1 and x2 , you can use the following MODEL statement:

  model y = spline2(x1,x2);

In this case, the GAM procedure fits a model equivalent to that of PROC TPSPLINE.

Estimates from PROC GAM

PROC GAM provides the ability to fit both nonparametric and semiparametric models. In order to better understand the underlying trend of any given factor, PROC GAM separates the linear trend from any general nonparametric trend during the fitting as well as in the final report. This makes it easy for you to determine whether the significance of a smoothing variable is associated with a simple linear trend or a more complicated pattern.

For example, suppose you want to fit a semiparametric model as

The GAM estimate for this model is

where f ₁ and f ₂ are linear-adjusted nonparametric estimates of the s 1 and s 2 effects. The p -values for ± , ± ₁ , ² ₁ , and ² ₂ are reported in the parameter estimates table. ² ₁ and ² ₂ are the estimates labeled Linear(x1) and Linear(x2) in the table. The p -values for f ₁ and f ₂ are reported in the analysis of deviance table.

Only ₁ , ₂ , and · are output to the output data set, with corresponding variable names P_x1 , P_x2 , and P_y , respectively. For Gaussian data, the complete marginal prediction for variable x1 is

For non-Gaussian data, an appropriate transformation is required to get back to the original data scale.

ODS Table Names

PROC GAM assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, refer to Chapter 14, Using the Output Delivery System.

Table 30.2: ODS Tables Produced by PROC GAM
ODS Table Name	Description	Statement	Option
ANODEV	Analysis of Deviance table for smoothing variables	PROC	default
ClassSummary	Summary of class variables	PROC	default
ConvergenceStatus	Convergence status of the local score algorithm	PROC	default
InputSummary	Input data summary	PROC	default
IterHistory	Iteration history table	MODEL	ITPRINT
IterSummary	Iteration summary	PROC	default
FitSummary	Fit parameters and fit summary	PROC	default
ParameterEstimates	Parameter estimation for regression variables	PROC	default

By referring to the names of such tables, you can use the ODS OUTPUT statement to place one or more of these tables in output data sets.

ODS Graphics (Experimental)

This section describes the use of ODS for creating statistical graphs with the GAM procedure. These graphics are experimental in this release, meaning that both the graphical results and the syntax for specifying them are subject to change in a future release. To request these graphs you must specify the ODS GRAPHICS statement in addition to the options indicated below. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.

When the ODS GRAPHICS are in effect, the GAM procedure will produce plots of the partial predictions for each nonparametric predictor in the model. Use the PLOTS option on the PROC GAM statement to control aspects of these plots.

PLOTS < ( general-plot-options ) > = keywords < ( plot-options ) >

specifies characteristics of the graphics produced when you use the experimental ODS GRAPHICS statement. You can specify the following general-plot-options in parentheses after the PLOTS option:

CLM	specifies that smoothing component plots should include a 95% confidence band. Note that producing this band can be computationally intensive for large data sets.
COMMONAXES	specifies that smoothing component plots within a single graphics panel should all have the same vertical axis limits. This enables you to visually judge relative effect size.
UNPACKUNPACKPANELS	specifies that multiple smoothing component plots that are collected into graphics panels should be additionally displayed separately. Use this option if you want to access individual smoothing component plots within the panel.

You can specify the following keywords as arguments to the PLOTS= option.

COMPONENTS < ( number-list ALL ) >

specifies that only particular smoothing component plots should be produced. Plots for successive smoothing components are named COMPONENT1, COMPONENT2, and so on. For example, specify PLOTS=COMPONENT(1 3) to produce only the first and the third smoothing component plots.

ODS Graph Names

PROC GAM assigns a name to each graph it creates using ODS. You can use these names to reference the graphs when using ODS. The names are listed in Table 30.3.

Table 30.3: ODS Graphics Produced by PROC GAM
ODS Graph Name	Plot Description	PLOTS= Option
Component i	Partial prediction curve for smoothing component i	Component
SmoothingComponentPanel i	Panel i of multiple partial prediction curves

To request these graphs you must specify the ODS GRAPHICS statement. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.

By default, partial prediction plots for each component are displayed in panels of multiple plots named SmoothingComponentPanel1 , SmoothingComponentPanel2 , and so on. Use the PLOTS(UNPANEL) option on the PROC GAM statement to display these plots individually as well. Use the PLOTS(CLM) option to superimpose confidence limits for the partial predictions.