Examples | SAS/STAT 9.1 Users Guide Volume 2 only

Example 30.1. Generalized Additive Model with Binary Data

The following example illustrates the capabilities of the GAM procedure and compares it to the GENMOD procedure.

The data used in this example are based on a study by Bell et al. (1994). Bell and his associates studied the result of multiple-level thoracic and lumbar laminectomy, a corrective spinal surgery commonly performed on children. The data in the study consist of retrospective measurements on 83 patients . The specific outcome of interest is the presence (1) or absence (0) of kyphosis, defined as a forward flexion of the spine of at least 40 degrees from vertical. The available predictor variables are Age in months at time of the operation, the starting of vertebrae levels involved in the operation ( StartVert ), and the number of levels involved ( NumVert ). The goal of this analysis is to identify risk factors for kyphosis. PROC GENMOD can be used to investigate the relationship among kyphosis and the predictors. The following DATA step creates the data kyphosis :

  title 'Comparing PROC GAM with PROC GENMOD';   data kyphosis;   input Age StartVert NumVert Kyphosis @@;   datalines;   71  5  3  0     158 14 3 0    128 5   4 1   2   1  5  0     1   15 4 0    1   16  2 0   61  17 2  0     37  16 3 0    113 16  2 0   59  12 6  1     82  14 5 1    148 16  3 0   18  2  5  0     1   12 4 0    243 8   8 0   168 18 3  0     1   16 3 0    78  15  6 0   175 13 5  0     80  16 5 0    27  9   4 0   22  16 2  0     105 5  6 1    96  12  3 1   131 3  2  0     15  2  7 1    9   13  5 0   12  2 14  1     8   6  3 0    100 14  3 0   4   16 3  0     151 16 2 0    31  16  3 0   125 11 2  0     130 13 5 0    112 16  3 0   140 11 5  0     93  16 3 0    1   9   3 0   52  6  5  1     20  9  6 0    91  12  5 1   73  1  5  1     35  13 3 0    143 3   9 0   61  1  4  0     97  16 3 0    139 10  3 1   136 15 4  0     131 13 5 0    121 3   3 1   177 14 2  0     68  10 5 0    9   17  2 0   139 6  10 1     2   17 2 0    140 15  4 0   72  15 5  0     2   13 3 0    120 8   5 1   51  9  7  0     102 13 3 0    130 1   4 1   114 8  7  1     81  1  4 0    118 16  3 0   118 16 4  0     17  10 4 0    195 17  2 0   159 13 4  0     18  11 4 0    15  16  5 0   158 15 4  0     127 12 4 0    87  16  4 0   206 10 4  0     11  15 3 0    178 15  4 0   157 13 3  1     26  13 7 0    120 13  2 0   42  6  7  1     36  13 4 0   ;   proc genmod;   model Kyphosis = Age StartVert NumVert   / link=logit dist=binomial;   run;

The GENMOD analysis of the independent variable effects is shown in Output 30.1.1. Based on these results, the only significant factor is StartVert with a log odds ratio of ˆ’ . 1972. The variable NumVert has a p -value of 0.0904 with a log odds ratio of 0 . 3031.

Output 30.1.1: GENMOD Analysis: Partial Output

  Comparing PROC GAM with PROC GENMOD   The GENMOD Procedure   PROC GENMOD is modeling the probability that Kyphosis='0'. One way to change   this to model the probability that Kyphosis='1' is to specify the DESCENDING   option in the PROC statement.   Analysis Of Parameter Estimates   Standard       Wald 95%          Chi-   Parameter  DF  Estimate     Error   Confidence Limits   Square  Pr > ChiSq   Intercept   1    1.2497    1.2424   1.1853    3.6848     1.01      0.3145   Age         1   0.0061    0.0055   0.0170    0.0048     1.21      0.2713   StartVert   1    0.1972    0.0657    0.0684    0.3260     9.01      0.0027   NumVert     1   0.3031    0.1790   0.6540    0.0477     2.87      0.0904   Scale       0    1.0000    0.0000    1.0000    1.0000   NOTE: The scale parameter was held fixed.

The GENMOD procedure assumes a strict linear relationship between the response and the predictors. The following SAS statements use PROC GAM to investigate a less restrictive model, with moderately flexible spline terms for each of the predictors:

  title 'Comparing PROC GAM with PROC GENMOD';   proc gam data=kyphosis;   model Kyphosis = spline(Age      ,df=3)   spline(StartVert,df=3)   spline(NumVert  ,df=3) / dist=binomial;   run;

The MODEL statement requests an additive model using a univariate smoothing spline for each term . The option dist=binomial with binary responses specifies a logistic model. Each term is fit using a univariate smoothing spline with three degrees of freedom. Of these three degrees of freedom, one is taken up by the linear portion of the fit and two are left for the nonlinear spline portion. Although this might seem to be an unduly modest amount of flexibility, it is better to be conservative with a data set this small.

Output 30.1.2 and Output 30.1.3 list the output from PROC GAM.

Output 30.1.2: Summary Statistics

  Comparing PROC GAM with PROC GENMOD   The GAM Procedure   Dependent Variable: Kyphosis   Smoothing Model Component(s): spline(Age) spline(StartVert) spline(NumVert)   Summary of Input Data Set   Number of Observations                  83   Number of Missing Observations           0   Distribution                      Binomial   Link Function                        Logit   Iteration Summary and Fit Statistics   Number of local score iterations                          9   Local score convergence criterion              2.6635657E-9   Final Number of Backfitting Iterations                    1   Final Backfitting Criterion                    5.2326588E-9   The Deviance of the Final Estimate             46.610922438

Output 30.1.3: Model Fit Statistics

  Comparing PROC GAM with PROC GENMOD   The GAM Procedure   Dependent Variable: Kyphosis   Smoothing Model Component(s): spline(Age) spline(StartVert) spline(NumVert)   Regression Model Analysis   Parameter Estimates   Parameter       Standard   Parameter               Estimate          Error    t Value    Pr > t   Intercept   2.01533        1.45620   1.38      0.1706   Linear(Age)              0.01213        0.00794       1.53      0.1308   Linear(StartVert)   0.18615        0.07628   2.44      0.0171   Linear(NumVert)          0.38347        0.19102       2.01      0.0484   Smoothing Model Analysis   Fit Summary for Smoothing Components   Num   Smoothing                                      Unique   Component               Parameter              DF             GCV         Obs   Spline(Age)              0.999996        2.000000      328.512831          66   Spline(StartVert)        0.999551        2.000000      317.646685          16   Spline(NumVert)          0.921758        2.000000       20.144056          10   Smoothing Model Analysis   Analysis of Deviance   Sum of   Source                        DF         Squares    Chi   Square    Pr > ChiSq   Spline(Age)              2.00000       10.494369       10.4944        0.0053   Spline(StartVert)        2.00000        5.494968        5.4950        0.0641   Spline(NumVert)          2.00000        2.184518        2.1845        0.3355

The critical part of the GAM results is the Analysis of Deviance table, shown in Output 30.1.3. For each smoothing effect in the model, this table gives a ² -test comparing the deviance between the full model and the model without this variable. In this case the analysis of deviance results indicates that the effect of Age is highly significant, the effect of StartVert is nearly significant, and the effect of NumVert is insignificant at the 5% level. Plots of predictions against predictor can be used to investigate why PROC GAM and PROC GENMOD produce different results. The following statements use the experimental ODS GRAPHICS statement to produce plots of the individual smoothing components. The CLM suboption for the PLOTS option adds a curvewise Bayesian confidence band to each smoothing component, while the COMMONAXES suboption forces all three smoothing component plots to share the same vertical axis limits, allowing a visual judgment of nonparametric effect size .

  ods html;   ods graphics on;   proc gam data=kyphosis plots(clm commonaxes);   model Kyphosis = spline(Age      ,df=3)   spline(StartVert,df=3)   spline(NumVert  ,df=3) / dist=binomial;   run;   ods graphics off;   ods html close;

For general information about ODS graphics, see Chapter 15, Statistical Graphics Using ODS. For specific information about the graphics available in the GAM procedure, see the section ODS Graphics on page 1581. The smoothing component plots are displayed in Output 30.1.4.

Output 30.1.4: Partial Prediction for Each Predictor (Experimental)

The plots show that the partial predictions corresponding to both Age and StartVert have a quadratic pattern, while NumVert has a more complicated but weaker pattern. However, in the plot for NumVert , notice that about half the vertical range of the function is determined by the point at the upper extreme. It would be a good idea, therefore, to rerun the analysis without this point, to explore how much it affects the conclusions. You can do this by simply including a WHERE clause when specifying the data set for the GAM procedure, as in the following code:

  ods html;   ods graphics on;   proc gam data=kyphosis(where=(NumVert^=14)) plots(clm commonaxes);   model Kyphosis = spline(Age      ,df=3)   spline(StartVert,df=3)   spline(NumVert  ,df=3) / dist=binomial;   run;   ods graphics off;   ods html close;

The analysis of deviance table from this reanalysis is shown in Output 30.1.5, and Output 30.1.6 shows the recomputed partial predictor plots.

Output 30.1.5: Analysis After Removing NumVert=14

  Comparing PROC GAM with PROC GENMOD   The GAM Procedure   Dependent Variable: Kyphosis   Smoothing Model Component(s): spline(Age) spline(StartVert) spline(NumVert)   Smoothing Model Analysis   Analysis of Deviance   Sum of   Source                        DF         Squares    Chi-Square    Pr > ChiSq   Spline(Age)              2.00000       10.587556       10.5876        0.0050   Spline(StartVert)        2.00000        5.477094        5.4771        0.0647   Spline(NumVert)          2.00000        3.209089        3.2091        0.2010

Output 30.1.6: Partial Prediction After Removing NumVert=14 (Experimental)

Removing data point NumVert =14 has little effect on either the analysis of deviance results or the estimated curves for StartVert and NumVert . But the removal has a noticeable effect on the variable NumVert , whose curve now also seems quadratic, though it is much less pronounced than for the other two variables.

An important difference between the first analysis of this data with GENMOD and the subsequent analysis with GAM is that GAM indicates that age is a significant predictor of kyphosis. The difference is due to the fact that the GENMOD model only includes a linear effect in Age whereas the GAM model allows a more complex relationship, which the plots indicate is nearly quadratic. Having used the GAM procedure to discover an appropriate form of the dependence of Kyphosis on each of the three independent variables, you can use the GENMOD procedure to fit and assess the corresponding parametric model. The following code fits a GENMOD model with quadratic terms for all three variables, including tests for the joint linear and quadratic effects of each variable. The resulting contrast tests are shown in Output 30.1.7.

Output 30.1.7: Joint Linear and Quadratic Tests

  Comparing PROC GAM with PROC GENMOD   The GENMOD Procedure   PROC GENMOD is modeling the probability that Kyphosis='0'. One way to change   this to model the probability that Kyphosis='1' is to specify the DESCENDING   option in the PROC statement.   Contrast Results   Chi-   Contrast         DF     Square    Pr > ChiSq    Type   Age               2      13.63        0.0011    LR   StartVert         2      15.41        0.0005    LR   NumVert           2       3.56        0.1684    LR

  title 'Comparing PROC GAM with PROC GENMOD';   proc genmod data=kyphosis(where=(NumVert^=14));   model kyphosis = Age       Age      *Age   StartVert StartVert*StartVert   NumVert   NumVert  *NumVert   /link=logit  dist=binomial;   contrast 'Age'       Age       1, Age*Age             1;   contrast 'StartVert' StartVert 1, StartVert*StartVert 1;   contrast 'NumVert'   NumVert   1, NumVert*NumVert     1;   run;

The results for the quadratic GENMOD model are now quite consistent with the GAM results.

From this example, you can see that PROC GAM is very useful in visualizing the data and detecting the nonlinearity among the variables.

Example 30.2. Poisson Regression Analysis of Component Reliability

In this example, the number of maintenance repairs on a complex system are modeled as realizations of Poisson random variables. The system under investigation has a large number of components, which occasionally break down and are replaced or repaired. During a four-year period, the system was observed to be in a state of steady operation, meaning that the rate of operation remained approximately constant. A monthly maintenance record is available for that period, which tracks the number of components removed for maintenance each month. The data are listed in the following statements that create a SAS data set.

  title 'Analysis of Component Reliability';   data equip;   input year month removals @@;   datalines;   1987   1  2 1987   2  4 1987   3  3   1987   4  3 1987   5  3 1987   6  8   1987   7  2 1987   8  6 1987   9  3   1987  10  9 1987  11  4 1987  12  10   1988   1  4 1988   2  6 1988   3  4   1988   4  4 1988   5  3 1988   6  5   1988   7  3 1988   8  4 1988   9  5   1988  10  3 1988  11  6 1988  12  3   1989   1  2 1989   2  6 1989   3  1   1989   4  5 1989   5  5 1989   6  4   1989   7  2 1989   8  2 1989   9  2   1989  10  5 1989  11  1 1989  12  10   1990   1  3 1990   2  8 1990   3  12   1990   4  7 1990   5  3 1990   6  2   1990   7  4 1990   8  3 1990   9  0   1990  10  6 1990  11  6 1990  12  6   ;   run;

For planning purposes, it is of interest to understand the long- and short-term trends in the maintenance needs of the system. Over the long term, it is suspected that the quality of new components and repair work improves over time, so the number of component removals would tend to decrease from year to year. It is not known whether the robustness of the system is affected by seasonal variations in the operating environment, but this possibility is also of interest.

Because the maintenance record is in the form of counts, the number of removals are modeled as realizations of Poisson random variables. Denote by » _ij the unobserved component removal rate for year i and month j . Since the data were recorded at regular intervals (from a system operating at a constant rate), each » _ij is assumed to be a function of year and month only.

A preliminary two-way analysis is performed using PROC GENMOD to make broad inferences on repair trends. A log-link is specified for the model

where µ is a grand mean, is the effect of the i th year, and is the effect of the j th month. A CLASS statement declares the variables year and month as categorical. Type III sum of squares are requested to test whether there is an overall effect of year and/or month.

  title2 'Two-way model';   proc genmod data=equip;   class year month;   model removals=year month   / dist=Poisson link=log type3;   run;

Output 30.2.1 displays the listed Type III statistics for the fitted model. With the test for year effects yielding a p -value of 0 . 4527, there is no evidence of a long-term trend in maintenance rates. Apparently, the quality of new or repaired components did not change between 1987 and 1990. However, the test for monthly trends does yield a small p -value of 0 . 0321, indicating that seasonal trends are just barely significant at the ± = 0 . 05 level.

Output 30.2.1: PROC GENMOD Listing for Type III Analysis

  Analysis of Component Reliability   Two-way model   The GENMOD Procedure   LR Statistics For Type 3 Analysis   Chi-   Source           DF     Square    Pr > ChiSq   year              3       2.63        0.4527   month            11      21.12        0.0321

The Type III tests indicate that the year term may be dropped from the model. The focus of the analysis is now on identifying the form of the underlying seasonal trend, which is a task that PROC GAM is especially suited for. PROC GAM will be used to fit both a reduced categorical model, with year eliminated, and a nonparametric spline model. Although PROC GENMOD also has the capability to fit categorical models, as demonstrated above, PROC GAM will be used to fit both models for a better comparison.

The following PROC GAM statements specify the reduced categorical model. For this part of the analysis, a CLASS statement is again used to specify that month is a categorical variable. In the follow-up, the seasonal effect will be treated as a nonparametric function of month .

  title2 'One-way model';   proc gam data=equip;   class month;   model removals=param(month)   / dist=Poisson;   output out=est predicted;   run;

The following statements generate a plot of the estimated seasonal trend. Note that the predicted values in the output data set correspond to the logarithms of the » _ij , and so the exponential function is applied to put them on the scale of the original data. The plot is displayed in Output 30.2.2.

Output 30.2.2: Predicted Seasonal Trend from a Parametric Model Fit Using a CLASS Statement

  proc sort data=est out=plot;   by month;   run;   data plot;   set plot;   P_removals = exp(P_removals);   run;   legend1 frame cframe=ligr cborder=black label=none   position=center;   axis1   minor=none order=(0 to 15 by 5)   label=(angle=90 rotate=0 "number of removals");   axis2   minor=none label=("month");   symbol1 color=black interpol=none value=dot;   symbol2 color=blue interpol=join value=none line=1;   title;   proc gplot data=plot;   plot removals*month=1 P_removals*month=2   / overlay cframe=ligr legend=legend1 frame   vaxis=axis1 haxis=axis2;   run; quit;

The predicted repair rates shown in Output 30.2.2 form a jagged seasonal pattern. Ignoring the month-to-month fluctuations, which are difficult to explain and may be artifacts of random noise, the general removal rate trend starts by increasing at the beginning of the year; the trend flattens out in February and then decreases through August; it flattens out again in September and begins an increasing trend that continues throughout the rest of the year.

One advantage of nonparametric regression is its ability to highlight general trends in the data, such as those described above, and attribute local fluctuations to unexplained random noise. The nonparametric regression model used by PROC GAM specifies that the underlying removal rates » _j are of the form

where ² ₁ is a linear coefficient and s () is a nonparametric regression function. ² ₁ and s () define the linear and nonparametric parts , respectively, of the seasonal trend.

The following statements request that PROC GAM fit a cubic spline model to the monthly repair data. The output listing is displayed in Output 30.2.3.

Output 30.2.3: PROC GAM Listing for Cubic Spline Regression Using the METHOD=GCV Option

  Analysis of Component Reliability   Spline model   The GAM Procedure   Dependent Variable: removals   Smoothing Model Component(s): spline(month)   Summary of Input Data Set   Number of Observations                 48   Number of Missing Observations          0   Distribution                      Poisson   Link Function                         Log   Iteration Summary and Fit Statistics   Number of local score iterations                          5   Local score convergence criterion              7.241527E-12   Final Number of Backfitting Iterations                    1   Final Backfitting Criterion                    1.710339E-11   The Deviance of the Final Estimate             56.901543546

  title 'Analysis of Component Reliability';   title2 'Spline model';   proc gam data=equip;   model removals=spline(month)   / dist=Poisson method=gcv;   run;

The METHOD=GCV option is used to determine an appropriate level of smoothing. The keywords LCLM and UCLM in the OUTPUT statement request that lower and upper 95% confidence bounds on each s ( Month _j ) be included in the output data set.

Notice in the listing of Output 30.2.4 that the DF value chosen for the nonlinear portion of the spline by minimizing GCV is about 1.88, which is smaller than the default value of 3. This indicates that the spline model of the seasonal trend is relatively simple. As indicated by the Analysis of Deviance table, it is a significant feature of the data: the table lists a p -value of 0 . 0103 for the hypothesis of no seasonal trend. Note also that the Parameter Estimates table lists a p -value of 0 . 2362 for the hypothesis of no linear factor in the seasonal trend.

Output 30.2.4: Model Fit Statistics

  Spline model   The GAM Procedure   Dependent Variable: removals   Smoothing Model Component(s): spline(month)   Regression Model Analysis   Parameter Estimates   Parameter       Standard   Parameter           Estimate          Error    t Value    Pr > t   Intercept            1.34594        0.14509       9.28      <.0001   Linear(month)        0.02274        0.01893       1.20      0.2362   Smoothing Model Analysis   Fit Summary for Smoothing Components   Num   Smoothing                                      Unique   Component           Parameter              DF             GCV         Obs   Spline(month)        0.901512        1.879980        0.115848          12   Smoothing Model Analysis   Analysis of Deviance   Sum of   Source                    DF         Squares    Chi-Square    Pr > ChiSq   Spline(month)        1.87998        8.877764        8.8778        0.0103

The following statements use the experimental ODS GRAPHICS statement to plot the smoothing component for the effect of Month on predicted repair rates.

  ods html;   ods graphics on;   proc gam data=equip;   model removals=spline(month)   / dist=Poisson method=gcv;   run;   ods graphics off;   ods html close;

Output 30.2.5: Predicted Seasonal Trend from a Cubic Spline Model (Experimental)

In Output 30.2.5, it is apparent that the pattern of repair rates follows the general pattern observed in Output 30.2.2. However, the plot of Output 30.2.5, is much cleaner as the month-to-month fluctuations are smoothed out to reveal the broader seasonal trend.

You can use the PLOTS(CLM) option to added a 95% confidence band to the plot of s ( Month _j ), as in the following statements.

  ods html; ods graphics on;   proc gam data=equip plots(clm);   model removals=spline(month)   / dist=Poisson method=gcv;   run;   ods graphics off; ods html close;

The plot is displayed in Output 30.2.6.

Output 30.2.6: Estimated Nonparametric Factor of Seasonal Trend, Along with 95% Confidence Bounds (Experimental)

The small p -value in Output 30.2.1 of 0.0321 for the hypothesis of no seasonal trend indicates that the data exhibit significant seasonal structure. However, Output 30.2.6 is a graphical illustration of a degree of indistinctness in that structure. For instance, the horizontal reference line at zero is entirely within the 95% confidence band; that is, the estimated nonlinear part of the trend is relatively flat. Thus, despite evidence of seasonality based on the parametric model, it is difficult to narrow down its significant effects to a specific part of the year.

Example 30.3. Comparing PROC GAM with PROC LOESS

In an analysis of simulated data from a hypothetical chemistry experiment, additive nonparametric regression performed by PROC GAM is compared to the unrestricted multidimensional procedure of PROC LOESS.

In each repetition of the experiment, a catalyst is added to a chemical solution, thereby inducing synthesis of a new material. The data are measurements of the temperature of the solution, the amount of catalyst added, and the yield of the chemical reaction. The following code reads and plots the raw data.

  data ExperimentA;   format Temperature f4.0 Catalyst f6.3 Yield f8.3;   input Temperature Catalyst Yield @@;   datalines;   80 0.005 6.039  80 0.010 4.719  80 0.015 6.301   80 0.020 4.558  80 0.025 5.917  80 0.030 4.365   80 0.035 6.540  80 0.040 5.063  80 0.045 4.668   80 0.050 7.641  80 0.055 6.736  80 0.060 7.255   80 0.065 5.515  80 0.070 5.260  80 0.075 4.813   80 0.080 4.465  90 0.005 4.540  90 0.010 3.553   90 0.015 5.611  90 0.020 4.586  90 0.025 6.503   90 0.030 4.671  90 0.035 4.919  90 0.040 6.536   90 0.045 4.799  90 0.050 6.002  90 0.055 6.988   90 0.060 6.206  90 0.065 5.193  90 0.070 5.783   90 0.075 6.482  90 0.080 5.222 100 0.005 5.042   100 0.010 5.551 100 0.015 4.804 100 0.020 5.313   100 0.025 4.957 100 0.030 6.177 100 0.035 5.433   100 0.040 6.139 100 0.045 6.217 100 0.050 6.498   100 0.055 7.037 100 0.060 5.589 100 0.065 5.593   100 0.070 7.438 100 0.075 4.794 100 0.080 3.692   110 0.005 6.005 110 0.010 5.493 110 0.015 5.107   110 0.020 5.511 110 0.025 5.692 110 0.030 5.969   110 0.035 6.244 110 0.040 7.364 110 0.045 6.412   110 0.050 6.928 110 0.055 6.814 110 0.060 8.071   110 0.065 6.038 110 0.070 6.295 110 0.075 4.308   110 0.080 7.020 120 0.005 5.409 120 0.010 7.009   120 0.015 6.160 120 0.020 7.408 120 0.025 7.123   120 0.030 7.009 120 0.035 7.708 120 0.040 5.278   120 0.045 8.111 120 0.050 8.547 120 0.055 8.279   120 0.060 8.736 120 0.065 6.988 120 0.070 6.283   120 0.075 7.367 120 0.080 6.579 130 0.005 7.629   130 0.010 7.171 130 0.015 5.997 130 0.020 6.587   130 0.025 7.335 130 0.030 7.209 130 0.035 8.259   130 0.040 6.530 130 0.045 8.400 130 0.050 7.218   130 0.055 9.167 130 0.060 9.082 130 0.065 7.680   130 0.070 7.139 130 0.075 7.275 130 0.080 7.544   140 0.005 4.860 140 0.010 5.932 140 0.015 3.685   140 0.020 5.581 140 0.025 4.935 140 0.030 5.197   140 0.035 5.559 140 0.040 4.836 140 0.045 5.795   140 0.050 5.524 140 0.055 7.736 140 0.060 5.628   140 0.065 6.644 140 0.070 3.785 140 0.075 4.853   140 0.080 6.006   ;   title2 'Raw data';   proc g3d data=ExperimentA;   plot Temperature*Catalyst=Yield   / zmin=2 zmax=11;   run;

The plot is displayed in Output 30.3.1.

Output 30.3.1: Surface Plot of Yield by Temperature and Amount of Catalyst

A surface fitted to the plot of Output 30.3.1 by PROC LOESS will be of a very general (and flexible) type, since the procedure requires only weak assumptions about the structure of the dependencies among the data. PROC GAM, on the other hand, makes stronger structural assumptions by restricting the fitted surface to an additive form. These differences will be demonstrated in this example.

The following code requests that both PROC LOESS and PROC GAM fit surfaces to the data.

  ods output OutputStatistics=PredLOESS;   proc loess data=ExperimentA;   model Yield = Temperature Catalyst   / scale=sd degree=2 select=gcv;   run;   ods output close;   proc gam data=ExperimentA;   model Yield = loess(Temperature) loess(Catalyst)   / method=gcv;   output out=PredGAM;   run;

In both cases the smoothing parameter was chosen as the value that minimizes GCV. This is performed automatically by PROC LOESS and PROC GAM.

The following code generates plots of the predicted yields, which are displayed in Output 30.3.2.

Output 30.3.2: Fitted Regression Surfaces

  title2 'PROC LOESS fitted surface';   proc g3d data=PredLOESS;   format pred f6.3;   plot Temperature*Catalyst=pred   / name='LOESSA' zmin=2 zmax=11;   run;   title2 'PROC GAM additive fitted surface';   proc g3d data=PredGAM;   format P_Yield f6.3;   plot Temperature*Catalyst=P_Yield   / name='GAMA' zmin=2 zmax=11;   run;   goptions display;   proc greplay nofs tc=sashelp.templt template=v2 ;   igout=gseg;   treplay 1:loessa 2:gama;   run; quit;

Though both PROC LOESS and PROC GAM use the statistical technique loess, it is apparent from Output 30.3.2 that the manner in which it is applied is very different. By smoothing out the data in local neighborhoods, PROC LOESS essentially fits a surface to the data in pieces, one neighborhood at a time. The local regions are treated independently, so separate areas of the fitted surface are only weakly re-lated. PROC GAM imposes additive structure, requiring that cross sections of the fitted surface always have the same shape and thereby relating regions that have a common value of the same individual regressor variable. Under that restriction, the loess technique need not be applied to the entire multidimensional scatter plot, but only to one-dimensional cross sections of the data.

The advantage of using additive model fitting is that its statistical power is directed toward univariate smoothing, and so it is able to discern the finer details of any underlying structure in the data. Regression data may be very sparse when viewed in the context of multidimensional space, even when every individual set of regressor values densely covers its range. This is the familiar curse of dimensionality. Sparse data greatly restricts the effectiveness of nonparametric procedures, but additive model fitting, when appropriate, is one way to overcome this limitation.

To examine these properties, plots of cross sections of unrestricted (PROC LOESS) and additive (PROC GAM) fitted surfaces for the variable Catalyst are generated by the following code. The code for creating the cross section plots and overlaying them is somewhat complicated, so a macro %XPlot is employed to make it easy to create this plot for the results of each procedure.

  axis1 minor=none order=(2 to 11 by 2)   label=(angle=90 rotate=0 "Predicted Yield");   axis2 minor=none order=(0.005 to 0.080 by 0.02 5)   label=("Catalyst");   symbol1 color=blue interpol=join value=none   line=1 width=1;   %macro XPLOT(proc=,name=);   proc sort data=Pred&proc;   by Catalyst Temperature;   run;   data PredX&proc;   keep Pred80 Pred90 Pred100 Pred110 Pred120 Pred130   Pred140 Catalyst;   array xPred{8:14} Pred80 Pred90 Pred100 Pred110   Pred120 Pred130 Pred140;   retain Pred80 Pred90 Pred100 Pred110 Pred120   Pred130 Pred140;   set Pred&proc;   %if &proc=LOESS %then %do;   xPred{Temperature/10} = pred;   %end;   %else %if &proc=GAM %then %do;   xPred{Temperature/10} = P_Yield;   %end;   if abs(Temperature-140)<1 then output;   run;   proc gplot data=PredX&proc;   plot Pred140*Catalyst=1 Pred130*Catalyst=1   Pred120*Catalyst=1 Pred110*Catalyst=1   Pred100*Catalyst=1 Pred90*Catalyst=1   Pred80*Catalyst=1   / overlay cframe=ligr name=&name   vaxis=axis1 haxis=axis2;   run; quit;   %mend;   title;   title2 'PROC LOESS fitted surface, cross sections';   %XPLOT(proc=LOESS,name='XLOESSA');   title2 'PROC GAM additive fitted surface, cross sections';   %XPLOT(proc=GAM,name='XGAMA');   goptions display;   proc greplay nofs tc=sashelp.templt template=v2 ;   igout=gseg;   treplay 1:xloessa 2:xgama;   run; quit;

The plots are displayed in Output 30.3.3.

Output 30.3.3: Cross sections of Fitted Regression Surfaces

Notice that the graphs in the top panel (PROC LOESS) of Output 30.3.3 have varying shapes , while every graph in the bottom panel (PROC GAM) is the same curve shifted vertically. This illustrates precisely the kind of structural differences that distinguish additive models. A second important comparison to make in Output 30.3.2 and Output 30.3.3 is the level of detail in the fitted regression surfaces. Cross sections of the PROC LOESS surface are rather flat, but those of the additive surface have a clear shape. In particular, the ridge near Catalyst=0.055 is only vaguely evident in the PROC LOESS surface, but it is plainly revealed by the additive procedure.

For an example of a situation where unrestricted multidimensional fitting is preferred over additive regression, consider the following simulated data from a similar experiment. The following code creates another SAS data set and plot.

  data ExperimentB;   format Temperature f4.0 Catalyst f6.3 Yield f8.3;   input Temperature Catalyst Yield @@;   datalines;   80 0.005  9.115  80 0.010  9.275  80 0.015  9.160   80 0.020  7.065  80 0.025  6.054  80 0.030  4.899   80 0.035  4.504  80 0.040  4.238  80 0.045  3.232   80 0.050  3.135  80 0.055  5.100  80 0.060  4.802   80 0.065  8.218  80 0.070  7.679  80 0.075  9.669   80 0.080  9.071  90 0.005  7.085  90 0.010  6.814   90 0.015  4.009  90 0.020  4.199  90 0.025  3.377   90 0.030  2.141  90 0.035  3.500  90 0.040  5.967   90 0.045  5.268  90 0.050  6.238  90 0.055  7.847   90 0.060  7.992  90 0.065  7.904  90 0.070 10.184   90 0.075  7.914  90 0.080  6.842 100 0.005  4.497   100 0.010  2.565 100 0.015  2.637 100 0.020  2.436   100 0.025  2.525 100 0.030  4.474 100 0.035  6.238   100 0.040  7.029 100 0.045  8.183 100 0.050  8.939   100 0.055  9.283 100 0.060  8.246 100 0.065  6.927   100 0.070  7.062 100 0.075  5.615 100 0.080  4.687   110 0.005  3.706 110 0.010  3.154 110 0.015  3.726   110 0.020  4.634 110 0.025  5.970 110 0.030  8.219   110 0.035  8.590 110 0.040  9.097 110 0.045  7.887   110 0.050  8.480 110 0.055  6.818 110 0.060  7.666   110 0.065  4.375 110 0.070  3.994 110 0.075  3.630   110 0.080  2.685 120 0.005  4.697 120 0.010  4.268   120 0.015  6.507 120 0.020  7.747 120 0.025  9.412   120 0.030  8.761 120 0.035  8.997 120 0.040  7.538   120 0.045  7.003 120 0.050  6.010 120 0.055  3.886   120 0.060  4.897 120 0.065  2.562 120 0.070  2.714   120 0.075  3.141 120 0.080  5.081 130 0.005  8.729   130 0.010  7.460 130 0.015  9.549 130 0.020 10.049   130 0.025  8.131 130 0.030  7.553 130 0.035  6.191   130 0.040  6.272 130 0.045  4.649 130 0.050  3.884   130 0.055  2.522 130 0.060  4.366 130 0.065  3.272   130 0.070  4.906 130 0.075  6.538 130 0.080  7.380   140 0.005  8.991 140 0.010  8.029 140 0.015  8.417   140 0.020  8.049 140 0.025  4.608 140 0.030  5.025   140 0.035  2.795 140 0.040  3.123 140 0.045  3.407   140 0.050  4.183 140 0.055  3.750 140 0.060  6.316   140 0.065  5.799 140 0.070  7.992 140 0.075  7.835   140 0.080  8.985   ;   run;   title2 'Raw data';   proc g3d data=ExperimentB;   plot Temperature*Catalyst=Yield   / zmin=2 zmax=11;   run;

A plot of the raw data is displayed in Output 30.3.4.

Output 30.3.4: Raw Data from Experiment B

Though the surface displayed in Output 30.3.4 is quite jagged, a distinct feature of the plot is a large ridge that runs diagonally across its surface. One would expect that the ridge would appear in the fitted regression surface of an appropriate nonparametric procedure. Nevertheless, between PROC LOESS and PROC GAM, only PROC LOESS is able to capture this significant feature.

The SAS code for fitting the new data is essentially the same as that for the data set from the first experiment. Similar to Output 30.3.2, multivariate and additive fitted surfaces for these data are displayed in Output 30.3.5.

Output 30.3.5: Fitted Regression Surfaces

It is clear from Output 30.3.5 that the results of PROC LOESS and PROC GAM are completely different. While the plot in the top panel resembles the raw data plot in Output 30.3.4, the plot in the bottom panel is essentially featureless.

To understand what is happening, compare the scatter plots of Yield by Catalyst for the two data sets in this example. These are generated by the following code and displayed in Output 30.3.6.

Output 30.3.6: Scatterplots of Yield by Catalyst

  axis1 minor=none order=(2 to 11 by 2)   label=(angle=90 rotate=0 "Predicted Yield");   axis2 minor=none order=(0.005 to 0.080 by 0.025) label=("Catalyst");   symbol2 c=yellow v=dot i=none;   title2 'Experiment A';   proc gplot data=ExperimentA;   plot Yield*Catalyst=2   / cframe=ligr name='ExptA'   vaxis=axis1 haxis=axis2;   run; quit;   title2 'Experiment B';   proc gplot data=ExperimentB;   plot Yield*Catalyst=2   / cframe=ligr name='ExptB'   vaxis=axis1 haxis=axis2;   run; quit;   proc greplay nofs tc=sashelp.templt template=v2;   igout=gseg;   treplay 1:expta 2:exptb;   run; quit;

The top panel of Output 30.3.6 hints at the same kind of structure exhibited in the fitted cross sections of Output 30.3.3. In PROC GAM, the additive model component corresponding to Catalyst is fit to a similar scatter plot, with the partial residuals computed in the backfitting algorithm, so it is able to capture the trend seen here. In contrast, when the second data set is viewed from the perspective of Output 30.3.6, the diagonal ridge apparent in Output 30.3.4 is washed out, and no clear structure shows up in the scatter plot. As a result, the additive model fit produced by PROC GAM is relatively featureless.