Examples | SAS.STAT 9.1 Users Guide (Vol. 6)

Example 74.1. Partial Spline Model Fit

The following example analyzes the data set Measure that was introduced in the 'Getting Started' section on page 4499. That analysis determined that the final estimated surface can be represented by a quadratic function for one or both of the independent variables . This example illustrates how you can use PROC TPSPLINE to fit a partial spline model. The data set Measure is fit using the following model:

The model has a parametric component (associated with the x ₁ variable) and a nonparametric component (associated with the x ₂ variable). The following statements fit a partial spline model.

  data Measure;   set Measure;   x1sq = x1*x1;   run;   data pred;   do x1=-1 to 1 by 0.1;   do x2=-1 to 1 by 0.1;   x1sq = x1*x1;   output;   end;   end;   run;   proc tpspline data= measure;   model y = x1 x1sq (x2);   score data = pred   out = predy;   run;

Output 74.1.1 displays the results from these statements.

As displayed in Output 74.1.1, there are five unique design points for the smoothing variable x2 and two regression variables in the model (x1,x1sq) . The dimension of the null space (polynomial space) is 4. The standard deviation of the estimate is much larger than the one based on the model with both x 1 and x 2 as smoothing variables (0.445954 compared to 0.098421). One of the many possible explanations may be that the number of unique design points of the smoothing variable is too small to warrant an accurate estimate for h ( x 2).

Output 74.1.1: Output from PROC TPSPLINE

  The TPSPLINE Procedure   Dependent Variable:  y   Summary of Input Data Set   Number of Non-Missing Observations    50   Number of Missing Observations         0   Unique Smoothing Design Points         5   Summary of Final Model   Number of Regression Variables        2   Number of Smoothing Variables         1   Order of Derivative in the Penalty    2   Dimension of Polynomial Space         4   Summary Statistics   of Final Estimation   log10(n*Lambda)   2.2374   Smoothing Penalty         205.3461   Residual SS                 8.5821   Tr(I-A)                    43.1534   Model DF                    6.8466   Standard Deviation          0.4460

The following statements produce a surface plot for the partial spline model:

  title 'Plot of Fitted Surface on a Fine Grid';   proc g3d data=predy;   plot x2*x1=p_y/grid   zmin=9   zmax=21   zticknum=4;   run;

The surface displayed in Output 74.1.2 is similar to the one estimated by using the full nonparametric model (displayed in Figure 74.5).

Output 74.1.2: Plot of TPSPLINE Fit from the Partial Spline Model

Figure 74.5: Plot of TPSPLINE fit

Example 74.2. Spline Model With Higher-Order Penalty

The following example continues the analysis of the data set Measure to illustrate how you can use PROC TPSPLINE to fit a spline model with a higher-order penalty term . Spline models with high-order penalty terms move low-order polynomial terms into the null space. Hence, there is no penalty for these terms, and they can vary without constraint.

As shown in the previous analyses, the final model for the data set Measure must include quadratic terms for both x1 and x2. This example fits the following model:

The model includes quadratic terms for both variables, although it differs from the usual linear model. The nonparametric term g ( x ₁ , x ₂ ) explains the variation of the data unaccounted for by a simple quadratic surface.

To modify the order of the derivative in the penalty term, specify the M= option. The following statements specify the option M=3 in order to include the quadratic terms in the null space:

  data measure;   set measure;   x1sq = x1*x1;   x2sq = x2*x2;   x1x2 = x1*x2;   ;   proc tpspline data= measure;   model y = (x1 x2) / m=3;   score data = pred   out = predy;   run;

The output resulting from these statements is displayed in Output 74.2.1.

Output 74.2.1: Output from PROC TPSPLINE with M=3

  The TPSPLINE Procedure   Dependent Variable:  y   Summary of Input Data Set   Number of Non-Missing Observations    50   Number of Missing Observations         0   Unique Smoothing Design Points        25   Summary of Final Model   Number of Regression Variables        0   Number of Smoothing Variables         2   Order of Derivative in the Penalty    3   Dimension of Polynomial Space         6   Summary Statistics   of Final Estimation   log10(n*Lambda)   3.7831   Smoothing Penalty        2092.4495   Residual SS                 0.2731   Tr(I-A)                    29.1716   Model DF                   20.8284   Standard Deviation          0.0968

The model contains six terms in the null space. Compare Output 74.2.1 with Figure 74.2: the LOGNLAMBDA value and the smoothing penalty differ significantly. Note that, in general, these terms are not directly comparable for different models. The final estimate based on this model is close to the estimate based on the model using the default, M=2.

  The TPSPLINE Procedure   Dependent Variable:  y   Summary of Input Data Set   Number of Non-Missing Observations    50   Number of Missing Observations         0   Unique Smoothing Design Points        25   Summary of Final Model   Number of Regression Variables        0   Number of Smoothing Variables         2   Order of Derivative in the Penalty    2   Dimension of Polynomial Space         3

Figure 74.2: Output from PROC TPSPLINE

In the following statements, the REG procedure fits a quadratic surface model to the data set Measure .

  proc reg data= measure;   model y = x1 x1sq x2 x2sq x1x2;   run;

The results are displayed in Output 74.2.2.

Output 74.2.2: Quadratic Surface Model: The REG Procedure

  The REG Procedure   Model: MODEL1   Dependent Variable: y   Analysis of Variance   Sum of           Mean   Source                   DF        Squares         Square    F Value    Pr > F   Model                     5      443.20502       88.64100     436.33    <.0001   Error                    44        8.93874        0.20315   Corrected Total          49      452.14376   Root MSE              0.45073    R-Square     0.9802   Dependent Mean       15.08548    Adj R-Sq     0.9780   Coeff Var             2.98781   Parameter Estimates   Parameter       Standard   Variable     DF       Estimate          Error    t Value    Pr > t   Intercept     1       14.90834        0.12519     119.09      <.0001   x1            1        0.01292        0.09015       0.14      0.8867   x1sq          1   4.85194        0.15237   31.84      <.0001   x2            1        0.02618        0.09015       0.29      0.7729   x2sq          1        5.20624        0.15237      34.17      <.0001   x1x2          1   0.04814        0.12748   0.38      0.7076

The REG procedure produces slightly different results. To fit a similar model with PROC TPSPLINE, you can use a MODEL statement specifying the degrees of freedom with the DF= option. You can also use a large value for the LOGNLAMBDA0 = option to force a parametric model fit.

Because there is one degree of freedom for each of the following terms, Intercept , x1 , x2 , x1sq , x2sq , and x1x2 , the DF=6 option is used.

  proc tpspline data=measure;   model y=(x1 x2) /m=3 df=6 lognlambda=(   4 to 1 by 0.2);   score data = pred   out = predy;   run;

The results are displayed in Output 74.2.3. PROC TPSPLINE displays the list of GCV values for comparison.

Output 74.2.3: Output from PROC TPSPLINE Using M=3 and DF=6

  The TPSPLINE Procedure   Dependent Variable: y   Summary of Input Data Set   Number of Non-Missing Observations    50   Number of Missing Observations         0   Unique Smoothing Design Points        25   Summary of Final Model   Number of Regression Variables        0   Number of Smoothing Variables         2   Order of Derivative in the Penalty    3   Dimension of Polynomial Space         6   GCV Function   log10(n*Lambda)             GCV     4.000000        0.016330     3.800000        0.016051*     3.600000        0.016363     3.400000        0.017770     3.200000        0.021071     3.000000        0.027496     2.800000        0.038707     2.600000        0.056292     2.400000        0.080613     2.200000        0.109714     2.000000        0.139642     1.800000        0.166338   -1.600000        0.187437     1.400000        0.202625     1.200000        0.212871     1.000000        0.219512     0.800000        0.223727     0.600000        0.226377     0.400000        0.228041     0.200000        0.229085   0        0.229740   0.200000        0.230153   0.400000        0.230413   0.600000        0.230576   0.800000        0.230680   1.000000        0.230745   Note: * indicates minimum GCV value.

  The TPSPLINE Procedure   Dependent Variable:  y   Summary Statistics   of Final Estimation   log10(n*Lambda)             2.3830   Smoothing Penalty           0.0000   Residual SS                 8.9384   Tr(I-A)                    43.9997   Model DF                    6.0003   Standard Deviation          0.4507

The final estimate is based on 6.000330 degrees of freedom because there are already 6 degrees of freedom in the null space and the search range for lambda is not large enough (in this case, setting DF=6 is equivalent to setting lambda = ˆ ).

The standard deviation and RSS (Output 74.2.3) are close to the sum of squares for the error term and the root MSE from the the linear regression model (Output 74.2.2), respectively.

For this model, the optimal LOGNLAMBDA is around ˆ’ 3.8, which produces a standard deviation estimate of 0.096765 (see Output 74.2.1) and a GCV value of 0.016051, while the model specifying DF=6 results in a LOGNLAMBDA larger than 1 and a GCV value larger than 0.23074. The nonparametric model, based on the GCV , should provide better prediction, but the linear regression model can be more easily interpreted.

Example 74.3. Multiple Minima of the GCV Function

The following data represent the deposition of sulfate ( SO ₄ ) at 179 sites in 48 con-tiguous states of the United States in 1990. Each observation records the latitude and longitude of the site as well as the SO ₄ deposition at the site measured in gram per square meter ( g/m ² ).

You can use PROC TPSPLINE to fit a surface that reflects the general trend and that reveals underlying features of the data.

  data so4;   input latitude longitude so4 @@;   datalines;   32.45833  87.24222 1.403 34.28778  85.96889 2.103   33.07139 109.86472 0.299 36.07167 112.15500 0.304   31.95056 112.80000 0.263 33.60500  92.09722 1.950   34.17944  93.09861 2.168 36.08389  92.58694 1.578   .   .   .    162 additional observations    .   .   .   45.82278  91.87444 0.984 41.34028 106.19083 0.335   42.73389 108.85000 0.236 42.49472 108.82917 0.313   42.92889 109.78667 0.182 43.22278 109.99111 0.161   43.87333 104.19222 0.306 44.91722 110.42028 0.210   45.07611  72.67556 2.646   ;   data pred;   do latitude = 25 to 47 by 1;   do longitude = 68 to 124 by 1;   output;   end;   end;   run;

The preceding statements create the SAS data set so4 and the data set pred in order to make predictions on a regular grid. The following statements fit a surface for SO ₄ deposition. The ODS OUTPUT statement creates a data set called GCV to contain the GCV values for LOGNLAMBDA in the range from ˆ’ 6 to 1.

  proc tpspline data=so4;   ods output GCVFunction=gcv;   model so4 = (latitude longitude) /lognlambda=(   6 to 1 by 0.1);   score data=pred out=prediction1;   run;

Partial output from these statements is displayed in Output 74.3.1.

Output 74.3.1: Partial Output from PROC TPSPLINE for Data Set SO4

  The TPSPLINE Procedure   Dependent Variable: so4   Summary of Input Data Set   Number of Non-Missing Observations    179   Number of Missing Observations          0   Unique Smoothing Design Points        179   Summary of Final Model   Number of Regression Variables        0   Number of Smoothing Variables         2   Order of Derivative in the Penalty    2   Dimension of Polynomial Space         3   Summary Statistics   of Final Estimation   log10(n*Lambda)             0.2770   Smoothing Penalty           2.4588   Residual SS                12.4450   Tr(I-A)                   140.2750   Model DF                   38.7250   Standard Deviation          0.2979

The following statements produce Output 74.3.2:

Output 74.3.2: GCV Function of SO4 Data Set

  symbol1 interpol=join value=none;   title "GCV Function";   proc gplot data=gcv;   plot gcv*lognlambda/frame cframe=ligr   vaxis=axis1 haxis=axis2;   run;

Output 74.3.2 displays the plot of the GCV function versus nlambda in log 10 scale. The GCV function has two minima. PROC TPSPLINE locates the minimum at 0.277005. The figure also displays a local minimum located around ˆ’ 2 . 56. Note that the TPSPLINE procedure may not always find the global minimum, although it did in this case.

The following analysis specifies the option LOGNLAMBDA0 = ˆ’ 2 . 56. The output is displayed in Output 74.3.3.

  proc tpspline data=so4;   model so4 = (latitude longitude) /lognlambda0=   2.56;   score data=pred out=prediction2;   run;

Output 74.3.3: Output from PROC TPSPLINE for Data Set SO4 with LOGNLAMBDA= ˆ’ 2.56

  The TPSPLINE Procedure   Dependent Variable: so4   Summary of Input Data Set   Number of Non-Missing Observations    179   Number of Missing Observations          0   Unique Smoothing Design Points        179   Summary of Final Model   Number of Regression Variables        0   Number of Smoothing Variables         2   Order of Derivative in the Penalty    2   Dimension of Polynomial Space         3   Summary Statistics   of Final Estimation   log10(n*Lambda)   2.5600   Smoothing Penalty         177.2144   Residual SS                 0.0438   Tr(I-A)                     7.2086   Model DF                  171.7914   Standard Deviation          0.0779

The smoothing penalty is much larger in Output 74.3.3 than that displayed in Output 74.3.1. The estimate in Output 74.3.1 uses a large lambda value and, therefore, the surface is smoother than the estimate using LOGNLAMBDA = ˆ’ 2 . 56 (Output 74.3.3).

The estimate based on LOGNLAMBDA = ˆ’ 2 . 56 has a larger value for the degrees of freedom, and it has a much smaller standard deviation.

However, a smaller standard deviation in nonparametric regression does not necessarily mean that the estimate is good: a small » value always produces an estimate closer to the data and, therefore, a smaller standard deviation.

The following statements produce two contour plots of the estimates using the GCONTOUR procedure. In the final step, the plots are placed into a single graphic with the GREPLAY procedure.

  title "TPSPLINE fit with lognlambda=0.277";   proc gcontour data=prediction1 gout=grafcat;   plot latitude*longitude = P_so4/   name="tpscon1" legend=legend1   vaxis=axis1 haxis=axis2 cframe=ligr hreverse;   run;   title "TPSPLINE fit with lognlambda=-2.56";   proc gcontour data=prediction2 gout=grafcat;   plot latitude*longitude = P_so4/   name="tpscon2" legend=legend1   vaxis=axis1 haxis=axis2 cframe=ligr hreverse;   run;   title;   proc greplay igout=grafcat tc=sashelp.templt template=v2 nofs;   treplay 1:tpscon1 2:tpscon2;   quit;

Compare the two estimates by examining the contour plots of both estimates (Output 74.3.4).

Output 74.3.4: Contour Plot of TPSPLINE Estimates with Different Lambdas

As the contour plots show, the estimate with LOGNLAMBDA=0.277 may represent the underlying trend, while the estimate with the LOGNLAMBDA =-2.56 is very rough and may be modeling the noise component.

Example 74.4. Large Data Set Application

The following example illustrates how you can use the D= option to decrease the computation time needed by the TPSPLINE procedure. Note that, while the D= option can be helpful in decreasing computation time for large data sets, it may produce unexpected results when used with small data sets.

The following statements generate the data set large :

  data large;   do x=   5 to 5 by 0.02;   y=5*sin(3*x)+1*rannor(57391);   output;   end;   run;

The data set large contains 501 observations with one independent variable x and one dependent variable y . The following statements invoke PROC TPSPLINE to produce a thin-plate smoothing spline estimate and the associated 99% confidence interval. The output statistics are saved in the data set fit1 .

  proc tpspline data=large;   model y =(x) /lambda=(-5 to -1 by 0.2) alpha=0.01;   output out=fit1 pred lclm uclm;   run;

The results from this MODEL statement are displayed in Output 74.4.1.

Output 74.4.1: Output from PROC TPSPLINE without the D= Option

  The TPSPLINE Procedure   Dependent Variable: y   Summary of Input Data Set   Number of Non-Missing Observations    501   Number of Missing Observations          0   Unique Smoothing Design Points        501   Summary of Final Model   Number of Regression Variables        0   Number of Smoothing Variables         1   Order of Derivative in the Penalty    2   Dimension of Polynomial Space         2   GCV Function   log10(n*Lambda)             GCV     5.000000        1.258653     4.800000        1.228743     4.600000        1.205835     4.400000        1.188371     4.200000        1.174644     4.000000        1.163102     3.800000        1.152627     3.600000        1.142590     3.400000        1.132700     3.200000        1.122789     3.000000        1.112755     2.800000        1.102642     2.600000        1.092769     2.400000        1.083779     2.200000        1.076636     2.000000        1.072763*     1.800000        1.074636     1.600000        1.087152     1.400000        1.120339     1.200000        1.194023     1.000000        1.344213   Note: * indicates minimum GCV value.

  The TPSPLINE Procedure   Dependent Variable: y   Summary Statistics   of Final Estimation   log10(n*Lambda)   1.9483   Smoothing Penalty        9953.7063   Residual SS               475.0984   Tr(I-A)                   471.0861   Model DF                   29.9139   Standard Deviation          1.0042

The following statements specify an identical model, but with the additional specification of the D= option. The estimates are obtained by treating nearby points as replicates.

  proc tpspline data=large;   model y =(x) /lambda=(   5 to   1 by 0.2) d=0.05 alpha=0.01;   output out=fit2 pred lclm uclm;   run;

The output is displayed in Output 74.4.2.

Output 74.4.2: Output from PROC TPSPLINE with the D= Option

  The TPSPLINE Procedure   Dependent Variable: y   Summary of Input Data Set   Number of Non-Missing Observations    501   Number of Missing Observations          0   Unique Smoothing Design Points        251   Summary of Final Model   Number of Regression Variables        0   Number of Smoothing Variables         1   Order of Derivative in the Penalty    2   Dimension of Polynomial Space         2   GCV Function   log10(n*Lambda)             GCV     5.000000        1.306536     4.800000        1.261692     4.600000        1.226881     4.400000        1.200060     4.200000        1.179284     4.000000        1.162776     3.800000        1.149072     3.600000        1.137120     3.400000        1.126220     3.200000        1.115884     3.000000        1.105766     2.800000        1.095730     2.600000        1.085972     2.400000        1.077066     2.200000        1.069954     2.000000        1.066076*     1.800000        1.067929     1.600000        1.080419     1.400000        1.113564     1.200000        1.187172     1.000000        1.337252   Note: * indicates minimum GCV value.

  The TPSPLINE Procedure   Dependent Variable:  y   Summary Statistics   of Final Estimation   log10(n*Lambda)   1.9477   Smoothing Penalty        9943.5615   Residual SS               472.1424   Tr(I-A)                   471.0901   Model DF                   29.9099   Standard Deviation          1.0011

The difference between the two estimates is minimal. However, the CPU time for the second MODEL statement is only about 1/8 of the CPU time used in the first model fit.

The following statements produce a plot for comparison of the two estimates:

  data fit2;   set fit2;   P1_y     = P_y;   LCLM1_y = LCLM_y;   UCLM1_y = UCLM_y;   drop P_y LCLM_y UCLM_y;   proc sort data=fit1;   by x y;   proc sort data=fit2;   by x y;   data comp;   merge fit1 fit2;   by x y;   label p1_y   ="Yhat1" p_y="Yhat0"   lclm_y ="Lower CL"   uclm_y ="Upper CL";   symbol1  i=join v=none ;   symbol2  i=join v=none ;   symbol3  i=join v=none color=cyan;   symbol4  i=join v=none color=cyan;   title 'Comparison of Two Estimates';   title2 'with and without the D= Option';   proc gplot data=comp;   plot P_y*x=1   P1_y*x=2   LCLM_y*x=4   UCLM_y*x=4/overlay     legend=legend1   vaxis=axis1 haxis=axis2   frame       cframe=ligr;   run;

The estimates from fit1 and fit2 are displayed in Output 74.4.3 with the 99% confidence interval from the fit1 output data set.

Output 74.4.3: Comparison of Two Fits with and without the D= Option

Example 74.5. Computing a Bootstrap Confidence Interval

The following example illustrates how you can construct a bootstrap confidence interval by using the multiple responses feature in PROC TPSPLINE.

Numerous epidemiological observations have indicated that exposure to solar radiation is an important factor in the etiology of melanoma. The following data present age-adjusted melanoma incidences for 37 years from the Connecticut Tumor Registry (Houghton, Flannery, and Viola 1980). The data are analyzed by Ramsay and Silverman (1997).

  data melanoma;   input year incidences @@;   datalines;   1936    0.9   1937   0.8 1938 0.8 1939 1.3   1940    1.4   1941   1.2 1942 1.7 1943 1.8   1944    1.6   1945   1.5 1946 1.5 1947 2.0   1948    2.5   1949   2.7 1950 2.9 1951 2.5   1952    3.1   1953   2.4 1954 2.2 1955 2.9   1956    2.5   1957   2.6 1958 3.2 1959 3.8   1960    4.2   1961   3.9 1962 3.7 1963 3.3   1964    3.7   1965   3.9 1966 4.1 1967 3.8   1968    4.7   1969   4.4 1970 4.8 1971 4.8   1972    4.8   ;   run;

The variable incidences records the number of melanoma cases per 100,000 people for the years 1936 to 1972. The following model fits the data and requests a 90% Bayesian confidence interval along with the estimate.

  proc tpspline data=melanoma;   model incidences = (year) /alpha = 0.1;   output out = result pred uclm lclm;   run;

The output is displayed in Output 74.5.1

Output 74.5.1: Output from PROC TPSPLINE for the Melanoma Data Set

  The TPSPLINE Procedure   Dependent Variable: incidences   Summary of Input Data Set   Number of Non-Missing Observations    37   Number of Missing Observations         0   Unique Smoothing Design Points        37   Summary of Final Model   Number of Regression Variables        0   Number of Smoothing Variables         1   Order of Derivative in the Penalty    2   Dimension of Polynomial Space         2   Summary Statistics   of Final Estimation   log10(n*Lambda)   0.0607   Smoothing Penalty           0.5171   Residual SS                 1.2243   Tr(I-A)                    22.5852   Model DF                   14.4148   Standard Deviation          0.2328

The following statements produce a plot of the estimated curve:

  symbol1 h=1pct ;   symbol2 i=join v=none;   symbol3 i=join v=none;   symbol4 i=join v=none c=cyan;   legend1 frame cframe=ligr cborder=black   label=none position=center;   axis1 label=(angle=90 rotate=0) minor=none;   axis2 minor=none;   title1 'Age-adjusted Melanoma Incidences for 37 years';   proc gplot data=result;   plot incidences*year=1   p_incidences*year=2   lclm_incidences*year=3   uclm_incidences*year=4 /overlay legend=legend1   vaxis=axis1 haxis=axis2   frame cframe=ligr;   run;

The estimated curve is displayed with 90% confidence interval bands in Output 74.5.2 . The number of melanoma incidences exhibits a periodic pattern and increases over the years. The periodic pattern is related to sunspot activity and the accompanying fluctuations in solar radiation.

Output 74.5.2: TPSPLINE Estimate and 90% Confidence Interval of Melanoma Data

Wang and Wahba (1995) compared several bootstrap confidence intervals to Bayesian confidence intervals for smoothing splines. Both bootstrap and Bayesian confidence intervals are across-the-curve intervals, not point-wise intervals. They concluded that bootstrap confidence intervals work as well as Bayesian intervals concerning average coverage probability. Additionally, bootstrap confidence intervals appear to be better for small sample sizes. Based on their simulation, the 'percentile- t interval' bootstrap interval performs better than the other types of bootstrap intervals.

Suppose that and are the estimates of f and ƒ from the data. Assume that is the 'true' f , and generate the bootstrap sample as follows :

where . Denote ( x _i ) as the random variable of the bootstrap estimate at x _i . Repeat this process K times, so that at each point x _i , you have K bootstrap estimates ( x _i ) or K realizations of ( x _i ). For each fixed x _i , consider the following statistic , which is similar to a Student's t statistic:

where _i * is the estimate of based on the i th bootstrap sample.

Suppose _{± /} ₂ and _{1 ˆ’} ± / ₂ are the lower and upper ± / 2 points of the empirical distribution of .The (1 ˆ’ ± )100% bootstrap confidence interval is defined as

Bootstrap confidence intervals are easy to interpret and can be used with any distribution. However, because they require K model fits, their construction is computationally intensive .

The multiple dependent variables feature in PROC TPSPLINE enables you to fit multiple models with the same independent variables. The procedure calculates the matrix decomposition part of the calculations only once regardless of the number of dependent variables in the model. These calculations are responsible for most of the computing time used by the TPSPLINE procedure. This feature is particularly useful when you need to generate a bootstrap confidence interval.

To construct a bootstrap confidence interval, perform the following tasks :

Fit the data using PROC TPSPLINE and obtain estimates ( xi ) and .
Generate K bootstrap samples based on ( xi ) and .
Fit the K bootstrap samples with the TPSPLINE procedure to obtain estimates of ( x _i ) and .
Compute and the values _{± /} ₂ and _{1 ˆ’} ± / ₂ .

The following statements illustrate this process:

  proc tpspline data=melanoma;   model incidences = (year) /alpha = 0.05;   output out = result pred uclm lclm;   run;

The output from the initial PROC TPSPLINE analysis is displayed in Output 74.5.3. The data set result contains the predicted values and confidence limits from the analysis.

Output 74.5.3: Output from PROC TPSPLINE for the Melanoma Data Set

  The TPSPLINE Procedure   Dependent Variable: incidences   Summary of Input Data Set   Number of Non-Missing Observations    37   Number of Missing Observations         0   Unique Smoothing Design Points        37   Summary of Final Model   Number of Regression Variables        0   Number of Smoothing Variables         1   Order of Derivative in the Penalty    2   Dimension of Polynomial Space         2   Summary Statistics   of Final Estimation   log10(n*Lambda)   0.0607   Smoothing Penalty           0.5171   Residual SS                 1.2243   Tr(I-A)                    22.5852   Model DF                   14.4148   Standard Deviation          0.2328

The following statements illustrate how you can obtain a bootstrap confidence interval for the Melanoma data. The following statements create the data set bootstrap . The observations are created with information from the preceding PROC TPSPLINE execution; as displayed in Output 74.5.3, = 0.232823. The values of ( xi ) are stored in the data set result in the variable P_incidence .

  data bootstrap;   set result;   array y{1070} y1-y1070;   do i=1 to 1070;   y{i} = p_incidences + 0.232823*rannor(123456789);   end;   keep y1-y1070 p_incidences year;   run;   ods listing close;   proc tpspline data=bootstrap;   ods output FitStatistics=FitResult;   id p_incidences;   model y1-y1070 = (year);   output out=result2;   run;   ods listing;

The DATA step generates 1,070 bootstrap samples based on the previous estimate from PROC TPSPLINE. For this data set, some of the bootstrap samples result in » s (selected by the GCV function) that cause problematic behavior. Thus, an additional 70 bootstrap samples are generated.

The ODS listing destination is closed before PROC TPSPLINE is invoked. The model fits all the y1_y1070 variables as dependent variables, and the models are fit for all bootstrap samples simultaneously . The output data set result2 contains the variables year, y1-y1070 , p_y1-p_y1070 , and P_incidences .

The ODS OUTPUT statement writes the FitStatistics table to the data set FitResult . The data set FitResult contains the two variables. They are Parameter and Value . The FitResult data set is used in subsequent calculations for .

In the data set FitResult , there are 63 estimates with a standard deviation of zero, suggesting that the estimates provide perfect fits of the data and are caused by s that are approximately equal to zero. For small sample sizes, there is a positive probability that the » chosen by the GCV function will be zero (refer to Wang and Wahba 1995).

In the following steps, these cases are removed from the bootstrap samples as 'bad' samples: they represent failure of the GCV function.

The following SAS statements manipulate the data set FitResult , retaining the standard deviations for all bootstrap samples and merging FitResult with the data set result2 , which contains the estimates for bootstrap samples. In the final data set boot , the statistics are calculated.

  data FitResult; set FitResult;   if Parameter="Standard Deviation";   keep Value;   run;   proc transpose data=FitResult out=sd prefix=sd;   data result2;   if _N_ = 1 then set sd;   set result2;   data boot;   set result2;   array y{1070} p_y1-p_y1070;   array sd{1070} sd1-sd1070;   do i=1 to 1070;   if sd{i} > 0 then do;   d = (y{i} - P_incidences)/sd{i};   obs = _N_;   output;   end;   end;   keep d obs P_incidences year;   run;

The following SAS statements retain the first 1000 bootstrap samples and calculate the values _{± /} ₂ and _{1 ˆ’} _{± /} ₂ with ± = 0 . 1.

  proc sort data=boot;   by obs;   run;   data boot;   set boot;   by obs;   retain n;   if first.obs then n=1;   else n=n+1;   if n > 1000 then delete;   run;   proc sort data=boot;   by obs d;   run;   data chi1 chi2 ;   set boot;   if (_N_ = (obs-1)*1000+50) then output chi1;   if (_N_ = (obs-1)*1000+950) then output chi2;   run;   proc sort data=result;   by year;   run;   proc sort data=chi1;   by year;   run;   proc sort data=chi2;   by year;   run;   data result;   merge result   chi1(rename=(d=chi05))   chi2(rename=(d=chi95));   keep year incidences P_incidences lower upper   LCLM_incidences UCLM_incidences;   lower = -chi95*0.232823 + P_incidences;   upper = -chi05*0.232823 + P_incidences;   label lower="Lower 90% CL (Bootstrap)"   upper="Upper 90% CL (Bootstrap)"   lclm_incidences="Lower 90% CL (Bayesian)"   uclm_incidences="Upper 90% CL (Bayesian)";   run;

The data set result contains the variables year , incidences , the TPSPLINE estimate P_incidences , and the 90% Bayesian and 90% bootstrap confidence intervals.

The following statements produce Output 74.5.4:

  symbol1  v=dot h=1pct ;   symbol2  i=join v=none l=1;   symbol3  i=join v=none l=33;   symbol4  i=join v=none l=33;   symbol5  i=join v=none l=43 c=green;   symbol6  i=join v=none l=43 c=green;   title1 'Age-adjusted Melanoma Incidences for 37 years';   proc gplot data=result;   plot       incidences * year=1   p_incidences * year=2   lclm_incidences * year=3   uclm_incidences * year=3   lower * year=4   upper * year=4   /overlay      legend=legend1   vaxis=axis1  haxis=axis2   frame        cframe=ligr;   run;

Output 74.5.4 displays the plot of the variable incidences , the predicted values, and the Bayesian and bootstrap confidence intervals.

The plot shows that the bootstrap confidence interval is similar to the Bayesian confidence interval. However, the Bayesian confidence interval is symmetric around the estimates, while the bootstrap confidence interval is not.

Output 74.5.4: Comparison of Bayesian and Bootstrap Confidence Interval for Melanoma Data