The following example analyzes the data set Measure that was introduced in the 'Getting Started' section on page 4499. That analysis determined that the final estimated surface can be represented by a quadratic function for one or both of the independent variables . This example illustrates how you can use PROC TPSPLINE to fit a partial spline model. The data set Measure is fit using the following model:
The model has a parametric component (associated with the x 1 variable) and a nonparametric component (associated with the x 2 variable). The following statements fit a partial spline model.
data Measure; set Measure; x1sq = x1*x1; run; data pred; do x1=-1 to 1 by 0.1; do x2=-1 to 1 by 0.1; x1sq = x1*x1; output; end; end; run; proc tpspline data= measure; model y = x1 x1sq (x2); score data = pred out = predy; run;
Output 74.1.1 displays the results from these statements.
As displayed in Output 74.1.1, there are five unique design points for the smoothing variable x2 and two regression variables in the model (x1,x1sq) . The dimension of the null space (polynomial space) is 4. The standard deviation of the estimate is much larger than the one based on the model with both x 1 and x 2 as smoothing variables (0.445954 compared to 0.098421). One of the many possible explanations may be that the number of unique design points of the smoothing variable is too small to warrant an accurate estimate for h ( x 2).
The TPSPLINE Procedure Dependent Variable: y Summary of Input Data Set Number of Non-Missing Observations 50 Number of Missing Observations 0 Unique Smoothing Design Points 5 Summary of Final Model Number of Regression Variables 2 Number of Smoothing Variables 1 Order of Derivative in the Penalty 2 Dimension of Polynomial Space 4 Summary Statistics of Final Estimation log10(n*Lambda) 2.2374 Smoothing Penalty 205.3461 Residual SS 8.5821 Tr(I-A) 43.1534 Model DF 6.8466 Standard Deviation 0.4460
The following statements produce a surface plot for the partial spline model:
title 'Plot of Fitted Surface on a Fine Grid'; proc g3d data=predy; plot x2*x1=p_y/grid zmin=9 zmax=21 zticknum=4; run;
The surface displayed in Output 74.1.2 is similar to the one estimated by using the full nonparametric model (displayed in Figure 74.5).
The following example continues the analysis of the data set Measure to illustrate how you can use PROC TPSPLINE to fit a spline model with a higher-order penalty term . Spline models with high-order penalty terms move low-order polynomial terms into the null space. Hence, there is no penalty for these terms, and they can vary without constraint.
As shown in the previous analyses, the final model for the data set Measure must include quadratic terms for both x1 and x2. This example fits the following model:
The model includes quadratic terms for both variables, although it differs from the usual linear model. The nonparametric term g ( x 1 , x 2 ) explains the variation of the data unaccounted for by a simple quadratic surface.
To modify the order of the derivative in the penalty term, specify the M= option. The following statements specify the option M=3 in order to include the quadratic terms in the null space:
data measure; set measure; x1sq = x1*x1; x2sq = x2*x2; x1x2 = x1*x2; ; proc tpspline data= measure; model y = (x1 x2) / m=3; score data = pred out = predy; run;
The output resulting from these statements is displayed in Output 74.2.1.
The TPSPLINE Procedure Dependent Variable: y Summary of Input Data Set Number of Non-Missing Observations 50 Number of Missing Observations 0 Unique Smoothing Design Points 25 Summary of Final Model Number of Regression Variables 0 Number of Smoothing Variables 2 Order of Derivative in the Penalty 3 Dimension of Polynomial Space 6 Summary Statistics of Final Estimation log10(n*Lambda) 3.7831 Smoothing Penalty 2092.4495 Residual SS 0.2731 Tr(I-A) 29.1716 Model DF 20.8284 Standard Deviation 0.0968
The model contains six terms in the null space. Compare Output 74.2.1 with Figure 74.2: the LOGNLAMBDA value and the smoothing penalty differ significantly. Note that, in general, these terms are not directly comparable for different models. The final estimate based on this model is close to the estimate based on the model using the default, M=2.
The TPSPLINE Procedure Dependent Variable: y Summary of Input Data Set Number of Non-Missing Observations 50 Number of Missing Observations 0 Unique Smoothing Design Points 25 Summary of Final Model Number of Regression Variables 0 Number of Smoothing Variables 2 Order of Derivative in the Penalty 2 Dimension of Polynomial Space 3
In the following statements, the REG procedure fits a quadratic surface model to the data set Measure .
proc reg data= measure; model y = x1 x1sq x2 x2sq x1x2; run;
The results are displayed in Output 74.2.2.
The REG Procedure Model: MODEL1 Dependent Variable: y Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 443.20502 88.64100 436.33 <.0001 Error 44 8.93874 0.20315 Corrected Total 49 452.14376 Root MSE 0.45073 R-Square 0.9802 Dependent Mean 15.08548 Adj R-Sq 0.9780 Coeff Var 2.98781 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 14.90834 0.12519 119.09 <.0001 x1 1 0.01292 0.09015 0.14 0.8867 x1sq 1 4.85194 0.15237 31.84 <.0001 x2 1 0.02618 0.09015 0.29 0.7729 x2sq 1 5.20624 0.15237 34.17 <.0001 x1x2 1 0.04814 0.12748 0.38 0.7076
The REG procedure produces slightly different results. To fit a similar model with PROC TPSPLINE, you can use a MODEL statement specifying the degrees of freedom with the DF= option. You can also use a large value for the LOGNLAMBDA0 = option to force a parametric model fit.
Because there is one degree of freedom for each of the following terms, Intercept , x1 , x2 , x1sq , x2sq , and x1x2 , the DF=6 option is used.
proc tpspline data=measure; model y=(x1 x2) /m=3 df=6 lognlambda=( 4 to 1 by 0.2); score data = pred out = predy; run;
The results are displayed in Output 74.2.3. PROC TPSPLINE displays the list of GCV values for comparison.
The TPSPLINE Procedure Dependent Variable: y Summary of Input Data Set Number of Non-Missing Observations 50 Number of Missing Observations 0 Unique Smoothing Design Points 25 Summary of Final Model Number of Regression Variables 0 Number of Smoothing Variables 2 Order of Derivative in the Penalty 3 Dimension of Polynomial Space 6 GCV Function log10(n*Lambda) GCV 4.000000 0.016330 3.800000 0.016051* 3.600000 0.016363 3.400000 0.017770 3.200000 0.021071 3.000000 0.027496 2.800000 0.038707 2.600000 0.056292 2.400000 0.080613 2.200000 0.109714 2.000000 0.139642 1.800000 0.166338 -1.600000 0.187437 1.400000 0.202625 1.200000 0.212871 1.000000 0.219512 0.800000 0.223727 0.600000 0.226377 0.400000 0.228041 0.200000 0.229085 0 0.229740 0.200000 0.230153 0.400000 0.230413 0.600000 0.230576 0.800000 0.230680 1.000000 0.230745 Note: * indicates minimum GCV value.
The TPSPLINE Procedure Dependent Variable: y Summary Statistics of Final Estimation log10(n*Lambda) 2.3830 Smoothing Penalty 0.0000 Residual SS 8.9384 Tr(I-A) 43.9997 Model DF 6.0003 Standard Deviation 0.4507
The final estimate is based on 6.000330 degrees of freedom because there are already 6 degrees of freedom in the null space and the search range for lambda is not large enough (in this case, setting DF=6 is equivalent to setting lambda = ˆ ).
The standard deviation and RSS (Output 74.2.3) are close to the sum of squares for the error term and the root MSE from the the linear regression model (Output 74.2.2), respectively.
For this model, the optimal LOGNLAMBDA is around ˆ’ 3.8, which produces a standard deviation estimate of 0.096765 (see Output 74.2.1) and a GCV value of 0.016051, while the model specifying DF=6 results in a LOGNLAMBDA larger than 1 and a GCV value larger than 0.23074. The nonparametric model, based on the GCV , should provide better prediction, but the linear regression model can be more easily interpreted.
The following data represent the deposition of sulfate ( SO 4 ) at 179 sites in 48 con-tiguous states of the United States in 1990. Each observation records the latitude and longitude of the site as well as the SO 4 deposition at the site measured in gram per square meter ( g/m 2 ).
You can use PROC TPSPLINE to fit a surface that reflects the general trend and that reveals underlying features of the data.
data so4; input latitude longitude so4 @@; datalines; 32.45833 87.24222 1.403 34.28778 85.96889 2.103 33.07139 109.86472 0.299 36.07167 112.15500 0.304 31.95056 112.80000 0.263 33.60500 92.09722 1.950 34.17944 93.09861 2.168 36.08389 92.58694 1.578 . . . 162 additional observations . . . 45.82278 91.87444 0.984 41.34028 106.19083 0.335 42.73389 108.85000 0.236 42.49472 108.82917 0.313 42.92889 109.78667 0.182 43.22278 109.99111 0.161 43.87333 104.19222 0.306 44.91722 110.42028 0.210 45.07611 72.67556 2.646 ; data pred; do latitude = 25 to 47 by 1; do longitude = 68 to 124 by 1; output; end; end; run;
The preceding statements create the SAS data set so4 and the data set pred in order to make predictions on a regular grid. The following statements fit a surface for SO 4 deposition. The ODS OUTPUT statement creates a data set called GCV to contain the GCV values for LOGNLAMBDA in the range from ˆ’ 6 to 1.
proc tpspline data=so4; ods output GCVFunction=gcv; model so4 = (latitude longitude) /lognlambda=( 6 to 1 by 0.1); score data=pred out=prediction1; run;
Partial output from these statements is displayed in Output 74.3.1.
The TPSPLINE Procedure Dependent Variable: so4 Summary of Input Data Set Number of Non-Missing Observations 179 Number of Missing Observations 0 Unique Smoothing Design Points 179 Summary of Final Model Number of Regression Variables 0 Number of Smoothing Variables 2 Order of Derivative in the Penalty 2 Dimension of Polynomial Space 3 Summary Statistics of Final Estimation log10(n*Lambda) 0.2770 Smoothing Penalty 2.4588 Residual SS 12.4450 Tr(I-A) 140.2750 Model DF 38.7250 Standard Deviation 0.2979
The following statements produce Output 74.3.2:
symbol1 interpol=join value=none; title "GCV Function"; proc gplot data=gcv; plot gcv*lognlambda/frame cframe=ligr vaxis=axis1 haxis=axis2; run;
Output 74.3.2 displays the plot of the GCV function versus nlambda in log 10 scale. The GCV function has two minima. PROC TPSPLINE locates the minimum at 0.277005. The figure also displays a local minimum located around ˆ’ 2 . 56. Note that the TPSPLINE procedure may not always find the global minimum, although it did in this case.
The following analysis specifies the option LOGNLAMBDA0 = ˆ’ 2 . 56. The output is displayed in Output 74.3.3.
proc tpspline data=so4; model so4 = (latitude longitude) /lognlambda0= 2.56; score data=pred out=prediction2; run;
The TPSPLINE Procedure Dependent Variable: so4 Summary of Input Data Set Number of Non-Missing Observations 179 Number of Missing Observations 0 Unique Smoothing Design Points 179 Summary of Final Model Number of Regression Variables 0 Number of Smoothing Variables 2 Order of Derivative in the Penalty 2 Dimension of Polynomial Space 3 Summary Statistics of Final Estimation log10(n*Lambda) 2.5600 Smoothing Penalty 177.2144 Residual SS 0.0438 Tr(I-A) 7.2086 Model DF 171.7914 Standard Deviation 0.0779
The smoothing penalty is much larger in Output 74.3.3 than that displayed in Output 74.3.1. The estimate in Output 74.3.1 uses a large lambda value and, therefore, the surface is smoother than the estimate using LOGNLAMBDA = ˆ’ 2 . 56 (Output 74.3.3).
The estimate based on LOGNLAMBDA = ˆ’ 2 . 56 has a larger value for the degrees of freedom, and it has a much smaller standard deviation.
However, a smaller standard deviation in nonparametric regression does not necessarily mean that the estimate is good: a small » value always produces an estimate closer to the data and, therefore, a smaller standard deviation.
The following statements produce two contour plots of the estimates using the GCONTOUR procedure. In the final step, the plots are placed into a single graphic with the GREPLAY procedure.
title "TPSPLINE fit with lognlambda=0.277"; proc gcontour data=prediction1 gout=grafcat; plot latitude*longitude = P_so4/ name="tpscon1" legend=legend1 vaxis=axis1 haxis=axis2 cframe=ligr hreverse; run; title "TPSPLINE fit with lognlambda=-2.56"; proc gcontour data=prediction2 gout=grafcat; plot latitude*longitude = P_so4/ name="tpscon2" legend=legend1 vaxis=axis1 haxis=axis2 cframe=ligr hreverse; run; title; proc greplay igout=grafcat tc=sashelp.templt template=v2 nofs; treplay 1:tpscon1 2:tpscon2; quit;
Compare the two estimates by examining the contour plots of both estimates (Output 74.3.4).
As the contour plots show, the estimate with LOGNLAMBDA=0.277 may represent the underlying trend, while the estimate with the LOGNLAMBDA =-2.56 is very rough and may be modeling the noise component.
The following example illustrates how you can use the D= option to decrease the computation time needed by the TPSPLINE procedure. Note that, while the D= option can be helpful in decreasing computation time for large data sets, it may produce unexpected results when used with small data sets.
The following statements generate the data set large :
data large; do x= 5 to 5 by 0.02; y=5*sin(3*x)+1*rannor(57391); output; end; run;
The data set large contains 501 observations with one independent variable x and one dependent variable y . The following statements invoke PROC TPSPLINE to produce a thin-plate smoothing spline estimate and the associated 99% confidence interval. The output statistics are saved in the data set fit1 .
proc tpspline data=large; model y =(x) /lambda=(-5 to -1 by 0.2) alpha=0.01; output out=fit1 pred lclm uclm; run;
The results from this MODEL statement are displayed in Output 74.4.1.
The TPSPLINE Procedure Dependent Variable: y Summary of Input Data Set Number of Non-Missing Observations 501 Number of Missing Observations 0 Unique Smoothing Design Points 501 Summary of Final Model Number of Regression Variables 0 Number of Smoothing Variables 1 Order of Derivative in the Penalty 2 Dimension of Polynomial Space 2 GCV Function log10(n*Lambda) GCV 5.000000 1.258653 4.800000 1.228743 4.600000 1.205835 4.400000 1.188371 4.200000 1.174644 4.000000 1.163102 3.800000 1.152627 3.600000 1.142590 3.400000 1.132700 3.200000 1.122789 3.000000 1.112755 2.800000 1.102642 2.600000 1.092769 2.400000 1.083779 2.200000 1.076636 2.000000 1.072763* 1.800000 1.074636 1.600000 1.087152 1.400000 1.120339 1.200000 1.194023 1.000000 1.344213 Note: * indicates minimum GCV value.
The TPSPLINE Procedure Dependent Variable: y Summary Statistics of Final Estimation log10(n*Lambda) 1.9483 Smoothing Penalty 9953.7063 Residual SS 475.0984 Tr(I-A) 471.0861 Model DF 29.9139 Standard Deviation 1.0042
The following statements specify an identical model, but with the additional specification of the D= option. The estimates are obtained by treating nearby points as replicates.
proc tpspline data=large; model y =(x) /lambda=( 5 to 1 by 0.2) d=0.05 alpha=0.01; output out=fit2 pred lclm uclm; run;
The output is displayed in Output 74.4.2.
The TPSPLINE Procedure Dependent Variable: y Summary of Input Data Set Number of Non-Missing Observations 501 Number of Missing Observations 0 Unique Smoothing Design Points 251 Summary of Final Model Number of Regression Variables 0 Number of Smoothing Variables 1 Order of Derivative in the Penalty 2 Dimension of Polynomial Space 2 GCV Function log10(n*Lambda) GCV 5.000000 1.306536 4.800000 1.261692 4.600000 1.226881 4.400000 1.200060 4.200000 1.179284 4.000000 1.162776 3.800000 1.149072 3.600000 1.137120 3.400000 1.126220 3.200000 1.115884 3.000000 1.105766 2.800000 1.095730 2.600000 1.085972 2.400000 1.077066 2.200000 1.069954 2.000000 1.066076* 1.800000 1.067929 1.600000 1.080419 1.400000 1.113564 1.200000 1.187172 1.000000 1.337252 Note: * indicates minimum GCV value.
The TPSPLINE Procedure Dependent Variable: y Summary Statistics of Final Estimation log10(n*Lambda) 1.9477 Smoothing Penalty 9943.5615 Residual SS 472.1424 Tr(I-A) 471.0901 Model DF 29.9099 Standard Deviation 1.0011
The difference between the two estimates is minimal. However, the CPU time for the second MODEL statement is only about 1/8 of the CPU time used in the first model fit.
The following statements produce a plot for comparison of the two estimates:
data fit2; set fit2; P1_y = P_y; LCLM1_y = LCLM_y; UCLM1_y = UCLM_y; drop P_y LCLM_y UCLM_y; proc sort data=fit1; by x y; proc sort data=fit2; by x y; data comp; merge fit1 fit2; by x y; label p1_y ="Yhat1" p_y="Yhat0" lclm_y ="Lower CL" uclm_y ="Upper CL"; symbol1 i=join v=none ; symbol2 i=join v=none ; symbol3 i=join v=none color=cyan; symbol4 i=join v=none color=cyan; title 'Comparison of Two Estimates'; title2 'with and without the D= Option'; proc gplot data=comp; plot P_y*x=1 P1_y*x=2 LCLM_y*x=4 UCLM_y*x=4/overlay legend=legend1 vaxis=axis1 haxis=axis2 frame cframe=ligr; run;
The estimates from fit1 and fit2 are displayed in Output 74.4.3 with the 99% confidence interval from the fit1 output data set.
The following example illustrates how you can construct a bootstrap confidence interval by using the multiple responses feature in PROC TPSPLINE.
Numerous epidemiological observations have indicated that exposure to solar radiation is an important factor in the etiology of melanoma. The following data present age-adjusted melanoma incidences for 37 years from the Connecticut Tumor Registry (Houghton, Flannery, and Viola 1980). The data are analyzed by Ramsay and Silverman (1997).
data melanoma; input year incidences @@; datalines; 1936 0.9 1937 0.8 1938 0.8 1939 1.3 1940 1.4 1941 1.2 1942 1.7 1943 1.8 1944 1.6 1945 1.5 1946 1.5 1947 2.0 1948 2.5 1949 2.7 1950 2.9 1951 2.5 1952 3.1 1953 2.4 1954 2.2 1955 2.9 1956 2.5 1957 2.6 1958 3.2 1959 3.8 1960 4.2 1961 3.9 1962 3.7 1963 3.3 1964 3.7 1965 3.9 1966 4.1 1967 3.8 1968 4.7 1969 4.4 1970 4.8 1971 4.8 1972 4.8 ; run;
The variable incidences records the number of melanoma cases per 100,000 people for the years 1936 to 1972. The following model fits the data and requests a 90% Bayesian confidence interval along with the estimate.
proc tpspline data=melanoma; model incidences = (year) /alpha = 0.1; output out = result pred uclm lclm; run;
The output is displayed in Output 74.5.1
The TPSPLINE Procedure Dependent Variable: incidences Summary of Input Data Set Number of Non-Missing Observations 37 Number of Missing Observations 0 Unique Smoothing Design Points 37 Summary of Final Model Number of Regression Variables 0 Number of Smoothing Variables 1 Order of Derivative in the Penalty 2 Dimension of Polynomial Space 2 Summary Statistics of Final Estimation log10(n*Lambda) 0.0607 Smoothing Penalty 0.5171 Residual SS 1.2243 Tr(I-A) 22.5852 Model DF 14.4148 Standard Deviation 0.2328
The following statements produce a plot of the estimated curve:
symbol1 h=1pct ; symbol2 i=join v=none; symbol3 i=join v=none; symbol4 i=join v=none c=cyan; legend1 frame cframe=ligr cborder=black label=none position=center; axis1 label=(angle=90 rotate=0) minor=none; axis2 minor=none; title1 'Age-adjusted Melanoma Incidences for 37 years'; proc gplot data=result; plot incidences*year=1 p_incidences*year=2 lclm_incidences*year=3 uclm_incidences*year=4 /overlay legend=legend1 vaxis=axis1 haxis=axis2 frame cframe=ligr; run;
The estimated curve is displayed with 90% confidence interval bands in Output 74.5.2 . The number of melanoma incidences exhibits a periodic pattern and increases over the years. The periodic pattern is related to sunspot activity and the accompanying fluctuations in solar radiation.
Wang and Wahba (1995) compared several bootstrap confidence intervals to Bayesian confidence intervals for smoothing splines. Both bootstrap and Bayesian confidence intervals are across-the-curve intervals, not point-wise intervals. They concluded that bootstrap confidence intervals work as well as Bayesian intervals concerning average coverage probability. Additionally, bootstrap confidence intervals appear to be better for small sample sizes. Based on their simulation, the 'percentile- t interval' bootstrap interval performs better than the other types of bootstrap intervals.
Suppose that and are the estimates of f and ƒ from the data. Assume that is the 'true' f , and generate the bootstrap sample as follows :
where . Denote ( x i ) as the random variable of the bootstrap estimate at x i . Repeat this process K times, so that at each point x i , you have K bootstrap estimates ( x i ) or K realizations of ( x i ). For each fixed x i , consider the following statistic , which is similar to a Student's t statistic:
where i * is the estimate of based on the i th bootstrap sample.
Suppose ± / 2 and 1 ˆ’ ± / 2 are the lower and upper ± / 2 points of the empirical distribution of .The (1 ˆ’ ± )100% bootstrap confidence interval is defined as
Bootstrap confidence intervals are easy to interpret and can be used with any distribution. However, because they require K model fits, their construction is computationally intensive .
The multiple dependent variables feature in PROC TPSPLINE enables you to fit multiple models with the same independent variables. The procedure calculates the matrix decomposition part of the calculations only once regardless of the number of dependent variables in the model. These calculations are responsible for most of the computing time used by the TPSPLINE procedure. This feature is particularly useful when you need to generate a bootstrap confidence interval.
To construct a bootstrap confidence interval, perform the following tasks :
Fit the data using PROC TPSPLINE and obtain estimates ( xi ) and .
Generate K bootstrap samples based on ( xi ) and .
Fit the K bootstrap samples with the TPSPLINE procedure to obtain estimates of ( x i ) and .
Compute and the values ± / 2 and 1 ˆ’ ± / 2 .
The following statements illustrate this process:
proc tpspline data=melanoma; model incidences = (year) /alpha = 0.05; output out = result pred uclm lclm; run;
The output from the initial PROC TPSPLINE analysis is displayed in Output 74.5.3. The data set result contains the predicted values and confidence limits from the analysis.
The TPSPLINE Procedure Dependent Variable: incidences Summary of Input Data Set Number of Non-Missing Observations 37 Number of Missing Observations 0 Unique Smoothing Design Points 37 Summary of Final Model Number of Regression Variables 0 Number of Smoothing Variables 1 Order of Derivative in the Penalty 2 Dimension of Polynomial Space 2 Summary Statistics of Final Estimation log10(n*Lambda) 0.0607 Smoothing Penalty 0.5171 Residual SS 1.2243 Tr(I-A) 22.5852 Model DF 14.4148 Standard Deviation 0.2328
The following statements illustrate how you can obtain a bootstrap confidence interval for the Melanoma data. The following statements create the data set bootstrap . The observations are created with information from the preceding PROC TPSPLINE execution; as displayed in Output 74.5.3, = 0.232823. The values of ( xi ) are stored in the data set result in the variable P_incidence .
data bootstrap; set result; array y{1070} y1-y1070; do i=1 to 1070; y{i} = p_incidences + 0.232823*rannor(123456789); end; keep y1-y1070 p_incidences year; run; ods listing close; proc tpspline data=bootstrap; ods output FitStatistics=FitResult; id p_incidences; model y1-y1070 = (year); output out=result2; run; ods listing;
The DATA step generates 1,070 bootstrap samples based on the previous estimate from PROC TPSPLINE. For this data set, some of the bootstrap samples result in » s (selected by the GCV function) that cause problematic behavior. Thus, an additional 70 bootstrap samples are generated.
The ODS listing destination is closed before PROC TPSPLINE is invoked. The model fits all the y1_y1070 variables as dependent variables, and the models are fit for all bootstrap samples simultaneously . The output data set result2 contains the variables year, y1-y1070 , p_y1-p_y1070 , and P_incidences .
The ODS OUTPUT statement writes the FitStatistics table to the data set FitResult . The data set FitResult contains the two variables. They are Parameter and Value . The FitResult data set is used in subsequent calculations for .
In the data set FitResult , there are 63 estimates with a standard deviation of zero, suggesting that the estimates provide perfect fits of the data and are caused by s that are approximately equal to zero. For small sample sizes, there is a positive probability that the » chosen by the GCV function will be zero (refer to Wang and Wahba 1995).
In the following steps, these cases are removed from the bootstrap samples as 'bad' samples: they represent failure of the GCV function.
The following SAS statements manipulate the data set FitResult , retaining the standard deviations for all bootstrap samples and merging FitResult with the data set result2 , which contains the estimates for bootstrap samples. In the final data set boot , the statistics are calculated.
data FitResult; set FitResult; if Parameter="Standard Deviation"; keep Value; run; proc transpose data=FitResult out=sd prefix=sd; data result2; if _N_ = 1 then set sd; set result2; data boot; set result2; array y{1070} p_y1-p_y1070; array sd{1070} sd1-sd1070; do i=1 to 1070; if sd{i} > 0 then do; d = (y{i} - P_incidences)/sd{i}; obs = _N_; output; end; end; keep d obs P_incidences year; run;
The following SAS statements retain the first 1000 bootstrap samples and calculate the values ± / 2 and 1 ˆ’ ± / 2 with ± = 0 . 1.
proc sort data=boot; by obs; run; data boot; set boot; by obs; retain n; if first.obs then n=1; else n=n+1; if n > 1000 then delete; run; proc sort data=boot; by obs d; run; data chi1 chi2 ; set boot; if (_N_ = (obs-1)*1000+50) then output chi1; if (_N_ = (obs-1)*1000+950) then output chi2; run; proc sort data=result; by year; run; proc sort data=chi1; by year; run; proc sort data=chi2; by year; run; data result; merge result chi1(rename=(d=chi05)) chi2(rename=(d=chi95)); keep year incidences P_incidences lower upper LCLM_incidences UCLM_incidences; lower = -chi95*0.232823 + P_incidences; upper = -chi05*0.232823 + P_incidences; label lower="Lower 90% CL (Bootstrap)" upper="Upper 90% CL (Bootstrap)" lclm_incidences="Lower 90% CL (Bayesian)" uclm_incidences="Upper 90% CL (Bayesian)"; run;
The data set result contains the variables year , incidences , the TPSPLINE estimate P_incidences , and the 90% Bayesian and 90% bootstrap confidence intervals.
The following statements produce Output 74.5.4:
symbol1 v=dot h=1pct ; symbol2 i=join v=none l=1; symbol3 i=join v=none l=33; symbol4 i=join v=none l=33; symbol5 i=join v=none l=43 c=green; symbol6 i=join v=none l=43 c=green; title1 'Age-adjusted Melanoma Incidences for 37 years'; proc gplot data=result; plot incidences * year=1 p_incidences * year=2 lclm_incidences * year=3 uclm_incidences * year=3 lower * year=4 upper * year=4 /overlay legend=legend1 vaxis=axis1 haxis=axis2 frame cframe=ligr; run;
Output 74.5.4 displays the plot of the variable incidences , the predicted values, and the Bayesian and bootstrap confidence intervals.
The plot shows that the bootstrap confidence interval is similar to the Bayesian confidence interval. However, the Bayesian confidence interval is symmetric around the estimates, while the bootstrap confidence interval is not.