The following data from the Connecticut Tumor Registry presents age-adjusted numbers of melanoma incidences per 100,000 people for 37 years from 1936 to 1972 (Houghton, Flannery, and Viola 1980).
data Melanoma; input Year Incidences @@; format Year d4.0; format DepVar d4.1; datalines; 1936 0.9 1937 0.8 1938 0.8 1939 1.3 1940 1.4 1941 1.2 1942 1.7 1943 1.8 1944 1.6 1945 1.5 1946 1.5 1947 2.0 1948 2.5 1949 2.7 1950 2.9 1951 2.5 1952 3.1 1953 2.4 1954 2.2 1955 2.9 1956 2.5 1957 2.6 1958 3.2 1959 3.8 1960 4.2 1961 3.9 1962 3.7 1963 3.3 1964 3.7 1965 3.9 1966 4.1 1967 3.8 1968 4.7 1969 4.4 1970 4.8 1971 4.8 1972 4.8 ;
The following PROC GPLOT statements produce the simple scatter plot of these data displayed in Figure 41.1.
symbol1 color=black value=dot ; proc gplot data=Melanoma; title1 'Scatter Plot of Melanoma Data'; plot Incidences*Year; run;
Suppose that you want to smooth the response variable Incidences as a function of the variable Year . The following PROC LOESS statements request this analysis:
proc loess data=Melanoma; model Incidences=Year/details(OutputStatistics); run;
You use the PROC LOESS statement to invoke the procedure and specify the data set. The MODEL statement names the dependent and independent variables . You use options in the MODEL statement to specify fitting parameters and control the displayed output. For example, the MODEL statement option DETAILS(OutputStatistics) requests that the Output Statistics table be included in the displayed output. By default, this table is not displayed.
The results are displayed in Figure 41.2 and Figure 41.3.
Loess Fit of Melanoma Data The LOESS Procedure Independent Variable Scaling Scaling applied: None Statistic Year Minimum Value 1936 Maximum Value 1972 Loess Fit of Melanoma Data The LOESS Procedure Dependent Variable: Incidences Optimal Smoothing Criterion Smoothing AICC Parameter 1.17277 0.25676 Loess Fit of Melanoma Data The LOESS Procedure Selected Smoothing Parameter: 0.257 Dependent Variable: Incidences Fit Summary Fit Method kd Tree Blending Linear Number of Observations 37 Number of Fitting Points 37 kd Tree Bucket Size 1 Degree of Local Polynomials 1 Smoothing Parameter 0.25676 Points in Local Neighborhood 9 Residual Sum of Squares 2.03105 Trace[L] 8.62243 GCV 0.00252 AICC 1.17277
Loess Fit of Melanoma Data The LOESS Procedure Selected Smoothing Parameter: 0.257 Dependent Variable: Incidences Output Statistics Predicted Obs Year Incidences Incidences 1 1936 0.9 0.76235 2 1937 0.8 0.88992 3 1938 0.8 1.01764 4 1939 1.3 1.14303 5 1940 1.4 1.28654 6 1941 1.2 1.44528 7 1942 1.7 1.53482 8 1943 1.8 1.57895 9 1944 1.6 1.62058 10 1945 1.5 1.68627 11 1946 1.5 1.82449 12 1947 2.0 2.04976 13 1948 2.5 2.30981 14 1949 2.7 2.53653 15 1950 2.9 2.68921 16 1951 2.5 2.70779 17 1952 3.1 2.64837 18 1953 2.4 2.61468 19 1954 2.2 2.58792 20 1955 2.9 2.57877 21 1956 2.5 2.71078 22 1957 2.6 2.96981 23 1958 3.2 3.26005 24 1959 3.8 3.54143 25 1960 4.2 3.73482 26 1961 3.9 3.78186 27 1962 3.7 3.74362 28 1963 3.3 3.70904 29 1964 3.7 3.72917 30 1965 3.9 3.82382 31 1966 4.1 4.00515 32 1967 3.8 4.18573 33 1968 4.7 4.35152 34 1969 4.4 4.50284 35 1970 4.8 4.64413 36 1971 4.8 4.78291 37 1972 4.8 4.91602
Usually, such displayed results are of limited use. Most frequently the results are needed in an output data set so that they can be displayed graphically and analyzed further. For example, to place the Output Statistics table shown in Figure 41.3 in an output data set, you use the ODS OUTPUT statement as follows :
proc loess data=Melanoma; model Incidences=Year; ods output OutputStatistics=Results; run;
The statement
ods output OutputStatistics=Results;
requests that the Output Statistics table that appears in Figure 41.2 be placed in a SAS data set named Results . Note also that the DETAILS(OutputStatistics) option that caused this table to be included in the displayed output need not be specified.
The PRINT procedure displays the first five observations of this data set:
title1 'First 5 Observations of the Results Data Set'; proc print data=Results(obs=5); id obs; run;
First 5 Observations of the Results Data Set Dep Obs Year Var Pred 1 1936 0.9 0.76235 2 1937 0.8 0.88992 3 1938 0.8 1.01764 4 1939 1.3 1.14303 5 1940 1.4 1.28654
You can now produce a scatter plot including the fitted loess curve as follows:
symbol1 color=black value=dot; symbol2 color=black interpol=join value=none; /* macro used in subsequent examples */ %let opts=vaxis=axis1 hm=3 vm=3 overlay; axis1 label=(angle=90 rotate=0); proc gplot data=Results; title1 'Melanoma Data with Default LOESS Fit'; plot DepVar*Year Pred*Year/ &opts; run;
The smoothing parameter value used in the loess fit shown in Figure 41.5 was obtained with a default method of selecting this value. This method minimizes a bias corrected AIC criterion (Hurvich, Simonoff, and Tsai 1998), which balances the residual sum of squares against the smoothness of the fit.
You can find the selected smoothing parameter value in the Smoothing Criterion table shown in Figure 41.2. Note that with this smoothing parameter value, the loess fit captures the increasing trend in the data as well the periodic pattern in the data, which is related to an 11-year sunspot activity cycle.
You can obtain a summary of all the models that PROC LOESS evaluated in choosing this smoothing parameter value in the Model Summary table. You request this optionally displayed table by adding the choice, ModelSummary, in the DETAILS option in the model statement as follows:
proc loess data=Melanoma; model Incidences=Year/details(OutputStatistics ModelSummary); ods output OutputStatistics=Results; run;
Note that this example shows that you can request more than one optional table using the DETAILS option. The requested Model Summary table is shown in Figure 41.6
Loess Fit of Melanoma Data The LOESS Procedure Dependent Variable: Incidences Model Summary Smoothing Local Parameter Points Residual SS GCV AICC 0.41892 15 3.42229 0.00339 0.96252 0.68919 25 4.05838 0.00359 0.93459 0.31081 11 2.51054 0.00279 1.12034 0.20270 7 1.58513 0.00239 1.12221 0.17568 6 1.56896 0.00241 1.09706 0.28378 10 2.50487 0.00282 1.10402 0.20270 7 1.58513 0.00239 1.12221 0.25676 9 2.03105 0.00252 1.17277 0.22973 8 2.02965 0.00256 1.15145 0.25676 9 2.03105 0.00252 1.17277
Rather than use an automatic method for selecting the smoothing parameter value, you may want to examine loess fits for a range of values. You do this by using the SMOOTH= option in the MODEL statement as follows:
proc loess data=Melanoma; model Incidences=Year/smooth=0.1 0.2 0.3 0.4 residual; ods output OutputStatistics=Results; run;
The RESIDUAL option causes the residuals to be added to the Output Statistics table. PROC PRINT displays the first five observations of this data set:
proc print data=Results(obs=5); id obs; run;
First 5 Observations of the Results Data Set Smoothing Dep Obs Parameter Year Var Pred Residual 1 0.1 1936 0.9 0.90000 0 2 0.1 1937 0.8 0.80000 0 3 0.1 1938 0.8 0.80000 0 4 0.1 1939 1.3 1.30000 0 5 0.1 1940 1.4 1.40000 0
Note that the fits for all the smoothing parameters are placed in single data set and that ODS has added a SmoothingParameter variable to this data set that you can use to distinguish each fit.
The following statements display the loess fits obtained in a 2 by 2 plot grid:
goptions nodisplay; proc gplot data=Results; by SmoothingParameter; plot DepVar*Year=1 Pred*Year/ &opts name='fit'; run; quit; goptions display; proc greplay nofs tc=sashelp.templt template=l2r2; igout gseg; treplay 1:fit 2:fit2 3:fit1 4:fit3; run; quit;
proc loess data=Results; by SmoothingParameter; ods output OutputStatistics=residout; model Residual=Year / smooth=0.3; run; axis1 label = (angle=90 rotate=0) order = ( 0.8 to 0.8 by 0.4); goptions nodisplay; proc gplot data=residout; by SmoothingParameter; format DepVar 3.1; plot DepVar*Year Pred*Year / &opts vref=0 lv=2 vm=1 name='resids'; run; quit; goptions display; proc greplay nofs tc=sashelp.templt template=l2r2; igout gseg; treplay 1:resids 2:resids2 3:resids1 4:resids3; run; quit;
Having now decided on a loess fit, you may want to obtain confidence limits for your model predictions . This is done by adding the CLM option in the MODEL statement. By default 95% limits are produced, but this can be changed by using the ALPHA= option in the MODEL statement. The following statements add 90% confidence limits to the Results data set and display the results graphically:
proc loess data=Melanoma; model Incidences=Year/smooth=0.2 residual clm alpha=0.1; ods output OutputStatistics=Results; run; symbol3 color=green interpol=join value=none; symbol4 color=green interpol=join value=none; axis1 label = (angle=90 rotate=0) order = (0 to 6); title1 'Age-adjusted Melanoma Incidences for 37 Years'; proc gplot data=Results; plot DepVar*Year Pred*Year LowerCl*Year UpperCL*Year / &opts; run;