Getting Started


Scatter Plot Smoothing

The following data from the Connecticut Tumor Registry presents age-adjusted numbers of melanoma incidences per 100,000 people for 37 years from 1936 to 1972 (Houghton, Flannery, and Viola 1980).

  data Melanoma;   input Year Incidences @@;   format Year d4.0;   format DepVar d4.1;   datalines;   1936    0.9   1937   0.8  1938   0.8  1939   1.3   1940    1.4   1941   1.2  1942   1.7  1943   1.8   1944    1.6   1945   1.5  1946   1.5  1947   2.0   1948    2.5   1949   2.7  1950   2.9  1951   2.5   1952    3.1   1953   2.4  1954   2.2  1955   2.9   1956    2.5   1957   2.6  1958   3.2  1959   3.8   1960    4.2   1961   3.9  1962   3.7  1963   3.3   1964    3.7   1965   3.9  1966   4.1  1967   3.8   1968    4.7   1969   4.4  1970   4.8  1971   4.8   1972    4.8   ;  

The following PROC GPLOT statements produce the simple scatter plot of these data displayed in Figure 41.1.

  symbol1 color=black value=dot ;   proc gplot data=Melanoma;   title1 'Scatter Plot of Melanoma Data';   plot Incidences*Year;   run;  
click to expand
Figure 41.1: Scatter Plot of Incidences versus Year for the Melanoma Data

Suppose that you want to smooth the response variable Incidences as a function of the variable Year . The following PROC LOESS statements request this analysis:

  proc loess data=Melanoma;   model Incidences=Year/details(OutputStatistics);   run;  

You use the PROC LOESS statement to invoke the procedure and specify the data set. The MODEL statement names the dependent and independent variables . You use options in the MODEL statement to specify fitting parameters and control the displayed output. For example, the MODEL statement option DETAILS(OutputStatistics) requests that the Output Statistics table be included in the displayed output. By default, this table is not displayed.

The results are displayed in Figure 41.2 and Figure 41.3.

start figure
  Loess Fit of Melanoma Data   The LOESS Procedure   Independent Variable Scaling   Scaling applied: None   Statistic                     Year   Minimum Value                 1936   Maximum Value                 1972   Loess Fit of Melanoma Data   The LOESS Procedure   Dependent Variable: Incidences   Optimal Smoothing   Criterion   Smoothing   AICC      Parameter   1.17277        0.25676   Loess Fit of Melanoma Data   The LOESS Procedure   Selected Smoothing Parameter: 0.257   Dependent Variable: Incidences   Fit Summary   Fit Method                          kd Tree   Blending                             Linear   Number of Observations                   37   Number of Fitting Points                 37   kd Tree Bucket Size                       1   Degree of Local Polynomials               1   Smoothing Parameter                 0.25676   Points in Local Neighborhood              9   Residual Sum of Squares             2.03105   Trace[L]                            8.62243   GCV                                 0.00252   AICC                               1.17277  
end figure

Figure 41.2: Output from PROC LOESS
start figure
  Loess Fit of Melanoma Data   The LOESS Procedure   Selected Smoothing Parameter: 0.257   Dependent Variable: Incidences   Output Statistics   Predicted   Obs    Year    Incidences     Incidences   1    1936           0.9        0.76235   2    1937           0.8        0.88992   3    1938           0.8        1.01764   4    1939           1.3        1.14303   5    1940           1.4        1.28654   6    1941           1.2        1.44528   7    1942           1.7        1.53482   8    1943           1.8        1.57895   9    1944           1.6        1.62058   10    1945           1.5        1.68627   11    1946           1.5        1.82449   12    1947           2.0        2.04976   13    1948           2.5        2.30981   14    1949           2.7        2.53653   15    1950           2.9        2.68921   16    1951           2.5        2.70779   17    1952           3.1        2.64837   18    1953           2.4        2.61468   19    1954           2.2        2.58792   20    1955           2.9        2.57877   21    1956           2.5        2.71078   22    1957           2.6        2.96981   23    1958           3.2        3.26005   24    1959           3.8        3.54143   25    1960           4.2        3.73482   26    1961           3.9        3.78186   27    1962           3.7        3.74362   28    1963           3.3        3.70904   29    1964           3.7        3.72917   30    1965           3.9        3.82382   31    1966           4.1        4.00515   32    1967           3.8        4.18573   33    1968           4.7        4.35152   34    1969           4.4        4.50284   35    1970           4.8        4.64413   36    1971           4.8        4.78291   37    1972           4.8        4.91602  
end figure

Figure 41.3: Output from PROC LOESS continued

Usually, such displayed results are of limited use. Most frequently the results are needed in an output data set so that they can be displayed graphically and analyzed further. For example, to place the Output Statistics table shown in Figure 41.3 in an output data set, you use the ODS OUTPUT statement as follows :

  proc loess data=Melanoma;   model Incidences=Year;   ods output OutputStatistics=Results;   run;  

The statement

  ods output OutputStatistics=Results;  

requests that the Output Statistics table that appears in Figure 41.2 be placed in a SAS data set named Results . Note also that the DETAILS(OutputStatistics) option that caused this table to be included in the displayed output need not be specified.

The PRINT procedure displays the first five observations of this data set:

  title1 'First 5 Observations of the Results Data Set';   proc print data=Results(obs=5);   id obs;   run;  
start figure
  First 5 Observations of the Results Data Set   Dep   Obs    Year     Var           Pred   1    1936     0.9        0.76235   2    1937     0.8        0.88992   3    1938     0.8        1.01764   4    1939     1.3        1.14303   5    1940     1.4        1.28654  
end figure

Figure 41.4: PROC PRINT Output of the Results Data Set

You can now produce a scatter plot including the fitted loess curve as follows:

  symbol1 color=black value=dot;   symbol2 color=black interpol=join value=none;   /* macro used in subsequent examples */   %let opts=vaxis=axis1 hm=3 vm=3 overlay;   axis1 label=(angle=90 rotate=0);   proc gplot data=Results;   title1 'Melanoma Data with Default LOESS Fit';   plot DepVar*Year Pred*Year/ &opts;   run;  

The smoothing parameter value used in the loess fit shown in Figure 41.5 was obtained with a default method of selecting this value. This method minimizes a bias corrected AIC criterion (Hurvich, Simonoff, and Tsai 1998), which balances the residual sum of squares against the smoothness of the fit.

click to expand
Figure 41.5: Default Loess Fit for Melanoma Data

You can find the selected smoothing parameter value in the Smoothing Criterion table shown in Figure 41.2. Note that with this smoothing parameter value, the loess fit captures the increasing trend in the data as well the periodic pattern in the data, which is related to an 11-year sunspot activity cycle.

You can obtain a summary of all the models that PROC LOESS evaluated in choosing this smoothing parameter value in the Model Summary table. You request this optionally displayed table by adding the choice, ModelSummary, in the DETAILS option in the model statement as follows:

  proc loess data=Melanoma;   model Incidences=Year/details(OutputStatistics ModelSummary);   ods output OutputStatistics=Results;   run;  

Note that this example shows that you can request more than one optional table using the DETAILS option. The requested Model Summary table is shown in Figure 41.6

start figure
  Loess Fit of Melanoma Data   The LOESS Procedure   Dependent Variable: Incidences   Model Summary   Smoothing      Local   Parameter     Points    Residual SS            GCV           AICC   0.41892          15        3.42229        0.00339   0.96252   0.68919          25        4.05838        0.00359   0.93459   0.31081          11        2.51054        0.00279   1.12034   0.20270           7        1.58513        0.00239   1.12221   0.17568           6        1.56896        0.00241   1.09706   0.28378          10        2.50487        0.00282   1.10402   0.20270           7        1.58513        0.00239   1.12221   0.25676           9        2.03105        0.00252   1.17277   0.22973           8        2.02965        0.00256   1.15145   0.25676           9        2.03105        0.00252   1.17277  
end figure

Figure 41.6: Model Summary Table

Rather than use an automatic method for selecting the smoothing parameter value, you may want to examine loess fits for a range of values. You do this by using the SMOOTH= option in the MODEL statement as follows:

  proc loess data=Melanoma;   model Incidences=Year/smooth=0.1 0.2 0.3 0.4 residual;   ods output OutputStatistics=Results;   run;  

The RESIDUAL option causes the residuals to be added to the Output Statistics table. PROC PRINT displays the first five observations of this data set:

  proc print data=Results(obs=5);   id obs;   run;  
start figure
  First 5 Observations of the Results Data Set   Smoothing             Dep   Obs    Parameter    Year     Var           Pred       Residual   1       0.1       1936     0.9        0.90000              0   2       0.1       1937     0.8        0.80000              0   3       0.1       1938     0.8        0.80000              0   4       0.1       1939     1.3        1.30000              0   5       0.1       1940     1.4        1.40000              0  
end figure

Figure 41.7: PROC PRINT Output of the Results Data Set

Note that the fits for all the smoothing parameters are placed in single data set and that ODS has added a SmoothingParameter variable to this data set that you can use to distinguish each fit.

The following statements display the loess fits obtained in a 2 by 2 plot grid:

  goptions nodisplay;   proc gplot data=Results;   by SmoothingParameter;   plot DepVar*Year=1 Pred*Year/ &opts name='fit';   run; quit;   goptions display;   proc greplay nofs tc=sashelp.templt template=l2r2;   igout gseg;   treplay 1:fit 2:fit2 3:fit1 4:fit3;   run; quit;  

If you examine the plots in Figure 41.8, you see that a good fit is obtained with smoothing parameter value 0 . 2. You can gain further insight in how to choose the smoothing parameter value by examining scatter plots of the fit residuals versus the year. To aid the interpretation of these scatter plots, you can again use PROC LOESS to smooth the response Residual as a function of Year .

click to expand
Figure 41.8: Loess Fits with a Range of Smoothing Parameters
  proc loess data=Results;   by SmoothingParameter;   ods output OutputStatistics=residout;   model Residual=Year / smooth=0.3;   run;   axis1 label = (angle=90 rotate=0)   order = (   0.8 to 0.8 by 0.4);   goptions nodisplay;   proc gplot data=residout;   by SmoothingParameter;   format DepVar 3.1;   plot DepVar*Year Pred*Year / &opts vref=0 lv=2 vm=1   name='resids';   run; quit;   goptions display;   proc greplay nofs tc=sashelp.templt template=l2r2;   igout gseg;   treplay 1:resids 2:resids2 3:resids1 4:resids3;   run; quit;  

Looking at the scatter plots in Figure 41.9 confirms that the choice 0.2 is reasonable. With smoothing parameter value 0.1, there is gross overfitting in the sense that the original data are exactly interpolated. The loess fits on the Residual versus Year scatter plots for smoothing parameter values 0.3 and 0.4 reveal that there is a periodic trend in the residuals that is much weaker when the value 0.2. This suggests that when the smoothing parameter value is above 0.3, an overly smooth fit is obtained that misses essential features in the original data.

click to expand
Figure 41.9: Scatter Plots of Residuals versus Year

Having now decided on a loess fit, you may want to obtain confidence limits for your model predictions . This is done by adding the CLM option in the MODEL statement. By default 95% limits are produced, but this can be changed by using the ALPHA= option in the MODEL statement. The following statements add 90% confidence limits to the Results data set and display the results graphically:

  proc loess data=Melanoma;   model Incidences=Year/smooth=0.2 residual clm   alpha=0.1;   ods output OutputStatistics=Results;   run;   symbol3 color=green interpol=join value=none;   symbol4 color=green interpol=join value=none;   axis1 label = (angle=90 rotate=0)   order = (0 to 6);   title1 'Age-adjusted Melanoma Incidences for 37 Years';   proc gplot data=Results;   plot DepVar*Year Pred*Year LowerCl*Year UpperCL*Year   / &opts;   run;  
click to expand
Figure 41.10: Loess fit of Melanoma Data with 90% Confidence Bands



SAS.STAT 9.1 Users Guide (Vol. 3)
SAS/STAT 9.1, Users Guide, Volume 3 (volume 3 ONLY)
ISBN: B0042UQTBS
EAN: N/A
Year: 2004
Pages: 105

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net