Investigators studied the exhaust emissions of a one cylinder engine (Brinkman 1981). The SAS data set Gas contains the results data. The dependent variable, NOx , measures the concentration, in micrograms per joule, of nitric oxide and nitrogen dioxide normalized by the amount of work of the engine. The independent variable, E , is a measure of the richness of the air and fuel mixture.
data Gas; input NOx E; format NOx f3.1; format E f3.1; datalines; 4.818 0.831 2.849 1.045 3.275 1.021 4.691 0.97 4.255 0.825 5.064 0.891 2.118 0.71 4.602 0.801 2.286 1.074 0.97 1.148 3.965 1 5.344 0.928 3.834 0.767 1.99 0.701 5.199 0.807 5.283 0.902 3.752 0.997 0.537 1.224 1.64 1.089 5.055 0.973 4.937 0.98 1.561 0.665 ;
The following PROC GPLOT statements produce the simple scatter plot of these data, displayed in Output 41.1.1.
symbol1 color=black value=dot ; proc gplot data=Gas; plot NOx*E; run;
The following statements fit two loess models for these data. Because this is a small data set, it is reasonable to do direct fitting at every data point. As there is substantial curvature in the data, quadratic local polynomials are used. An ODS OUTPUT statement creates two output data sets containing the Output Statistics and Fit Summary tables.
proc loess data=Gas; ods output OutputStatistics = GasFit FitSummary=Summary; model NOx = E / degree=2 direct smooth = 0.6 1.0 alpha=.01 all details; run;
The Fit Summary table for smoothing parameter value 0.6, shown in Output 41.1.2, records the fitting parameters specified and some overall fit statistics.
The LOESS Procedure Smoothing Parameter: 0.6 Dependent Variable: NOx Fit Summary Fit Method Direct Number of Observations 22 Degree of Local Polynomials 2 Smoothing Parameter 0.60000 Points in Local Neighborhood 13 Residual Sum of Squares 1.71852 Trace[L] 6.42184 GCV 0.00708 AICC 0.45637 AICC1 9.39715 Delta1 15.12582 Delta2 14.73089 Equivalent Number of Parameters 5.96950 Lookup Degrees of Freedom 15.53133 Residual Standard Error 0.33707
The matrix L referenced in the Fit Summary table is the smoothing matrix. This matrix satisfies
where y is the vector of observed values and · is the corresponding vector of predicted values of the dependent variable. The quantities
in the Fit Summary table are used in doing statistical inference.
The equivalent number of parameters and residual standard error in the Fit Summary table are defined by
The Output Statistics table for smoothing parameter value 0.6 is shown in Output 41.1.3. Note that, as the ALL option in the MODEL statement is specified, this table includes all the relevant optional columns . Furthermore, because the ALPHA=0.01 option is specified in the MODEL statement, the confidence limits in this table are 99% limits.
The LOESS Procedure Smoothing Parameter: 0.6 Dependent Variable: NOx Output Statistics Estimated Predicted Prediction Obs E NOx NOx Std Deviation Residual t Value 1 0.8 4.8 4.87377 0.15528 0.05577 0.36 2 1.0 2.8 2.81984 0.15380 0.02916 0.19 3 1.0 3.3 3.48153 0.15187 0.20653 1.36 4 1.0 4.7 4.73249 0.13923 0.04149 0.30 5 0.8 4.3 4.82305 0.15278 0.56805 3.72 6 0.9 5.1 5.18561 0.19337 -0.12161 0.63 7 0.7 2.1 2.51120 0.15528 0.39320 2.53 8 0.8 4.6 4.48267 0.15285 0.11933 0.78 9 1.1 2.3 2.12619 0.16683 0.15981 0.96 10 1.1 1.0 0.97120 0.18134 0.00120 0.01 11 1.0 4.0 4.09987 0.13477 0.13487 1.00 12 0.9 5.3 5.31258 0.17283 0.03142 0.18 13 0.8 3.8 3.84572 0.14929 0.01172 0.08 14 0.7 2.0 2.26578 0.16712 0.27578 1.65 15 0.8 5.2 4.58394 0.15363 0.61506 4.00 16 0.9 5.3 5.24741 0.19319 0.03559 0.18 17 1.0 3.8 4.16979 0.13478 0.41779 3.10 18 1.2 0.5 0.53059 0.32170 0.00641 0.02 19 1.1 1.6 1.83157 0.17127 0.19157 1.12 20 1.0 5.1 4.66733 0.13735 0.38767 2.82 21 1.0 4.9 4.52385 0.13556 0.41315 3.05 22 0.7 1.6 1.19888 0.26774 0.36212 1.35 Output Statistics Obs 99% Confidence Limits 1 4.41841 5.32912 2 2.36883 3.27085 3 3.03617 3.92689 4 4.32419 5.14079 5 4.37503 5.27107 6 4.61855 5.75266 7 2.05585 2.96655 8 4.03444 4.93089 9 1.63697 2.61541 10 0.43942 1.50298 11 3.70467 4.49507 12 4.80576 5.81940 13 3.40794 4.28350 14 1.77571 2.75584 15 4.13342 5.03445 16 4.68089 5.81393 17 3.77457 4.56502 18 0.41278 1.47397 19 1.32933 2.33380 20 4.26456 5.07010 21 4.12632 4.92139 22 0.41375 1.98401
Plots of the data points and fitted models with 99% confidence bands are shown in Output 41.1.4.
proc sort data=GasFit; by SmoothingParameter E; run; symbol1 color=black value=dot ; symbol2 color=black interpol=spline value=none; symbol3 color=green interpol=spline value=none; symbol4 color=green interpol=spline value=none; %let opts=vaxis=axis1 hm=3 vm=3 overlay; goptions nodisplay hsize=3.75; axis1 label=(angle=90 rotate=0); proc gplot data=GasFit; by SmoothingParameter; plot (DepVar Pred LowerCL UpperCL)*E/ &opts name='fitGas'; run; quit; goptions display hsize=0 hpos=0; proc greplay nofs tc=sashelp.templt template=h2; igout gseg; treplay 1:fitGas 2:fitGas1; run; quit;
It is evident from the preceding figure that the better fit is obtained with smoothing parameter value 0.6. Scatter plots of the fit residuals confirm this observation. Note also that PROC LOESS is again used to produce the Residual variable on these plots.
proc loess data=GasFit; by SmoothingParameter; ods output OutputStatistics=residout; model Residual=E; run; axis1 label = (angle=90 rotate=0) order = ( 1 to 1 by 0.5); goptions nodisplay hsize=3.75; proc gplot data=residout; by SmoothingParameter; format DepVar 3.1; plot DepVar*E Pred*E/ &opts vref=0 lv=2 vm=1 name='resGas'; run; quit; goptions display hsize=0 hpos=0; proc greplay nofs tc=sashelp.templt template=h2; igout gseg; treplay 1:resGas 2:resGas1; run; quit;
The residual plots show that with smoothing parameter value 1, the loess model exhibits a lack of fit. Analysis of variance can be used to compare the model with smoothing parameter value 1, which serves as the null model, to the model with smoothing parameter value 0.6.
The statistic
has a distribution that is well approximated by an F distribution with
numerator degrees of freedom and denominator degrees of freedom (Cleveland and Grosse 1991). Here quantities with superscript n refer to the null model, rss is the residual sum of squares, and 1 , 2 ,and are as previously defined.
The Fit Summary tables contain the information needed to carry out such an analysis. These tables have been captured in the output data set named Summary using an ODS OUTPUT statement. The following statements extract the relevant information from this data set and carry out the analysis of variance:
data h0 h1; set Summary(keep=SmoothingParameter Label1 nValue1 where=(Label1 in ('Residual Sum of Squares','Delta1', 'Delta2','Lookup Degrees of Freedom'))); if SmoothingParameter = 1 then output h0; else output h1; run; proc transpose data=h0(drop=SmoothingParameter Label1) out=h0; data h0(drop=_NAME_); set h0; rename Col1 = RSSNull Col2 = delta1Null Col3 = delta2Null; proc transpose data=h1(drop=SmoothingParameter Label1) out=h1; data h1(drop=_NAME_); set h1; rename Col1 = RSS Col2 = delta1 Col3 = delta2 Col4 = rho; data ftest; merge h0 h1; nu = (delta1Null - delta1)**2 / (delta2Null - delta2); Numerator = (RSSNull - RSS)/(delta1Null - delta1); Denominator = RSS/delta1; FValue = Numerator / Denominator; PValue = 1 - ProbF(FValue, nu, rho); label nu = 'Num DF' rho = 'Den DF' FValue = 'F Value' PValue = 'Pr > F'; proc print data=ftest label; var nu rho Numerator Denominator FValue PValue; format nu rho FValue 7.2 PValue 6.4; run;
The results are shown in Output 41.1.6.
Obs Num DF Den DF Numerator Denominator F Value Pr > F 1 2.67 15.53 1.05946 0.11362 9.32 0.0012
The highly significant p -value confirms that the loess model with smoothing parameter value 0 . 6 provides a better fit than the model with smoothing parameter value 1.
The following data set contains measurements in grams per square meter of sulfate (SO 4 ) deposits during 1990 at 179 sites throughout the 48 states.
data SO4; input Latitude Longitude SO4 @@; datalines; 32.45833 87.24222 1.403 34.28778 85.96889 2.103 33.07139 109.86472 0.299 36.07167 112.15500 0.304 31.95056 112.80000 0.263 33.60500 92.09722 1.950 . . more data lines . 42.92889 109.78667 0.182 43.22278 109.99111 0.161 43.87333 104.19222 0.306 44.91722 110.42028 0.210 45.07611 72.67556 2.646 ;
The following statements produce the two scatter plots of the SO4 data shown in Output 41.2.1 and Output 41.2.2:
symbol1 color=black value=dot ; proc gplot data=SO4; plot Latitude*Longitude/hreverse; run; proc g3d data=SO4; format SO4 f4.1; scatter Longitude*Latitude=SO4 / shape='balloon' size=0.35 rotate=80 tilt=60; run;
From these scatter plots, it is clear that the largest concentrations are in the northeastern United States. These plots also indicate that a nonparametric surface, such as a loess fit, is appropriate for these data.
The sulfate measurements are irregularly spaced . The following statements create a SAS data set containing a regular grid of points that will be used in the SCORE statement:
data PredPoints; do Latitude = 26 to 46 by 1; do Longitude = 79 to 123 by 1; output; end; end;
The following statements fit loess models for two values of the smoothing parameter and save the results in output data sets:
proc loess data=SO4; ods Output ScoreResults=ScoreOut OutputStatistics=StatOut; model SO4=Latitude Longitude/smooth=0.15 0.4 residual; score data=PredPoints; run;
Notice that even though there are two predictors in the model, the SCALE= option is not appropriate because the predictors (Latitude and Longitude) are identically scaled.
proc loess data=StatOut; by SmoothingParameter; ods output OutputStatistics=ResidLatOut; model residual=Latitude; run; proc loess data=StatOut; by SmoothingParameter; ods output OutputStatistics=ResidLongOut; model residual=Longitude; run; proc sort data=ResidLatOut; by SmoothingParameter Latitude; run; proc sort data=ResidLongOut; by SmoothingParameter Longitude; run; goptions nodisplay; symbol1 color=black value=dot ; symbol2 color=black interpol=join value=none; %let opts = vaxis=axis1 overlay vref=0 lv=2; axis1 label = (angle=90 rotate=0); proc gplot data=ResidLatOut; by smoothingParameter; format DepVar 3.1; plot (DepVar Pred) * Latitude / &opts name='lat'; run; proc gplot data=ResidLongOut; by smoothingParameter; format DepVar 3.1; plot (DepVar Pred) * Longitude / &opts name='long'; run; goptions display; proc greplay nofs tc=sashelp.templt template=l2r2; igout gseg; treplay 1:long 2:long1 3:lat 4:lat1; run; quit ;
The ScoreOut data set contains the model predictions at the grid definedinthe PredPoints data set. The following statements request a fitted surface and a contour plot of this surface with a smoothing parameter value 0.15:
proc g3d data=ScoreOut(where= (smoothingParameter=0.15)); format Latitude f4.0; format Longitude f4.0; format p_SO4 f4.1; plot Longitude*Latitude=p_SO4/tilt=60 rotate=80; run; proc gcontour data=ScoreOut(where= (smoothingParameter=0.15)); format latitude f4.0; format longitude f4.0; format p_SO4 f4.1; plot Latitude*Longitude = p_SO4/hreverse; run;
The following data set records the results of an experiment to determine how the yield of a chemical reaction varies with temperature and amount of a catalyst used.
data Experiment; input Temperature Catalyst Yield; datalines; 80 0.000 6.842 80 0.002 7.944 . . more data lines . 140 0.078 4.012 140 0.080 5.212 ;
Researchers know that about 10% of the yield measurements are corrupted due to an intermittent equipment problem. This can be seen in the surface plot of raw data shown in Output 41.3.1.
proc g3d data=Experiment; plot Temperature*Catalyst=Yield/zmin=0 zmax=25 zticknum=6; run;
A robust fitting method is needed to estimate the underlying surface in the presence of data outliers. The following statements invoke PROC LOESS with iterative reweighting to fit a surface to these data:
proc loess data=Experiment; ods output OutputStatistics=Results; model Yield = Temperature Catalyst / scale=sd(0.1) iterations=3; run;
The ITERATIONS=3 option in the MODEL statement requests two iteratively reweighted iterations. Note the use of the SCALE=SD(0.1) option in the MODEL statement. This specifies that the independent variables in the model are to be divided by their respective 10% trimmed standard deviations before the fitted model is computed. This is appropriate as the independent variables Temperature and Catalyst are not similarly scaled. The Scale Details table produced by PROC LOESS is shown in Output 41.3.2.
The LOESS Procedure Independent Variable Scaling Scaling applied: 10% trimmed standard deviation Statistic Temperature Catalyst Minimum Value 80 0.000 Maximum Value 140 0.080 Trimmed Mean 110 0.040 Trimmed Standard Deviation 14 0.019
The following statements use the G3D procedure to plot the fittedsurfaceshownin Output 41.3.3.
proc g3d data=Results; format Temperature f4.0; format Catalyst f6.3; format pred f5.2; plot Temperature*Catalyst=pred/zmin=0 zmax=10 zticknum=3; run;
The following data set contains measurements of monthly averaged atmospheric pressure differences between Easter Island and Darwin, Australia, for a period of 168 months (National Institute of Standards and Technology 1998):
data ENSO; input Pressure @@; Month=_N_; format Pressure 4.1; format Month 3.0; datalines; 12.9 11.3 10.6 11.2 10.9 7.5 7.7 11.7 12.9 14.3 10.9 13.7 17.1 14.0 15.3 8.5 5.7 5.5 7.6 8.6 7.3 7.6 12.7 11.0 12.7 12.9 13.0 10.9 10.4 10.2 8.0 10.9 13.6 10.5 9.2 12.4 12.7 13.3 10.1 7.8 4.8 3.0 2.5 6.3 9.7 11.6 8.6 12.4 10.5 13.3 10.4 8.1 3.7 10.7 5.1 10.4 10.9 11.7 11.4 13.7 14.1 14.0 12.5 6.3 9.6 11.7 5.0 10.8 12.7 10.8 11.8 12.6 15.7 12.6 14.8 7.8 7.1 11.2 8.1 6.4 5.2 12.0 10.2 12.7 10.2 14.7 12.2 7.1 5.7 6.7 3.9 8.5 8.3 10.8 16.7 12.6 12.5 12.5 9.8 7.2 4.1 10.6 10.1 10.1 11.9 13.6 16.3 17.6 15.5 16.0 15.2 11.2 14.3 14.5 8.5 12.0 12.7 11.3 14.5 15.1 10.4 11.5 13.4 7.5 0.6 0.3 5.5 5.0 4.6 8.2 9.9 9.2 12.5 10.9 9.9 8.9 7.6 9.5 8.4 10.7 13.6 13.7 13.7 16.5 16.8 17.1 15.4 9.5 6.1 10.1 9.3 5.3 11.2 16.6 15.6 12.0 11.5 8.6 13.8 8.7 8.6 8.6 8.7 12.8 13.2 14.0 13.4 14.8 ;
The following PROC GPLOT statements produce the simple scatter plot of these data, displayed in Output 41.4.1:
symbol1 color=black value=dot ; proc gplot data=ENSO; plot Pressure*Month / hminor = 0 vminor = 0 vaxis = axis1 frame; axis1 label = (r=0 a=90) order=(0 to 20 by 4);; run;
You can compute a loess fit and plot the results for these data using the following statements:
ods output OutputStatistics=ENSOstats; proc loess data=ENSO; model Pressure=Month ; run; symbol1 color=black value=dot h=2.5 pct; symbol2 color=black interpol=join value=none width=2; proc gplot data=ENSOstats; plot (depvar pred)*Month / overlay hminor = 0 vminor = 0 vaxis = axis1 frame; axis1 label = (r=0 a=90) order=(0 to 20 by 4); run; quit;
The Smoothing Criterion and Fit Summary tables are shown in Output 41.4.2 and the fit is plotted in Output 41.4.3.
The LOESS Procedure Dependent Variable: Pressure Optimal Smoothing Criterion Smoothing AICC Parameter 3.41105 0.22321 The LOESS Procedure Selected Smoothing Parameter: 0.223 Dependent Variable: Pressure Fit Summary Fit Method kd Tree Blending Linear Number of Observations 168 Number of Fitting Points 33 kd Tree Bucket Size 7 Degree of Local Polynomials 1 Smoothing Parameter 0.22321 Points in Local Neighborhood 37 Residual Sum of Squares 1654.27725 Trace[L] 8.74180 GCV 0.06522 AICC 3.41105
The smoothing parameter value used for the loess fits hown in Output 41.4.3 was chosen using the default method of PROC LOESS, namely a golden section minimization of the AICC criterion over the interval (0 , 1]. The fit seems to be oversmoothed. What accounts for this poor fit?
One possibility is that the golden section search has found a local rather than a global minimum of the AICC criterion. You can test this by redoing the fit requesting a global minimum. It is also helpful to plot the AICC criterion as a function of the smoothing parameter value used. You do this with the following statements:
ods output ModelSummary=ENSOsummary; proc loess data=ENSO; model Pressure=Month/select=AICC(global); run; proc sort data=ENSOsummary; by smooth; run; symbol1 color=black interpol=join value=none width=2; proc gplot data=ENSOsummary; format AICC f4.1; format smooth f4.1; plot AICC*Smooth / hminor = 0 vminor = 0 vaxis = axis1 frame; axis1 label = (r=0 a=90); run; quit;
The results are shown in Output 41.4.4.
The explanation for the oversmoothed fit in Output 41.4.3 is now apparent. The golden section search algorithm found the local minimum that occurs near the value 0.22 of the smoothing parameter rather than the global minimum that occurs near 0.06. Note that if you restrict the range of smoothing parameter values examined to lie below 0.2, then the golden section search finds the global minimum as the following statements demonstrate :
ods output OutputStatistics=ENSOstats; proc loess data=ENSO; model Pressure=Month/select=AICC(range(0.03,0.2)); run; symbol1 color=black value=dot h=2.5 pct; symbol2 color=black interpol=join value=none width=2; proc gplot data=ENSOstats; plot (depvar pred)*Month / overlay hminor = 0 vminor = 0 vaxis = axis1 frame; axis1 label = (r=0 a=90) order=(0 to 20 by 4); run; quit;
The fit obtained is shown in Output 41.4.5.
The loess fit shown in Output 41.4.5 clearly shows an annual cycle in the data. An interesting question is whether there is some phenomenon captured in the data that would explain the presence of the local minimum near 0.22 in the AICC curve. Note that there is some evidence of a cycle of about 42 months in the oversmoothed fitin Output 41.4.3. You can see this cycle because the strong annual cycle in Output 41.4.5 has been smoothed out. The physical phenomenon that accounts for the existence of this cycle has been identified as the periodic warming of the Pacific Ocean known as El Ni ±o.
This example highlights the use of ODS for creating statistical graphs with the LOESS procedure. The ENSO example is revisited to show how these graphics can be used to enrich the analysis and simplify the process for obtaining functionally equivalent plots to those previously presented with this example.
To request these plots you to need to specify the experimental ODS GRAPHICS statement. For general information about ODS graphics see Chapter 15, Statistical Graphics Using ODS. The following statements produce the default plots:
ods html; ods graphics on; proc loess data=ENSO; model Pressure=Month/select=AICC(range(0.03,0.4) global) clm alpha=0.01; run; ods graphics off; ods html close;
The fit plot shown in Output 41.5.2 is produced for models with a single regressor. Note that Output 41.5.2 includes the 99% confidence band that was requested in the MODEL statement using the CLM and ALPHA= options.
Diagnostic plots are produced when the RESIDUAL option is included in the model statement. The following statements produce these diagnostic plots:
ods html; ods graphics on; proc loess data=ENSO; model Pressure=Month/smooth = 0.0565 residual; run; ods graphics off; ods html close;
The residuals in Output 41.5.3 do not exhibit any obvious patterns, suggesting that the signal in the data has been successfully modeled .
Additional information about the LOESS fit can be seen in the plots in the fitdiag-nostics panel that is also produced whenever the RESIDUAL option is specified in the MODEL statement. Output 41.5.4 shows this panel for the optimal smoothing parameter.
These diagnostic plots all suggest that the fit obtained is appropriate for these data:
The plot of residuals versus predicted value shows no obvious pattern.
The residual histogram with overlayed normal density supports the assumption of gaussian errors.
The normal quantile plot is consistent with the assumption of gaussian errors.
The Residual-Fit (or RF) plot consisting of side-by-side quantile plots of the centered fit and the residuals shows that the spread in the residuals is no greater than the spread in the centered fit. For inappropriate models, the spread of the residuals in such a plot is often greater than the spread of the centered fit.
The plot of the dependent variable versus the predicted values is centered around a 45 degree line and shows no obvious outliers.
If you want to obtain the plots in the Diagnostics Panel as individual plots, you can do so by specifying the PLOTS(UNPACKPANELS) option in the PROC LOESS statement. The following statements provide an example:
ods html; ods graphics on; proc loess data=ENSO plots(unpackpanels); model Pressure=Month/smooth = 0.0565 residual; run; ods graphics off; ods html close;
The residual histogram is shown in Output 41.5.5.