The iris data published by Fisher (1936) are widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on fifty iris specimens from each of three species, Iris setosa, I. versicolor, and I. virginica . The iris data are used in Example 25.1 through Example 25.3.
Example 25.4 and Example 25.5 use remote-sensing data on crops. In this data set, the observations are grouped into five crops: clover, corn, cotton, soybeans, and sugar beets. Four measures called X1 through X4 make up the descriptive variables .
Example 25.1. Univariate Density Estimates and Posterior Probabilities
In this example, several discriminant analyses are run with a single quantitative variable, petal width, so that density estimates and posterior probabilities can be plotted easily. The example produces Output 25.1.1 through Output 25.1.5. The GCHART procedure is used to display the sample distribution of petal width in the three species. Note the overlap between species I. versicolor and I. virginica that the bar chart shows. These statements produce Output 25.1.1:
proc format; value specname 1='Setosa ' 2='Versicolor' 3='Virginica '; run; data iris; title 'Discriminant Analysis of Fisher (1936) Iris Data'; input SepalLength SepalWidth PetalLength PetalWidth Species @@; format Species specname.; label SepalLength='Sepal Length in mm.' SepalWidth ='Sepal Width in mm.' PetalLength='Petal Length in mm.' PetalWidth ='Petal Width in mm.'; symbol = put(Species, specname10.); datalines; 50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3 63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2 59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2 65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3 68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3 77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3 49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2 64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3 55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1 49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1 67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1 77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2 50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1 61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1 61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1 51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1 51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1 46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1 50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3 57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1 71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3 49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1 49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1 66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1 44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2 47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2 74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1 56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3 49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1 56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2 51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3 54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3 61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3 68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1 45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1 55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1 51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2 63 33 60 25 3 53 37 15 02 1 ; pattern1 c=red /*v=l1 */; pattern2 c=yellow /*v=empty*/; pattern3 c=blue /*v=r1 */; axis1 label=(angle=90); axis2 value=(height=.6); legend1 frame label=none; proc gchart data=iris; vbar PetalWidth / subgroup=Species midpoints=0 to 25 raxis=axis1 maxis=axis2 legend=legend1 cframe=ligr; run;
Output 25.1.1: Sample Distribution of Petal Width in Three Species Output 25.1.2: Normal Density Estimates with Equal Variance Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Equal Variance The DISCRIM Procedure Observations 150 DF Total 149 Variables 1 DF Within Classes 147 Classes 3 DF Between Classes 2 Class Level Information Variable Prior Species Name Frequency Weight Proportion Probability Setosa Setosa 50 50.0000 0.333333 0.333333 Versicolor Versicolor 50 50.0000 0.333333 0.333333 Virginica Virginica 50 50.0000 0.333333 0.333333
Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Equal Variance The DISCRIM Procedure Classification Results for Calibration Data: WORK.IRIS Cross-validation Results using Linear Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (X-X)' COV (X-X) j (X)j (X) (X)j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Posterior Probability of Membership in Species From Classified Obs Species into Species Setosa Versicolor Virginica 5 Virginica Versicolor * 0.0000 0.9610 0.0390 9 Versicolor Virginica * 0.0000 0.0952 0.9048 57 Virginica Versicolor * 0.0000 0.9940 0.0060 78 Virginica Versicolor * 0.0000 0.8009 0.1991 91 Virginica Versicolor * 0.0000 0.9610 0.0390 148 Versicolor Virginica * 0.0000 0.3828 0.6172 * Misclassified observation
Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Equal Variance The DISCRIM Procedure Classification Summary for Calibration Data: WORK.IRIS Cross-validation Summary using Linear Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (X-X)' COV (X-X) j (X)j (X) (X)j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Number of Observations and Percent Classified into Species From Species Setosa Versicolor Virginica Total Setosa 50 0 0 50 100.00 0.00 0.00 100.00 Versicolor 0 48 2 50 0.00 96.00 4.00 100.00 Virginica 0 4 46 50 0.00 8.00 92.00 100.00 Total 50 52 48 150 33.33 34.67 32.00 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for Species Setosa Versicolor Virginica Total Rate 0.0000 0.0400 0.0800 0.0400 Priors 0.3333 0.3333 0.3333
Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Equal Variance The DISCRIM Procedure Classification Summary for Test Data: WORK.PLOTDATA Classification Summary using Linear Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (X-X)' COV (X-X) j j j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Number of Observations and Percent Classified into Species Setosa Versicolor Virginica Total Total 26 18 27 71 36.62 25.35 38.03 100.00 Priors 0.33333 0.33333 0.33333
Output 25.1.3: Normal Density Estimates with Unequal Variance Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Unequal Variance The DISCRIM Procedure Observations 150 DF Total 149 Variables 1 DF Within Classes 147 Classes 3 DF Between Classes 2 Class Level Information Variable Prior Species Name Frequency Weight Proportion Probability Setosa Setosa 50 50.0000 0.333333 0.333333 Versicolor Versicolor 50 50.0000 0.333333 0.333333 Virginica Virginica 50 50.0000 0.333333 0.333333
Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Unequal Variance The DISCRIM Procedure Classification Results for Calibration Data: WORK.IRIS Cross-validation Results using Quadratic Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (X-X)' COV (X-X) + ln COV j (X)j (X)j (X)j (X)j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Posterior Probability of Membership in Species From Classified Obs Species into Species Setosa Versicolor Virginica 5 Virginica Versicolor * 0.0000 0.8740 0.1260 9 Versicolor Virginica * 0.0000 0.0686 0.9314 42 Setosa Versicolor * 0.4923 0.5073 0.0004 57 Virginica Versicolor * 0.0000 0.9602 0.0398 78 Virginica Versicolor * 0.0000 0.6558 0.3442 91 Virginica Versicolor * 0.0000 0.8740 0.1260 148 Versicolor Virginica * 0.0000 0.2871 0.7129 * Misclassified observation
Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Unequal Variance The DISCRIM Procedure Classification Summary for Calibration Data: WORK.IRIS Cross-validation Summary using Quadratic Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (X-X)' COV (X-X) + ln COV j (X)j (X)j (X)j (X)j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Number of Observations and Percent Classified into Species From Species Setosa Versicolor Virginica Total Setosa 49 1 0 50 98.00 2.00 0.00 100.00 Versicolor 0 48 2 50 0.00 96.00 4.00 100.00 Virginica 0 4 46 50 0.00 8.00 92.00 100.00 Total 49 53 48 150 32.67 35.33 32.00 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for Species Setosa Versicolor Virginica Total Rate 0.0200 0.0400 0.0800 0.0467 Priors 0.3333 0.3333 0.3333
Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Unequal Variance The DISCRIM Procedure Classification Summary for Test Data: WORK.PLOTDATA Classification Summary using Quadratic Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (X-X)' COV (X-X) + ln COV j j j j j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Number of Observations and Percent Classified into Species Setosa Versicolor Virginica Total Total 23 20 28 71 32.39 28.17 39.44 100.00 Priors 0.33333 0.33333 0.33333
Output 25.1.4: Kernel Density Estimates with Equal Bandwidth Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Equal Bandwidth The DISCRIM Procedure Observations 150 DF Total 149 Variables 1 DF Within Classes 147 Classes 3 DF Between Classes 2 Class Level Information Variable Prior Species Name Frequency Weight Proportion Probability Setosa Setosa 50 50.0000 0.333333 0.333333 Versicolor Versicolor 50 50.0000 0.333333 0.333333 Virginica Virginica 50 50.0000 0.333333 0.333333
Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Equal Bandwidth The DISCRIM Procedure Classification Results for Calibration Data: WORK.IRIS Cross-validation Results using Normal Kernel Density Squared Distance Function 2 1 D (X,Y) = (XY)' COV (XY) Posterior Probability of Membership in Each Species 1 2 2 F(Xj) = n SUM exp( .5 D (X,Y) / R) j i ji Pr(jX) = PRIOR F(Xj) / SUM PRIOR F(Xk) j k k Posterior Probability of Membership in Species From Classified Obs Species into Species Setosa Versicolor Virginica 5 Virginica Versicolor * 0.0000 0.8827 0.1173 9 Versicolor Virginica * 0.0000 0.0438 0.9562 57 Virginica Versicolor * 0.0000 0.9472 0.0528 78 Virginica Versicolor * 0.0000 0.8061 0.1939 91 Virginica Versicolor * 0.0000 0.8827 0.1173 148 Versicolor Virginica * 0.0000 0.2586 0.7414 * Misclassified observation
Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Equal Bandwidth The DISCRIM Procedure Classification Summary for Calibration Data: WORK.IRIS Cross-validation Summary using Normal Kernel Density Squared Distance Function 2 1 D (X,Y) = (XY)' COV (XY) Posterior Probability of Membership in Each Species 1 2 2 F(Xj) = n SUM exp(.5 D (X,Y) / R) j i ji Pr(jX) = PRIOR F(Xj) / SUM PRIOR F(Xk) j k k Number of Observations and Percent Classified into Species From Species Setosa Versicolor Virginica Total Setosa 50 0 0 50 100.00 0.00 0.00 100.00 Versicolor 0 48 2 50 0.00 96.00 4.00 100.00 Virginica 0 4 46 50 0.00 8.00 92.00 100.00 Total 50 52 48 150 33.33 34.67 32.00 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for Species Setosa Versicolor Virginica Total Rate 0.0000 0.0400 0.0800 0.0400 Priors 0.3333 0.3333 0.3333
Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Equal Bandwidth The DISCRIM Procedure Classification Summary for Test Data: WORK.PLOTDATA Classification Summary using Normal Kernel Density Squared Distance Function 2 1 D (X,Y) = (X-Y) COV (X-Y) Posterior Probability of Membership in Each Species 1 2 2 F(Xj) = n SUM exp( .5 D (X,Y) / R) j i ji Pr(jX) = PRIOR F(Xj) / SUM PRIOR F(Xk) j k k Number of Observations and Percent Classified into Species Setosa Versicolor Virginica Total Total 26 18 27 71 36.62 25.35 38.03 100.00 Priors 0.33333 0.33333 0.33333
Output 25.1.5: Kernel Density Estimates with Unequal Bandwidth Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Unequal Bandwidth The DISCRIM Procedure Observations 150 DF Total 149 Variables 1 DF Within Classes 147 Classes 3 DF Between Classes 2 Class Level Information Variable Prior Species Name Frequency Weight Proportion Probability Setosa Setosa 50 50.0000 0.333333 0.333333 Versicolor Versicolor 50 50.0000 0.333333 0.333333 Virginica Virginica 50 50.0000 0.333333 0.333333
Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Unequal Bandwidth The DISCRIM Procedure Classification Results for Calibration Data: WORK.IRIS Cross-validation Results using Normal Kernel Density Squared Distance Function 2 1 D (X,Y) = (XY)' COV (XY) j Posterior Probability of Membership in Each Species 1 2 2 F(Xj) = n SUM exp( .5 D (X,Y) / R) j i ji Pr(jX) = PRIOR F(Xj) / SUM PRIOR F(Xk) j k k Posterior Probability of Membership in Species From Classified Obs Species into Species Setosa Versicolor Virginica 5 Virginica Versicolor * 0.0000 0.8805 0.1195 9 Versicolor Virginica * 0.0000 0.0466 0.9534 57 Virginica Versicolor * 0.0000 0.9394 0.0606 78 Virginica Versicolor * 0.0000 0.7193 0.2807 91 Virginica Versicolor * 0.0000 0.8805 0.1195 148 Versicolor Virginica * 0.0000 0.2275 0.7725 * Misclassified observation
Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Unequal Bandwidth The DISCRIM Procedure Classification Summary for Calibration Data: WORK.IRIS Cross-validation Summary using Normal Kernel Density Squared Distance Function 2 1 D (X,Y) = (X-Y)' COV (X-Y) j Posterior Probability of Membership in Each Species 1 2 2 F(Xj) = n SUM exp( .5 D (X,Y) / R) j i ji Pr(jX) = PRIOR F(Xj) / SUM PRIOR F(Xk) j k k Number of Observations and Percent Classified into Species From Species Setosa Versicolor Virginica Total Setosa 50 0 0 50 100.00 0.00 0.00 100.00 Versicolor 0 48 2 50 0.00 96.00 4.00 100.00 Virginica 0 4 46 50 0.00 8.00 92.00 100.00 Total 50 52 48 150 33.33 34.67 32.00 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for Species Setosa Versicolor Virginica Total Rate 0.0000 0.0400 0.0800 0.0400 Priors 0.3333 0.3333 0.3333
Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Unequal Bandwidth The DISCRIM Procedure Classification Summary for Test Data: WORK.PLOTDATA Classification Summary using Normal Kernel Density Squared Distance Function 2 1 D (X,Y) = (XY)' COV (XY) j Posterior Probability of Membership in Each Species 1 2 2 F(Xj) = n SUM exp(.5 D (X,Y) / R) j i ji Pr(jX) = PRIOR F(Xj) / SUM PRIOR F(Xk) j k k Number of Observations and Percent Classified into Species Setosa Versicolor Virginica Total Total 25 18 28 71 35.21 25.35 39.44 100.00 Priors 0.33333 0.33333 0.33333
In order to plot the density estimates and posterior probabilities, a data set called plotdata is created containing equally spaced values from -5 to 30, covering the range of petal width with a little to spare on each end. The plotdata data set is used with the TESTDATA= option in PROC DISCRIM.
data plotdata; do PetalWidth=-5 to 30 by .5; output; end; run;
The same plots are produced after each discriminant analysis, so a macro can be used to reduce the amount of typing required. The macro PLOT uses two data sets. The data set plotd , containing density estimates, is created by the TESTOUTD= option in PROC DISCRIM. The data set plotp , containing posterior probabilities, is created by the TESTOUT= option. For each data set, the macro PLOT removes uninteresting values (near zero) and does an overlay plot showing all three species on a single plot. The following statements create the macro PLOT
%macro plot; data plotd; set plotd; if setosa<.002 then setosa=.; if versicolor<.002 then versicolor=.; if virginica <.002 then virginica=.; label PetalWidth='Petal Width in mm.'; run; symbol1 i=join v=none c=red l=1 /*l=21*/; symbol2 i=join v=none c=yellow l=1 /*l= 1*/; symbol3 i=join v=none c=blue l=1 /*l= 2*/; legend1 label=none frame; axis1 label=(angle=90 'Density') order=(0 to .6 by .1); proc gplot data=plotd; plot setosa*PetalWidth versicolor*PetalWidth virginica*PetalWidth / overlay vaxis=axis1 legend=legend1 frame cframe=ligr; title3 'Plot of Estimated Densities'; run; data plotp; set plotp; if setosa<.01 then setosa=.; if versicolor<.01 then versicolor=.; if virginica<.01 then virginica=.; label PetalWidth='Petal Width in mm.'; run; axis1 label=(angle=90 'Posterior Probability') order=(0 to 1 by .2); proc gplot data=plotp; plot setosa*PetalWidth versicolor*PetalWidth virginica*PetalWidth / overlay vaxis=axis1 legend=legend1 frame cframe=ligr; title3 'Plot of Posterior Probabilities'; run; %mend;
The first analysis uses normal-theory methods (METHOD=NORMAL) assuming equal variances (POOL=YES) in the three classes. The NOCLASSIFY option suppresses the resubstitution classification results of the input data set observations. The CROSSLISTERR option lists the observations that are misclassified under cross validation and displays cross validation error-rate estimates. The following statements produce Output 25.1.2:
proc discrim data=iris method=normal pool=yes testdata=plotdata testout=plotp testoutd=plotd short noclassify crosslisterr; class Species; var PetalWidth; title2 'Using Normal Density Estimates with Equal Variance'; run; %plot
The next analysis uses normal-theory methods assuming unequal variances (POOL=NO) in the three classes. The following statements produce Output 25.1.3:
proc discrim data=iris method=normal pool=no testdata=plotdata testout=plotp testoutd=plotd short noclassify crosslisterr; class Species; var PetalWidth; title2 'Using Normal Density Estimates with Unequal Variance'; run; %plot
Two more analyses are run with nonparametric methods (METHOD=NPAR), specifically kernel density estimates with normal kernels (KERNEL=NORMAL). The first of these uses equal bandwidths (smoothing parameters) (POOL=YES) in each class. The use of equal bandwidths does not constrain the density estimates to be of equal variance. The value of the radius parameter that, assuming normality, minimizes an approximate mean integrated square error is 0 . 48 (see the Nonparametric Methods section on page 1158). Choosing r = 0 . 4 gives a more detailed look at the irregularities in the data. The following statements produce Output 25.1.4:
proc discrim data=iris method=npar kernel=normal r=.4 pool=yes testdata=plotdata testout=plotp testoutd=plotd short noclassify crosslisterr; class Species; var PetalWidth; title2 'Using Kernel Density Estimates with Equal Bandwidth'; run; %plot
Another nonparametric analysis is run with unequal bandwidths (POOL=NO). These statements produce Output 25.1.5:
proc discrim data=iris method=npar kernel=normal r=.4 pool=no testdata=plotdata testout=plotp testoutd=plotd short noclassify crosslisterr; class Species; var PetalWidth; title2 'Using Kernel Density Estimates with Unequal Bandwidth'; run; %plot
Example 25.2. Bivariate Density Estimates and Posterior Probabilities
In this example, four more discriminant analyses of iris data are run with two quantitative variables: petal width and petal length. The example produces Output 25.2.1 through Output 25.2.5. A scatter plot shows the joint sample distribution. See Appendix B, Using the %PLOTIT Macro, for more information on the % PLOTIT macro.
Output 25.2.2: Normal Density Estimates with Equal Variance Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Equal Variance The DISCRIM Procedure Observations 150 DF Total 149 Variables 2 DF Within Classes 147 Classes 3 DF Between Classes 2 Class Level Information Variable Prior Species Name Frequency Weight Proportion Probability Setosa Setosa 50 50.0000 0.333333 0.333333 Versicolor Versicolor 50 50.0000 0.333333 0.333333 Virginica Virginica 50 50.0000 0.333333 0.333333
Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Equal Variance The DISCRIM Procedure Classification Results for Calibration Data: WORK.IRIS Cross-validation Results using Linear Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (XX)' COV (XX) j (X)j (X) (X)j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Posterior Probability of Membership in Species From Classified Obs Species into Species Setosa Versicolor Virginica 5 Virginica Versicolor * 0.0000 0.8453 0.1547 9 Versicolor Virginica * 0.0000 0.2130 0.7870 25 Virginica Versicolor * 0.0000 0.8322 0.1678 57 Virginica Versicolor * 0.0000 0.8057 0.1943 91 Virginica Versicolor * 0.0000 0.8903 0.1097 148 Versicolor Virginica * 0.0000 0.3118 0.6882 * Misclassified observation
Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Equal Variance The DISCRIM Procedure Classification Summary for Calibration Data: WORK.IRIS Cross-validation Summary using Linear Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (XX)' COV (XX) j (X)j (X) (X)j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Number of Observations and Percent Classified into Species From Species Setosa Versicolor Virginica Total Setosa 50 0 0 50 100.00 0.00 0.00 100.00 Versicolor 0 48 2 50 0.00 96.00 4.00 100.00 Virginica 0 4 46 50 0.00 8.00 92.00 100.00 Total 50 52 48 150 33.33 34.67 32.00 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for Species Setosa Versicolor Virginica Total Rate 0.0000 0.0400 0.0800 0.0400 Priors 0.3333 0.3333 0.3333
Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Equal Variance The DISCRIM Procedure Classification Summary for Test Data: WORK.PLOTDATA Classification Summary using Linear Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (XX)' COV (XX) j j j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Number of Observations and Percent Classified into Species Setosa Versicolor Virginica Total Total 14507 16888 12858 44253 32.78 38.16 29.06 100.00 Priors 0.33333 0.33333 0.33333
Output 25.2.3: Normal Density Estimates with Unequal Variance Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Unequal Variance The DISCRIM Procedure Observations 150 DF Total 149 Variables 2 DF Within Classes 147 Classes 3 DF Between Classes 2 Class Level Information Variable Prior Species Name Frequency Weight Proportion Probability Setosa Setosa 50 50.0000 0.333333 0.333333 Versicolor Versicolor 50 50.0000 0.333333 0.333333 Virginica Virginica 50 50.0000 0.333333 0.333333
Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Unequal Variance The DISCRIM Procedure Classification Results for Calibration Data: WORK.IRIS Cross validation Results using Quadratic Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (X X)' COV (X X) + ln COV j (X)j (X)j (X)j (X)j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Posterior Probability of Membership in Species From Classified Obs Species into Species Setosa Versicolor Virginica 5 Virginica Versicolor * 0.0000 0.7288 0.2712 9 Versicolor Virginica * 0.0000 0.0903 0.9097 25 Virginica Versicolor * 0.0000 0.5196 0.4804 91 Virginica Versicolor * 0.0000 0.8335 0.1665 148 Versicolor Virginica * 0.0000 0.4675 0.5325 * Misclassified observation
Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Unequal Variance The DISCRIM Procedure Classification Summary for Calibration Data: WORK.IRIS Cross validation Summary using Quadratic Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (X X)' COV (X X) + ln COV j (X)j (X)j (X)j (X)j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Number of Observations and Percent Classified into Species From Species Setosa Versicolor Virginica Total Setosa 50 0 0 50 100.00 0.00 0.00 100.00 Versicolor 0 48 2 50 0.00 96.00 4.00 100.00 Virginica 0 3 47 50 0.00 6.00 94.00 100.00 Total 50 51 49 150 33.33 34.00 32.67 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for Species Setosa Versicolor Virginica Total Rate 0.0000 0.0400 0.0600 0.0333 Priors 0.3333 0.3333 0.3333
Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Unequal Variance The DISCRIM Procedure Classification Summary for Test Data: WORK.PLOTDATA Classification Summary using Quadratic Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (X X)' COV (X X) + ln COV j j j j j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Number of Observations and Percent Classified into Species Setosa Versicolor Virginica Total Total 5461 5354 33438 44253 12.34 12.10 75.56 100.00 Priors 0.33333 0.33333 0.33333
Output 25.2.4: Kernel Density Estimates with Equal Bandwidth Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Equal Bandwidth The DISCRIM Procedure Observations 150 DF Total 149 Variables 2 DF Within Classes 147 Classes 3 DF Between Classes 2 Class Level Information Variable Prior Species Name Frequency Weight Proportion Probability Setosa Setosa 50 50.0000 0.333333 0.333333 Versicolor Versicolor 50 50.0000 0.333333 0.333333 Virginica Virginica 50 50.0000 0.333333 0.333333
Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Equal Bandwidth The DISCRIM Procedure Classification Results for Calibration Data: WORK.IRIS Cross validation Results using Normal Kernel Density Squared Distance Function 2 1 D (X,Y) = (X Y)' COV (X Y) Posterior Probability of Membership in Each Species 1 2 2 F(Xj) = n SUM exp( .5 D (X,Y) / R) j i ji Pr(jX) = PRIOR F(Xj) / SUM PRIOR F(Xk) j k k Posterior Probability of Membership in Species From Classified Obs Species into Species Setosa Versicolor Virginica 5 Virginica Versicolor * 0.0000 0.7474 0.2526 9 Versicolor Virginica * 0.0000 0.0800 0.9200 25 Virginica Versicolor * 0.0000 0.5863 0.4137 91 Virginica Versicolor * 0.0000 0.8358 0.1642 148 Versicolor Virginica * 0.0000 0.4123 0.5877 * Misclassified observation
Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Equal Bandwidth The DISCRIM Procedure Classification Summary for Calibration Data: WORK.IRIS Cross validation Summary using Normal Kernel Density Squared Distance Function 2 1 D (X,Y) = (X Y)' COV (X Y) Posterior Probability of Membership in Each Species 1 2 2 F(Xj) = n SUM exp( .5 D (X,Y) / R) j i ji Pr(jX) = PRIOR F(Xj) / SUM PRIOR F(Xk) j k k Number of Observations and Percent Classified into Species From Species Setosa Versicolor Virginica Total Setosa 50 0 0 50 100.00 0.00 0.00 100.00 Versicolor 0 48 2 50 0.00 96.00 4.00 100.00 Virginica 0 3 47 50 0.00 6.00 94.00 100.00 Total 50 51 49 150 33.33 34.00 32.67 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for Species Setosa Versicolor Virginica Total Rate 0.0000 0.0400 0.0600 0.0333 Priors 0.3333 0.3333 0.3333
Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Equal Bandwidth The DISCRIM Procedure Classification Summary for Test Data: WORK.PLOTDATA Classification Summary using Normal Kernel Density Squared Distance Function 2 1 D (X,Y) = (X Y)' COV (X Y) Posterior Probability of Membership in Each Species 1 2 2 F(Xj) = n SUM exp( .5 D (X,Y) / R) j i ji Pr(jX) = PRIOR F(Xj) / SUM PRIOR F(Xk) j k k Number of Observations and Percent Classified into Species Setosa Versicolor Virginica Total Total 12631 9941 21681 44253 28.54 22.46 48.99 100.00 Priors 0.33333 0.33333 0.33333
Output 25.2.5: Kernel Density Estimates with Unequal Bandwidth Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Unequal Bandwidth The DISCRIM Procedure Observations 150 DF Total 149 Variables 2 DF Within Classes 147 Classes 3 DF Between Classes 2 Class Level Information Variable Prior Species Name Frequency Weight Proportion Probability Setosa Setosa 50 50.0000 0.333333 0.333333 Versicolor Versicolor 50 50.0000 0.333333 0.333333 Virginica Virginica 50 50.0000 0.333333 0.333333
Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Unequal Bandwidth The DISCRIM Procedure Classification Results for Calibration Data: WORK.IRIS Cross validation Results using Normal Kernel Density Squared Distance Function 2 1 D (X,Y) = (X Y)' COV (X Y) j Posterior Probability of Membership in Each Species 1 2 2 F(Xj) = n SUM exp( .5 D (X,Y) / R) j i ji Pr(jX) = PRIOR F(Xj) / SUM PRIOR F(Xk) j k k Posterior Probability of Membership in Species From Classified Obs Species into Species Setosa Versicolor Virginica 5 Virginica Versicolor * 0.0000 0.7826 0.2174 9 Versicolor Virginica * 0.0000 0.0506 0.9494 91 Virginica Versicolor * 0.0000 0.8802 0.1198 148 Versicolor Virginica * 0.0000 0.3726 0.6274 * Misclassified observation
Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Unequal Bandwidth The DISCRIM Procedure Classification Summary for Calibration Data: WORK.IRIS Cross validation Summary using Normal Kernel Density Squared Distance Function 2 1 D (X,Y) = (X Y)' COV (X Y) j Posterior Probability of Membership in Each Species 1 2 2 F(Xj) = n SUM exp( .5 D (X,Y) / R) j i ji Pr(jX) = PRIOR F(Xj) / SUM PRIOR F(Xk) j k k Number of Observations and Percent Classified into Species From Species Setosa Versicolor Virginica Total Setosa 50 0 0 50 100.00 0.00 0.00 100.00 Versicolor 0 48 2 50 0.00 96.00 4.00 100.00 Virginica 0 2 48 50 0.00 4.00 96.00 100.00 Total 50 50 50 150 33.33 33.33 33.33 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for Species Setosa Versicolor Virginica Total Rate 0.0000 0.0400 0.0400 0.0267 Priors 0.3333 0.3333 0.3333
Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Unequal Bandwidth The DISCRIM Procedure Classification Summary for Test Data: WORK.PLOTDATA Classification Summary using Normal Kernel Density Squared Distance Function 2 1 D (X,Y) = (X Y)' COV (X Y) j Posterior Probability of Membership in Each Species 1 2 2 F(Xj) = n SUM exp( .5 D (X,Y) / R) j i ji Pr(jX) = PRIOR F(Xj) / SUM PRIOR F(Xk) j k k Number of Observations and Percent Classified into Species Setosa Versicolor Virginica Total Total 5447 5984 32822 44253 12.31 13.52 74.17 100.00 Priors 0.33333 0.33333 0.33333
Another data set is created for plotting, containing a grid of points suitable for contour plots. The large number of points in the grid makes the following analyses very time-consuming . If you attempt to duplicate these examples, begin with a small number of points in the grid.
data plotdata; do PetalLength= 2 to 72 by 0.25; h + 1; * Number of horizontal cells; do PetalWidth= 5 to 32 by 0.25; n + 1; * Total number of cells; output; end; end; * Make variables to contain H and V grid sizes; call symput('hnobs', compress(put(h , best12.))); call symput('vnobs', compress(put(n / h, best12.))); drop n h; run;
A macro CONTOUR is defined to make contour plots of density estimates and posterior probabilities. Classification results are also plotted on the same grid.
%macro contour; data contour(keep=PetalWidth PetalLength symbol density); set plotd(in=d) iris; if d then density = max(setosa,versicolor,virginica); run; title3 'Plot of Estimated Densities'; %plotit(data=contour, plotvars=PetalWidth PetalLength, labelvar=_blank_, symvar=symbol, typevar=symbol, symlen=4, exttypes=symbol contour, ls=100, paint=density white black, rgbtypes=contour, hnobs=&hnobs, vnobs=&vnobs, excolors=white, rgbround=-16 1 1 1, extend=close, options=noclip, types =Setosa Versicolor Virginica '', symtype=symbol symbol symbol contour, symsize=0.6 0.6 0.6 1, symfont=swiss swiss swiss solid) data posterior(keep=PetalWidth PetalLength symbol prob _into_); set plotp(in=d) iris; if d then prob = max(setosa,versicolor,virginica); run; title3 'Plot of Posterior Probabilities ' '(Black to White is Low to High Probability)'; %plotit(data=posterior, plotvars=PetalWidth PetalLength, labelvar=_blank_, symvar=symbol, typevar=symbol, symlen=4, exttypes=symbol contour, ls=100, paint=prob black white 0.3 0.999, rgbtypes=contour, hnobs=&hnobs, vnobs=&vnobs, excolors=white, rgbround=-16 1 1 1, extend=close, options=noclip, types =Setosa Versicolor Virginica '', symtype=symbol symbol symbol contour, symsize=0.6 0.6 0.6 1, symfont=swiss swiss swiss solid) title3 'Plot of Classification Results'; %plotit(data=posterior, plotvars=PetalWidth PetalLength, labelvar=_blank_, symvar=symbol, typevar=symbol, symlen=4, exttypes=symbol contour, ls=100, paint=_into_ CXCCCCCC CXDDDDDD white, rgbtypes=contour, hnobs=&hnobs, vnobs=&vnobs, excolors=white, extend=close, options=noclip, types =Setosa Versicolor Virginica '', symtype=symbol symbol symbol contour, symsize=0.6 0.6 0.6 1, symfont=swiss swiss swiss solid) %mend;
A normal-theory analysis (METHOD=NORMAL) assuming equal covariance matrices (POOL=YES) illustrates the linearity of the classification boundaries. These statements produce Output 25.2.2:
proc discrim data=iris method=normal pool=yes testdata=plotdata testout=plotp testoutd=plotd short noclassify crosslisterr; class Species; var Petal:; title2 'Using Normal Density Estimates with Equal Variance'; run; %contour
A normal-theory analysis assuming unequal covariance matrices (POOL=NO) illustrates quadratic classification boundaries. These statements produce Output 25.2.3:
proc discrim data=iris method=normal pool=no testdata=plotdata testout=plotp testoutd=plotd short noclassify crosslisterr; class Species; var Petal:; title2 'Using Normal Density Estimates with Unequal Variance'; run; %contour
A nonparametric analysis (METHOD=NPAR) follows , using normal kernels (KERNEL=NORMAL) and equal bandwidths (POOL=YES) in each class. The value of the radius parameter r that, assuming normality, minimizes an approximate mean integrated square error is 0 . 50 (see the Nonparametric Methods section on page 1158). These statements produce Output 25.2.4:
proc discrim data=iris method=npar kernel=normal r=.5 pool=yes testdata=plotdata testout=plotp testoutd=plotd short noclassify crosslisterr; class Species; var Petal:; title2 'Using Kernel Density Estimates with Equal Bandwidth'; run; %contour
Another nonparametric analysis is run with unequal bandwidths (POOL=NO). These statements produce Output 25.2.5:
proc discrim data=iris method=npar kernel=normal r=.5 pool=no testdata=plotdata testout=plotp testoutd=plotd short noclassify crosslisterr; class Species; var Petal:; title2 'Using Kernel Density Estimates with Unequal Bandwidth'; run; %contour
Example 25.3. Normal-Theory Discriminant Analysis of Iris Data
In this example, PROC DISCRIM uses normal-theory methods to classify the iris data used in Example 25.1. The POOL=TEST option tests the homogeneity of the within- group covariance matrices ( Output 25.3.3). Since the resulting test statistic is significant at the 0.10 level, the within-group covariance matrices are used to derive the quadratic discriminant criterion. The WCOV and PCOV options display the within-group covariance matrices and the pooled covariance matrix ( Output 25.3.2). The DISTANCE option displays squared distances between classes ( Output 25.3.4). The ANOVA and MANOVA options test the hypothesis that the class means are equal, using univariate statistics and multivariate statistics; all statistics are significantatthe 0.0001 level ( Output 25.3.5). The LISTERR option lists the misclassified observations under resubstitution ( Output 25.3.6). The CROSSLISTERR option lists the observations that are misclassified under cross validation and displays cross validation error-rate estimates ( Output 25.3.7). The resubstitution error count estimate, 0.02, is not larger than the cross validation error count estimate, 0.0267, as would be expected because the resubstitution estimate is optimistically biased . The OUTSTAT= option generates a TYPE=MIXED (because POOL=TEST) output data set containing various statistics such as means, covariances, and coefficients of the discriminant function ( Output 25.3.8).
The following statements produce Output 25.3.1 through Output 25.3.8:
proc discrim data=iris outstat=irisstat wcov pcov method=normal pool=test distance anova manova listerr crosslisterr; class Species; var SepalLength SepalWidth PetalLength PetalWidth; title2 'Using Quadratic Discriminant Function'; run; proc print data=irisstat; title2 'Output Discriminant Statistics'; run;
Output 25.3.1: Quadratic Discriminant Analysis of Iris Data Discriminant Analysis of Fisher (1936) Iris Data Using Quadratic Discriminant Function The DISCRIM Procedure Observations 150 DF Total 149 Variables 4 DF Within Classes 147 Classes 3 DF Between Classes 2 Class Level Information Variable Prior Species Name Frequency Weight Proportion Probability Setosa Setosa 50 50.0000 0.333333 0.333333 Versicolor Versicolor 50 50.0000 0.333333 0.333333 Virginica Virginica 50 50.0000 0.333333 0.333333
Output 25.3.2: Covariance Matrices |
Discriminant Analysis of Fisher (1936) Iris Data Using Quadratic Discriminant Function The DISCRIM Procedure Within-Class Covariance Matrices Species = Setosa, DF = 49 Variable Label SepalLength SepalWidth PetalLength PetalWidth SepalLength Sepal Length in mm. 12.42489796 9.92163265 1.63551020 1.03306122 SepalWidth Sepal Width in mm. 9.92163265 14.36897959 1.16979592 0.92979592 PetalLength Petal Length in mm. 1.63551020 1.16979592 3.01591837 0.60693878 PetalWidth Petal Width in mm. 1.03306122 0.92979592 0.60693878 1.11061224 ---------------------------------------------------------------------------------------------- Species = Versicolor, DF = 49 Variable Label SepalLength SepalWidth PetalLength PetalWidth SepalLength Sepal Length in mm. 26.64326531 8.51836735 18.28979592 5.57795918 SepalWidth Sepal Width in mm. 8.51836735 9.84693878 8.26530612 4.12040816 PetalLength Petal Length in mm. 18.28979592 8.26530612 22.08163265 7.31020408 PetalWidth Petal Width in mm. 5.57795918 4.12040816 7.31020408 3.91061224 ---------------------------------------------------------------------------------------------- Species = Virginica, DF = 49 Variable Label SepalLength SepalWidth PetalLength PetalWidth SepalLength Sepal Length in mm. 40.43428571 9.37632653 30.32897959 4.90938776 SepalWidth Sepal Width in mm. 9.37632653 10.40040816 7.13795918 4.76285714 PetalLength Petal Length in mm. 30.32897959 7.13795918 30.45877551 4.88244898 PetalWidth Petal Width in mm. 4.90938776 4.76285714 4.88244898 7.54326531 ----------------------------------------------------------------------------------------------
Discriminant Analysis of Fisher (1936) Iris Data Using Quadratic Discriminant Function The DISCRIM Procedure Pooled Within-Class Covariance Matrix, DF = 147 Variable Label SepalLength SepalWidth PetalLength PetalWidth SepalLength Sepal Length in mm. 26.50081633 9.27210884 16.75142857 3.84013605 SepalWidth Sepal Width in mm. 9.27210884 11.53877551 5.52435374 3.27102041 PetalLength Petal Length in mm. 16.75142857 5.52435374 18.51877551 4.26653061 PetalWidth Petal Width in mm. 3.84013605 3.27102041 4.26653061 4.18816327 Within Covariance Matrix Information Natural Log of the Covariance Determinant of the Species Matrix Rank Covariance Matrix Setosa 4 5.35332 Versicolor 4 7.54636 Virginica 4 9.49362 Pooled 4 8.46214
Output 25.3.3: Homogeneity Test Discriminant Analysis of Fisher (1936) Iris Data Using Quadratic Discriminant Function The DISCRIM Procedure Test of Homogeneity of Within Covariance Matrices Notation: K = Number of Groups P = Number of Variables N = Total Number of Observations - Number of Groups N(i) = Number of Observations in the ith Group 1 __ N(i)/2 Within SS Matrix(i) V = ----------------------------------- N/2 Pooled SS Matrix _ _ 2 1 1 2P + 3P 1 RHO = 1.0 SUM ----- --- ------------- _ N(i) N _ 6(P+1)(K 1) DF = .5(K 1)P(P+1) _ _ PN/2 N V Under the null hypothesis: 2 RHO ln ------------------ __ PN(i)/2 _ N(i) _ is distributed approximately as Chi-Square(DF). Chi-Square DF Pr > ChiSq 140.943050 20 <.0001 Since the Chi-Square value is significant at the 0.1 level, the within covariance matrices will be used in the discriminant function. Reference: Morrison, D.F. (1976) Multivariate Statistical Methods p252.
Output 25.3.4: Squared Distances Discriminant Analysis of Fisher (1936) Iris Data Using Quadratic Discriminant Function The DISCRIM Procedure Pairwise Squared Distances Between Groups 2 _ _ 1 _ _ D (ij) = (X X)' COV (X X) i j j i j Squared Distance to Species From Species Setosa Versicolor Virginica Setosa 0 103.19382 168.76759 Versicolor 323.06203 0 13.83875 Virginica 706.08494 17.86670 0 Pairwise Generalized Squared Distances Between Groups 2 _ _ 1 _ _ D (ij) = (X X)' COV (X X) + ln COV i j j i j j Generalized Squared Distance to Species From Species Setosa Versicolor Virginica Setosa 5.35332 110.74017 178.26121 Versicolor 328.41535 7.54636 23.33238 Virginica 711.43826 25.41306 9.49362
Output 25.3.5: Tests of Equal Class Means |
Discriminant Analysis of Fisher (1936) Iris Data Using Quadratic Discriminant Function The DISCRIM Procedure Univariate Test Statistics F Statistics, Num DF=2, Den DF=147 Total Pooled Between Standard Standard Standard R-Square Variable Label Deviation Deviation Deviation R-Square / (1 RSq) F Value Pr > F SepalLength Sepal Length in mm. 8.2807 5.1479 7.9506 0.6187 1.6226 119.26 <.0001 SepalWidth Sepal Width in mm. 4.3587 3.3969 3.3682 0.4008 0.6688 49.16 <.0001 PetalLength Petal Length in mm. 17.6530 4.3033 20.9070 0.9414 16.0566 1180.16 <.0001 PetalWidth Petal Width in mm. 7.6224 2.0465 8.9673 0.9289 13.0613 960.01 <.0001 Average R-Square Unweighted 0.7224358 Weighted by Variance 0.8689444 Multivariate Statistics and F Approximations S=2 M=0.5 N=71 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.02343863 199.15 8 288 <.0001 Pillai's Trace 1.19189883 53.47 8 290 <.0001 Hotelling-Lawley Trace 32.47732024 582.20 8 203.4 <.0001 Roy's Greatest Root 32.19192920 1166.96 4 145 <.0001 NOTE: F Statistic for Roy's Greatest Root is an upper bound. NOTE: F Statistic for Wilks' Lambda is exact.
|
Output 25.3.6: Misclassified Observations: Resubstitution Discriminant Analysis of Fisher (1936) Iris Data Using Quadratic Discriminant Function The DISCRIM Procedure Classification Results for Calibration Data: WORK.IRIS Resubstitution Results using Quadratic Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (X X)' COV (X X) + ln COV j j j j j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Posterior Probability of Membership in Species From Classified Obs Species into Species Setosa Versicolor Virginica 5 Virginica Versicolor * 0.0000 0.6050 0.3950 9 Versicolor Virginica * 0.0000 0.3359 0.6641 12 Versicolor Virginica * 0.0000 0.1543 0.8457 * Misclassified observation
Discriminant Analysis of Fisher (1936) Iris Data Using Quadratic Discriminant Function The DISCRIM Procedure Classification Summary for Calibration Data: WORK.IRIS Resubstitution Summary using Quadratic Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (X X)' COV (X X) + ln COV j j j j j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Number of Observations and Percent Classified into Species From Species Setosa Versicolor Virginica Total Setosa 50 0 0 50 100.00 0.00 0.00 100.00 Versicolor 0 48 2 50 0.00 96.00 4.00 100.00 Virginica 0 1 49 50 0.00 2.00 98.00 100.00 Total 50 49 51 150 33.33 32.67 34.00 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for Species Setosa Versicolor Virginica Total Rate 0.0000 0.0400 0.0200 0.0200 Priors 0.3333 0.3333 0.3333
Output 25.3.7: Misclassified Observations: Cross validation Discriminant Analysis of Fisher (1936) Iris Data Using Quadratic Discriminant Function The DISCRIM Procedure Classification Results for Calibration Data: WORK.IRIS Cross validation Results using Quadratic Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (X X)' COV (X X) + ln COV j (X)j (X)j (X)j (X)j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Posterior Probability of Membership in Species From Classified Obs Species into Species Setosa Versicolor Virginica 5 Virginica Versicolor * 0.0000 0.6632 0.3368 8 Versicolor Virginica * 0.0000 0.3134 0.6866 9 Versicolor Virginica * 0.0000 0.1616 0.8384 12 Versicolor Virginica * 0.0000 0.0713 0.9287 * Misclassified observation
Discriminant Analysis of Fisher (1936) Iris Data Using Quadratic Discriminant Function The DISCRIM Procedure Classification Summary for Calibration Data: WORK.IRIS Cross validation Summary using Quadratic Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (X X)' COV (X X) + ln COV j (X)j (X)j (X)j (X)j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Number of Observations and Percent Classified into Species From Species Setosa Versicolor Virginica Total Setosa 50 0 0 50 100.00 0.00 0.00 100.00 Versicolor 0 47 3 50 0.00 94.00 6.00 100.00 Virginica 0 1 49 50 0.00 2.00 98.00 100.00 Total 50 48 52 150 33.33 32.00 34.67 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for Species Setosa Versicolor Virginica Total Rate 0.0000 0.0600 0.0200 0.0267 Priors 0.3333 0.3333 0.3333
Output 25.3.8: Output Statistics from Iris Data |
Discriminant Analysis of Fisher (1936) Iris Data Output Discriminant Statistics Sepal Sepal Petal Petal Obs Species _TYPE_ _NAME_ Length Width Length Width 1 . N 150.00 150.00 150.00 150.00 2 Setosa N 50.00 50.00 50.00 50.00 3 Versicolor N 50.00 50.00 50.00 50.00 4 Virginica N 50.00 50.00 50.00 50.00 5 . MEAN 58.43 30.57 37.58 11.99 6 Setosa MEAN 50.06 34.28 14.62 2.46 7 Versicolor MEAN 59.36 27.70 42.60 13.26 8 Virginica MEAN 65.88 29.74 55.52 20.26 9 Setosa PRIOR 0.33 0.33 0.33 0.33 10 Versicolor PRIOR 0.33 0.33 0.33 0.33 11 Virginica PRIOR 0.33 0.33 0.33 0.33 12 Setosa CSSCP SepalLength 608.82 486.16 80.14 50.62 13 Setosa CSSCP SepalWidth 486.16 704.08 57.32 45.56 14 Setosa CSSCP PetalLength 80.14 57.32 147.78 29.74 15 Setosa CSSCP PetalWidth 50.62 45.56 29.74 54.42 16 Versicolor CSSCP SepalLength 1305.52 417.40 896.20 273.32 17 Versicolor CSSCP SepalWidth 417.40 482.50 405.00 201.90 18 Versicolor CSSCP PetalLength 896.20 405.00 1082.00 358.20 19 Versicolor CSSCP PetalWidth 273.32 201.90 358.20 191.62 20 Virginica CSSCP SepalLength 1981.28 459.44 1486.12 240.56 21 Virginica CSSCP SepalWidth 459.44 509.62 349.76 233.38 22 Virginica CSSCP PetalLength 1486.12 349.76 1492.48 239.24 23 Virginica CSSCP PetalWidth 240.56 233.38 239.24 369.62 24 . PSSCP SepalLength 3895.62 1363.00 2462.46 564.50 25 . PSSCP SepalWidth 1363.00 1696.20 812.08 480.84 26 . PSSCP PetalLength 2462.46 812.08 2722.26 627.18 27 . PSSCP PetalWidth 564.50 480.84 627.18 615.66 28 . BSSCP SepalLength 6321.21 1995.27 16524.84 7127.93 29 . BSSCP SepalWidth 1995.27 1134.49 5723.96 2293.27 30 . BSSCP PetalLength 16524.84 5723.96 43710.28 18677.40 31 . BSSCP PetalWidth 7127.93 2293.27 18677.40 8041.33 32 . CSSCP SepalLength 10216.83 632.27 18987.30 7692.43 33 . CSSCP SepalWidth 632.27 2830.69 4911.88 1812.43 34 . CSSCP PetalLength 18987.30 4911.88 46432.54 19304.58 35 . CSSCP PetalWidth 7692.43 1812.43 19304.58 8656.99 36 . RSQUARED 0.62 0.40 0.94 0.93 37 Setosa COV SepalLength 12.42 9.92 1.64 1.03 38 Setosa COV SepalWidth 9.92 14.37 1.17 0.93 39 Setosa COV PetalLength 1.64 1.17 3.02 0.61 40 Setosa COV PetalWidth 1.03 0.93 0.61 1.11 41 Versicolor COV SepalLength 26.64 8.52 18.29 5.58 42 Versicolor COV SepalWidth 8.52 9.85 8.27 4.12 43 Versicolor COV PetalLength 18.29 8.27 22.08 7.31 44 Versicolor COV PetalWidth 5.58 4.12 7.31 3.91 45 Virginica COV SepalLength 40.43 9.38 30.33 4.91 46 Virginica COV SepalWidth 9.38 10.40 7.14 4.76 47 Virginica COV PetalLength 30.33 7.14 30.46 4.88 48 Virginica COV PetalWidth 4.91 4.76 4.88 7.54 49 . PCOV SepalLength 26.50 9.27 16.75 3.84 50 . PCOV SepalWidth 9.27 11.54 5.52 3.27 51 . PCOV PetalLength 16.75 5.52 18.52 4.27 52 . PCOV PetalWidth 3.84 3.27 4.27 4.19 53 . BCOV SepalLength 63.21 19.95 165.25 71.28 54 . BCOV SepalWidth 19.95 11.34 57.24 22.93 55 . BCOV PetalLength 165.25 57.24 437.10 186.77 56 . BCOV PetalWidth 71.28 22.93 186.77 80.41 57 . COV SepalLength 68.57 4.24 127.43 51.63 58 . COV SepalWidth 4.24 19.00 32.97 12.16 59 . COV PetalLength 127.43 32.97 311.63 129.56 60 . COV PetalWidth 51.63 12.16 129.56 58.10 61 Setosa STD 3.52 3.79 1.74 1.05 62 Versicolor STD 5.16 3.14 4.70 1.98 63 Virginica STD 6.36 3.22 5.52 2.75 64 . PSTD 5.15 3.40 4.30 2.05 65 . BSTD 7.95 3.37 20.91 8.97 66 . STD 8.28 4.36 17.65 7.62 67 Setosa CORR SepalLength 1.00 0.74 0.27 0.28 68 Setosa CORR SepalWidth 0.74 1.00 0.18 0.23 69 Setosa CORR PetalLength 0.27 0.18 1.00 0.33 70 Setosa CORR PetalWidth 0.28 0.23 0.33 1.00
Discriminant Analysis of Fisher (1936) Iris Data Output Discriminant Statistics Sepal Sepal Petal Petal Obs Species _TYPE_ _NAME_ Length Width Length Width 71 Versicolor CORR SepalLength 1.000 0.526 0.754 0.546 72 Versicolor CORR SepalWidth 0.526 1.000 0.561 0.664 73 Versicolor CORR PetalLength 0.754 0.561 1.000 0.787 74 Versicolor CORR PetalWidth 0.546 0.664 0.787 1.000 75 Virginica CORR SepalLength 1.000 0.457 0.864 0.281 76 Virginica CORR SepalWidth 0.457 1.000 0.401 0.538 77 Virginica CORR PetalLength 0.864 0.401 1.000 0.322 78 Virginica CORR PetalWidth 0.281 0.538 0.322 1.000 79 . PCORR SepalLength 1.000 0.530 0.756 0.365 80 . PCORR SepalWidth 0.530 1.000 0.378 0.471 81 . PCORR PetalLength 0.756 0.378 1.000 0.484 82 . PCORR PetalWidth 0.365 0.471 0.484 1.000 83 . BCORR SepalLength 1.000 0.745 0.994 1.000 84 . BCORR SepalWidth 0.745 1.000 0.813 0.759 85 . BCORR PetalLength 0.994 0.813 1.000 0.996 86 . BCORR PetalWidth 1.000 0.759 0.996 1.000 87 . CORR SepalLength 1.000 0.118 0.872 0.818 88 . CORR SepalWidth 0.118 1.000 0.428 0.366 89 . CORR PetalLength 0.872 0.428 1.000 0.963 90 . CORR PetalWidth 0.818 0.366 0.963 1.000 91 Setosa STDMEAN 1.011 0.850 1.301 1.251 92 Versicolor STDMEAN 0.112 0.659 0.284 0.166 93 Virginica STDMEAN 0.899 0.191 1.016 1.085 94 Setosa PSTDMEAN 1.627 1.091 5.335 4.658 95 Versicolor PSTDMEAN 0.180 0.846 1.167 0.619 96 Virginica PSTDMEAN 1.447 0.245 4.169 4.039 97 . LNDETERM 8.462 8.462 8.462 8.462 98 Setosa LNDETERM 5.353 5.353 5.353 5.353 99 Versicolor LNDETERM 7.546 7.546 7.546 7.546 100 Virginica LNDETERM 9.494 9.494 9.494 9.494 101 Setosa QUAD SepalLength 0.095 0.062 0.023 0.024 102 Setosa QUAD SepalWidth 0.062 0.078 0.006 0.011 103 Setosa QUAD PetalLength 0.023 0.006 0.194 0.090 104 Setosa QUAD PetalWidth 0.024 0.011 0.090 0.530 105 Setosa QUAD _LINEAR_ 4.455 0.762 3.356 3.126 106 Setosa QUAD _CONST_ 121.826 121.826 121.826 121.826 107 Versicolor QUAD SepalLength 0.048 0.018 0.043 0.032 108 Versicolor QUAD SepalWidth 0.018 0.099 0.011 0.097 109 Versicolor QUAD PetalLength 0.043 0.011 0.099 0.135 110 Versicolor QUAD PetalWidth 0.032 0.097 0.135 0.436 111 Versicolor QUAD _LINEAR_ 1.801 1.596 0.327 1.471 112 Versicolor QUAD _CONST_ 76.549 76.549 76.549 76.549 113 Virginica QUAD SepalLength 0.053 0.017 0.050 0.009 114 Virginica QUAD SepalWidth 0.017 0.079 0.006 0.042 115 Virginica QUAD PetalLength 0.050 0.006 0.067 0.014 116 Virginica QUAD PetalWidth 0.009 0.042 0.014 0.097 117 Virginica QUAD _LINEAR_ 0.737 1.325 0.623 0.966 118 Virginica QUAD _CONST_ 75.821 75.821 75.821 75.821
|
Example 25.5. Quadratic Discriminant Analysis of Remote-Sensing Data on Crops
In this example, PROC DISCRIM uses normal-theory methods (METHOD=NORMAL) assuming unequal variances (POOL=NO) for the remote-sensing data of Example 25.4. The PRIORS statement, PRIORS PROP, sets the prior probabilities proportional to the sample sizes. The CROSSVALIDATE option displays cross validation error-rate estimates. Note that the total error count estimate by cross validation (0.5556) is much larger than the total error count estimate by resubstitution (0.1111). The following statements produce Output 25.5.1:
proc discrim data=crops method=normal pool=no crossvalidate; class Crop; priors prop; id xvalues; var x1-x4; title2 'Using Quadratic Discriminant Function'; run;
Output 25.5.1: Quadratic Discriminant Function on Crop Data Discriminant Analysis of Remote Sensing Data on Five Crops Using Quadratic Discriminant Function The DISCRIM Procedure Observations 36 DF Total 35 Variables 4 DF Within Classes 31 Classes 5 DF Between Classes 4 Class Level Information Variable Prior Crop Name Frequency Weight Proportion Probability Clover Clover 11 11.0000 0.305556 0.305556 Corn Corn 7 7.0000 0.194444 0.194444 Cotton Cotton 6 6.0000 0.166667 0.166667 Soybeans Soybeans 6 6.0000 0.166667 0.166667 Sugarbeets Sugarbeets 6 6.0000 0.166667 0.166667
Discriminant Analysis of Remote Sensing Data on Five Crops Using Quadratic Discriminant Function The DISCRIM Procedure Within Covariance Matrix Information Natural Log of the Covariance Determinant of the Crop Matrix Rank Covariance Matrix Clover 4 23.64618 Corn 4 11.13472 Cotton 4 13.23569 Soybeans 4 12.45263 Sugarbeets 4 17.76293
Discriminant Analysis of Remote Sensing Data on Five Crops Using Quadratic Discriminant Function The DISCRIM Procedure Pairwise Generalized Squared Distances Between Groups 2 _ _ 1 _ _ D (ij) = (X X)' COV (X X) + ln COV 2 ln PRIOR i j j i j j j Generalized Squared Distance to Crop From Crop Clover Corn Cotton Soybeans Sugarbeets Clover 26.01743 1320 104.18297 194.10546 31.40816 Corn 27.73809 14.40994 150.50763 38.36252 25.55421 Cotton 26.38544 588.86232 16.81921 52.03266 37.15560 Soybeans 27.07134 46.42131 41.01631 16.03615 23.15920 Sugarbeets 26.80188 332.11563 43.98280 107.95676 21.34645
Discriminant Analysis of Remote Sensing Data on Five Crops Using Quadratic Discriminant Function The DISCRIM Procedure Classification Summary for Calibration Data: WORK.CROPS Resubstitution Summary using Quadratic Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (X X)' COV (X X) + ln COV 2 ln PRIOR j j j j j j Posterior Probability of Membership in Each Crop 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Number of Observations and Percent Classified into Crop From Crop Clover Corn Cotton Soybeans Sugarbeets Total Clover 9 0 0 0 2 11 81.82 0.00 0.00 0.00 18.18 100.00 Corn 0 7 0 0 0 7 0.00 100.00 0.00 0.00 0.00 100.00 Cotton 0 0 6 0 0 6 0.00 0.00 100.00 0.00 0.00 100.00 Soybeans 0 0 0 6 0 6 0.00 0.00 0.00 100.00 0.00 100.00 Sugarbeets 0 0 1 1 4 6 0.00 0.00 16.67 16.67 66.67 100.00 Total 9 7 7 7 6 36 25.00 19.44 19.44 19.44 16.67 100.00 Priors 0.30556 0.19444 0.16667 0.16667 0.16667 Error Count Estimates for Crop Clover Corn Cotton Soybeans Sugarbeets Total Rate 0.1818 0.0000 0.0000 0.0000 0.3333 0.1111 Priors 0.3056 0.1944 0.1667 0.1667 0.1667
Discriminant Analysis of Remote Sensing Data on Five Crops Using Quadratic Discriminant Function The DISCRIM Procedure Classification Summary for Calibration Data: WORK.CROPS Cross validation Summary using Quadratic Discriminant Function Generalized Squared Distance Function 2 _ 1 _ D (X) = (X X)' COV (X X) + ln COV 2 ln PRIOR j (X)j (X)j (X)j (X)j j Posterior Probability of Membership in Each Crop 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Number of Observations and Percent Classified into Crop From Crop Clover Corn Cotton Soybeans Sugarbeets Total Clover 9 0 0 0 2 11 81.82 0.00 0.00 0.00 18.18 100.00 Corn 3 2 0 0 2 7 42.86 28.57 0.00 0.00 28.57 100.00 Cotton 3 0 2 0 1 6 50.00 0.00 33.33 0.00 16.67 100.00 Soybeans 3 0 0 2 1 6 50.00 0.00 0.00 33.33 16.67 100.00 Sugarbeets 3 0 1 1 1 6 50.00 0.00 16.67 16.67 16.67 100.00 Total 21 2 3 3 7 36 58.33 5.56 8.33 8.33 19.44 100.00 Priors 0.30556 0.19444 0.16667 0.16667 0.16667 Error Count Estimates for Crop Clover Corn Cotton Soybeans Sugarbeets Total Rate 0.1818 0.7143 0.6667 0.6667 0.8333 0.5556 Priors 0.3056 0.1944 0.1667 0.1667 0.1667
|