The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on fifty iris specimens from each of three species: Iris setosa, I. versicolor, and I. virginica .
This example is a canonical discriminant analysis that creates an output data set containing scores on the canonical variables and plots the canonical variables . The following statements produce Output 21.1.1 through Output 21.1.7:
proc format; value specname 1='Setosa ' 2='Versicolor' 3='Virginica '; run; data iris; title 'Fisher (1936) Iris Data'; input SepalLength SepalWidth PetalLength PetalWidth Species @@; format Species specname.; label SepalLength='Sepal Length in mm.' SepalWidth ='Sepal Width in mm.' PetalLength='Petal Length in mm.' PetalWidth ='Petal Width in mm.'; symbol = put(Species, specname10.); datalines; 50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3 63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2 59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2 65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3 68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3 77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3 49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2 64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3 55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1 49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1 67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1 77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2 50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1 61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1 61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1 51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1 51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1 46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1 50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3 57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1 71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3 49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1 49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1 66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1 44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2 47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2 74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1 56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3 49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1 56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2 51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3 54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3 61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3 68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1 45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1 55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1 51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2 63 33 60 25 3 53 37 15 02 1 ; proc candisc data=iris out=outcan distance anova; class Species; var SepalLength SepalWidth PetalLength PetalWidth; run;
Fisher (1936) Iris Data The CANDISC Procedure Observations 150 DF Total 149 Variables 4 DF Within Classes 147 Classes 3 DF Between Classes 2 Class Level Information Variable Species Name Frequency Weight Proportion Setosa Setosa 50 50.0000 0.333333 Versicolor Versicolor 50 50.0000 0.333333 Virginica Virginica 50 50.0000 0.333333
Fisher (1936) Iris Data The CANDISC Procedure Pairwise Squared Distances Between Groups 2 _ _ -1 _ _ D (ij) = (X X) COV (X X) i j i j Squared Distance to Species From Species Setosa Versicolor Virginica Setosa 0 89.86419 179.38471 Versicolor 89.86419 0 17.20107 Virginica 179.38471 17.20107 0
Fisher (1936) Iris Data The CANDISC Procedure F Statistics, NDF=4, DDF=144 for Squared Distance to Species From Species Setosa Versicolor Virginica Setosa 0 550.18889 1098 Versicolor 550.18889 0 105.31265 Virginica 1098 105.31265 0 Prob > Mahalanobis Distance for Squared Distance to Species From Species Setosa Versicolor Virginica Setosa 1.0000 <.0001 <.0001 Versicolor <.0001 1.0000 <.0001 Virginica <.0001 <.0001 1.0000
Fisher (1936) Iris Data The CANDISC Procedure Univariate Test Statistics F Statistics, Num DF=2, Den DF=147 Total Pooled Between Standard Standard Standard R-Square Variable Label Deviation Deviation Deviation R-Square / (1-RSq) F Value Pr > F SepalLength Sepal Length in mm. 8.2807 5.1479 7.9506 0.6187 1.6226 119.26 <.0001 SepalWidth Sepal Width in mm. 4.3587 3.3969 3.3682 0.4008 0.6688 49.16 <.0001 PetalLength Petal Length in mm. 17.6530 4.3033 20.9070 0.9414 16.0566 1180.16 <.0001 PetalWidth Petal Width in mm. 7.6224 2.0465 8.9673 0.9289 13.0613 960.01 <.0001 Average R-Square Unweighted 0.7224358 Weighted by Variance 0.8689444 Multivariate Statistics and F Approximations S=2 M=0.5 N=71 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.02343863 199.15 8 288 <.0001 Pillai's Trace 1.19189883 53.47 8 290 <.0001 Hotelling-Lawley Trace 32.47732024 582.20 8 203.4 <.0001 Roy's Greatest Root 32.19192920 1166.96 4 145 <.0001 NOTE: F Statistic for Roy's Greatest Root is an upper bound. NOTE: F Statistic for Wilks' Lambda is exact.
Fisher (1936) Iris Data The CANDISC Procedure Adjusted Approximate Squared Canonical Canonical Standard Canonical Correlation Correlation Error Correlation 1 0.984821 0.984508 0.002468 0.969872 2 0.471197 0.461445 0.063734 0.222027 Test of H0: The canonical correlations in the current row and all Eigenvalues of Inv(E)*H that follow are zero = CanRsq/(1-CanRsq) Likelihood Approximate Eigenvalue Difference Proportion Cumulative Ratio F Value Num DF Den DF Pr > F 1 32.1919 31.9065 0.9912 0.9912 0.02343863 199.15 8 288 <.0001 2 0.2854 0.0088 1.0000 0.77797337 13.79 3 145 <.0001
Fisher (1936) Iris Data The CANDISC Procedure Total Canonical Structure Variable Label Can1 Can2 SepalLength Sepal Length in mm. 0.791888 0.217593 SepalWidth Sepal Width in mm. 0.530759 0.757989 PetalLength Petal Length in mm. 0.984951 0.046037 PetalWidth Petal Width in mm. 0.972812 0.222902 Between Canonical Structure Variable Label Can1 Can2 SepalLength Sepal Length in mm. 0.991468 0.130348 SepalWidth Sepal Width in mm. 0.825658 0.564171 PetalLength Petal Length in mm. 0.999750 0.022358 PetalWidth Petal Width in mm. 0.994044 0.108977 Pooled Within Canonical Structure Variable Label Can1 Can2 SepalLength Sepal Length in mm. 0.222596 0.310812 SepalWidth Sepal Width in mm. 0.119012 0.863681 PetalLength Petal Length in mm. 0.706065 0.167701 PetalWidth Petal Width in mm. 0.633178 0.737242
Fisher (1936) Iris Data The CANDISC Procedure Total-Sample Standardized Canonical Coefficients Variable Label Can1 Can2 SepalLength Sepal Length in mm. 0.686779533 0.019958173 SepalWidth Sepal Width in mm. 0.668825075 0.943441829 PetalLength Petal Length in mm. 3.885795047 1.645118866 PetalWidth Petal Width in mm. 2.142238715 2.164135931 Pooled Within-Class Standardized Canonical Coefficients Variable Label Can1 Can2 SepalLength Sepal Length in mm. .4269548486 0.0124075316 SepalWidth Sepal Width in mm. .5212416758 0.7352613085 PetalLength Petal Length in mm. 0.9472572487 .4010378190 PetalWidth Petal Width in mm. 0.5751607719 0.5810398645
PROC CANDISC first displays information about the observations and the classes in thedatasetinOutput 21.1.1.
The DISTANCE option in the PROC CANDISC statement displays squared Mahalanobis distances between class means. Results from the DISTANCE option is shown in Output 21.1.2 and Output 21.1.3.
The ANOVA option specifies testing of the hypothesis that the class means are equal using univariate statistics. The resulting R 2 values (see Output 21.1.4) range from 0.4008 for SepalWidth to 0.9414 for PetalLength , and each variable is significant at the 0.0001 level. The multivariate test for differences between the classes (which is displayed by default) is also significant at the 0.0001 level; you would expect this from the highly significant univariate test results.
The R 2 between Can1 and the class variable, 0.969872, is much larger than the corresponding R 2 for Can2 , 0.222027. This is displayed in Output 21.1.5.
The raw canonical coefficients (shown in Output 21.1.7) for the first canonical variable, Can1 , show that the classes differ most widely on the linear combination of the centered variables ˆ’ 0.0829378 — SepalLength ˆ’ 0.153447 — SepalWidth + 0.220121 — PetalLength + 0.281046 — PetalWidth .
Fisher (1936) Iris Data The CANDISC Procedure Raw Canonical Coefficients Variable Label Can1 Can2 SepalLength Sepal Length in mm. .0829377642 0.0024102149 SepalWidth Sepal Width in mm. .1534473068 0.2164521235 PetalLength Petal Length in mm. 0.2201211656 .0931921210 PetalWidth Petal Width in mm. 0.2810460309 0.2839187853 Class Means on Canonical Variables Species Can1 Can2 Setosa 7.607599927 0.215133017 Versicolor 1.825049490 0.727899622 Virginica 5.782550437 0.512766605
The plot of canonical variables in Output 21.1.8 shows that of the two canonical variables Can1 has the most discriminatory power. The following invocation of the % PLOTIT macro creates this plot:
%plotit(data=outcan, plotvars=Can2 Can1, labelvar=_blank_, symvar=symbol, typevar=symbol, symsize=1, symlen=4, exttypes=symbol, ls=100, tsize=1.5, extend=close);