The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on fifty iris specimens from each of three species, Iris setosa, I. versicolor, and I. virginica . Mezzich and Solomon (1980) discuss a variety of cluster analyses of the iris data.
In this example PROC ACECLUS is used to transform the data, and the clustering is performed by PROC FASTCLUS. Compare this with the example in Chapter 28, The FASTCLUS Procedure. The results from the FREQ procedure display fewer misclassifications when PROC ACECLUS is used. The following statements produce Output 16.1.1 through Output 16.1.5.
proc format; value specname 1='Setosa ' 2='Versicolor' 3='Virginica '; run; data iris; title 'Fisher (1936) Iris Data'; input SepalLength SepalWidth PetalLength PetalWidth Species @@; format Species specname.; label SepalLength='Sepal Length in mm.' SepalWidth ='Sepal Width in mm.' PetalLength='Petal Length in mm.' PetalWidth ='Petal Width in mm.'; symbol = put(species, specname10.); datalines; 50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3 63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2 59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2 65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3 68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3 77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3 49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2 64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3 55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1 49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1 67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1 77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2 50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1 61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1 61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1 51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1 51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1 46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1 50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3 57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1 71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3 49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1 49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1 66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1 44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2 47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2 74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1 56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3 49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1 56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2 51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3 54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3 61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3 68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1 45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1 55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1 51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2 63 33 60 25 3 53 37 15 02 1 ; proc aceclus data=iris out=ace p=.02 outstat=score; var SepalLength SepalWidth PetalLength PetalWidth; run; legend1 frame cframe=white cborder=black position=center value=(justify=center); axis1 label=(angle=90 rotate=0) minor=none; axis2 minor=none; proc gplot data=ace; plot can2*can1=Species / frame cframe=white legend=legend1 vaxis=axis1 haxis=axis2; format Species specname. ; run; quit; proc fastclus data=ace maxc=3 maxiter=10 conv=0 out=clus; var can:; run; proc freq; tables cluster*Species; run;
Fisher (1936) Iris Data The ACECLUS Procedure Approximate Covariance Estimation for Cluster Analysis Observations 150 Proportion 0.0200 Variables 4 Converge 0.00100 Means and Standard Deviations Standard Variable Mean Deviation Label SepalLength 58.4333 8.2807 Sepal Length in mm. SepalWidth 30.5733 4.3587 Sepal Width in mm. PetalLength 37.5800 17.6530 Petal Length in mm. PetalWidth 11.9933 7.6224 Petal Width in mm. COV: Total Sample Covariances SepalLength SepalWidth PetalLength PetalWidth SepalLength 68.5693512 4.2434004 127.4315436 51.6270694 SepalWidth 4.2434004 18.9979418 32.9656376 12.1639374 PetalLength 127.4315436 32.9656376 311.6277852 129.5609396 PetalWidth 51.6270694 12.1639374 129.5609396 58.1006264 Initial Within-Cluster Covariance Estimate = Full Covariance Matrix Threshold = 0.334211 Iteration History Pairs RMS Distance Within Convergence Iteration Distance Cutoff Cutoff Measure ------------------------------------------------------------ 1 2.828 0.945 408.0 0.465775 2 11.905 3.979 559.0 0.013487 3 13.152 4.396 940.0 0.029499 4 13.439 4.491 1506.0 0.046846 5 13.271 4.435 2036.0 0.046859 6 12.591 4.208 2285.0 0.025027 7 12.199 4.077 2366.0 0.009559 8 12.121 4.051 2402.0 0.003895 9 12.064 4.032 2417.0 0.002051 10 12.047 4.026 2429.0 0.000971 Algorithm converged.
ACE: Approximate Covariance Estimate Within Clusters SepalLength SepalWidth PetalLength PetalWidth SepalLength 11.73342939 5.47550432 4.95389049 2.02902429 SepalWidth 5.47550432 6.91992590 2.42177851 1.74125154 PetalLength 4.95389049 2.42177851 6.53746398 2.35302594 PetalWidth 2.02902429 1.74125154 2.35302594 2.05166735 Eigenvalues of Inv(ACE)*(COV-ACE) Eigenvalue Difference Proportion Cumulative 1 63.7716 61.1593 0.9367 0.9367 2 2.6123 1.5561 0.0384 0.9751 3 1.0562 0.4167 0.0155 0.9906 4 0.6395 0.00939 1.0000 Eigenvectors (Raw Canonical Coefficients) Can1 Can2 Can3 Can4 SepalLength Sepal Length in mm. .012009 .098074 .059852 0.402352 SepalWidth Sepal Width in mm. .211068 .000072 0.402391 .225993 PetalLength Petal Length in mm. 0.324705 .328583 0.110383 .321069 PetalWidth Petal Width in mm. 0.266239 0.870434 .085215 0.320286 Standardized Canonical Coefficients Can1 Can2 Can3 Can4 SepalLength Sepal Length in mm. 0.09944 0.81211 0.49562 3.33174 SepalWidth Sepal Width in mm. 0.91998 0.00031 1.75389 0.98503 PetalLength Petal Length in mm. 5.73200 5.80047 1.94859 5.66782 PetalWidth Petal Width in mm. 2.02937 6.63478 -0.64954 2.44134
The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0 Cluster Summary Maximum Distance RMS Std from Seed Radius Nearest Cluster Frequency Deviation to Observation Exceeded Cluster ----------------------------------------------------------------------------- 1 50 1.1016 5.2768 3 2 50 1.8880 6.8298 3 3 50 1.4138 5.3152 2 Cluster Summary Distance Between Cluster Cluster Centroids ----------------------------- 1 13.2845 2 5.8580 3 5.8580 Statistics for Variables Variable Total STD Within STD R-Square RSQ/(1-RSQ) ------------------------------------------------------------------ Can1 8.04808 1.48537 0.966394 28.756658 Can2 1.90061 1.85646 0.058725 0.062389 Can3 1.43395 1.32518 0.157417 0.186826 Can4 1.28044 1.27550 0.021025 0.021477 OVER-ALL 4.24499 1.50298 0.876324 7.085666 Pseudo F Statistic = 520.80 Approximate Expected Over-All R-Squared = 0.80391 Cubic Clustering Criterion = 5.179 Cluster Means Cluster Can1 Can2 Can3 Can4 ------------------------------------------------------------------------------- 1 10.67516964 0.06706906 0.27068819 0.11164209 2 8.12988211 0.52566663 0.51836499 0.14915404 3 2.54528754 0.59273569 0.78905317 0.26079612 Cluster Standard Deviations Cluster Can1 Can2 Can3 Can4 ------------------------------------------------------------------------------- 1 0.953761025 0.931943571 1.398456061 1.058217627 2 1.799159552 2.743869556 1.270344142 1.370523175 3 1.572366584 1.393565864 1.303411851 1.372050319
The FREQ Procedure Table of CLUSTER by Species CLUSTER(Cluster) Species Frequency Percent Row Pct Col Pct Setosa VersicolVirginic Total or a ---------+--------+--------+--------+ 1 50 0 0 50 33.33 0.00 0.00 33.33 100.00 0.00 0.00 100.00 0.00 0.00 ---------+--------+--------+--------+ 2 0 2 48 50 0.00 1.33 32.00 33.33 0.00 4.00 96.00 0.00 4.00 96.00 ---------+--------+--------+--------+ 3 0 48 2 50 0.00 32.00 1.33 33.33 0.00 96.00 4.00 0.00 96.00 4.00 ---------+--------+--------+--------+ Total 50 50 50 150 33.33 33.33 33.33 100.00