Example


Example 16.1. Transformation and Cluster Analysis of Fisher Iris Data

The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on fifty iris specimens from each of three species, Iris setosa, I. versicolor, and I. virginica . Mezzich and Solomon (1980) discuss a variety of cluster analyses of the iris data.

In this example PROC ACECLUS is used to transform the data, and the clustering is performed by PROC FASTCLUS. Compare this with the example in Chapter 28, The FASTCLUS Procedure. The results from the FREQ procedure display fewer misclassifications when PROC ACECLUS is used. The following statements produce Output 16.1.1 through Output 16.1.5.

  proc format;   value specname   1='Setosa    '   2='Versicolor'   3='Virginica ';   run;   data iris;   title 'Fisher (1936) Iris Data';   input SepalLength SepalWidth PetalLength PetalWidth Species @@;   format Species specname.;   label SepalLength='Sepal Length in mm.'   SepalWidth ='Sepal Width in mm.'   PetalLength='Petal Length in mm.'   PetalWidth ='Petal Width in mm.';   symbol = put(species, specname10.);   datalines;   50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3   63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2   59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2   65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3   68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3   77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3   49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2   64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3   55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1   49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1   67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1   77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2   50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1   61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1   61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1   51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1   51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1   46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1   50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3   57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1   71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3   49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1   49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1   66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1   44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2   47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2   74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1   56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3   49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1   56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2   51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3   54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3   61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3   68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1   45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1   55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1   51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2   63 33 60 25 3 53 37 15 02 1   ;   proc aceclus data=iris out=ace p=.02 outstat=score;   var SepalLength SepalWidth PetalLength PetalWidth;   run;   legend1 frame cframe=white cborder=black   position=center value=(justify=center);   axis1 label=(angle=90 rotate=0) minor=none;   axis2 minor=none;   proc gplot data=ace;   plot can2*can1=Species /   frame cframe=white legend=legend1 vaxis=axis1 haxis=axis2;   format Species specname. ;   run;   quit;   proc fastclus data=ace maxc=3 maxiter=10 conv=0 out=clus;   var can:;   run;   proc freq;   tables cluster*Species;   run;  
Output 16.1.1: Using PROC ACECLUS to Transform Fisher s Iris Data
start example
  Fisher (1936) Iris Data   The ACECLUS Procedure   Approximate Covariance Estimation for Cluster Analysis   Observations         150    Proportion      0.0200   Variables              4    Converge       0.00100   Means and Standard Deviations   Standard   Variable           Mean    Deviation    Label   SepalLength     58.4333       8.2807    Sepal Length in mm.   SepalWidth      30.5733       4.3587    Sepal Width in mm.   PetalLength     37.5800      17.6530    Petal Length in mm.   PetalWidth      11.9933       7.6224    Petal Width in mm.   COV: Total Sample Covariances   SepalLength       SepalWidth      PetalLength       PetalWidth   SepalLength       68.5693512   4.2434004      127.4315436       51.6270694   SepalWidth   4.2434004       18.9979418   32.9656376   12.1639374   PetalLength      127.4315436   32.9656376      311.6277852      129.5609396   PetalWidth        51.6270694   12.1639374      129.5609396       58.1006264   Initial Within-Cluster Covariance Estimate = Full Covariance Matrix   Threshold =    0.334211   Iteration History   Pairs   RMS       Distance      Within    Convergence   Iteration    Distance     Cutoff       Cutoff      Measure   ------------------------------------------------------------   1       2.828       0.945       408.0     0.465775   2      11.905       3.979       559.0     0.013487   3      13.152       4.396       940.0     0.029499   4      13.439       4.491      1506.0     0.046846   5      13.271       4.435      2036.0     0.046859   6      12.591       4.208      2285.0     0.025027   7      12.199       4.077      2366.0     0.009559   8      12.121       4.051      2402.0     0.003895   9      12.064       4.032      2417.0     0.002051   10      12.047       4.026      2429.0     0.000971   Algorithm converged.  
end example
 
Output 16.1.2: Eigenvalues, Raw Canonical Coefficients, and Standardized Canonical Coefficients
start example
  ACE: Approximate Covariance Estimate Within Clusters   SepalLength       SepalWidth      PetalLength       PetalWidth   SepalLength      11.73342939       5.47550432       4.95389049       2.02902429   SepalWidth        5.47550432       6.91992590       2.42177851       1.74125154   PetalLength       4.95389049       2.42177851       6.53746398       2.35302594   PetalWidth        2.02902429       1.74125154       2.35302594       2.05166735   Eigenvalues of Inv(ACE)*(COV-ACE)   Eigenvalue    Difference    Proportion    Cumulative   1       63.7716       61.1593        0.9367        0.9367   2        2.6123        1.5561        0.0384        0.9751   3        1.0562        0.4167        0.0155        0.9906   4        0.6395                     0.00939        1.0000   Eigenvectors (Raw Canonical Coefficients)   Can1        Can2       Can3       Can4   SepalLength   Sepal Length in mm.   .012009   .098074   .059852   0.402352   SepalWidth    Sepal Width in mm.   .211068   .000072    0.402391   .225993   PetalLength   Petal Length in mm.   0.324705   .328583    0.110383   .321069   PetalWidth    Petal Width in mm.     0.266239   0.870434   .085215   0.320286   Standardized Canonical Coefficients   Can1        Can2       Can3       Can4   SepalLength   Sepal Length in mm.   0.09944   0.81211   0.49562    3.33174   SepalWidth    Sepal Width in mm.   0.91998   0.00031    1.75389   0.98503   PetalLength   Petal Length in mm.    5.73200   5.80047    1.94859   5.66782   PetalWidth    Petal Width in mm.     2.02937     6.63478   -0.64954    2.44134  
end example
 
Output 16.1.3: Plot of Transformed Iris Data: PROC PLOT
start example
  click to expand  
end example
 
Output 16.1.4: Clustering of Transformed Iris Data: Partial Output from PROC FASTCLUS
start example
  The FASTCLUS Procedure   Replace=FULL  Radius=0  Maxclusters=3 Maxiter=10 Converge=0   Cluster Summary   Maximum Distance   RMS Std           from Seed     Radius     Nearest   Cluster     Frequency    Deviation      to Observation    Exceeded    Cluster   -----------------------------------------------------------------------------   1               50       1.1016              5.2768                      3   2               50       1.8880              6.8298                      3   3               50       1.4138              5.3152                      2   Cluster Summary   Distance Between   Cluster     Cluster Centroids   -----------------------------   1                  13.2845   2                   5.8580   3                   5.8580   Statistics for Variables   Variable     Total STD    Within STD      R-Square     RSQ/(1-RSQ)   ------------------------------------------------------------------   Can1           8.04808       1.48537      0.966394       28.756658   Can2           1.90061       1.85646      0.058725        0.062389   Can3           1.43395       1.32518      0.157417        0.186826   Can4           1.28044       1.27550      0.021025        0.021477   OVER-ALL       4.24499       1.50298      0.876324        7.085666   Pseudo F Statistic =   520.80   Approximate Expected Over-All R-Squared =   0.80391   Cubic Clustering Criterion =    5.179   Cluster Means   Cluster              Can1              Can2              Can3              Can4   -------------------------------------------------------------------------------   1   10.67516964        0.06706906        0.27068819        0.11164209   2           8.12988211        0.52566663        0.51836499        0.14915404   3           2.54528754   0.59273569   0.78905317   0.26079612   Cluster Standard Deviations   Cluster              Can1              Can2              Can3              Can4   -------------------------------------------------------------------------------   1          0.953761025       0.931943571       1.398456061       1.058217627   2          1.799159552       2.743869556       1.270344142       1.370523175   3          1.572366584       1.393565864       1.303411851       1.372050319  
end example
 
Output 16.1.5: Crosstabulation of Cluster by Species for Fisher s Iris Data: PROC FREQ
start example
  The FREQ Procedure   Table of CLUSTER by Species   CLUSTER(Cluster)     Species   Frequency   Percent   Row Pct   Col Pct  Setosa  VersicolVirginic    Total   or      a   ---------+--------+--------+--------+   1      50       0       0        50   33.33    0.00    0.00     33.33   100.00    0.00    0.00   100.00    0.00    0.00   ---------+--------+--------+--------+   2       0       2      48        50   0.00    1.33   32.00     33.33   0.00    4.00   96.00   0.00    4.00   96.00   ---------+--------+--------+--------+   3       0      48       2        50   0.00   32.00    1.33     33.33   0.00   96.00    4.00   0.00   96.00    4.00   ---------+--------+--------+--------+   Total          50       50       50        150   33.33    33.33    33.33     100.00  
end example
 



SAS.STAT 9.1 Users Guide (Vol. 1)
SAS/STAT 9.1 Users Guide, Volumes 1-7
ISBN: 1590472438
EAN: 2147483647
Year: 2004
Pages: 156

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net