Example


Example 21.1. Analysis of Iris Data Using PROC CANDISC

The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on fifty iris specimens from each of three species: Iris setosa, I. versicolor, and I. virginica .

This example is a canonical discriminant analysis that creates an output data set containing scores on the canonical variables and plots the canonical variables . The following statements produce Output 21.1.1 through Output 21.1.7:

  proc format;   value specname   1='Setosa    '   2='Versicolor'   3='Virginica ';   run;   data iris;   title 'Fisher (1936) Iris Data';   input SepalLength SepalWidth PetalLength PetalWidth   Species @@;   format Species specname.;   label SepalLength='Sepal Length in mm.'   SepalWidth ='Sepal Width in mm.'   PetalLength='Petal Length in mm.'   PetalWidth ='Petal Width in mm.';   symbol = put(Species, specname10.);   datalines;   50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3   63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2   59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2   65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3   68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3   77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3   49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2   64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3   55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1   49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1   67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1   77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2   50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1   61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1   61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1   51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1   51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1   46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1   50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3   57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1   71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3   49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1   49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1   66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1   44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2   47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2   74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1   56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3   49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1   56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2   51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3   54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3   61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3   68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1   45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1   55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1   51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2   63 33 60 25 3 53 37 15 02 1   ;   proc candisc data=iris out=outcan distance anova;   class Species;   var SepalLength SepalWidth PetalLength PetalWidth;   run;  
Output 21.1.1: Iris Data: Summary Information
start example
  Fisher (1936) Iris Data   The CANDISC Procedure   Observations     150          DF Total               149   Variables          4          DF Within Classes      147   Classes            3          DF Between Classes       2   Class Level Information   Variable   Species       Name          Frequency       Weight    Proportion   Setosa        Setosa               50      50.0000      0.333333   Versicolor    Versicolor           50      50.0000      0.333333   Virginica     Virginica            50      50.0000      0.333333  
end example
 
Output 21.1.2: Iris Data: Squared Mahalanobis Distances
start example
  Fisher (1936) Iris Data   The CANDISC Procedure   Pairwise Squared Distances Between Groups   2         _   _       -1  _   _   D (ij) = (X   X) COV   (X   X)   i   j           i   j   Squared Distance to Species   From   Species           Setosa    Versicolor     Virginica   Setosa                 0      89.86419     179.38471   Versicolor      89.86419             0      17.20107   Virginica      179.38471      17.20107             0  
end example
 
Output 21.1.3: Iris Data: Squared Mahalanobis Distance Statistics
start example
  Fisher (1936) Iris Data   The CANDISC Procedure   F Statistics, NDF=4, DDF=144 for Squared Distance to Species   From   Species           Setosa    Versicolor     Virginica   Setosa                 0     550.18889          1098   Versicolor     550.18889             0     105.31265   Virginica           1098     105.31265             0   Prob > Mahalanobis Distance for Squared Distance to Species   From   Species           Setosa     Versicolor     Virginica   Setosa            1.0000        <.0001        <.0001   Versicolor        <.0001        1.0000        <.0001   Virginica         <.0001        <.0001        1.0000  
end example
 
Output 21.1.4: Iris Data: Univariate and Multivariate Statistics
start example
  Fisher (1936) Iris Data   The CANDISC Procedure   Univariate Test Statistics   F Statistics,    Num DF=2,   Den DF=147   Total    Pooled   Between   Standard Standard Standard             R-Square   Variable    Label               Deviation Deviation Deviation R-Square / (1-RSq) F Value Pr > F   SepalLength Sepal Length in mm.    8.2807    5.1479    7.9506   0.6187    1.6226  119.26 <.0001   SepalWidth  Sepal Width in mm.     4.3587    3.3969    3.3682   0.4008    0.6688   49.16 <.0001   PetalLength Petal Length in mm.   17.6530    4.3033   20.9070   0.9414   16.0566 1180.16 <.0001   PetalWidth  Petal Width in mm.     7.6224    2.0465    8.9673   0.9289   13.0613  960.01 <.0001   Average R-Square   Unweighted              0.7224358   Weighted by Variance    0.8689444   Multivariate Statistics and F Approximations   S=2    M=0.5    N=71   Statistic                        Value    F Value    Num DF    Den DF    Pr > F   Wilks' Lambda               0.02343863     199.15         8       288    <.0001   Pillai's Trace              1.19189883      53.47         8       290    <.0001   Hotelling-Lawley Trace     32.47732024     582.20         8     203.4    <.0001   Roy's Greatest Root        32.19192920    1166.96         4       145    <.0001   NOTE: F Statistic for Roy's Greatest Root is an upper bound.   NOTE: F Statistic for Wilks' Lambda is exact.  
end example
 
Output 21.1.5: Iris Data: Canonical Correlations and Eigenvalues
start example
  Fisher (1936) Iris Data   The CANDISC Procedure   Adjusted    Approximate        Squared   Canonical      Canonical       Standard      Canonical   Correlation    Correlation          Error    Correlation   1    0.984821       0.984508       0.002468       0.969872   2    0.471197       0.461445       0.063734       0.222027   Test of H0: The canonical correlations in   the current row and all   Eigenvalues of Inv(E)*H                       that follow are zero   = CanRsq/(1-CanRsq)   Likelihood Approximate   Eigenvalue Difference Proportion Cumulative      Ratio     F Value Num DF Den DF Pr > F   1    32.1919    31.9065     0.9912     0.9912 0.02343863      199.15      8    288 <.0001   2     0.2854                0.0088     1.0000 0.77797337       13.79      3    145 <.0001  
end example
 
Output 21.1.6: Iris Data: Correlations Between Canonical and Original Variables
start example
  Fisher (1936) Iris Data   The CANDISC Procedure   Total Canonical Structure   Variable         Label                            Can1              Can2   SepalLength      Sepal Length in mm.          0.791888          0.217593   SepalWidth       Sepal Width in mm.   0.530759          0.757989   PetalLength      Petal Length in mm.          0.984951          0.046037   PetalWidth       Petal Width in mm.           0.972812          0.222902   Between Canonical Structure   Variable         Label                            Can1              Can2   SepalLength      Sepal Length in mm.          0.991468          0.130348   SepalWidth       Sepal Width in mm.   0.825658          0.564171   PetalLength      Petal Length in mm.          0.999750          0.022358   PetalWidth       Petal Width in mm.           0.994044          0.108977   Pooled Within Canonical Structure   Variable         Label                            Can1              Can2   SepalLength      Sepal Length in mm.          0.222596          0.310812   SepalWidth       Sepal Width in mm.   0.119012          0.863681   PetalLength      Petal Length in mm.          0.706065          0.167701   PetalWidth       Petal Width in mm.           0.633178          0.737242  
end example
 
Output 21.1.7: Iris Data: Canonical Coefficients
start example
  Fisher (1936) Iris Data   The CANDISC Procedure   Total-Sample Standardized Canonical Coefficients   Variable         Label                            Can1              Can2   SepalLength      Sepal Length in mm.   0.686779533       0.019958173   SepalWidth       Sepal Width in mm.   0.668825075       0.943441829   PetalLength      Petal Length in mm.       3.885795047   1.645118866   PetalWidth       Petal Width in mm.        2.142238715       2.164135931   Pooled Within-Class Standardized Canonical Coefficients   Variable         Label                            Can1              Can2   SepalLength      Sepal Length in mm.   .4269548486      0.0124075316   SepalWidth       Sepal Width in mm.   .5212416758      0.7352613085   PetalLength      Petal Length in mm.      0.9472572487   .4010378190   PetalWidth       Petal Width in mm.       0.5751607719      0.5810398645  
end example
 

PROC CANDISC first displays information about the observations and the classes in thedatasetinOutput 21.1.1.

The DISTANCE option in the PROC CANDISC statement displays squared Mahalanobis distances between class means. Results from the DISTANCE option is shown in Output 21.1.2 and Output 21.1.3.

The ANOVA option specifies testing of the hypothesis that the class means are equal using univariate statistics. The resulting R 2 values (see Output 21.1.4) range from 0.4008 for SepalWidth to 0.9414 for PetalLength , and each variable is significant at the 0.0001 level. The multivariate test for differences between the classes (which is displayed by default) is also significant at the 0.0001 level; you would expect this from the highly significant univariate test results.

The R 2 between Can1 and the class variable, 0.969872, is much larger than the corresponding R 2 for Can2 , 0.222027. This is displayed in Output 21.1.5.

The raw canonical coefficients (shown in Output 21.1.7) for the first canonical variable, Can1 , show that the classes differ most widely on the linear combination of the centered variables ˆ’ 0.0829378 — SepalLength ˆ’ 0.153447 — SepalWidth + 0.220121 — PetalLength + 0.281046 — PetalWidth .

  Fisher (1936) Iris Data   The CANDISC Procedure   Raw Canonical Coefficients   Variable         Label                            Can1              Can2   SepalLength      Sepal Length in mm.   .0829377642      0.0024102149   SepalWidth       Sepal Width in mm.   .1534473068      0.2164521235   PetalLength      Petal Length in mm.      0.2201211656   .0931921210   PetalWidth       Petal Width in mm.       0.2810460309      0.2839187853   Class Means on Canonical Variables   Species                 Can1              Can2   Setosa   7.607599927       0.215133017   Versicolor       1.825049490   0.727899622   Virginica        5.782550437       0.512766605  

The plot of canonical variables in Output 21.1.8 shows that of the two canonical variables Can1 has the most discriminatory power. The following invocation of the % PLOTIT macro creates this plot:

  %plotit(data=outcan, plotvars=Can2 Can1,   labelvar=_blank_, symvar=symbol, typevar=symbol,   symsize=1, symlen=4, exttypes=symbol, ls=100,   tsize=1.5, extend=close);  
Output 21.1.8: Iris Data: Plot of First Two Canonical Variables
start example
  click to expand  
end example
 



SAS.STAT 9.1 Users Guide (Vol. 1)
SAS/STAT 9.1 Users Guide, Volumes 1-7
ISBN: 1590472438
EAN: 2147483647
Year: 2004
Pages: 156

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net