The data in this example are measurements on 159 fish caught in Finland's lake Laengelmavesi. The species, weight, three different length measurements, height, and width of each fish is tallied. The complete data set is displayed in Chapter 67, 'The STEPDISC Procedure.' The STEPDISC procedure identified all the variables as significant indicators of the differences among the seven fish species.
proc format; value specfmt 1='Bream' 2='Roach' 3='Whitefish' 4='Parkki' 5='Perch' 6='Pike' 7='Smelt'; data fish (drop=HtPct WidthPct); title 'Fish Measurement Data'; input Species Weight Length1 Length2 Length3 HtPct WidthPct @@; Height=HtPct*Length3/100; Width=WidthPct*Length3/100; format Species specfmt.; symbol = put(Species, specfmt2.); datalines; 1 242.0 23.2 25.4 30.0 38.4 13.4 1 290.0 24.0 26.3 31.2 40.0 13.8 1 340.0 23.9 26.5 31.1 39.8 15.1 1 363.0 26.3 29.0 33.5 38.0 13.3 ...[155 more records] ;
The following program uses PROC CANDISC to find the three canonical variables that best separate the species of fish in the fish data and creates the output data set outcan . The NCAN= option is used to request that only the first three canonical variables are displayed. The %PLOTIT macro is invoked to create a plot of the first two canonical variables. See Appendix B, 'Using the %PLOTIT Macro,' for more information on the % PLOTIT macro.
proc candisc data=fish ncan=3 out=outcan; class Species; var Weight Length1 Length2 Length3 Height Width; run; %plotit(data=outcan, plotvars=Can2 Can1, labelvar=_blank_, symvar=symbol, typevar=symbol, symsize=1, symlen=4, tsize=1.5, exttypes=symbol, ls=100, plotopts=vaxis=-5 to 15 by 5, vtoh=, extend=close);
PROC CANDISC begins by displaying summary information about the variables in the analysis. This information includes the number of observations, the number of quantitative variables in the analysis (specified with the VAR statement), and the number of classes in the classification variable (specified with the CLASS statement). The frequency of each class is also displayed.
Fish Measurement Data The CANDISC Procedure Observations 158 DF Total 157 Variables 6 DF Within Classes 151 Classes 7 DF Between Classes 6 Class Level Information Variable Species Name Frequency Weight Proportion Bream Bream 34 34.0000 0.215190 Parkki Parkki 11 11.0000 0.069620 Perch Perch 56 56.0000 0.354430 Pike Pike 17 17.0000 0.107595 Roach Roach 20 20.0000 0.126582 Smelt Smelt 14 14.0000 0.088608 Whitefish Whitefish 6 6.0000 0.037975
PROC CANDISC performs a multivariate one-way analysis of variance (one-way MANOVA) and provides four multivariate tests of the hypothesis that the class mean vectors are equal. These tests, shown in Figure 21.2, indicate that not all of the mean vectors are equal ( p < .0001).
Fish Measurement Data The CANDISC Procedure Multivariate Statistics and F Approximations S=6 M=-0.5 N=72 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.00036325 90.71 36 643.89 <.0001 Pillai's Trace 3.10465132 26.99 36 906 <.0001 Hotelling-Lawley Trace 52.05799676 209.24 36 413.64 <.0001 Roy's Greatest Root 39.13499776 984.90 6 151 <.0001 NOTE: F Statistic for Roy's Greatest Root is an upper bound.
The first canonical correlation is the greatest possible multiple correlation with the classes that can be achieved using a linear combination of the quantitative variables. The first canonical correlation, displayed in Figure 21.3, is 0.987463.
Fish Measurement Data The CANDISC Procedure Adjusted Approximate Squared Canonical Canonical Standard Canonical Correlation Correlation Error Correlation 1 0.987463 0.986671 0.001989 0.975084 2 0.952349 0.950095 0.007425 0.906969 3 0.838637 0.832518 0.023678 0.703313 4 0.633094 0.623649 0.047821 0.400809 5 0.344157 0.334170 0.070356 0.118444 6 0.005701 . 0.079806 0.000033
A likelihood ratio test is displayed of the hypothesis that the current canonical correlation and all smaller ones are zero. The first line is equivalent to Wilks' Lambda multivariate test.
Test of H0: The canonical correlations in the current row and all that follow are zero Likelihood Approximate Ratio F Value Num DF Den DF Pr > F 1 0.00036325 90.71 36 643.89 <.0001 2 0.01457896 46.46 25 547.58 <.0001 3 0.15671134 23.61 16 452.79 <.0001 4 0.52820347 12.09 9 362.78 <.0001 5 0.88152702 4.88 4 300 0.0008 6 0.99996749 0.00 1 151 0.9442
The first canonical variable, Can1 , shows that the linear combination of the centered variables Can1 = ˆ’ 0.0006 — Weight ˆ’ 0.33 — Length1 ˆ’ 2.49 — Length2 + 2.60 — Length3 + 1.12 — Height ˆ’ 1.45 — Width separates the species most effectively (see Figure 21.5).
Fish Measurement Data The CANDISC Procedure Raw Canonical Coefficients Variable Can1 Can2 Can3 Weight 0.000648508 0.005231659 0.005596192 Length1 0.329435762 0.626598051 2.934324102 Length2 2.486133674 0.690253987 4.045038893 Length3 2.595648437 1.803175454 1.139264914 Height 1.121983854 0.714749340 0.283202557 Width 1.446386704 0.907025481 0.741486686
PROC CANDISC computes the means of the canonical variables for each class. The first canonical variable is the linear combination of the variables Weight , Length1 , Length2 , Length3 , Height ,and Width that provides the greatest difference (in terms of a univariate F -test) between the class means. The second canonical variable provides the greatest difference between class means while being uncorrelated with the first canonical variable.
Fish Measurement Data The CANDISC Procedure Class Means on Canonical Variables Species Can1 Can2 Can3 Bream 10.94142464 0.52078394 0.23496708 Parkki 2.58903743 2.54722416 0.49326158 Perch 4.47181389 1.70822715 1.29281314 Pike 4.89689441 8.22140791 0.16469132 Roach 0.35837149 0.08733611 1.10056438 Smelt 4.09136653 2.35805841 4.03836098 Whitefish 0.39541755 0.42071778 1.06459242
A plot of the first two canonical variables (Figure 21.7) shows that Can1 discriminates between three groups: 1) bream; 2) whitefish, roach, and parkki; and 3) smelt, pike, and perch. Can2 best discriminates between pike and the other species.