The data in this example are measurements taken on 159 fish caught in Finland s lake Laengelmavesi. The species, weight, three different length measurements, height, and width of each fish are tallied. The full data set is displayed in Chapter 67, The STEPDISC Procedure. The STEPDISC procedure identifies all the variables as significant indicators of the differences among the seven fish species. The goal now is to find a discriminant function based on these six variables that best classifies the fish into species.
First, assume that the data are normally distributed within each group with equal covariances across groups. The following program uses PROC DISCRIM to analyze the Fish data and create Figure 25.1 through Figure 25.5.
proc format; value specfmt 1='Bream' 2='Roach' 3='Whitefish' 4='Parkki' 5='Perch' 6='Pike' 7='Smelt'; data fish (drop=HtPct WidthPct); title 'Fish Measurement Data'; input Species Weight Length1 Length2 Length3 HtPct WidthPct @@; Height=HtPct*Length3/100; Width=WidthPct*Length3/100; format Species specfmt.; symbol = put(Species, specfmt.); datalines; 1 242.0 23.2 25.4 30.0 38.4 13.4 1 290.0 24.0 26.3 31.2 40.0 13.8 1 340.0 23.9 26.5 31.1 39.8 15.1 1 363.0 26.3 29.0 33.5 38.0 13.3 ...[155 more records] ; proc discrim data=fish; class Species; run;
Fish Measurement Data The DISCRIM Procedure Observations 158 DF Total 157 Variables 6 DF Within Classes 151 Classes 7 DF Between Classes 6 Class Level Information Variable Prior Species Name Frequency Weight Proportion Probability Bream Bream 34 34.0000 0.215190 0.142857 Parkki Parkki 11 11.0000 0.069620 0.142857 Perch Perch 56 56.0000 0.354430 0.142857 Pike Pike 17 17.0000 0.107595 0.142857 Roach Roach 20 20.0000 0.126582 0.142857 Smelt Smelt 14 14.0000 0.088608 0.142857 Whitefish Whitefish 6 6.0000 0.037975 0.142857
Fish Measurement Data The DISCRIM Procedure Pooled Covariance Matrix Information Natural Log of the Covariance Determinant of the Matrix Rank Covariance Matrix 6 4.17613
Fish Measurement Data The DISCRIM Procedure 2 _ _ 1 _ _ D (ij) = (X X)' COV (X X) i j i j Generalized Squared Distance to Species From Species Bream Parkki Perch Pike Roach Smelt Whitefish Bream 0 83.32523 243.66688 310.52333 133.06721 252.75503 132.05820 Parkki 83.32523 0 57.09760 174.20918 27.00096 60.52076 26.54855 Perch 243.66688 57.09760 0 101.06791 29.21632 29.26806 20.43791 Pike 310.52333 174.20918 101.06791 0 92.40876 127.82177 99.90673 Roach 133.06721 27.00096 29.21632 92.40876 0 33.84280 6.31997 Smelt 252.75503 60.52076 29.26806 127.82177 33.84280 0 46.37326 Whitefish 132.05820 26.54855 20.43791 99.90673 6.31997 46.37326 0
Fish Measurement Data The DISCRIM Procedure Linear Discriminant Function _ 1 _ 1 _ Constant = .5 X' COV X Coefficient Vector = COV X j j j Linear Discriminant Function for Species Variable Bream Parkki Perch Pike Roach Smelt Whitefish Constant 185.91682 64.92517 48.68009 148.06402 62.65963 19.70401 67.44603 Weight 0.10912 0.09031 0.09418 0.13805 0.09901 0.05778 0.09948 Length1 23.02273 13.64180 19.45368 20.92442 14.63635 4.09257 22.57117 Length2 26.70692 5.38195 17.33061 6.19887 7.47195 3.63996 3.83450 Length3 50.55780 20.89531 5.25993 22.94989 25.00702 10.60171 21.12638 Height 13.91638 8.44567 1.42833 8.99687 0.26083 1.84569 0.64957 Width 23.71895 13.38592 1.32749 9.13410 3.74542 3.43630 2.52442
Fish Measurement Data The DISCRIM Procedure Classification Summary for Calibration Data: WORK.FISH Resubstitution Summary using Linear Discriminant Function 2 _ 1 _ D (X) = (X-X)' COV (X-X) j j j Posterior Probability of Membership in Each Species 2 2 Pr(jX) = exp( .5 D (X)) / SUM exp( .5 D (X)) j k k Number of Observations and Percent Classified into Species From Species Bream Parkki Perch Pike Roach Smelt Whitefish Total Bream 34 0 0 0 0 0 0 34 100.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 Parkki 0 11 0 0 0 0 0 11 0.00 100.00 0.00 0.00 0.00 0.00 0.00 100.00 Perch 0 0 53 0 0 3 0 56 0.00 0.00 94.64 0.00 0.00 5.36 0.00 100.00 Pike 0 0 0 17 0 0 0 17 0.00 0.00 0.00 100.00 0.00 0.00 0.00 100.00 Roach 0 0 0 0 20 0 0 20 0.00 0.00 0.00 0.00 100.00 0.00 0.00 100.00 Smelt 0 0 0 0 0 14 0 14 0.00 0.00 0.00 0.00 0.00 100.00 0.00 100.00 Whitefish 0 0 0 0 0 0 6 6 0.00 0.00 0.00 0.00 0.00 0.00 100.00 100.00 Total 34 11 53 17 20 17 6 158 21.52 6.96 33.54 10.76 12.66 10.76 3.80 100.00 Priors 0.14286 0.14286 0.14286 0.14286 0.14286 0.14286 0.14286 Error Count Estimates for Species Bream Parkki Perch Pike Roach Smelt Whitefish Total Rate 0.0000 0.0000 0.0536 0.0000 0.0000 0.0000 0.0000 0.0077 Priors 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429
The DISCRIM procedure begins by displaying summary information about the variables in the analysis. This information includes the number of observations, the number of quantitative variables in the analysis (specified with the VAR statement), and the number of classes in the classification variable (specified with the CLASS statement). The frequency of each class, its weight, proportion of the total sample, and prior probability are also displayed. Equal priors are assigned by default.
The natural log of the determinant of the pooled covariance matrix is displayed next ( Figure 25.2). The squared distances between the classes are shown in Figure 25.3.
The coefficients of the linear discriminant function are displayed (in Figure 25.4) with the default options METHOD=NORMAL and POOL=YES.
A summary of how the discriminant function classifies the data used to develop the function is displayed last. In Figure 25.5, you see that only three of the observations are misclassified. The error-count estimates give the proportion of misclassified observations in each group. Since you are classifying the same data that are used to derive the discriminant function, these error-count estimates are biased . One way to reduce the bias of the error-count estimates is to split the Fish data into two sets, use one set to derive the discriminant function, and use the other to run validation tests; Example 25.4 on page 1231 shows how to analyze a test data set. Another method of reducing bias is to classify each observation using a discriminant function computed from all of the other observations; this method is invoked with the CROSSVALIDATE option.