Getting Started | SAS/STAT 9.1 Users Guide Volume 2 only

The data in this example are measurements taken on 159 fish caught in Finland s lake Laengelmavesi. The species, weight, three different length measurements, height, and width of each fish are tallied. The full data set is displayed in Chapter 67, The STEPDISC Procedure. The STEPDISC procedure identifies all the variables as significant indicators of the differences among the seven fish species. The goal now is to find a discriminant function based on these six variables that best classifies the fish into species.

First, assume that the data are normally distributed within each group with equal covariances across groups. The following program uses PROC DISCRIM to analyze the Fish data and create Figure 25.1 through Figure 25.5.

  proc format;   value specfmt   1='Bream'   2='Roach'   3='Whitefish'   4='Parkki'   5='Perch'   6='Pike'   7='Smelt';   data fish (drop=HtPct WidthPct);   title 'Fish Measurement Data';   input Species Weight Length1 Length2 Length3 HtPct   WidthPct @@;   Height=HtPct*Length3/100;   Width=WidthPct*Length3/100;   format Species specfmt.;   symbol = put(Species, specfmt.);   datalines;   1  242.0 23.2 25.4 30.0 38.4 13.4   1  290.0 24.0 26.3 31.2 40.0 13.8   1  340.0 23.9 26.5 31.1 39.8 15.1   1  363.0 26.3 29.0 33.5 38.0 13.3    ...[155 more records]    ;   proc discrim data=fish;   class Species;   run;

  Fish Measurement Data   The DISCRIM Procedure   Observations     158          DF Total               157   Variables          6          DF Within Classes      151   Classes            7          DF Between Classes       6   Class Level Information   Variable                                                   Prior   Species      Name         Frequency       Weight    Proportion    Probability   Bream        Bream               34      34.0000      0.215190       0.142857   Parkki       Parkki              11      11.0000      0.069620       0.142857   Perch        Perch               56      56.0000      0.354430       0.142857   Pike         Pike                17      17.0000      0.107595       0.142857   Roach        Roach               20      20.0000      0.126582       0.142857   Smelt        Smelt               14      14.0000      0.088608       0.142857   Whitefish    Whitefish            6       6.0000      0.037975       0.142857

Figure 25.1: Summary Information

  Fish Measurement Data   The DISCRIM Procedure   Pooled Covariance Matrix Information   Natural Log of the   Covariance    Determinant of the   Matrix Rank     Covariance Matrix   6               4.17613

Figure 25.2: Pooled Covariance Matrix Information

  Fish Measurement Data   The DISCRIM Procedure   2         _   _   1  _   _   D (ij) = (X  X)' COV   (X  X)   i   j           i   j   Generalized Squared Distance to Species   From   Species         Bream      Parkki       Perch        Pike       Roach       Smelt  Whitefish   Bream               0    83.32523   243.66688   310.52333   133.06721   252.75503  132.05820   Parkki       83.32523           0    57.09760   174.20918    27.00096    60.52076   26.54855   Perch       243.66688    57.09760           0   101.06791    29.21632    29.26806   20.43791   Pike        310.52333   174.20918   101.06791           0    92.40876   127.82177   99.90673   Roach       133.06721    27.00096    29.21632    92.40876           0    33.84280    6.31997   Smelt       252.75503    60.52076    29.26806   127.82177    33.84280           0   46.37326   Whitefish   132.05820    26.54855    20.43791    99.90673     6.31997    46.37326          0

Figure 25.3: Squared Distances

  Fish Measurement Data   The DISCRIM Procedure   Linear Discriminant Function   _   1 _   1 _   Constant =   .5 X' COV   X      Coefficient Vector = COV   X   j        j                                 j   Linear Discriminant Function for Species   Variable       Bream      Parkki       Perch        Pike       Roach       Smelt  Whitefish   Constant   185.91682   64.92517   48.68009   148.06402   62.65963   19.70401   67.44603   Weight   0.10912   0.09031   0.09418   0.13805   0.09901   0.05778   0.09948   Length1   23.02273   13.64180   19.45368   20.92442   14.63635   4.09257   22.57117   Length2   26.70692   5.38195    17.33061     6.19887   7.47195   3.63996    3.83450   Length3     50.55780    20.89531     5.25993    22.94989    25.00702    10.60171   21.12638   Height      13.91638     8.44567   1.42833   8.99687   0.26083   1.84569    0.64957   Width   23.71895   13.38592     1.32749   9.13410   3.74542   3.43630   2.52442

Figure 25.4: Linear Discriminant Function

  Fish Measurement Data   The DISCRIM Procedure   Classification Summary for Calibration Data: WORK.FISH   Resubstitution Summary using Linear Discriminant Function   2         _   1   _   D (X) = (X-X)' COV (X-X)   j          j            j   Posterior Probability of Membership in Each Species   2                    2   Pr(jX) = exp(   .5 D (X)) / SUM exp(   .5 D (X))   j        k           k   Number of Observations and Percent Classified into Species   From   Species       Bream    Parkki     Perch      Pike     Roach     Smelt   Whitefish     Total   Bream            34         0         0         0         0         0           0        34   100.00      0.00      0.00      0.00      0.00      0.00        0.00    100.00   Parkki            0        11         0         0         0         0           0        11   0.00    100.00      0.00      0.00      0.00      0.00        0.00    100.00   Perch             0         0        53         0         0         3           0        56   0.00      0.00     94.64      0.00      0.00      5.36        0.00    100.00   Pike              0         0         0        17         0         0           0        17   0.00      0.00      0.00    100.00      0.00      0.00        0.00    100.00   Roach             0         0         0         0        20         0           0        20   0.00      0.00      0.00      0.00    100.00      0.00        0.00    100.00   Smelt             0         0         0         0         0        14           0        14   0.00      0.00      0.00      0.00      0.00    100.00        0.00    100.00   Whitefish         0         0         0         0         0         0           6         6   0.00      0.00      0.00      0.00      0.00      0.00      100.00    100.00   Total            34        11        53        17        20        17           6       158   21.52      6.96     33.54     10.76     12.66     10.76        3.80    100.00   Priors      0.14286   0.14286   0.14286   0.14286   0.14286   0.14286     0.14286   Error Count Estimates for Species   Bream    Parkki     Perch      Pike     Roach     Smelt  Whitefish     Total   Rate            0.0000    0.0000    0.0536    0.0000    0.0000    0.0000     0.0000    0.0077   Priors          0.1429    0.1429    0.1429    0.1429    0.1429    0.1429     0.1429

Figure 25.5: Resubstitution Misclassification Summary

The DISCRIM procedure begins by displaying summary information about the variables in the analysis. This information includes the number of observations, the number of quantitative variables in the analysis (specified with the VAR statement), and the number of classes in the classification variable (specified with the CLASS statement). The frequency of each class, its weight, proportion of the total sample, and prior probability are also displayed. Equal priors are assigned by default.

The natural log of the determinant of the pooled covariance matrix is displayed next ( Figure 25.2). The squared distances between the classes are shown in Figure 25.3.

The coefficients of the linear discriminant function are displayed (in Figure 25.4) with the default options METHOD=NORMAL and POOL=YES.

A summary of how the discriminant function classifies the data used to develop the function is displayed last. In Figure 25.5, you see that only three of the observations are misclassified. The error-count estimates give the proportion of misclassified observations in each group. Since you are classifying the same data that are used to derive the discriminant function, these error-count estimates are biased . One way to reduce the bias of the error-count estimates is to split the Fish data into two sets, use one set to derive the discriminant function, and use the other to run validation tests; Example 25.4 on page 1231 shows how to analyze a test data set. Another method of reducing bias is to classify each observation using a discriminant function computed from all of the other observations; this method is invoked with the CROSSVALIDATE option.