Examples | SAS/STAT 9.1 Users Guide Volume 2 only

Example 23.1. Cluster Analysis of Flying Mileages between Ten American Cities

This first example clusters ten American cities based on the flying mileages between them. Six clustering methods are shown with corresponding tree diagrams produced by the TREE procedure. The EML method cannot be used because it requires coordinate data. The other omitted methods produce the same clusters, although not the same distances between clusters, as one of the illustrated methods: complete linkage and the flexible-beta method yield the same clusters as Ward s method, McQuitty s similarity analysis produces the same clusters as average linkage, and the median method corresponds to the centroid method.

All of the methods suggest a division of the cities into two clusters along the east-west dimension. There is disagreement , however, about which cluster Denver should belong to. Some of the methods indicate a possible third cluster containing Denver and Houston . The following statements produce Output 23.1.1:

Output 23.1.1: Statistics and Tree Diagrams for Six Different Clustering Methods

  Cluster Analysis of Flying Mileages Between 10 American Cities   The CLUSTER Procedure   Average Linkage Cluster Analysis   Root-Mean-Square Distance Between Observations   = 1580.242   Cluster History   Norm    T   RMS    i   NCL    ---------Clusters Joined----------      FREQ     PSF    PST2      Dist    e   9    NEW YORK           WASHINGTON D.C.         2    66.7      .     0.1297   8    LOS ANGELES        SAN FRANCISCO           2    39.2      .     0.2196   7    ATLANTA            CHICAGO                 2    21.7      .     0.3715   6    CL7                CL9                     4    14.5     3.4    0.4149   5    CL8                SEATTLE                 3    12.4     7.3    0.5255   4    DENVER             HOUSTON                 2    13.9      .     0.5562   3    CL6                MIAMI                   5    15.5     3.8    0.6185   2    CL3                CL4                     7    16.0     5.3    0.8005   1    CL2                CL5                    10      .     16.0    1.2967

  title 'Cluster Analysis of Flying Mileages Between 10 American Cities';   data mileages(type=distance);   input (atlanta chicago denver houston losangeles   miami newyork sanfran seattle washdc) (5.)   @55 city .;   datalines;   0                                                 ATLANTA   587    0                                            CHICAGO   1212  920    0                                       DENVER   701  940  879    0                                  HOUSTON   1936 1745  831 1374    0                             LOS ANGELES   604 1188 1726  968 2339    0                        MIAMI   748  713 1631 1420 2451 1092    0                   NEW YORK   2139 1858  949 1645  347 2594 2571    0              SAN FRANCISCO   2182 1737 1021 1891  959 2734 2408  678    0         SEATTLE   543  597 1494 1220 2300  923  205 2442 2329    0    WASHINGTON D.C.   ;   /*---------------------- Average linkage --------------------*/   proc cluster data=mileages method=average pseudo;   id city;   run;   proc tree horizontal spaces=2;   id city;   run;   /*---------------------- Centroid method --------------------*/   proc cluster data=mileages method=centroid pseudo;   id city;   run;   proc tree horizontal spaces=2;   id city;   run;   /*-------- Density linkage with 3rd-nearest-neighbor --------*/   proc cluster data=mileages method=density k=3;   id city;   run;   proc tree horizontal spaces=2;   id city;   run;   /*--------------------- Single linkage ----------------------*/   proc cluster data=mileages method=single;   id city;   run;   proc tree horizontal spaces=2;   id city;   run;   /*--- Two-stage density linkage with 3rd-nearest-neighbor ---*/   proc cluster data=mileages method=twostage k=3;   id city;   run;   proc tree horizontal spaces=2;   id city;   run;   /* Ward's minimum variance with pseudo $F$ and $t^2$ statistics */   proc cluster data=mileages method=ward pseudo;   id city;   run;   proc tree horizontal spaces=2;   id city;   run;

  Cluster Analysis of Flying Mileages Between 10 American Cities   The CLUSTER Procedure   Centroid Hierarchical Cluster Analysis   Root-Mean-Square Distance Between Observations   = 1580.242   Cluster History   Norm    T   Cent    i   NCL    ---------Clusters Joined----------      FREQ     PSF    PST2      Dist    e   9    NEW YORK           WASHINGTON D.C.         2    66.7      .     0.1297   8    LOS ANGELES        SAN FRANCISCO           2    39.2      .     0.2196   7    ATLANTA            CHICAGO                 2    21.7      .     0.3715   6    CL7                CL9                     4    14.5     3.4    0.3652   5    CL8                SEATTLE                 3    12.4     7.3    0.5139   4    DENVER             CL5                     4    12.4     2.1    0.5337   3    CL6                MIAMI                   5    14.2     3.8    0.5743   2    CL3                HOUSTON                 6    22.1     2.6    0.6091   1    CL2                CL4                    10      .     22.1     1.173

  Cluster Analysis of Flying Mileages Between 10 American Cities   The CLUSTER Procedure   Density Linkage Cluster Analysis   K = 3   Cluster History   Normalized            Maximum Density     T   Fusion            in Each Cluster     i   NCL    ---------Clusters Joined----------      FREQ       Density           Lesser     Greater   e   9    ATLANTA            WASHINGTON D.C.         2        96.106          92.5043       100.0   8    CL9                CHICAGO                 3        95.263          90.9548       100.0   7    CL8                NEW YORK                4        86.465          76.1571       100.0   6    CL7                HOUSTON                 5        74.079          61.7747       100.0   T   5    CL6                MIAMI                   6        74.079          58.8299       100.0   4    LOS ANGELES        SAN FRANCISCO           2        71.968          65.3430     80.0885   3    CL4                SEATTLE                 3        66.341          56.6215     80.0885   2    CL3                DENVER                  4        63.509          61.7747     80.0885   1    CL5                CL2                    10        61.775    *     80.0885       100.0   * indicates fusion of two modal or multimodal clusters   2 modal clusters have been formed.

  Cluster Analysis of Flying Mileages Between 10 American Cities   The CLUSTER Procedure   Single Linkage Cluster Analysis   Mean Distance Between Observations               = 1417.133   Cluster History   Norm    T   Min    i   NCL    ---------Clusters Joined----------      FREQ      Dist    e   9    NEW YORK           WASHINGTON D.C.         2    0.1447   8    LOS ANGELES        SAN FRANCISCO           2    0.2449   7    ATLANTA            CL9                     3    0.3832   6    CL7                CHICAGO                 4    0.4142   5    CL6                MIAMI                   5    0.4262   4    CL8                SEATTLE                 3    0.4784   3    CL5                HOUSTON                 6    0.4947   2    DENVER             CL4                     4    0.5864   1    CL3                CL2                    10    0.6203

  Cluster Analysis of Flying Mileages Between 10 American Cities   The CLUSTER Procedure   Two-Stage Density Linkage Clustering   K = 3   Cluster History   Normalized       Maximum Density      T   Fusion       in Each Cluster      i   NCL    ---------Clusters Joined----------      FREQ       Density      Lesser     Greater    e   9    ATLANTA            WASHINGTON D.C.         2        96.106     92.5043       100.0   8    CL9                CHICAGO                 3        95.263     90.9548       100.0   7    CL8                NEW YORK                4        86.465     76.1571       100.0   6    CL7                HOUSTON                 5        74.079     61.7747       100.0    T   5    CL6                MIAMI                   6        74.079     58.8299       100.0   4    LOS ANGELES        SAN FRANCISCO           2        71.968     65.3430     80.0885   3    CL4                SEATTLE                 3        66.341     56.6215     80.0885   2    CL3                DENVER                  4        63.509     61.7747     80.0885   1    CL5                CL2                    10        61.775     80.0885       100.0   2 modal clusters have been formed.

  Cluster Analysis of Flying Mileages Between 10 American Cities   The CLUSTER Procedure   Ward's Minimum Variance Cluster Analysis   Root-Mean-Square Distance Between Observations   = 1580.242   Cluster History  T                                                                                           i  NCL    ---------Clusters Joined----------      FREQ     SPRSQ     RSQ     PSF    PST2    e   9    NEW YORK           WASHINGTON D.C.         2    0.0019    .998    66.7      .   8    LOS ANGELES        SAN FRANCISCO           2    0.0054    .993    39.2      .   7    ATLANTA            CHICAGO                 2    0.0153    .977    21.7      .   6    CL7                CL9                     4    0.0296    .948    14.5     3.4   5    DENVER             HOUSTON                 2    0.0344    .913    13.2      .   4    CL8                SEATTLE                 3    0.0391    .874    13.9     7.3   3    CL6                MIAMI                   5    0.0586    .816    15.5     3.8   2    CL3                CL5                     7    0.1488    .667    16.0     5.3   1    CL2                CL4                    10    0.6669    .000      .     16.0

Example 23.2. Crude Birth and Death Rates

The following example uses the SAS data set Poverty created in the Getting Started section beginning on page 958. The data, from Rouncefield (1995), are birth rates, death rates, and infant death rates for 97 countries . Six cluster analyses are performed with eight methods. Scatter plots showing cluster membership at selected levels are produced instead of tree diagrams.

Each cluster analysis is performed by a macro called ANALYZE. The macro takes two arguments. The first, &METHOD, specifies the value of the METHOD= option to be used in the PROC CLUSTER statement. The second, &NCL, must be specified as a list of integers, separated by blanks, indicating the number of clusters desired in each scatter plot. For example, the first invocation of ANALYZE specifies the AVERAGE method and requests plots of 3 and 8 clusters. When two-stage density linkage is used, the K= and R= options are specified as part of the first argument.

The ANALYZE macro first invokes the CLUSTER procedure with METHOD=&METHOD, where &METHOD represents the value of the first argument to ANALYZE. This part of the macro produces the PROC CLUSTER output shown.

The %DO loop processes &NCL, the list of numbers of clusters to plot. The macro variable &K is a counter that indexes the numbers within &NCL. The %SCAN function picks out the &Kth number in &NCL, which is then assigned to the macro variable &N. When &K exceeds the number of numbers in &NCL, %SCAN returns a null string. Thus, the %DO loop executes while &N is not equal to a null string. In the %WHILE condition, a null string is indicated by the absence of any nonblank characters between the comparison operator (NE) and the right parenthesis that terminates the condition.

Within the %DO loop, the TREE procedure creates an output data set containing &N clusters. The GPLOT procedure then produces a scatter plot in which each observation is identified by the number of the cluster to which it belongs. The TITLE2 statement uses double quotes so that &N and &METHOD can be used within the title. At the end of the loop, &K is incremented by 1, and the next number is extracted from &NCL by %SCAN.

For this example, plots are obtained only for average linkage. To generate plots for other methods, follow the example shown in the first macro call. The following statements produce Output 23.2.1 through Output 23.2.7.

  title 'Cluster Analysis of Birth and Death Rates';   %macro analyze(method,ncl);   proc cluster data=poverty outtree=tree method=&method p=15 ccc pseudo;   var birth death;   title2;   run;   %let k=1;   %let n=%scan(&ncl,&k);   %do %while(&n NE);   proc tree data=tree noprint out=out ncl=&n;   copy birth death;   run;   legend1 frame cframe=ligr cborder=black   position=center value=(justify=center);   axis1 label=(angle=90 rotate=0) minor=none;   axis2 minor=none;   proc gplot;   plot death*birth=cluster /   frame cframe=ligr legend=legend1 vaxis=axis1 haxis=axis2;   title2 "Plot of &n Clusters from METHOD=&METHOD";   run;   %let k=%eval(&k+1);   %let n=%scan(&ncl,&k);   %end;   %mend;   %analyze(average,3 8)   %analyze(complete,3)   %analyze(single,7 10)   %analyze(two k=10,3)   %analyze(two k=18,2)

Output 23.2.1: Clusters for Birth and Death Rates ”METHOD=AVERAGE

  Cluster Analysis of Birth and Death Rates   The CLUSTER Procedure   Average Linkage Cluster Analysis   Eigenvalues of the Covariance Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    189.106588    173.101020        0.9220        0.9220   2     16.005568                      0.0780        1.0000   Root-Mean-Square Total-Sample Standard Deviation =   10.127   Root-Mean-Square Distance Between Observations   = 20.25399   Cluster History   Norm    T   RMS    i   NCL    --Clusters Joined---     FREQ     SPRSQ     RSQ    ERSQ     CCC     PSF   PST2      Dist    e   15    CL27        CL20           18    0.0035    .980    .975    2.61     292   18.6    0.2325   14    CL23        CL17           28    0.0034    .977    .972    1.97     271   17.7    0.2358   13    CL18        CL54            8    0.0015    .975    .969    2.35     279    7.1    0.2432   12    CL21        CL26            8    0.0015    .974    .966    2.85     290    6.1    0.2493   11    CL19        CL24           12    0.0033    .971    .962    2.78     285   14.8    0.2767   10    CL22        CL16           12    0.0036    .967    .957    2.84     284   17.4    0.2858   9    CL15        CL28           22    0.0061    .961    .951    2.45     271   17.5    0.3353   8    OB23        OB61            2    0.0014    .960    .943    3.59     302     .     0.3703   7    CL25        CL11           17    0.0098    .950    .933    3.01     284   23.3    0.4033   6    CL7         CL12           25    0.0122    .938    .920    2.63     273   14.8    0.4132   5    CL10        CL14           40    0.0303    .907    .902    0.59     225   82.7    0.4584   4    CL13        CL6            33    0.0244    .883    .875    0.77     234   22.2    0.5194   3    CL9         CL8            24    0.0182    .865    .827    2.13     300   27.7     0.735   2    CL5         CL3            64    0.1836    .681    .697    -.55     203    148    0.8402   1    CL2         CL4            97    0.6810    .000    .000    0.00      .     203    1.3348

Output 23.2.2: Plot of Three Clusters, METHOD=AVERAGE

Output 23.2.3: Plot of Eight Clusters, METHOD=AVERAGE

Output 23.2.4: Clusters for Birth and Death Rates ”METHOD=COMPLETE

  Cluster Analysis of Birth and Death Rates   The CLUSTER Procedure   Complete Linkage Cluster Analysis   Eigenvalues of the Covariance Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    189.106588    173.101020        0.9220        0.9220   2     16.005568                      0.0780        1.0000   Root-Mean-Square Total-Sample Standard Deviation =   10.127   Mean Distance Between Observations               = 17.13099   Cluster History   Norm    T   Max    i   NCL    --Clusters Joined---     FREQ     SPRSQ     RSQ    ERSQ     CCC     PSF   PST2      Dist    e   15    CL22        CL33            8    0.0015    .983    .975    3.80     329    6.1    0.4092   14    CL56        CL18            8    0.0014    .981    .972    3.97     331    6.6    0.4255   13    CL30        CL44            8    0.0019    .979    .969    4.04     330   19.0    0.4332   12    OB23        OB61            2    0.0014    .978    .966    4.45     340     .     0.4378   11    CL19        CL24           24    0.0034    .974    .962    4.17     327   24.1    0.4962   10    CL17        CL28           12    0.0033    .971    .957    4.18     325   14.8    0.5204   9    CL20        CL13           16    0.0067    .964    .951    3.38     297   25.2    0.5236   8    CL11        CL21           32    0.0054    .959    .943    3.44     297   19.7    0.6001   7    CL26        CL15           13    0.0096    .949    .933    2.93     282   28.9    0.7233   6    CL14        CL10           20    0.0128    .937    .920    2.46     269   27.7    0.8033   5    CL9         CL16           30    0.0237    .913    .902    1.29     241   47.1    0.8993   4    CL6         CL7            33    0.0240    .889    .875    1.38     248   21.7    1.2165   3    CL5         CL12           32    0.0178    .871    .827    2.56     317   13.6    1.2326   2    CL3         CL8            64    0.1900    .681    .697    -.55     203    167    1.5412   1    CL2         CL4            97    0.6810    .000    .000    0.00      .     203    2.5233

Output 23.2.5: Clusters for Birth and Death Rates ”METHOD=SINGLE

  Cluster Analysis of Birth and Death Rates   The CLUSTER Procedure   Single Linkage Cluster Analysis   Eigenvalues of the Covariance Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    189.106588    173.101020        0.9220        0.9220   2     16.005568                      0.0780        1.0000   Root-Mean-Square Total-Sample Standard Deviation =   10.127   Mean Distance Between Observations               = 17.13099   Cluster History   Norm    T   Min    i   NCL    --Clusters Joined---     FREQ     SPRSQ     RSQ    ERSQ     CCC     PSF   PST2      Dist    e   15    CL37        CL19            8    0.0014    .968    .975   2.3     178    6.6    0.1331   14    CL20        CL23           15    0.0059    .962    .972   3.1     162   18.7    0.1412   13    CL14        CL16           19    0.0054    .957    .969   3.4     155    8.8    0.1442   12    CL26        OB58           31    0.0014    .955    .966   2.7     165    4.0    0.1486   11    OB86        CL18            4    0.0003    .955    .962   1.6     183    3.8    0.1495   10    CL13        CL11           23    0.0088    .946    .957   2.3     170   11.3    0.1518   9    CL15        CL10           31    0.0210    .925    .951   4.4     136   21.8    0.1593    T   8    CL22        CL17           30    0.0235    .902    .943   5.8     117   45.7    0.1593   7    CL8         OB75           31    0.0052    .897    .933   4.7     130    4.0    0.1628   6    CL7         CL12           62    0.2023    .694    .920   15    41.3    223    0.1725   5    CL6         CL9            93    0.6681    .026    .902   26     0.6    199    0.1756   4    CL5         OB48           94    0.0056    .021    .875   24     0.7    0.5    0.1811    T   3    CL4         OB67           95    0.0083    .012    .827   15     0.6    0.8    0.1811   2    OB23        OB61            2    0.0014    .011    .697   13     1.0     .     0.4378   1    CL3         CL2            97    0.0109    .000    .000    0.00      .     1.0    0.5815

Output 23.2.6: Clusters for Birth and Death Rates ”METHOD=TWOSTAGE, K=10

  Cluster Analysis of Birth and Death Rates   The CLUSTER Procedure   Two-Stage Density Linkage Clustering   Eigenvalues of the Covariance Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    189.106588    173.101020        0.9220        0.9220   2     16.005568                      0.0780        1.0000   K = 10   Root-Mean-Square Total-Sample Standard Deviation =   10.127   Cluster History   Normalized     Maximum Density     T   Fusion     in Each Cluster     i   NCL   --Clusters Joined--    FREQ    SPRSQ    RSQ   ERSQ    CCC    PSF   PST2     Density     Lesser    Greater   e   15   CL16       OB94          22   0.0015   .921   .975   11   68.4    1.4      9.2234     6.7927    15.3069   14   CL19       OB49          28   0.0021   .919   .972   11   72.4    1.8      8.7369     5.9334    33.4385   13   CL15       OB52          23   0.0024   .917   .969   10   76.9    2.3      8.5847     5.9651    15.3069   12   CL13       OB96          24   0.0018   .915   .966   9.3   83.0    1.6      7.9252     5.4724    15.3069   11   CL12       OB93          25   0.0025   .912   .962   8.5   89.5    2.2      7.8913     5.4401    15.3069   10   CL11       OB78          26   0.0031   .909   .957   7.7   96.9    2.5       7.787     5.4082    15.3069   9   CL10       OB76          27   0.0026   .907   .951   6.7    107    2.1      7.7133     5.4401    15.3069   8   CL9        OB77          28   0.0023   .904   .943   5.5    120    1.7      7.4256     4.9017    15.3069   7   CL8        OB43          29   0.0022   .902   .933   4.1    138    1.6       6.927     4.4764    15.3069   6   CL7        OB87          30   0.0043   .898   .920   2.7    160    3.1       4.932     2.9977    15.3069   5   CL6        OB82          31   0.0055   .892   .902   1.1    191    3.7      3.7331     2.1560    15.3069   4   CL22       OB61          37   0.0079   .884   .875   0.93    237   10.6      3.1713     1.6308      100.0   3   CL14       OB23          29   0.0126   .872   .827   2.60    320   10.4      2.0654     1.0744    33.4385   2   CL4        CL3           66   0.2129   .659   .697   1.3    183    172      12.409    33.4385      100.0   1   CL2        CL5           97   0.6588   .000   .000   0.00     .     183      10.071    15.3069      100.0   3 modal clusters have been formed.

Output 23.2.7: Clusters for Birth and Death Rates ”METHOD=TWOSTAGE, K=18

  Cluster Analysis of Birth and Death Rates   The CLUSTER Procedure   Two-Stage Density Linkage Clustering   Eigenvalues of the Covariance Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    189.106588    173.101020        0.9220        0.9220   2     16.005568                      0.0780        1.0000   K = 18   Root-Mean-Square Total-Sample Standard Deviation =   10.127   Cluster History   Normalized     Maximum Density     T   Fusion     in Each Cluster     i   NCL   --Clusters Joined--    FREQ    SPRSQ    RSQ   ERSQ    CCC    PSF   PST2     Density     Lesser    Greater   e   15   CL16       OB72          46   0.0107   .799   .975   21   23.3    3.0      10.118     7.7445    23.4457   14   CL15       OB94          47   0.0098   .789   .972   21   23.9    2.7       9.676     7.1257    23.4457   13   CL14       OB51          48   0.0037   .786   .969   20   25.6    1.0       9.409     6.8398    23.4457   T   12   CL13       OB96          49   0.0099   .776   .966   19   26.7    2.6       9.409     6.8398    23.4457   11   CL12       OB76          50   0.0114   .764   .962   19   27.9    2.9      8.8136     6.3138    23.4457   10   CL11       OB77          51   0.0021   .762   .957   18   31.0    0.5      8.6593     6.0751    23.4457   9   CL10       OB78          52   0.0103   .752   .951   17   33.3    2.5      8.6007     6.0976    23.4457   8   CL9        OB43          53   0.0034   .748   .943   16   37.8    0.8      8.4964     5.9160    23.4457   7   CL8        OB93          54   0.0109   .737   .933   15   42.1    2.6       8.367     5.7913    23.4457   6   CL7        OB88          55   0.0110   .726   .920   13   48.3    2.6       7.916     5.3679    23.4457   5   CL6        OB87          56   0.0120   .714   .902   12   57.5    2.7      6.6917     4.3415    23.4457   4   CL20       OB61          39   0.0077   .707   .875   9.8   74.7    8.3      6.2578     3.2882      100.0   3   CL5        OB82          57   0.0138   .693   .827   5.0    106    3.0      5.3605     3.2834    23.4457   2   CL3        OB23          58   0.0117   .681   .697   .54    203    2.5      3.2687     1.7568    23.4457   1   CL2        CL4           97   0.6812   .000   .000   0.00     .     203      13.764    23.4457      100.0   2 modal clusters have been formed.

For average linkage, the CCC has peaks at 3, 8, 10, and 12 clusters, but the 3-cluster peak is lower than the 8-cluster peak. The pseudo F statistic has peaks at 3, 8, and 12 clusters. The pseudo t ² statistic drops sharply at 3 clusters, continues to fall at 4 clusters, and has a particularly low value at 12 clusters. However, there are not enough data to seriously consider as many as 12 clusters. Scatter plots are given for 3 and 8 clusters. The results are shown in Output 23.2.1 through Output 23.2.2. In Output 23.2.2, the eighth cluster consists of the two outlying observations, Mexico and Korea.

Complete linkage shows CCC peaks at 3, 8 and 12 clusters. The pseudo F statistic peaks at 3 and 12 clusters. The pseudo t ² statistic indicates 3 clusters.

The scatter plot for 3 clusters is shown. The results are shown in Output 23.2.4.

The CCC and pseudo F statistics are not appropriate for use with single linkage because of the method s tendency to chop off tails of distributions. The pseudo t ² statistic can be used by looking for large values and taking the number of clusters to be one greater than the level at which the large pseudo t ² value is displayed. For these data, there are large values at levels 6 and 9, suggesting 7 or 10 clusters.

The scatter plots for 7 and 10 clusters are shown. The results are shown in Output 23.2.5.

For k th-nearest-neighbor density linkage, the number of modes as a function of k is as follows (not all of these analyses are shown):

k	modes
3	13
4	6
5-7	4
8-15	3
16-21	2
22+	1

Thus, there is strong evidence of 3 modes and an indication of the possibility of 2 modes. Uniform-kernel density linkage gives similar results. For K=10 (10th-nearest-neighbor density linkage), the scatter plot for 3 clusters is shown; and for K=18, the scatter plot for 2 clusters is shown. The results are shown in Output 23.2.6.

In summary, most of the clustering methods indicate 3 or 8 clusters. Most methods agree at the 3-cluster level, but at the other levels, there is considerable disagreement about the composition of the clusters. The presence of numerous ties also complicates the analysis; see Example 23.4 on page 1027.

Example 23.3. Cluster Analysis of Fisher Iris Data

The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on fifty iris specimens from each of three species, Iris setosa, I. versicolor, and I. virginica . Mezzich and Solomon (1980) discuss a variety of cluster analyses of the iris data.

This example analyzes the iris data by Ward s method and two-stage density linkage and then illustrates how the FASTCLUS procedure can be used in combination with PROC CLUSTER to analyze large data sets.

  title 'Cluster Analysis of Fisher (1936) Iris Data';   proc format;   value specname   1='Setosa    '   2='Versicolor'   3='Virginica ';   run;   data iris;   input SepalLength SepalWidth PetalLength PetalWidth Species @@;   format Species specname.;   label SepalLength='Sepal Length in mm.'   SepalWidth ='Sepal Width in mm.;   PetalLength='Petal Length in mm.'   PetalWidth ='Petal Width in mm.';   symbol = put(species, specname10.);   datalines;   50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3   63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2   59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2   65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3   68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3   77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3   49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2   64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3   55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1   49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1   67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1   77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2   50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1   61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1   61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1   51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1   51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1   46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1   50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3   57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1   71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3   49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1   49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1   66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1   44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2   47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2   74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1   56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3   49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1   56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2   51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3   54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3   61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3   68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1   45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1   55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1   51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2   63 33 60 25 3 53 37 15 02 1   ;

The following macro, SHOW, is used in the subsequent analyses to display cluster results. It invokes the FREQ procedure to crosstabulate clusters and species. The CANDISC procedure computes canonical variables for discriminating among the clusters, and the first two canonical variables are plotted to show cluster membership. See Chapter 21, The CANDISC Procedure, for a canonical discriminant analysis of the iris species.

  %macro show;   proc freq;   tables cluster*species;   run;   proc candisc noprint out=can;   class cluster;   var petal: sepal:;   run;   legend1 frame cframe=ligr cborder=black   position=center value=(justify=center);   axis1 label=(angle=90 rotate=0) minor=none;   axis2 minor=none;   proc gplot;   plot can2 *can1=cluster /   frame cframe=ligr legend=legend1 vaxis=axis1 haxis=axis2;   run;   %mend;

The first analysis clusters the iris data by Ward s method and plots the CCC and pseudo F and t ² statistics. The CCC has a local peak at 3 clusters but a higher peak at 5 clusters. The pseudo F statistic indicates 3 clusters, while the pseudo t ² statistic suggests 3 or 6 clusters. For large numbers of clusters, Version 6 of the SAS System produces somewhat different results than previous versions of PROC CLUSTER. This is due to changes in the treatment of ties. Results are identical for 5 or fewer clusters.

The TREE procedure creates an output data set containing the 3-cluster partition for use by the SHOW macro. The FREQ procedure reveals 16 misclassifications. The results are shown in Output 23.3.1.

Output 23.3.1: Cluster Analysis of Fisher Iris Data ”CLUSTER with METHOD=WARD

  Cluster Analysis of Fisher (1936) Iris Data   By Ward's Method   The CLUSTER Procedure   Ward's Minimum Variance Cluster Analysis   Eigenvalues of the Covariance Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    422.824171    398.557096        0.9246        0.9246   2     24.267075     16.446125        0.0531        0.9777   3      7.820950      5.437441        0.0171        0.9948   4      2.383509                      0.0052        1.0000   Root-Mean-Square Total-Sample Standard Deviation = 10.69224   Root-Mean-Square Distance Between Observations   = 30.24221   Cluster History   T   i   NCL   --Clusters Joined---       FREQ     SPRSQ     RSQ    ERSQ     CCC     PSF    PST2    e   15   CL24        CL28             15    0.0016    .971    .958    5.93     324     9.8   14   CL21        CL53              7    0.0019    .969    .955    5.85     329     5.1   13   CL18        CL48             15    0.0023    .967    .953    5.69     334     8.9   12   CL16        CL23             24    0.0023    .965    .950    4.63     342     9.6   11   CL14        CL43             12    0.0025    .962    .946    4.67     353     5.8   10   CL26        CL20             22    0.0027    .959    .942    4.81     368    12.9   9   CL27        CL17             31    0.0031    .956    .936    5.02     387    17.8   8   CL35        CL15             23    0.0031    .953    .930    5.44     414    13.8   7   CL10        CL47             26    0.0058    .947    .921    5.43     430    19.1   6   CL8         CL13             38    0.0060    .941    .911    5.81     463    16.3   5   CL9         CL19             50    0.0105    .931    .895    5.82     488    43.2   4   CL12        CL11             36    0.0172    .914    .872    3.99     515    41.0   3   CL6         CL7              64    0.0301    .884    .827    4.33     558    57.2   2   CL4         CL3             100    0.1110    .773    .697    3.83     503     116   1   CL5         CL2             150    0.7726    .000    .000    0.00      .      503

  title2 'By Ward''s Method';   proc cluster data=iris method=ward print=15 ccc pseudo;   var petal: sepal:;   copy species;   run;   legend1 frame cframe=ligr cborder=black   position=center value=(justify=center);   axis1 label=(angle=90 rotate=0) minor=none order=(0 to 600 by 100);   axis2 minor=none order=(1 to 30 by 1);   axis3 label=(angle=90 rotate=0) minor=none order=(0 to 7 by 1);   proc gplot;   plot _ccc_*_ncl_  /   frame cframe=ligr legend=legend1 vaxis=axis3 haxis=axis2;   plot _psf_*_ncl_  _pst2_*_ncl_ /overlay   frame cframe=ligr legend=legend1 vaxis=axis1 haxis=axis2;   run;   proc tree noprint ncl=3 out=out;   copy petal: sepal: species;   run;   %show;

  Cluster Analysis of Fisher (1936) Iris Data   The FREQ Procedure   Table of CLUSTER by Species   CLUSTER     Species   Frequency   Percent   Row Pct   Col Pct  Setosa  VersicolVirginic    Total   or      a   ---------+--------+--------+--------+   1       0      49      15        64   0.00   32.67   10.00     42.67   0.00   76.56   23.44   0.00   98.00   30.00   ---------+--------+--------+--------+   2       0       1      35       36   0.00    0.67   23.33    24.00   0.00    2.78   97.22   0.00    2.00   70.00   ---------+--------+--------+--------+   3      50       0       0       50   33.33    0.00    0.00    33.33   100.00    0.00    0.00   100.00    0.00    0.00   ---------+--------+--------+--------+   Total          50       50       50       150   33.33    33.33    33.33    100.00

The second analysis uses two-stage density linkage. The raw data suggest 2 or 6 modes instead of 3:

k	modes
3	12
4-6	6
7	4
8	3
9-50	2
51+	1

However, the ACECLUS procedure can be used to reveal 3 modes. This analysis uses K=8 to produce 3 clusters for comparison with other analyses. There are only 6 misclassifications. The results are shown in Output 23.3.2.

  title2 'By Two-Stage Density Linkage';   proc cluster data=iris method=twostage k=8 print=15 ccc pseudo;   var petal: sepal:;   copy species;   run;   proc tree noprint ncl=3 out=out;   copy petal: sepal: species;   run;   %show;

Output 23.3.2: Cluster Analysis of Fisher Iris Data ”CLUSTER with METHOD=TWOSTAGE

  Cluster Analysis of Fisher (1936) Iris Data   By Two-Stage Density Linkage   The CLUSTER Procedure   Two-Stage Density Linkage Clustering   Eigenvalues of the Covariance Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    422.824171    398.557096        0.9246        0.9246   2     24.267075     16.446125        0.0531        0.9777   3      7.820950      5.437441        0.0171        0.9948   4      2.383509                      0.0052        1.0000   K = 8   Root-Mean-Square Total-Sample Standard Deviation = 10.69224   Cluster History   Normalized     Maximum Density     T   Fusion     in Each Cluster     i   NCL   --Clusters Joined--    FREQ    SPRSQ    RSQ   ERSQ    CCC    PSF   PST2     Density     Lesser    Greater   e   15   CL17       OB127         44   0.0025   .916   .958   11    105    3.4      0.3903     0.2066     3.5156   14   CL16       OB137         50   0.0023   .913   .955   11    110    5.6      0.3637     0.1837      100.0   13   CL15       OB74          45   0.0029   .910   .953   10    116    3.7      0.3553     0.2130     3.5156   12   CL28       OB49          46   0.0036   .907   .950   8.0    122    5.2      0.3223     0.1736     8.3678   T   11   CL12       OB85          47   0.0036   .903   .946   7.6    130    4.8      0.3223     0.1736     8.3678   10   CL11       OB98          48   0.0033   .900   .942   7.1    140    4.1      0.2879     0.1479     8.3678   9   CL13       OB24          46   0.0037   .896   .936   6.5    152    4.4      0.2802     0.2005     3.5156   8   CL10       OB25          49   0.0019   .894   .930   5.5    171    2.2      0.2699     0.1372     8.3678   7   CL8        OB121         50   0.0035   .891   .921   4.5    194    4.0      0.2586     0.1372     8.3678   6   CL9        OB45          47   0.0042   .886   .911   3.3    225    4.6      0.1412     0.0832     3.5156   5   CL6        OB39          48   0.0049   .882   .895   1.7    270    5.0       0.107     0.0605     3.5156   4   CL5        OB21          49   0.0049   .877   .872   0.35    346    4.7      0.0969     0.0541     3.5156   3   CL4        OB90          50   0.0047   .872   .827   3.28    500    4.1      0.0715     0.0370     3.5156   2   CL3        CL7          100   0.0993   .773   .697   3.83    503   91.9      2.6277     3.5156     8.3678   3 modal clusters have been formed.

  Cluster Analysis of Fisher (1936) Iris Data   The FREQ Procedure   Table of CLUSTER by Species   CLUSTER     Species   Frequency   Percent   Row Pct   Col Pct  Setosa  VersicolVirginic    Total   or      a   ---------+--------+--------+--------+   1      50       0       0       50   33.33    0.00    0.00    33.33   100.00    0.00    0.00   100.00    0.00    0.00   ---------+--------+--------+--------+   2       0      47       3       50   0.00   31.33    2.00    33.33   0.00   94.00    6.00   0.00   94.00    6.00   ---------+--------+--------+--------+   3       0       3      47       50   0.00    2.00   31.33    33.33   0.00    6.00   94.00   0.00    6.00   94.00   ---------+--------+--------+--------+   Total          50       50       50       150   33.33    33.33    33.33    100.00

The CLUSTER procedure is not practical for very large data sets because, with most methods, the CPU time varies as the square or cube of the number of observations. The FASTCLUS procedure requires time proportional to the number of observations and can, therefore, be used with much larger data sets than PROC CLUSTER. If you want to hierarchically cluster a very large data set, you can use PROC FASTCLUS for a preliminary cluster analysis producing a large number of clusters and then use PROC CLUSTER to hierarchically cluster the preliminary clusters.

FASTCLUS automatically creates variables _FREQ_ and _RMSSTD_ in the MEAN= output data set. These variables are then automatically used by PROC CLUSTER in the computation of various statistics.

The iris data are used to illustrate the process of clustering clusters. In the preliminary analysis, PROC FASTCLUS produces ten clusters, which are then crosstabulated with species. The data set containing the preliminary clusters is sorted in preparation for later merges. The results are shown in Output 23.3.3.

  title2 'Preliminary Analysis by FASTCLUS';   proc fastclus data=iris summary maxc=10 maxiter=99 converge=0   mean=mean out=prelim cluster=preclus;   var petal: sepal:;   run;   proc freq;   tables preclus*species;   run;   proc sort data=prelim;   by preclus;   run;

Output 23.3.3: Preliminary Analysis of Fisher Iris Data

  Cluster Analysis of Fisher (1936) Iris Data   Preliminary Analysis by FASTCLUS   The FASTCLUS Procedure   Replace=FULL Radius=0  Maxclusters=10 Maxiter=99 Converge=0   Cluster Summary   Maximum Distance   RMS Std           from Seed     Radius     Nearest     Distance Between   Cluster     Frequency    Deviation      to Observation    Exceeded    Cluster    Cluster Centroids   --------------------------------------------------------------------------------------------------   1               9       2.7067              8.2027                      5               8.7362   2              19       2.2001              7.7340                      4               6.2243   3              18       2.1496              6.2173                      8               7.5049   4               4       2.5249              5.3268                      2               6.2243   5               3       2.7234              5.8214                      1               8.7362   6               7       2.2939              5.1508                      2               9.3318   7              17       2.0274              6.9576                     10               7.9503   8              18       2.2628              7.1135                      3               7.5049   9              22       2.2666              7.5029                      8               9.0090   10              33       2.0594             10.0033                      7               7.9503   Pseudo F Statistic =   370.58   Observed Over-All R-Squared = 0.95971   Approximate Expected Over-All R-Squared =   0.82928   Cubic Clustering Criterion =   27.077   WARNING: The two values above are invalid for correlated variables.

  Cluster Analysis of Fisher (1936) Iris Data   Preliminary Analysis by FASTCLUS   The FREQ Procedure   Table of preclus by Species   preclus(Cluster)     Species   Frequency   Percent   Row Pct   Col Pct  Setosa  VersicolVirginic   Total   or      a   ---------+--------+--------+--------+   1       0       0       9        9   0.00    0.00    6.00     6.00   0.00    0.00  100.00   0.00    0.00   18.00   ---------+--------+--------+--------+   2       0      19       0       19   0.00   12.67    0.00    12.67   0.00  100.00    0.00   0.00   38.00    0.00   ---------+--------+--------+--------+   3       0      18       0       18   0.00   12.00    0.00    12.00   0.00  100.00    0.00   0.00   36.00    0.00   ---------+--------+--------+--------+   4       0       3       1        4   0.00    2.00    0.67     2.67   0.00  75.00  25.00   0.00    6.00    2.00   ---------+--------+--------+--------+   5       0       0       3        3   0.00    0.00    2.00     2.00   0.00    0.00  100.00   0.00    0.00    6.00   ---------+--------+--------+--------+   6       0       7       0        7   0.00    4.67    0.00     4.67   0.00  100.00    0.00   0.00   14.00    0.00   ---------+--------+--------+--------+   7      17       0       0       17   11.33    0.00    0.00    11.33   100.00    0.00    0.00   34.00    0.00    0.00   ---------+--------+--------+--------+   8       0       3      15       18   0.00    2.00   10.00    12.00   0.00   16.67   83.33   0.00    6.00   30.00   ---------+--------+--------+--------+   9       0       0      22       22   0.00    0.00   14.67    14.67   0.00    0.00  100.00   0.00    0.00   44.00   ---------+--------+--------+--------+   10      33       0       0       33   22.00    0.00    0.00    22.00   100.00    0.00    0.00   66.00    0.00    0.00   ---------+--------+--------+--------+   Total          50       50       50       150   33.33    33.33    33.33    100.00

The following macro, CLUS, clusters the preliminary clusters. There is one argument to choose the METHOD= specification to be used by PROC CLUSTER. The TREE procedure creates an output data set containing the 3-cluster partition, which is sorted and merged with the OUT= data set from PROC FASTCLUS to determine to which cluster each of the original 150 observations belongs. The SHOW macro is then used to display the results. In this example, the CLUS macro is invoked using Ward s method, which produces 16 misclassifications, and Wong s hybrid method, which produces 22 misclassifications. The results are shown in Output 23.3.4 and Output 23.3.5.

  %macro clus(method);   proc cluster data=mean method=&method ccc pseudo;   var petal: sepal:;   copy preclus;   run;   proc tree noprint ncl=3 out=out;   copy petal: sepal: preclus;   run;   proc sort data=out;   by preclus;   run;   data clus;   merge out prelim;   by preclus;   run;   %show;   %mend;   title2 'Clustering Clusters by Ward''s Method';   %clus(ward);   title2 'Clustering Clusters by Wong''s Hybrid Method';   %clus(twostage hybrid);

  Cluster Analysis of Fisher (1936) Iris Data   The FREQ Procedure   Table of CLUSTER by Species   CLUSTER     Species   Frequency   Percent   Row Pct   Col Pct  Setosa  VersicolVirginic   Total   or      a   ---------+--------+--------+--------+   1       0      50      16       66   0.00   33.33   10.67    44.00   0.00   75.76   24.24   0.00  100.00   32.00   ---------+--------+--------+--------+   2       0       0      34       34   0.00    0.00   22.67    22.67   0.00    0.00  100.00   0.00    0.00   68.00   ---------+--------+--------+--------+   3      50       0       0       50   33.33    0.00    0.00    33.33   100.00    0.00    0.00   100.00    0.00    0.00   ---------+--------+--------+--------+   Total          50       50       50       150   33.33    33.33    33.33    100.00

Output 23.3.5: Clustering Clusters: PROC CLUSTER with Wong s Hybrid Method

  Cluster Analysis of Fisher (1936) Iris Data   Clustering Clusters by Wong's Hybrid Method   The CLUSTER Procedure   Two-Stage Density Linkage Clustering   Eigenvalues of the Covariance Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    416.976349    398.666421        0.9501        0.9501   2     18.309928     14.952922        0.0417        0.9918   3      3.357006      3.126943        0.0076        0.9995   4      0.230063                      0.0005        1.0000   Root-Mean-Square Total-Sample Standard Deviation = 10.69224   Cluster History   Normalized     Maximum Density     T   Fusion     in Each Cluster     i   NCL   --Clusters Joined--     FREQ    SPRSQ    RSQ   ERSQ    CCC    PSF   PST2     Density     Lesser    Greater   e   9   OB10       OB7            50   0.0104   .949   .932   3.81    330   42.2       40.24    58.2179      100.0   8   OB3        OB8            36   0.0074   .942   .926   3.22    329   26.0      27.981    39.4511    48.4350   7   OB2        OB4            23   0.0019   .940   .918   4.24    373    6.3      23.775     8.9675    46.3026   6   CL8        OB9            58   0.0194   .921   .907   2.13    334   46.3      20.724    46.8846    48.4350   5   CL7        OB6            30   0.0069   .914   .892   3.09    383   19.5      13.303    17.6360    46.3026   4   CL6        OB1            67   0.0292   .884   .870   1.21    372   41.0      8.4137    10.8758    48.4350   3   CL4        OB5            70   0.0138   .871   .824   3.33    494   12.3      5.1855     6.2890    48.4350   2   CL3        CL5           100   0.0979   .773   .695   3.94    503   89.5      19.513    46.3026    48.4350   1   CL2        CL9           150   0.7726   .000   .000   0.00     .     503      1.3337    48.4350      100.0   3 modal clusters have been formed.

  Cluster Analysis of Fisher (1936) Iris Data   The FREQ Procedure   Table of CLUSTER by Species   CLUSTER     Species   Frequency   Percent   Row Pct   Col Pct  Setosa  VersicolVirginic  Total   or      a   ---------+--------+--------+--------+   1      50       0       0      50   33.33    0.00    0.00   33.33   100.00    0.00    0.00   100.00    0.00    0.00   ---------+--------+--------+--------+   2       0      21      49      70   0.00   14.00   32.67    46.67   0.00   30.00   70.00   0.00   42.00   98.00   ---------+--------+--------+--------+   3       0      29       1      30   0.00   19.33    0.67   20.00   0.00   96.67    3.33   0.00   58.00    2.00   ---------+--------+--------+--------+   Total          50       50       50      150   33.33    33.33    33.33   100.00

Example 23.4. Evaluating the Effects of Ties

If, at some level of the cluster history, there is a tie for minimum distance between clusters, then one or more levels of the sample cluster tree are not uniquely determined. This example shows how the degree of indeterminacy can be assessed.

Mammals have four kinds of teeth: incisors, canines, premolars, and molars. The following data set gives the number of teeth of each kind on one side of the top and bottom jaws for 32 mammals.

Since all eight variables are measured in the same units, it is not strictly necessary to rescale the data. However, the canines have much less variance than the other kinds of teeth and, therefore, have little effect on the analysis if the variables are not standardized. An average linkage cluster analysis is run with and without standardization to allow comparison of the results. The results are shown in Output 23.4.1 and Output 23.4.2.

  title 'Hierarchical Cluster Analysis of Mammals'' Teeth Data';   title2 'Evaluating the Effects of Ties';   data teeth;   input mammal $ 1-16   @21 (v1-v8) (1.);   label v1='Top incisors'   v2='Bottom incisors'   v3='Top canines'   v4='Bottom canines'   v5='Top premolars'   v6='Bottom premolars'   v7='Top molars'   v8='Bottom molars';   datalines;   BROWN BAT           23113333   MOLE                32103333   SILVER HAIR BAT     23112333   PIGMY BAT           23112233   HOUSE BAT           23111233   RED BAT             13112233   PIKA                21002233   RABBIT              21003233   BEAVER              11002133   GROUNDHOG           11002133   GRAY SQUIRREL       11001133   HOUSE MOUSE         11000033   PORCUPINE           11001133   WOLF                33114423   BEAR                33114423   RACCOON             33114432   MARTEN              33114412   WEASEL              33113312   WOLVERINE           33114412   BADGER              33113312   RIVER OTTER         33114312   SEA OTTER           32113312   JAGUAR              33113211   COUGAR              33113211   FUR SEAL            32114411   SEA LION            32114411   GREY SEAL           32113322   ELEPHANT SEAL       21114411   REINDEER            04103333   ELK                 04103333   DEER                04003333   MOOSE               04003333   ;   proc cluster data=teeth method=average nonorm   outtree=_null_;   var v1-v8;   id mammal;   title3 'Raw Data';   run;   proc cluster data=teeth std method=average nonorm   outtree=_null_;   var v1-v8;   id mammal;   title3 'Standardized Data';   run;

Output 23.4.1: Average Linkage Analysis of Mammals Teeth Data: Raw Data

  Hierarchical Cluster Analysis of Mammals' Teeth Data   Evaluating the Effects of Ties   Raw Data   The CLUSTER Procedure   Average Linkage Cluster Analysis   Eigenvalues of the Covariance Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    3.76799365    2.33557185        0.5840        0.5840   2    1.43242180    0.91781899        0.2220        0.8061   3    0.51460281    0.08414950        0.0798        0.8858   4    0.43045331    0.30021485        0.0667        0.9525   5    0.13023846    0.03814626        0.0202        0.9727   6    0.09209220    0.04216914        0.0143        0.9870   7    0.04992305    0.01603541        0.0077        0.9947   8    0.03388764                      0.0053        1.0000   Root-Mean-Square Total-Sample Standard Deviation = 0.898027   Cluster History   T   RMS    i   NCL    ----------Clusters Joined-----------      FREQ      Dist    e   31    BEAVER              GROUNDHOG                2         0    T   30    GRAY SQUIRREL       PORCUPINE                2         0    T   29    WOLF                BEAR                     2         0    T   28    MARTEN              WOLVERINE                2         0    T   27    WEASEL              BADGER                   2         0    T   26    JAGUAR              COUGAR                   2         0    T   25    FUR SEAL            SEA LION                 2         0    T   24    REINDEER            ELK                      2         0    T   23    DEER                MOOSE                    2         0   22    BROWN BAT           SILVER HAIR BAT          2         1    T   21    PIGMY BAT           HOUSE BAT                2         1    T   20    PIKA                RABBIT                   2         1    T   19    CL31                CL30                     4         1    T   18    CL28                RIVER OTTER              3         1    T   17    CL27                SEA OTTER                3         1    T   16    CL24                CL23                     4         1   15    CL21                RED BAT                  3    1.2247   14    CL17                GREY SEAL                4     1.291   13    CL29                RACCOON                  3    1.4142    T   12    CL25                ELEPHANT SEAL            3    1.4142   11    CL18                CL14                     7    1.5546   10    CL22                CL15                     5    1.5811   9    CL20                CL19                     6    1.8708    T   8    CL11                CL26                     9    1.9272   7    CL8                 CL12                    12    2.2278   6    MOLE                CL13                     4    2.2361   5    CL9                 HOUSE MOUSE              7    2.4833   4    CL6                 CL7                     16    2.5658   3    CL10                CL16                     9    2.8107   2    CL3                 CL5                     16    3.7054   1    CL2                 CL4                     32    4.2939

Output 23.4.2: Average Linkage Analysis of Mammals Teeth Data: Standardized Data

  Hierarchical Cluster Analysis of Mammals' Teeth Data   Evaluating the Effects of Ties   Standardized Data   The CLUSTER Procedure   Average Linkage Cluster Analysis   Eigenvalues of the Correlation Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    4.74153902    3.27458808        0.5927        0.5927   2    1.46695094    0.70824118        0.1834        0.7761   3    0.75870977    0.25146252        0.0948        0.8709   4    0.50724724    0.30264737        0.0634        0.9343   5    0.20459987    0.05925818        0.0256        0.9599   6    0.14534169    0.03450100        0.0182        0.9780   7    0.11084070    0.04606994        0.0139        0.9919   8    0.06477076                      0.0081        1.0000   The data have been standardized to mean 0 and variance 1   Root-Mean-Square Total-Sample Standard Deviation =        1   Cluster History   T   RMS    i   NCL    ----------Clusters Joined-----------      FREQ      Dist    e   31    BEAVER              GROUNDHOG                2         0    T   30    GRAY SQUIRREL       PORCUPINE                2         0    T   29    WOLF                BEAR                     2         0    T   28    MARTEN              WOLVERINE                2         0    T   27    WEASEL              BADGER                   2         0    T   26    JAGUAR              COUGAR                   2         0    T   25    FUR SEAL            SEA LION                 2         0    T   24    REINDEER            ELK                      2         0    T   23    DEER                MOOSE                    2         0   22    PIGMY BAT           RED BAT                  2    0.9157   21    CL28                RIVER OTTER              3    0.9169   20    CL31                CL30                     4    0.9428    T   19    BROWN BAT           SILVER HAIR BAT          2    0.9428    T   18    PIKA                RABBIT                   2    0.9428   17    CL27                SEA OTTER                3    0.9847   16    CL22                HOUSE BAT                3    1.1437   15    CL21                CL17                     6    1.3314   14    CL25                ELEPHANT SEAL            3    1.3447   13    CL19                CL16                     5    1.4688   12    CL15                GREY SEAL                7    1.6314   11    CL29                RACCOON                  3     1.692   10    CL18                CL20                     6    1.7357   9    CL12                CL26                     9    2.0285   8    CL24                CL23                     4    2.1891   7    CL9                 CL14                    12    2.2674   6    CL10                HOUSE MOUSE              7     2.317   5    CL11                CL7                     15    2.6484   4    CL13                MOLE                     6    2.8624   3    CL4                 CL8                     10    3.5194   2    CL3                 CL6                     17    4.1265   1    CL2                 CL5                     32    4.7753

There are ties at 16 levels for the raw data but at only 10 levels for the standardized data. There are more ties for the raw data because the increments between successive values are the same for all of the raw variables but different for the standardized variables.

One way to assess the importance of the ties in the analysis is to repeat the analysis on several random permutations of the observations and then to see to what extent the results are consistent at the interesting levels of the cluster history. Three macros are presented to facilitate this process.

  /* --------------------------------------------------------- */   /*                                                           */   /* The macro CLUSPERM randomly permutes observations and     */   /* does a cluster analysis for each permutation.             */   /* The arguments are as follows:                             */   /*                                                           */   /*    data    data set name                                  */   /*    var     list of variables to cluster                   */   /*    id      id variable for proc cluster                   */   /*    method  clustering method (and possibly other options) */   /*    nperm   number of random permutations.                 */   /*                                                           */   /* --------------------------------------------------------- */   %macro CLUSPERM(data,var,id,method,nperm);   /* ------CREATE TEMPORARY DATA SET WITH RANDOM NUMBERS------ */   data _temp_;   set &data;   array _random_ _ran_1-_ran_&nperm;   do over _random_;   _random_=ranuni(835297461);   end;   run;   /* ------PERMUTE AND CLUSTER THE DATA----------------------- */   %do n=1 %to &nperm;   proc sort data=_temp_(keep=_ran_&n &var &id) out=_perm_;   by _ran_&n;   run;   proc cluster method=&method noprint outtree=_tree_&n;   var &var;   id &id;   run;   %end;   %mend;   /* --------------------------------------------------------- */   /*                                                           */   /* The macro PLOTPERM plots various cluster statistics       */   /* against the number of clusters for each permutation.      */   /* The arguments are as follows:                             */   /*                                                           */   /*    stats   names of variables from tree data set          */   /*    nclus   maximum number of clusters to be plotted       */   /*    nperm   number of random permutations.                 */   /*                                                           */   /* --------------------------------------------------------- */   %macro PLOTPERM(stat,nclus,nperm);   /* ---CONCATENATE TREE DATA SETS FOR 20 OR FEWER CLUSTERS--- */   data _plot_;   set %do n=1 %to &nperm; _tree_&n(in=_in_&n) %end; ;   if _ncl_<=&nclus;   %do n=1 %to &nperm;   if _in_&n then _perm_=&n;   %end;   label _perm_='permutation number';   keep _ncl_ &stat _perm_;   run;   /* ---PLOT THE REQUESTED STATISTICS BY NUMBER OF CLUSTERS--- */   proc plot;   plot (&stat)*_ncl_=_perm_ /vpos=26;   title2 'Symbol is value of _PERM_';   run;   %mend;   /* --------------------------------------------------------- */   /*                                                           */   /* The macro TREEPERM generates cluster-membership variables */   /* for a specified number of clusters for each permutation.  */   /* PROC PRINT lists the objects in each cluster-combination, */   /* and PROC TABULATE gives the frequencies and means. The    */   /* arguments are as follows:                                 */   /*                                                           */   /*    var     list of variables to cluster                   */   /*            (no"-" or ":" allowed)                         */   /*    id      id variable for proc cluster                   */   /*    meanfmt format for printing means in PROC TABULATE     */   /*    nclus   number of clusters desired                     */   /*    nperm   number of random permutations.                 */   /*                                                           */   /* --------------------------------------------------------- */   %macro TREEPERM(var,id,meanfmt,nclus,nperm);   /* ------CREATE DATA SETS GIVING CLUSTER MEMBERSHIP--------- */   %do n=1 %to &nperm;   proc tree data=_tree_&n noprint n=&nclus   out=_out_&n(drop=clusname   rename=(cluster=_clus_&n));   copy &var;   id &id;   run;   proc sort;   by &id &var;   run;   %end;   /* ------MERGE THE CLUSTER VARIABLES------------------------ */   data _merge_;   merge   %do n=1 %to &nperm;   _out_&n   %end; ;   by &id &var;   length all_clus $ %eval(3*&nperm);   %do n=1 %to &nperm;   substr(all_clus, %eval(1+(&n-1)*3), 3) =   put(_clus_&n, 3.);   %end;   run;   /* ------PRINT AND TABULATE CLUSTER COMBINATIONS------------ */   proc sort;   by _clus_:;   run;   proc print;   var &var;   id &id;   by all_clus notsorted;   run;   proc tabulate order=data formchar='           ';   class all_clus;   var &var;   table all_clus, n='FREQ'*f=5. mean*f=&meanfmt*(&var) /   rts=%eval(&nperm*3+1);   run;   %mend;

To use these, it is first convenient to define a macro, VLIST, listing the teeth variables, since the forms V1-V8 or V: cannot be used with the TABULATE procedure in the TREEPERM macro:

  /* -TABULATE does not accept hyphens or colons in VAR lists- */   %let vlist=v1 v2 v3 v4 v5 v6 v7 v8;

The CLUSPERM macro is then called to analyze ten random permutations. The PLOTPERM macro plots the pseudo F and t ² statistics and the cubic clustering criterion. Since the data are discrete, the pseudo F statistic and the cubic clustering criterion can be expected to increase as the number of clusters increases , so local maxima or large jumps in these statistics are more relevant than the global maximum in determining the number of clusters. For the raw data, only the pseudo t ² statistic indicates the possible presence of clusters, with the 4-cluster level being suggested. Hence, the TREEPERM macro is used to analyze the results at the 4-cluster level:

  title3 'Raw Data';   /* ------CLUSTER RAW DATA WITH AVERAGE LINKAGE-------------- */   %clusperm(teeth, &vlist, mammal, average, 10);   /* -----PLOT STATISTICS FOR THE LAST 20 LEVELS-------------- */   %plotperm(_psf_ _pst2_ _ccc_, 20, 10);   /* ------ANALYZE THE 4-CLUSTER LEVEL------------------------ */   %treeperm(&vlist, mammal, 9.1, 4, 10);

The results are shown in Output 23.4.3.

Output 23.4.3: Analysis of Ten Random Permutations of Raw Mammals Teeth Data: Indeterminacy at the 4-Cluster Level

  Hierarchical Cluster Analysis of Mammals' Teeth Data   Symbol is value of _PERM_   Plot of _PSF_*_NCL_. Symbol is value of _perm_.     100 +       P   s   e                                                                                                      5   u  80 +   d   o     F   2   S  60 +   t                                                                                                 5    4   a                                                                                                 2   t                                                                                            9    9    1   i                                                                                       3    3    1    6   s                                                                             2    2    1    1    4   t  40 +                                                                   2    4    1   i                                                              1    1    1    1   c                                                         2    3   1                   2         2    1    1   1         1    1    1         1    1   2                   1   20 +   ---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+--   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   Number of Clusters   NOTE: 10 obs had missing values.  151 obs hidden.

  Hierarchical Cluster Analysis of Mammals' Teeth Data   Symbol is value of _PERM_   Plot of _PST2_*_NCL_.  Symbol is value of _perm_.   P   s 30 +   e   u      1   d   o 25 +     T   -                1   S 20 +   q           1   u   a   r 15 +   e   d                2   2                                  1   S 10 +   t   a                     2                        2                        3   t                     1    2    2    1    1              1   i  5 +                           1                                       2    5   s                          1                        2    3              4    1   t                                                   1   i                                                                       1    2    1   c  0 +   ---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+--   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   Number of Clusters   NOTE: 69 obs had missing values.  104 obs hidden.

  Hierarchical Cluster Analysis of Mammals' Teeth Data   Symbol is value of _PERM_   Plot of _CCC_*_NCL_.  Symbol is value of _perm_.   C   u 4 +   b   i   c   2   C   l 3 +   u                           1     1   s   t   e   r                           2   i 2 +   n                     1   g               1     2     C   r   i 1 +   t   e   r               2   i   o         1   n 0 +1   -+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+   1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19   20   Number of Clusters   NOTE: 140 obs had missing values.  50 obs hidden.

  -------------------------------------- all_clus='  1  3  1  1  1  3  3  3  2  3' ---------------------------------------   mammal      v1    v2    v3    v4    v5    v6    v7    v8   DEER         0     4     0     0     3     3     3     3   ELK          0     4     1     0     3     3     3     3   MOOSE        0     4     0     0     3     3     3     3   REINDEER     0     4     1     0     3     3     3     3   -------------------------------------- all_clus='  2  2  2  2  2  2  1  2  1  1' ---------------------------------------   mammal           v1    v2    v3    v4    v5    v6    v7    v8   BADGER            3     3     1     1     3     3     1     2   BEAR              3     3     1     1     4     4     2     3   COUGAR            3     3     1     1     3     2     1     1   ELEPHANT SEAL     2     1     1     1     4     4     1     1   FUR SEAL          3     2     1     1     4     4     1     1   GREY SEAL         3     2     1     1     3     3     2     2   JAGUAR            3     3     1     1     3     2     1     1   MARTEN            3     3     1     1     4     4     1     2   RACCOON           3     3     1     1     4     4     3     2   RIVER OTTER       3     3     1     1     4     3     1     2   SEA LION          3     2     1     1     4     4     1     1   SEA OTTER         3     2     1     1     3     3     1     2   WEASEL            3     3     1     1     3     3     1     2   WOLF              3     3     1     1     4     4     2     3   WOLVERINE         3     3     1     1     4     4     1     2   -------------------------------------- all_clus='  2  4  2  2  4  2  1  2  1  1' ---------------------------------------   mammal    v1    v2    v3    v4    v5    v6    v7    v8   MOLE      3     2     1     0     3     3     3     3   -------------------------------------- all_clus='  3  1  3  3  3  1  2  1  3  2' ---------------------------------------   mammal           v1    v2    v3    v4    v5    v6    v7    v8   BEAVER            1     1     0     0     2     1     3     3   GRAY SQUIRREL     1     1     0     0     1     1     3     3   GROUNDHOG         1     1     0     0     2     1     3     3   HOUSE MOUSE       1     1     0     0     0     0     3     3   PORCUPINE         1     1     0     0     1     1     3     3   -------------------------------------- all_clus='  3  4  3  3  4  1  2  1  3  2' ---------------------------------------   mammal    v1    v2    v3    v4    v5    v6    v7    v8   PIKA       2     1     0     0     2     2     3     3   RABBIT     2     1     0     0     3     2     3     3   -------------------------------------- all_clus='  4  4  4  4  4  4  4  4  4  4' ---------------------------------------   mammal             v1    v2    v3    v4    v5    v6    v7    v8   BROWN BAT           2     3     1     1     3     3     3     3   HOUSE BAT           2     3     1     1     1     2     3     3   PIGMY BAT           2     3     1     1     2     2     3     3   RED BAT             1     3     1     1     2     2     3     3   SILVER HAIR BAT     2     3     1     1     2     3     3     3

  Mean   Top     Bottom      Top     Bottom      Top     Bottom      Top     Bottom   FREQ  incisors  incisors   canines   canines  premolars premolars  molars    molars   all_clus   1  3  1  1  1  3  3  3  2  3     4      0.0         4.0       0.5       0.0       3.0       3.0       3.0       3.0   2  2  2  2  2  2  1  2  1  1    15      2.9         2.6       1.0       1.0       3.6       3.4       1.3       1.8   2  4  2  2  4  2  1  2  1  1     1      3.0         2.0       1.0       0.0       3.0       3.0       3.0       3.0   3  1  3  3  3  1  2  1  3  2     5      1.0         1.0       0.0       0.0       1.2       0.8       3.0       3.0   3  4  3  3  4  1  2  1  3  2     2      2.0         1.0       0.0       0.0       2.5       2.0       3.0       3.0   4  4  4  4  4  4  4  4  4  4     5      1.8         3.0       1.0       1.0       2.0       2.4       3.0       3.0

From the TABULATE and PRINT output, you can see that two types of clustering are obtained. In one case, the mole is grouped with the carnivores, while the pika and rabbit are grouped with the rodents. In the other case, both the mole and the lagomorphs are grouped with the bats.

Next, the analysis is repeated with the standardized data. The pseudo F and t ² statistics indicate 3 or 4 clusters, while the cubic clustering criterion shows a sharp rise up to 4 clusters and then levels off up to 6 clusters. So the TREEPERM macro is used again at the 4-cluster level. In this case, there is no indeterminacy, as the same four clusters are obtained with every permutation, although in different orders. It must be emphasized , however, that lack of indeterminacy in no way indicates validity. The results are shown in Output 23.4.4.

Output 23.4.4: Analysis of Ten Random Permutations of Standardized Mammals Teeth Data: No Indeterminacy at the 4-Cluster Level

  Hierarchical Cluster Analysis of Mammals' Teeth Data   Symbol is value of _PERM_   Plot of _PSF_*_NCL_.  Symbol is value of _perm_.     100 +       P   s   e                                                                                                      1   u  80 +   d                                                                                                 1   o                                                                                            1     F                                                                                       1   1   S  60 +   t   a   t   i                                                                        1    1   s   t  40 +                                                         1    1   i                                                         1   c                      1                             1   1                        1    1   1              1    1    1     20 +   ---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+--   1    2    3    4    5    6    7    8    9   10   11   12   13   14  15   16   17   18   19   20   Number of Clusters   NOTE: 10 obs had missing values.  171 obs hidden.

  title3 'Standardized Data';   /*------CLUSTER STANDARDIZED DATA WITH AVERAGE LINKAGE------*/   %clusperm(teeth, &vlist, mammal, average std, 10);   /*------PLOT STATISTICS FOR THE LAST 20 LEVELS--------------*/   %plotperm(_psf_ _pst2_ _ccc_, 20, 10);   /*------ANALYZE THE 4-CLUSTER LEVEL-------------------------*/   %treeperm(&vlist, mammal, 9.1, 4, 10);

  Hierarchical Cluster Analysis of Mammals' Teeth Data   Symbol is value of _PERM_   Plot of _PST2_*_NCL_.  Symbol is value of _perm_.   P   s   e   u   d   o   30 +   T      1   -   S   q   u   a 20 +   r   e           1   d   1   S   t 10 +                                               1   a   t                     1    1         1         1                             1   i   s                               1                             1    1   t                                                                                 1   i  0 +   c   ---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+--   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   Number of Clusters   NOTE: 70 obs had missing values.  117 obs hidden.

  Hierarchical Cluster Analysis of Mammals' Teeth Data   Symbol is value of _PERM_   Plot of _CCC_*_NCL_.  Symbol is value of _perm_.   C   u 4 +   b                     1           1   i   c     C   l 3 +                        1   u   s   t   e   r               1   i 2 +   n   g     C   r   i 1 +   t   e   r   i   o         1   n 0 +1   -+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+   1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19   20   Number of Clusters   NOTE: 140 obs had missing values.  54 obs hidden.

  -------------------------------------- all_clus='  1  3  1  1  1  3  3  3  2  3' ---------------------------------------   mammal      v1    v2    v3    v4    v5    v6    v7    v8   DEER         0     4     0     0     3     3     3     3   ELK          0     4     1     0     3     3     3     3   MOOSE        0     4     0     0     3     3     3     3   REINDEER     0     4     1     0     3     3     3     3   -------------------------------------- all_clus='  2  2  2  2  2  2  1  2  1  1' ---------------------------------------   mammal           v1    v2    v3    v4    v5    v6    v7    v8   BADGER            3     3     1     1     3     3     1     2   BEAR              3     3     1     1     4     4     2     3   COUGAR            3     3     1     1     3     2     1     1   ELEPHANT SEAL     2     1     1     1     4     4     1     1   FUR SEAL          3     2     1     1     4     4     1     1   GREY SEAL         3     2     1     1     3     3     2     2   JAGUAR            3     3     1     1     3     2     1     1   MARTEN            3     3     1     1     4     4     1     2   RACCOON           3     3     1     1     4     4     3     2   RIVER OTTER       3     3     1     1     4     3     1     2   SEA LION          3     2     1     1     4     4     1     1   SEA OTTER         3     2     1     1     3     3     1     2   WEASEL            3     3     1     1     3     3     1     2   WOLF              3     3     1     1     4     4     2     3   WOLVERINE         3     3     1     1     4     4     1     2   -------------------------------------- all_clus='  3  1  3  3  3  1  2  1  3  2' ---------------------------------------   mammal           v1    v2    v3    v4    v5    v6    v7    v8   BEAVER            1     1     0     0     2     1     3     3   GRAY SQUIRREL     1     1     0     0     1     1     3     3   GROUNDHOG         1     1     0     0     2     1     3     3   HOUSE MOUSE       1     1     0     0     0     0     3     3   PIKA              2     1     0     0     2     2     3     3   PORCUPINE         1     1     0     0     1     1     3     3   RABBIT            2     1     0     0     3     2     3     3   -------------------------------------- all_clus='  4  4  4  4  4  4  4  4  4  4' ---------------------------------------   mammal             v1    v2    v3    v4    v5    v6    v7    v8   BROWN BAT           2     3     1     1     3     3     3     3   HOUSE BAT           2     3     1     1     1     2     3     3   MOLE                3     2     1     0     3     3     3     3   PIGMY BAT           2     3     1     1     2     2     3     3   RED BAT             1     3     1     1     2     2     3     3   SILVER HAIR BAT     2     3     1     1     2     3     3     3   Mean   Top     Bottom      Top     Bottom      Top     Bottom      Top     Bottom   FREQ  incisors  incisors   canines   canines  premolars premolars  molars    molars   all_clus   1  3  1  1  1  3  3  3  2  3      4       0.0       4.0       0.5       0.0       3.0       3.0       3.0       3.0   2  2  2  2  2  2  1  2  1  1     15       2.9       2.6       1.0       1.0       3.6       3.4       1.3       1.8   3  1  3  3  3  1  2  1  3  2      7       1.3       1.0       0.0       0.0       1.6       1.1       3.0       3.0   4  4  4  4  4  4  4  4  4  4      6       2.0       2.8       1.0       0.8       2.2       2.5       3.0       3.0

Example 23.5. Computing a Distance Matrix

An example of the use of distance and similarity measures in cluster analysis is given in Example 26.1 in the PROC DISTANCE chapter.

Example 23.6. Size , Shape, and Correlation

The following example shows the analysis of a data set in which size information is detrimental to the classification. Imagine that an archaeologist of the future is excavating a 20th century grocery store. The archaeologist has discovered a large number of boxes of various sizes, shapes , and colors and wants to do a preliminary classification based on simple external measurements: height, width, depth, weight, and the predominant color of the box. It is known that a given product may have been sold in packages of different size, so the archaeologist wants to remove the effect of size from the classification. It is not known whether color is relevant to the use of the products, so the analysis should be done both with and without color information.

Unknown to the archaeologist, the boxes actually fall into six general categories according to the use of the product: breakfast cereals, crackers, laundry detergents, Little Debbie snacks, tea, and toothpaste. These categories are shown in the analysis so that you can evaluate the effectiveness of the classification.

Since there is no reason for the archaeologist to assume that the true categories have equal sample sizes or variances, the centroid method is used to avoid undue bias. Each analysis is done with Euclidean distances after suitable transformations of the data. Color is coded as five dummy variables with values of 0 or 1. The DATA step is as follows:

  options ls=120;   title 'Cluster Analysis of Grocery Boxes';   data grocery2;   length name    /* name of product */   class   /* category of product */   unit     /* unit of measurement for weights:   g=gram   o=ounce   l=lb   all weights are converted to grams */   color    /* predominant color of box */   height 8   /* height of box in cm. */   width 8    /* width of box in cm. */   depth 8    /* depth of box (front to back) in cm. */   weight 8   /* weight of box in grams */   c_white c_yellow c_red c_green c_blue 4;   /* dummy variables */   retain class;   drop unit;   /*--- read name with possible embedded blanks ---*/   input name & @;   /*--- if name starts with "---",              ---*/   /*--- it's really a category value            ---*/   if substr(name,1,3) = '---' then do;   class = substr(name,4,index(substr(name,4),'-')-1);   delete;   return;   end;   /*--- read the rest of the variables ---*/   input height width depth weight unit color;   /*--- convert weights to grams ---*/   select (unit);   when ('l') weight = weight * 454;   when ('o') weight = weight * 28.3;   when ('g') ;   otherwise put 'Invalid unit ' unit;   end;   /*--- use 0/1 coding for dummy variables for colors ---*/   c_white  = (color = 'w');   c_yellow = (color = 'y');   c_red    = (color = 'r');   c_green  = (color = 'g');   c_blue   = (color = 'b');   datalines;   ---Breakfast cereals---   Cheerios                            32.5 22.4  8.4  567 g y   Cheerios                            30.3 20.4  7.2  425 g y   Cheerios                            27.5 19    6.2  283 g y   Cheerios                            24.1 17.2  5.3  198 g y   Special K                           30.1 20.5  8.5   18 o w   Special K                           29.6 19.2  6.7   12 o w   Special K                           23.4 16.6  5.7    7 o w   Corn Flakes                         33.7 25.4  8     24 o w   Corn Flakes                         30.2 20.6  8.4   18 o w   Corn Flakes                         30   19.1  6.6   12 o w   Grape Nuts                          21.7 16.3  4.9  680 g w   Shredded Wheat                      19.7 19.9  7.5  283 g y   Shredded Wheat, Spoon Size          26.6 19.6  5.6  510 g r   All-Bran                            21.1 14.3  5.2 13.8 o y   Froot Loops                         30.2 20.8  8.5 19.7 o r   Froot Loops                         25   17.7  6.4   11 o r   ---Crackers---   Wheatsworth                         11.1 25.2  5.5  326 g w   Ritz                                23.1 16    5.3  340 g r   Ritz                                23.1 20.7  5.2  454 g r   Premium Saltines                    11   25   10.7  454 g w   Waverly Wafers                      14.4 22.5  6.2  454 g g   ---Detergent---   Arm & Hammer Detergent              38.8 30   16.9   25 l y   Arm & Hammer Detergent              39.5 25.8 11   14.2 l y   Arm & Hammer Detergent              33.7 22.8  7      7 l y   Arm & Hammer Detergent              27.8 19.4  6.3    4 l y   Tide                                39.4 24.8 11.3  9.2 l r   Tide                                32.5 23.2  7.3  4.5 l r   Tide                                26.5 19.9  6.3   42 o r   Tide                                19.3 14.6  4.7   17 o r   ---Little Debbie---   Figaroos                            13.5 18.6  3.7   12 o y   Swiss Cake Rolls                    10.1 21.8  5.8   13 o w   Fudge Brownies                      11   30.8  2.5   12 o w   Marshmallow Supremes                 9.4 32    7     10 o w   Apple Delights                      11.2 30.1  4.9   15 o w   Snack Cakes                         13.4 32    3.4   13 o b   Nutty Bar                           13.2 18.5  4.2   12 o y   Lemon Stix                          13.2 18.5  4.2    9 o w   Fudge Rounds                         8.1 28.3  5.4  9.5 o w   ---Tea---   Celestial Saesonings Mint Magic      7.8 13.8  6.3   49 g b   Celestial Saesonings Cranberry Cove  7.8 13.8  6.3   46 g r   Celestial Saesonings Sleepy Time     7.8 13.8  6.3   37 g g   Celestial Saesonings Lemon Zinger    7.8 13.8  6.3   56 g y   Bigelow Lemon Lift                   7.7 13.4  6.9   40 g y   Bigelow Plantation Mint              7.7 13.4  6.9   35 g g   Bigelow Earl Grey                    7.7 13.4  6.9   35 g b   Luzianne                             8.9 22.8  6.4    6 o r   Luzianne                            18.4 20.2  6.9    8 o r   Luzianne Decaffeinated               8.9 22.8  6.4 5.25 o g   Lipton Tea Bags                     17.1 20    6.7    8 o r   Lipton Tea Bags                     11.5 14.4  6.6 3.75 o r   Lipton Tea Bags                      6.7 10    5.7 1.25 o r   Lipton Family Size Tea Bags         13.7 24    9     12 o r   Lipton Family Size Tea Bags          8.7 20.8  8.2    6 o r   Lipton Family Size Tea Bags          8.9 11.1  8.2    3 o r   Lipton Loose Tea                    12.7 10.9  5.4    8 o r   ---Paste, Tooth---   Colgate                              4.4 22    3.5    7 o r   Colgate                              3.6 15.6  3.3    3 o r   Colgate                              4.2 18.3  3.5    5 o r   Crest                                4.3 21.7  3.7  6.4 o w   Crest                                4.3 17.4  3.6  4.6 o w   Crest                                3.5 15.2  3.2  2.7 o w   Crest                                3.0 10.9  2.8  .85 o w   Arm & Hammer                         4.4 17    3.7    5 o w   ;   data grocery;   length name ;   set grocery2;

The FORMAT procedure is used to define to formats to make the output easier to read. The STARS. format is used for graphical crosstabulations in the TABULATE procedure. The $COLOR format displays the names of the colors instead of just the first letter.

  /*------ formats and macros for displaying ------ */   /*------ cluster results                   ------ */   proc format; value stars   0='               '   1='              #'   2='             ##'   3='            ###'   4='           ####'   5='          #####'   6='         ######'   7='        #######'   8='       ########'   9='      #########'   10='     ##########'   11='    ###########'   12='   ############'   13='  #############'   14=' ##############'   15-high='>##############';   run;   proc format; value $color   'w'='White'   'y'='Yellow'   'r'='Red'   'g'='Green'   'b'='Blue';   run;

Since a full display of the results of each cluster analysis would be very long, a macro is used with five macro variables to select parts of the output. The macro variables are set to select only the PROC CLUSTER output and the crosstabulation of clusters and true categories for the first two analyses. The example could be run with different settings of the macro variables to show the full output or other selected parts .

  %let cluster=1;   /* 1=show CLUSTER output, 0=don't */   %let tree=0;      /* 1=print TREE diagram, 0=don't */   %let list=0;      /* 1=list clusters, 0=don't */   %let crosstab=1;  /* 1=crosstabulate clusters and classes,   0=don't                              */   %let crosscol=0;  /* 1=crosstabulate clusters and colors,   0=don't                              */   /*--- define macro with options for TREE ---*/   %macro treeopt;   %if &tree %then h page=1;   %else noprint;   %mend;   /*--- define macro with options for CLUSTER ---*/   %macro clusopt;   %if &cluster %then pseudo ccc p=20;   %else noprint;   %mend;   /*------ macro for showing cluster results ------*/   %macro show(n); /* n=number of clusters   to show results for */   proc tree data=tree %treeopt n=&n out=out;   id name;   copy class height width depth weight color;   run;   %if &list %then %do;   proc sort;   by cluster;   run;   proc print;   var class name height width depth weight color;   by cluster clusname;   run;   %end;   %if &crosstab %then %do;   proc tabulate noseps /* formchar='           ' */;   class class cluster;   table cluster, class*n='   '*f=stars./rts=10 misstext=' ';   run;   %end;   %if &crosscol %then %do;   proc tabulate noseps /* formchar='           ' */;   class color cluster;   table cluster, color*n='   '*f=stars./rts=10 misstext=' ';   format color $color.;   run;   %end;   %mend;

The first analysis uses the variables height , width , depth , and weight in standard-ized form to show the effect of including size information. The CCC, pseudo F , and pseudo t ² statistics indicate 10 clusters. Most of the clusters do not correspond closely to the true categories, and four of the clusters have only one or two observations.

  /**********************************************************/   /*                                                        */   /*       Analysis 1: standardized box measurements        */   /*                                                        */   /**********************************************************/   title2 'Analysis 1: Standardized data';   proc cluster data=grocery m=cen std %clusopt outtree=tree;   var height width depth weight;   id name;   copy class color;   run;   %show(10);

Output 23.6.1: Analysis of Standardized Data

  Cluster Analysis of Grocery Boxes   Analysis 1: Standardized data   The CLUSTER Procedure   Centroid Hierarchical Cluster Analysis   Eigenvalues of the Correlation Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    2.44512438    1.64456210        0.6113        0.6113   2    0.80056228    0.33149770        0.2001        0.8114   3    0.46906458    0.18381582        0.1173        0.9287   4    0.28524876                      0.0713        1.0000   The data have been standardized to mean 0 and variance 1   Root-Mean-Square Total-Sample Standard Deviation =        1   Root-Mean-Square Distance Between Observations   = 2.828427

  Cluster Analysis of Grocery Boxes   Analysis 1: Standardized data   The CLUSTER Procedure   Centroid Hierarchical Cluster Analysis   The data have been standardized to mean 0 and variance 1   Root-Mean-Square Total-Sample Standard Deviation =        1   Root-Mean-Square Distance Between Observations   = 2.828427   Cluster History   Norm    T   Cent    i   NCL    --Clusters Joined---      FREQ     SPRSQ     RSQ    ERSQ     CCC     PSF   PST2      Dist    e   20    CL22        OB54            11    0.0028    .974    .        .      85.4    4.5    0.3073   19    CL36        OB8              5    0.0026    .972    .        .      83.7   15.3    0.3146   18    CL24        CL41            12    0.0080    .964    .        .      70.2   10.0    0.3316   17    CL18        CL30            18    0.0144    .949    .        .      53.8   12.7    0.3343   16    OB33        CL29             3    0.0024    .947    .        .      55.8    4.7    0.3363   15    CL50        CL33             7    0.0055    .941    .        .      55.0   24.4     0.346   14    CL46        CL15            10    0.0069    .934    .        .      53.7    8.1    0.3192   13    CL27        OB53             6    0.0035    .931    .        .      56.1    6.3     0.362   12    CL31        CL16             5    0.0075    .923    .861    8.03    55.8    6.6    0.4416   11    CL19        CL23             7    0.0102    .913    .848    7.59    54.6   12.7    0.4713   10    OB23        OB26             2    0.0037    .909    .835    8.36    59.1     .     0.4781   9    CL11        CL17            25    0.0393    .870    .819    4.72    45.2   19.3    0.4918   8    CL13        CL14            16    0.0329    .837    .801    2.95    40.4   23.7    0.5215   7    CL8         CL20            27    0.0629    .774    .779    -.31    32.0   25.9    0.5467   6    CL7         OB62            28    0.0112    .763    .752    0.61    36.7    2.4    0.6003   5    CL9         CL6             53    0.1879    .575    .718    -5.9    19.6   43.4    0.6641   4    CL5         CL21            55    0.0345    .541    .672    -5.2    23.2    4.5     0.745   3    CL4         CL12            60    0.1137    .427    .602    -5.3    22.4   14.5    0.8769   2    CL3         CL10            62    0.1511    .276    .471    -4.3    23.2   15.8    1.5559   1    CL2         OB22            63    0.2759    .000    .000    0.00      .    23.2     2.948

  -----------------------------------------------------------------------------------------------------------   class   -----------------------------------------------------------------------------------------------   Breakfast   cereal        Crackers       Detergent    Little Debbie  Paste, Tooth        Tea   --------+---------------+---------------+---------------+---------------+---------------+---------------   CLUSTER   1                                                                                      ###########   2                                   ##                             #                           ###   3                 #####                            ##   4                                                                ###        #######   5           ###########             ##            ###                                           ##   6                                                              #####   7                                    #                                                           #   8                                                  ##   9                                                                                 #   10                                                  #   -----------------------------------------------------------------------------------------------------------

The second analysis uses logarithms of height , width , depth , and the cube root of weight ; the cube root is used for consistency with the linear measures. The rows are then centered to remove size information. Finally, the columns are standardized to have a standard deviation of 1. There is no compelling a priori reason to standardize the columns, but if they are not standardized, height dominates the analysis because of its large variance. The STANDARD procedure is used instead of the STD option in PROC CLUSTER so that a subsequent analysis can separately standardize the dummy variables for color.

  /**********************************************************/   /*                                                        */   /*    Analysis 2: standardized row-centered logarithms    */   /*                                                        */   /**********************************************************/   title2 'Row-centered logarithms';   data shape;   set grocery;   array x height width depth weight;   array l l_height l_width l_depth l_weight;   /* logarithms */   weight=weight**(1/3);  /* take cube root to conform with   the other linear measurements */   do over l;             /* take logarithms */   l=log(x);   end;   mean=mean(of l(*));   /* find row mean of logarithms */   do over l;   l=l-mean;           /* center row */   end;   run;   title2 'Analysis 2: Standardized row-centered logarithms';   proc standard data=shape out=shapstan m=0 s=1;   var l_height l_width l_depth l_weight;   run;   proc cluster data=shapstan m=cen %clusopt outtree=tree;   var l_height l_width l_depth l_weight;   id name;   copy class height width depth weight color;   run;   %show(8);

The results of the second analysis are shown for eight clusters. Clusters 1 through 4 correspond fairly well to tea, toothpaste, breakfast cereals, and detergents. Crackers and Little Debbie products are scattered among several clusters.

Output 23.6.2: Analysis of Standardized Row-Centered Logarithms

  Cluster Analysis of Grocery Boxes   Analysis 2: Standardized row-centered logarithms   The CLUSTER Procedure   Centroid Hierarchical Cluster Analysis   Eigenvalues of the Covariance Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    1.94931049    0.34845395        0.4873        0.4873   2    1.60085654    1.15102358        0.4002        0.8875   3    0.44983296    0.44983296        0.1125        1.0000   4   .00000000   0.0000        1.0000   Root-Mean-Square Total-Sample Standard Deviation =        1   Root-Mean-Square Distance Between Observations   = 2.828427   Cluster History   Norm    T   Cent    i   NCL    --Clusters Joined---      FREQ     SPRSQ     RSQ    ERSQ     CCC     PSF   PST2      Dist    e   20    CL29        OB14             4    0.0017    .977    .        .      94.7    2.9    0.2658   19    CL26        CL27             8    0.0045    .972    .        .      85.4    8.4    0.3047   18    OB38        OB62             2    0.0016    .971    .        .      87.2     .     0.3193   17    OB32        OB35             2    0.0018    .969    .        .      89.1     .     0.3331   16    OB22        OB55             2    0.0019    .967    .        .      91.3     .     0.3434   15    CL23        CL18             5    0.0050    .962    .        .      86.5    4.8    0.3587   14    CL37        CL21             5    0.0051    .957    .        .      83.5   10.4    0.3613   13    CL30        CL24             9    0.0068    .950    .        .      79.2   12.9    0.3682   12    CL32        CL20            16    0.0142    .936    .892    5.75    67.6   29.3    0.3826   11    CL22        OB34             4    0.0037    .932    .881    6.31    71.4    3.2    0.3901   10    CL11        CL31             7    0.0090    .923    .869    6.17    70.8    6.3    0.4032   9    CL33        CL13            11    0.0092    .914    .853    6.25    71.7    7.6    0.4181   8    CL19        CL16            10    0.0131    .901    .835    6.12    71.4   10.9     0.503   7    CL14        CL9             16    0.0297    .871    .813    4.63    63.1   15.6    0.5173   6    CL10        CL15            12    0.0329    .838    .785    3.69    59.1   13.6    0.5916   5    CL6         CL28            19    0.0557    .783    .748    2.01    52.2   15.8    0.6252   4    CL12        CL8             26    0.0885    .694    .697   .16    44.6   48.8    0.6679   3    CL5         CL17            21    0.0459    .648    .617    1.21    55.3    7.4    0.8863   2    CL4         CL7             42    0.2841    .364    .384   .56    34.9   60.3    0.9429   1    CL2         CL3             63    0.3640    .000    .000    0.00      .    34.9    0.8978

  -----------------------------------------------------------------------------------------------------------   class   -----------------------------------------------------------------------------------------------   Breakfast   cereal        Crackers       Detergent    Little Debbie  Paste, Tooth        Tea   --------+---------------+---------------+---------------+---------------+---------------+---------------   CLUSTER   1                                    #                                                  ##########   2                                                                           #######   3        ##############             ##   4                     #                      ########                                            #   5                                                                 ##              #             ##   6                     #                                                                       ####   7                                   ##                         #####   8                                                                 ##   -----------------------------------------------------------------------------------------------------------

The third analysis is similar to the second analysis except that the rows are standardized rather than just centered. There is a clear indication of seven clusters from the CCC, pseudo F , and pseudo t ² statistics. The clusters are listed as well as crosstabulated with the true categories and colors.

  /**********************************************************/   /*                                                        */   /*   Analysis 3: standardized row-standardized logarithms */   /*                                                        */   /**********************************************************/   %let list=1;   %let crosscol=1;   title2 'Row-standardized logarithms';   data std;   set grocery;   array x height width depth weight;   array l l_height l_width l_depth l_weight;   /* logarithms */   weight=weight**(1/3); /* take cube root to conform with   the other linear measurements */   do over l;   l=log(x);          /* take logarithms */   end;   mean=mean(of l(*));  /* find row mean of logarithms */   std=std(of l(*));    /* find row standard deviation */   do over l;   l=(l-mean)/std;    /* standardize row */   end;   run;   title2 'Analysis 3: Standardized row-standardized logarithms';   proc standard data=std out=stdstan m=0 s=1;   var l_height l_width l_depth l_weight;   run;   proc cluster data=stdstan m=cen %clusopt outtree=tree;   var l_height l_width l_depth l_weight;   id name;   copy class height width depth weight color;   run;   %show(7);

The output from the third analysis shows that cluster 1 contains 9 of the 17 teas. Cluster 2 contains all of the detergents plus Grape Nuts, a very heavy cereal. Cluster 3 includes all of the toothpastes and one Little Debbie product that is of very similar shape, although roughly twice as large. Cluster 4 has most of the cereals, Ritz crackers (which come in a box very similar to most of the cereal boxes), and Lipton Loose Tea (all the other teas in the sample come in tea bags). Clusters 5 and 6 each contain several Luzianne and Lipton teas and one or two miscellaneous items. Cluster 7 includes most of the Little Debbie products and two types of crackers. Thus, the crackers are not identified and the teas are broken up into three clusters, but the other categories correspond to single clusters. This analysis classifies toothpaste and Little Debbie products slightly better than the second analysis,

Output 23.6.3: Analysis of Standardized Row-Standardized Logarithms

  Cluster Analysis of Grocery Boxes   Analysis 3: Standardized row-standardized logarithms   The CLUSTER Procedure   Centroid Hierarchical Cluster Analysis   Eigenvalues of the Covariance Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    2.42684848    0.94583675        0.6067        0.6067   2    1.48101173    1.38887193        0.3703        0.9770   3    0.09213980    0.09213980        0.0230        1.0000   4    -.00000000                     -0.0000        1.0000   Root-Mean-Square Total-Sample Standard Deviation =        1   Root-Mean-Square Distance Between Observations   = 2.828427   Cluster History   Norm    T   Cent    i   NCL    --Clusters Joined---      FREQ     SPRSQ     RSQ    ERSQ     CCC     PSF   PST2      Dist    e   20    CL35        CL33             8    0.0024    .990    .        .       229   32.0    0.1923   19    CL22        OB19             5    0.0010    .989    .        .       224    2.9    0.2014   18    CL44        CL27             6    0.0018    .987    .        .       206   20.5    0.2073   17    CL18        CL26             9    0.0025    .985    .        .       187    6.4    0.1956   16    OB38        OB62             2    0.0009    .984    .        .       192     .       0.24   15    CL24        CL23             5    0.0029    .981    .        .       177    7.8    0.2753   14    CL25        OB21             4    0.0021    .979    .        .       175    7.7    0.2917   13    CL30        CL19            17    0.0101    .969    .        .       130   41.0    0.2974   12    CL16        CL31             9    0.0049    .964    .932    5.49     124   20.5    0.3121   11    CL21        OB52             4    0.0029    .961    .924    5.81     129    8.2    0.3445   10    CL41        CL11             6    0.0045    .957    .915    5.94     130    5.0     0.323   9    CL29        OB50             4    0.0031    .953    .904    6.52     138   20.3    0.3603   8    CL14        CL15             9    0.0101    .943    .890    6.08     131   10.7    0.3761   7    CL20        OB54             9    0.0047    .939    .872    6.89     143   11.7    0.4063   6    CL13        CL9             21    0.0272    .911    .848    5.23     117   30.0    0.5101   5    CL6         CL17            30    0.0746    .837    .814    1.30    74.3   42.2     0.606   4    CL10        CL7             15    0.0440    .793    .764    1.40    75.3   36.4    0.6152   3    CL8         CL12            18    0.0642    .729    .681    2.02    80.6   44.0    0.6648   2    CL3         CL4             33    0.2580    .471    .470    0.01    54.2   54.4    0.9887   1    CL5         CL2             63    0.4707    .000    .000    0.00      .    54.2    0.9636

  ------------------------------------------------ CLUSTER=1 CLUSNAME=CL7 ------------------------------------------------   Obs    class          name          height    width    depth     weight    color   1     Tea     Bigelow Plantati      7.7      13.4     6.9     3.27107      g   2     Tea     Bigelow Earl Gre      7.7      13.4     6.9     3.27107      b   3     Tea     Celestial Saeson      7.8      13.8     6.3     3.65931      b   4     Tea     Celestial Saeson      7.8      13.8     6.3     3.58305      r   5     Tea     Bigelow Lemon Li      7.7      13.4     6.9     3.41995      y   6     Tea     Celestial Saeson      7.8      13.8     6.3     3.82586      y   7     Tea     Celestial Saeson      7.8      13.8     6.3     3.33222      g   8     Tea     Lipton Tea Bags       6.7      10.0     5.7     3.28271      r   9     Tea     Lipton Family Si      8.9      11.1     8.2     4.39510      r   ----------------------------------------------- CLUSTER=2 CLUSNAME=CL17 ------------------------------------------------   Obs    class               name                height    width   depth     weight    color   10    Detergent           Tide                 26.5      19.9     6.3    10.5928      r   11    Detergent           Tide                 19.3      14.6     4.7     7.8357      r   12    Detergent           Tide                 32.5      23.2     7.3    12.6889      r   13    Breakfast cereal    Grape Nuts           21.7      16.3     4.9     8.7937      w   14    Detergent           Arm & Hammer Det     33.7      22.8     7.0    14.7023      y   15    Detergent           Arm & Hammer Det     27.8      19.4     6.3    12.2003      y   16    Detergent           Arm & Hammer Det     38.8      30.0    16.9    22.4732      y   17    Detergent           Tide                 39.4      24.8    11.3    16.1045      r   18    Detergent           Arm & Hammer Det     39.5      25.8    11.0    18.6115      y   ----------------------------------------------- CLUSTER=3 CLUSNAME=CL12 ------------------------------------------------   Obs        class        name            height    width    depth     weight    color   19    Paste, Tooth     Colgate           3.6      15.6     3.3     4.39510      r   20    Paste, Tooth     Crest             3.5      15.2     3.2     4.24343      w   21    Paste, Tooth     Crest             4.3      17.4     3.6     5.06813      w   22    Paste, Tooth     Arm & Hammer      4.4      17.0     3.7     5.21097      w   23    Paste, Tooth     Colgate           4.2      18.3     3.5     5.21097      r   24    Paste, Tooth     Crest             4.3      21.7     3.7     5.65790      w   25    Paste, Tooth     Colgate           4.4      22.0     3.5     5.82946      r   26    Little Debbie    Fudge Rounds      8.1      28.3     5.4     6.45411      w   27    Paste, Tooth     Crest             3.0      10.9     2.8     2.88670      w

  ----------------------------------------------- CLUSTER=4 CLUSNAME=CL13 ------------------------------------------------   Obs    class               name                height    width    depth     weight    color   28    Breakfast cereal    Cheerios             27.5      19.0     6.2     6.56541      y   29    Breakfast cereal    Froot Loops          25.0      17.7     6.4     6.77735      r   30    Breakfast cereal    Special K            30.1      20.5     8.5     7.98644      w   31    Breakfast cereal    Corn Flakes          30.2      20.6     8.4     7.98644      w   32    Breakfast cereal    Special K            29.6      19.2     6.7     6.97679      w   33    Breakfast cereal    Corn Flakes          30.0      19.1     6.6     6.97679      w   34    Breakfast cereal    Froot Loops          30.2      20.8     8.5     8.23034      r   35    Breakfast cereal    Cheerios             30.3      20.4     7.2     7.51847      y   36    Breakfast cereal    Cheerios             24.1      17.2     5.3     5.82848      y   37    Breakfast cereal    Corn Flakes          33.7      25.4     8.0     8.79021      w   38    Breakfast cereal    Special K            23.4      16.6     5.7     5.82946      w   39    Breakfast cereal    Cheerios             32.5      22.4     8.4     8.27677      y   40    Breakfast cereal    Shredded Wheat,      26.6      19.6     5.6     7.98957      r   41    Crackers            Ritz                 23.1      16.0     5.3     6.97953      r   42    Breakfast cereal    All-Bran             21.1      14.3     5.2     7.30951      y   43    Tea                 Lipton Loose Tea     12.7      10.9     5.4     6.09479      r   44    Crackers            Ritz                 23.1      20.7     5.2     7.68573      r   ----------------------------------------------- CLUSTER=5 CLUSNAME=CL10 ------------------------------------------------   Obs    class            name                height    width    depth     weight    color   45    Tea              Luzianne              8.9      22.8      6.4    5.53748      r   46    Tea              Luzianne Decaffe      8.9      22.8      6.4    5.29641      g   47    Crackers         Premium Saltines     11.0      25.0     10.7    7.68573      w   48    Tea              Lipton Family Si      8.7      20.8      8.2    5.53748      r   49    Little Debbie    Marshmallow Supr      9.4      32.0      7.0    6.56541      w   50    Tea              Lipton Family Si     13.7      24.0      9.0    6.97679      r

  ------------------------------------------------ CLUSTER=6 CLUSNAME=CL9 ------------------------------------------------   Obs    class               name               height    width    depth     weight    color   51    Tea                 Luzianne            18.4      20.2     6.9     6.09479      r   52    Tea                 Lipton Tea Bags     17.1      20.0     6.7     6.09479      r   53    Breakfast cereal    Shredded Wheat      19.7      19.9     7.5     6.56541      y   54    Tea                 Lipton Tea Bags     11.5      14.4     6.6     4.73448      r   ------------------------------------------------ CLUSTER=7 CLUSNAME=CL8 ------------------------------------------------   Obs    class            name                height    width    depth     weight    color   55    Crackers         Wheatsworth          11.1      25.2     5.5     6.88239      w   56    Little Debbie    Swiss Cake Rolls     10.1      21.8     5.8     7.16545      w   57    Little Debbie    Figaroos             13.5      18.6     3.7     6.97679      y   58    Little Debbie    Nutty Bar            13.2      18.5     4.2     6.97679      y   59    Little Debbie    Apple Delights       11.2      30.1     4.9     7.51552      w   60    Little Debbie    Lemon Stix           13.2      18.5     4.2     6.33884      w   61    Little Debbie    Fudge Brownies       11.0      30.8     2.5     6.97679      w   62    Little Debbie    Snack Cakes          13.4      32.0     3.4     7.16545      b   63    Crackers         Waverly Wafers       14.4      22.5     6.2     7.68573      g

  ----------------------------------------------------------------------------------------------------------   class   -----------------------------------------------------------------------------------------------   Breakfast   cereal        Crackers       Detergent    Little Debbie  Paste, Tooth       Tea   --------+---------------+---------------+---------------+---------------+---------------+---------------   CLUSTER   1                                                                                        #########   2                     #                      ########   3                                                                  #       ########   4        ##############             ##                                                           #   5                                    #                             #                          ####   6                     #                                                                        ###   7                                   ##                       #######   ----------------------------------------------------------------------------------------------------------

  ------------------------------------------------------------------------------------------   color   -------------------------------------------------------------------------------   Blue           Green           Red           White         Yellow   --------+---------------+---------------+---------------+---------------+---------------   CLUSTER   1                    ##             ##            ###                            ##   2                                                ####              #           ####   3                                                 ###         ######   4                                              ######         ######          #####   5                                    #            ###             ##   6                                                 ###                             #   7                     #              #                         #####             ##   ------------------------------------------------------------------------------------------

The last several analyses include color. Obviously, the dummy variables must not be included in calculations to standardize the rows. If the five dummy variables are simply standardized to variance 1.0 and included with the other variables, color dominates the analysis. The dummy variables should be scaled to a smaller variance, which must be determined by trial and error. Four analyses are done using PROC STANDARD to scale the dummy variables to a standard deviation of 0.2, 0.3, 0.4, or 0.8. The cluster listings are suppressed.

Since dummy variables drastically violate the normality assumption on which the CCC depends, the CCC tends to indicate an excessively large number of clusters.

  /************************************************************/   /*                                                          */   /* Analyses 4-7: standardized row-standardized logs & color */   /*                                                          */   /************************************************************/   %let list=0;   %let crosscol=1;   title2   'Analysis 4: Standardized row-standardized   logarithms and color (s=.2)';   proc standard data=stdstan out=stdstan m=0 s=.2;   var c_:;   run;   proc cluster data=stdstan m=cen %clusopt outtree=tree;   var l_height l_width l_depth l_weight c_:;   id name;   copy class height width depth weight color;   run;   %show(7);   title2   'Analysis 5: Standardized row-standardized   logarithms and color (s=.3)';   proc standard data=stdstan out=stdstan m=0 s=.3;   var c_:;   run;   proc cluster data=stdstan m=cen %clusopt outtree=tree;   var l_height l_width l_depth l_weight c_:;   id name;   copy class height width depth weight color;   run;   %show(6);   title2   'Analysis 6: Standardized row-standardized   logarithms and color (s=.4)';   proc standard data=stdstan out=stdstan m=0 s=.4;   var c_:;   run;   proc cluster data=stdstan m=cen %clusopt outtree=tree;   var l_height l_width l_depth l_weight c_:;   id name;   copy class height width depth weight color;   run;   %show(3);   title2   'Analysis 7: Standardized row-standardized   logarithms and color (s=.8)';   proc standard data=stdstan out=stdstan m=0 s=.8;   var c_:;   run;   proc cluster data=stdstan m=cen %clusopt outtree=tree;   var l_height l_width l_depth l_weight c_:;   id name;   copy class height width depth weight color;   run;   %show(10);

Using PROC STANDARD on the dummy variables with S=0.2 causes four of the Little Debbie products to join the toothpastes. Using S=0.3 causes one of the tea clusters to merge with the breakfast cereals while three cereals defect to the detergents. Using S=0.4 produces three clusters consisting of (1) cereals and detergents, (2) Little Debbie products and toothpaste, and (3) teas, with crackers divided among all three clusters and a few other misclassifications. With S=0.8, ten clusters are indicated, each entirely monochrome. So, S=0.2 or S=0.3 degrades the classification, S=0.4 yields a good but perhaps excessively coarse classification, and higher values of the S= option produce clusters that are determined mainly by color.

Output 23.6.4: Analysis of Standardized Row-Standardized Logarithms and Color

  Cluster Analysis of Grocery Boxes   Analysis 4: Standardized row-standardized logarithms and color (s=.2)   The CLUSTER Procedure   Centroid Hierarchical Cluster Analysis   Eigenvalues of the Covariance Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    2.43584975    0.94791932        0.5800        0.5800   2    1.48793042    1.39363531        0.3543        0.9342   3    0.09429511    0.03686218        0.0225        0.9567   4    0.05743293    0.01036136        0.0137        0.9704   5    0.04707157    0.00489503        0.0112        0.9816   6    0.04217654    0.00693298        0.0100        0.9916   7    0.03524355    0.03524355        0.0084        1.0000   8    0.00000000    0.00000000        0.0000        1.0000   9   .00000000   0.0000        1.0000   Root-Mean-Square Total-Sample Standard Deviation =  0.68313   Root-Mean-Square Distance Between Observations   = 2.898275   Cluster History   Norm    T   Cent    i   NCL    --Clusters Joined---      FREQ     SPRSQ     RSQ    ERSQ     CCC     PSF   PST2      Dist    e   20    CL46        OB37             3    0.0016    .968    .        .      67.5   11.9    0.2706   19    OB46        OB52             2    0.0014    .966    .        .      69.7     .     0.2995   18    CL25        CL37             6    0.0041    .962    .        .      67.1    5.0    0.3081   17    CL33        CL35            16    0.0099    .952    .        .      57.2   16.7    0.3196   16    CL19        OB48             3    0.0024    .950    .        .      59.2    1.7    0.3357   15    CL30        CL16             5    0.0042    .946    .        .      59.5    2.7    0.3299   14    CL27        CL18             8    0.0057    .940    .        .      58.9    4.2    0.3429   13    CL20        OB32             4    0.0031    .937    .        .      61.7    3.6    0.3564   12    CL24        OB50             4    0.0031    .934    .905    3.23    65.2    4.7     0.359   11    CL39        CL28             6    0.0068    .927    .896    3.17    65.9   12.1    0.3743   10    CL13        OB35             5    0.0036    .923    .886    3.62    70.8    2.3    0.3755   9    CL11        CL32            13    0.0176    .906    .874    2.70    64.8   16.0    0.4107   8    CL14        OB54             9    0.0052    .900    .859    3.29    71.0    2.6    0.4265   7    OB21        CL10             6    0.0052    .895    .841    4.09    79.8    2.4    0.4378   6    CL17        CL12            20    0.0248    .870    .817    3.52    76.6   19.7    0.4898   5    CL15        CL8             14    0.0326    .838    .783    3.08    75.0   14.0    0.5607   4    CL6         CL21            30    0.0743    .764    .734    1.35    63.5   35.6    0.5877   3    CL9         CL7             19    0.0579    .706    .653    2.17    72.0   22.8    0.6611   2    CL4         CL3             49    0.3632    .343    .450    -2.6    31.8   73.0    0.9838   1    CL2         CL5             63    0.3426    .000    .000    0.00      .    31.8    0.9876

  ----------------------------------------------------------------------------------------------------------   class   -----------------------------------------------------------------------------------------------   Breakfast   cereal        Crackers       Detergent    Little Debbie  Paste, Tooth        Tea   --------+---------------+---------------+---------------+---------------+---------------+---------------   CLUSTER   1                    ##                      ########   2                                    #                          ####       ########   3         #############             ##                                                           #   4                     #                                                                        ###   5                                    #                         #####   6                                                                                        #########   7                                    #                                                        ####   ----------------------------------------------------------------------------------------------------------

  ------------------------------------------------------------------------------------------   color   -------------------------------------------------------------------------------   Blue           Green           Red           White         Yellow   --------+---------------+---------------+---------------+---------------+---------------   CLUSTER   1                                                ####              #          #####   2                                                 ###     ##########   3                                              ######         ######           ####   4                                                 ###                             #   5                     #              #                            ##             ##   6                    ##             ##            ###                            ##   7                                    #            ###              #   ------------------------------------------------------------------------------------------

  Cluster Analysis of Grocery Boxes   Analysis 5: Standardized row-standardized logarithms and color (s=.3)   The CLUSTER Procedure   Centroid Hierarchical Cluster Analysis   Eigenvalues of the Covariance Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    2.44752302    0.95026671        0.5500        0.5500   2    1.49725632    1.36701945        0.3365        0.8865   3    0.13023687    0.02135049        0.0293        0.9157   4    0.10888637    0.00867367        0.0245        0.9402   5    0.10021271    0.00628821        0.0225        0.9627   6    0.09392449    0.02196469        0.0211        0.9838   7    0.07195981    0.07195981        0.0162        1.0000   8    0.00000000    0.00000000        0.0000        1.0000   9   .00000000   0.0000        1.0000   Root-Mean-Square Total-Sample Standard Deviation = 0.703167   Root-Mean-Square Distance Between Observations   = 2.983287   Cluster History   Norm    T   Cent    i   NCL    --Clusters Joined---      FREQ     SPRSQ     RSQ    ERSQ     CCC     PSF   PST2      Dist    e   20    CL24        CL28             4    0.0038    .953    .        .      45.7    2.7    0.3448   19    OB11        CL23             6    0.0033    .950    .        .      46.0    3.5    0.3477   18    CL46        OB37             3    0.0027    .947    .        .      47.1   21.9    0.3558   17    CL21        OB50             4    0.0031    .944    .        .      48.2    2.5    0.3577   16    CL39        CL33             6    0.0064    .937    .        .      46.9   12.1    0.3637   15    CL19        CL29            14    0.0152    .922    .        .      40.6   12.4    0.3707   14    CL18        OB32             4    0.0035    .919    .        .      42.5    2.5    0.3813   13    CL16        CL25            13    0.0175    .901    .        .      38.0   13.7    0.4103   12    CL22        OB54             5    0.0049    .896    .875    1.76    40.0    3.2    0.4353   11    CL12        CL37             7    0.0089    .887    .865    1.71    40.9    4.6    0.4397   10    CL20        OB48             5    0.0056    .882    .854    2.02    43.9    2.5    0.4669   9    CL26        CL17            16    0.0222    .859    .841    1.20    41.3   16.6     0.479   8    CL32        CL11             9    0.0125    .847    .826    1.31    43.5    4.5    0.4988   7    CL14        OB35             5    0.0070    .840    .806    1.95    49.0    3.3     0.519   6    OB21        CL7              6    0.0077    .832    .782    2.79    56.6    2.3    0.5366   5    CL9         CL15            30    0.0716    .761    .749    0.54    46.1   28.3    0.5452   4    CL10        CL8             14    0.0318    .729    .700    1.21    52.9    8.6    0.5542   3    CL5         CL6             36    0.0685    .660    .622    1.50    58.3   14.2    0.6516   2    CL13        CL4             27    0.2008    .460    .427    0.90    51.9   46.6    0.9611   1    CL3         CL2             63    0.4595    .000    .000    0.00      .    51.9    0.9609

  ----------------------------------------------------------------------------------------------------------   class   -----------------------------------------------------------------------------------------------   Breakfast   cereal        Crackers       Detergent    Little Debbie  Paste, Tooth       Tea   --------+---------------+---------------+---------------+---------------+---------------+---------------   CLUSTER   1                   ###             ##       ########                                            #   2                                    #                          ####       ########   3         #############                                                                        ###   4                                    #                         #####   5                                                                                        #########   6                                    #                                                        ####   ----------------------------------------------------------------------------------------------------------

  ------------------------------------------------------------------------------------------   color   -------------------------------------------------------------------------------   Blue           Green           Red           White         Yellow   --------+---------------+---------------+---------------+---------------+---------------   CLUSTER   1                                            ########              #          #####   2                                                 ###     ##########   3                                               #####         ######          #####   4                     #              #                            ##             ##   5                    ##             ##            ###                            ##   6                                    #            ###              #   ------------------------------------------------------------------------------------------

  Cluster Analysis of Grocery Boxes   Analysis 6: Standardized row-standardized logarithms and color (s=.4)   The CLUSTER Procedure   Centroid Hierarchical Cluster Analysis   Eigenvalues of the Covariance Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    2.46469435    0.95296119        0.5135        0.5135   2    1.51173316    1.28149311        0.3149        0.8284   3    0.23024005    0.04306536        0.0480        0.8764   4    0.18717469    0.01766446        0.0390        0.9154   5    0.16951023    0.01827481        0.0353        0.9507   6    0.15123542    0.06582379        0.0315        0.9822   7    0.08541162    0.08541162        0.0178        1.0000   8    0.00000000    0.00000000        0.0000        1.0000   9   .00000000   0.0000        1.0000   Root-Mean-Square Total-Sample Standard Deviation = 0.730297   Root-Mean-Square Distance Between Observations   = 3.098387   Cluster History   Norm    T   Cent    i   NCL    --Clusters Joined---      FREQ     SPRSQ     RSQ    ERSQ     CCC     PSF   PST2      Dist    e   20    CL29        CL44            10    0.0074    .955    .        .      47.7    8.2    0.3789   19    CL38        OB54             3    0.0031    .952    .        .      48.1    9.3    0.3792   18    CL25        CL41            11    0.0155    .936    .        .      38.8   36.7    0.4192   17    CL23        CL43            10    0.0120    .924    .        .      35.0   11.6    0.4208   16    OB11        CL26             6    0.0050    .919    .        .      35.6    5.8    0.4321   15    CL19        CL31             5    0.0074    .912    .        .      35.4    5.3    0.4362   14    OB20        CL27             4    0.0046    .907    .        .      36.8    2.9    0.4374   13    CL18        CL20            21    0.0352    .872    .        .      28.4   19.7    0.4562   12    CL13        CL16            27    0.0372    .835    .839    -.37    23.4   12.0    0.4968   11    CL21        CL17            15    0.0289    .806    .828    -1.5    21.6   13.6    0.5183   10    CL14        CL15             9    0.0200    .786    .815    -1.8    21.6    7.2    0.5281   9    OB21        OB48             2    0.0047    .781    .801    -1.2    24.1     .     0.5425   8    CL10        CL24            12    0.0243    .757    .785    -1.3    24.5    5.8    0.5783   7    CL12        CL46            29    0.0224    .735    .765    -1.3    25.8    5.3    0.6105   6    CL8         CL37            14    0.0220    .712    .740    -1.1    28.3    4.0    0.6313   5    CL6         CL32            16    0.0251    .687    .707    -.78    31.9    3.9    0.6664   4    CL11        CL9             17    0.0287    .659    .660    -.04    38.0    7.0    0.7098   3    CL4         OB35            18    0.0180    .641    .584    2.21    53.5    3.2    0.7678   2    CL3         CL5             34    0.2175    .423    .400    0.67    44.8   31.4    0.8923   1    CL7         CL2             63    0.4232    .000    .000    0.00      .    44.8    0.9156

  ----------------------------------------------------------------------------------------------------------   class   -----------------------------------------------------------------------------------------------   Breakfast   cereal        Crackers       Detergent    Little Debbie  Paste, Tooth        Tea   --------+---------------+---------------+---------------+---------------+---------------+---------------   CLUSTER   1       >##############             ##       ########             ##                             #   2                                   ##                       #######       ########              #   3                                    #                                             >##############   ----------------------------------------------------------------------------------------------------------

  ------------------------------------------------------------------------------------------   color   -------------------------------------------------------------------------------   Blue           Green           Red           White         Yellow   --------+---------------+---------------+---------------+---------------+---------------   CLUSTER   1                                          ##########        #######   ############   2                     #             ##            ###   ############   3                    ##             ##      #########              #             ##   ------------------------------------------------------------------------------------------

  Cluster Analysis of Grocery Boxes   Analysis 7: Standardized row-standardized logarithms and color (s=.8)   The CLUSTER Procedure   Centroid Hierarchical Cluster Analysis   Eigenvalues of the Covariance Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    2.61400794    0.93268930        0.3631        0.3631   2    1.68131864    0.77645948        0.2335        0.5966   3    0.90485916    0.22547234        0.1257        0.7222   4    0.67938683    0.00292216        0.0944        0.8166   5    0.67646466    0.12119211        0.0940        0.9106   6    0.55527255    0.46658428        0.0771        0.9877   7    0.08868827    0.08868827        0.0123        1.0000   8    0.00000000    0.00000000        0.0000        1.0000   9   .00000000   0.0000        1.0000   Root-Mean-Square Total-Sample Standard Deviation = 0.894427   Root-Mean-Square Distance Between Observations   = 3.794733   Cluster History   Norm    T   Cent    i   NCL    --Clusters Joined---      FREQ     SPRSQ     RSQ    ERSQ     CCC     PSF   PST2      Dist    e   20    CL29        CL44            10    0.0049    .970    .        .      72.7    8.2    0.3094   19    CL38        OB54             3    0.0021    .968    .        .      73.3    9.3    0.3096   18    CL21        CL23            12    0.0153    .952    .        .      53.0   15.0    0.4029   17    OB21        OB48             2    0.0032    .949    .        .      53.8     .      0.443   16    CL27        CL24             6    0.0095    .940    .        .      48.9   10.4     0.444   15    CL19        CL16             9    0.0136    .926    .        .      43.0    6.1    0.4587   14    CL41        OB11             7    0.0058    .920    .        .      43.6   51.2    0.4591   13    CL26        CL46             7    0.0105    .910    .        .      42.1   22.0    0.4769   12    CL25        CL13            12    0.0205    .889    .743    16.5    37.3   13.8     0.467   11    CL18        OB20            13    0.0093    .880    .726    16.7    38.2    4.0    0.5586   10    CL17        CL37             4    0.0134    .867    .706    16.5    38.3    7.9    0.6454   9    CL14        CL20            17    0.0567    .810    .684    11.0    28.8   52.6    0.6534   8    CL12        CL9             29    0.0828    .727    .659    5.03    20.9   20.7     0.604   7    CL11        CL43            16    0.0359    .691    .631    4.25    20.9   14.4    0.6758   6    CL15        CL31            11    0.0263    .665    .598    4.24    22.6    8.0    0.7065   5    CL7         CL6             27    0.1430    .522    .557    -1.7    15.8   28.2    0.8247   4    CL8         CL5             56    0.2692    .253    .507    -9.1     6.6   31.5    0.7726   3    OB35        CL32             3    0.0216    .231    .435    -6.6     9.0   46.0    1.0027   2    CL4         CL10            60    0.1228    .108    .289    -5.6     7.4    9.5    1.0096   1    CL2         CL3             63    0.1083    .000    .000    0.00      .     7.4    1.0839

  ----------------------------------------------------------------------------------------------------------   class   -----------------------------------------------------------------------------------------------   Breakfast   cereal        Crackers       Detergent    Little Debbie  Paste, Tooth        Tea   --------+---------------+---------------+---------------+---------------+---------------+---------------   CLUSTER   1                   ###             ##           ####                                            #   2                                   ##                        ######          #####   3               #######   4                ######                          ####             ##   5                                                                               ###   6                                                                                        #########   7                                    #                                                         ###   8                                                                                               ##   9                                                                                               ##   10                                                                 #   ----------------------------------------------------------------------------------------------------------

  ------------------------------------------------------------------------------------------   color   -------------------------------------------------------------------------------   Blue           Green           Red           White         Yellow   --------+---------------+---------------+---------------+---------------+---------------   CLUSTER   1                                          ##########   2                                                      #############   3                                                            #######   4                                                                      ############   5                                                 ###   6                                           #########   7                                 ####   8                    ##   9                                                                                ##   10                    #   ------------------------------------------------------------------------------------------