This first example clusters ten American cities based on the flying mileages between them. Six clustering methods are shown with corresponding tree diagrams produced by the TREE procedure. The EML method cannot be used because it requires coordinate data. The other omitted methods produce the same clusters, although not the same distances between clusters, as one of the illustrated methods: complete linkage and the flexible-beta method yield the same clusters as Ward s method, McQuitty s similarity analysis produces the same clusters as average linkage, and the median method corresponds to the centroid method.
All of the methods suggest a division of the cities into two clusters along the east-west dimension. There is disagreement , however, about which cluster Denver should belong to. Some of the methods indicate a possible third cluster containing Denver and Houston . The following statements produce Output 23.1.1:
Cluster Analysis of Flying Mileages Between 10 American Cities The CLUSTER Procedure Average Linkage Cluster Analysis Root-Mean-Square Distance Between Observations = 1580.242 Cluster History Norm T RMS i NCL ---------Clusters Joined---------- FREQ PSF PST2 Dist e 9 NEW YORK WASHINGTON D.C. 2 66.7 . 0.1297 8 LOS ANGELES SAN FRANCISCO 2 39.2 . 0.2196 7 ATLANTA CHICAGO 2 21.7 . 0.3715 6 CL7 CL9 4 14.5 3.4 0.4149 5 CL8 SEATTLE 3 12.4 7.3 0.5255 4 DENVER HOUSTON 2 13.9 . 0.5562 3 CL6 MIAMI 5 15.5 3.8 0.6185 2 CL3 CL4 7 16.0 5.3 0.8005 1 CL2 CL5 10 . 16.0 1.2967
title 'Cluster Analysis of Flying Mileages Between 10 American Cities'; data mileages(type=distance); input (atlanta chicago denver houston losangeles miami newyork sanfran seattle washdc) (5.) @55 city .; datalines; 0 ATLANTA 587 0 CHICAGO 1212 920 0 DENVER 701 940 879 0 HOUSTON 1936 1745 831 1374 0 LOS ANGELES 604 1188 1726 968 2339 0 MIAMI 748 713 1631 1420 2451 1092 0 NEW YORK 2139 1858 949 1645 347 2594 2571 0 SAN FRANCISCO 2182 1737 1021 1891 959 2734 2408 678 0 SEATTLE 543 597 1494 1220 2300 923 205 2442 2329 0 WASHINGTON D.C. ; /*---------------------- Average linkage --------------------*/ proc cluster data=mileages method=average pseudo; id city; run; proc tree horizontal spaces=2; id city; run; /*---------------------- Centroid method --------------------*/ proc cluster data=mileages method=centroid pseudo; id city; run; proc tree horizontal spaces=2; id city; run; /*-------- Density linkage with 3rd-nearest-neighbor --------*/ proc cluster data=mileages method=density k=3; id city; run; proc tree horizontal spaces=2; id city; run; /*--------------------- Single linkage ----------------------*/ proc cluster data=mileages method=single; id city; run; proc tree horizontal spaces=2; id city; run; /*--- Two-stage density linkage with 3rd-nearest-neighbor ---*/ proc cluster data=mileages method=twostage k=3; id city; run; proc tree horizontal spaces=2; id city; run; /* Ward's minimum variance with pseudo $F$ and $t^2$ statistics */ proc cluster data=mileages method=ward pseudo; id city; run; proc tree horizontal spaces=2; id city; run;
Cluster Analysis of Flying Mileages Between 10 American Cities The CLUSTER Procedure Centroid Hierarchical Cluster Analysis Root-Mean-Square Distance Between Observations = 1580.242 Cluster History Norm T Cent i NCL ---------Clusters Joined---------- FREQ PSF PST2 Dist e 9 NEW YORK WASHINGTON D.C. 2 66.7 . 0.1297 8 LOS ANGELES SAN FRANCISCO 2 39.2 . 0.2196 7 ATLANTA CHICAGO 2 21.7 . 0.3715 6 CL7 CL9 4 14.5 3.4 0.3652 5 CL8 SEATTLE 3 12.4 7.3 0.5139 4 DENVER CL5 4 12.4 2.1 0.5337 3 CL6 MIAMI 5 14.2 3.8 0.5743 2 CL3 HOUSTON 6 22.1 2.6 0.6091 1 CL2 CL4 10 . 22.1 1.173
Cluster Analysis of Flying Mileages Between 10 American Cities The CLUSTER Procedure Density Linkage Cluster Analysis K = 3 Cluster History Normalized Maximum Density T Fusion in Each Cluster i NCL ---------Clusters Joined---------- FREQ Density Lesser Greater e 9 ATLANTA WASHINGTON D.C. 2 96.106 92.5043 100.0 8 CL9 CHICAGO 3 95.263 90.9548 100.0 7 CL8 NEW YORK 4 86.465 76.1571 100.0 6 CL7 HOUSTON 5 74.079 61.7747 100.0 T 5 CL6 MIAMI 6 74.079 58.8299 100.0 4 LOS ANGELES SAN FRANCISCO 2 71.968 65.3430 80.0885 3 CL4 SEATTLE 3 66.341 56.6215 80.0885 2 CL3 DENVER 4 63.509 61.7747 80.0885 1 CL5 CL2 10 61.775 * 80.0885 100.0 * indicates fusion of two modal or multimodal clusters 2 modal clusters have been formed.
Cluster Analysis of Flying Mileages Between 10 American Cities The CLUSTER Procedure Single Linkage Cluster Analysis Mean Distance Between Observations = 1417.133 Cluster History Norm T Min i NCL ---------Clusters Joined---------- FREQ Dist e 9 NEW YORK WASHINGTON D.C. 2 0.1447 8 LOS ANGELES SAN FRANCISCO 2 0.2449 7 ATLANTA CL9 3 0.3832 6 CL7 CHICAGO 4 0.4142 5 CL6 MIAMI 5 0.4262 4 CL8 SEATTLE 3 0.4784 3 CL5 HOUSTON 6 0.4947 2 DENVER CL4 4 0.5864 1 CL3 CL2 10 0.6203
Cluster Analysis of Flying Mileages Between 10 American Cities The CLUSTER Procedure Two-Stage Density Linkage Clustering K = 3 Cluster History Normalized Maximum Density T Fusion in Each Cluster i NCL ---------Clusters Joined---------- FREQ Density Lesser Greater e 9 ATLANTA WASHINGTON D.C. 2 96.106 92.5043 100.0 8 CL9 CHICAGO 3 95.263 90.9548 100.0 7 CL8 NEW YORK 4 86.465 76.1571 100.0 6 CL7 HOUSTON 5 74.079 61.7747 100.0 T 5 CL6 MIAMI 6 74.079 58.8299 100.0 4 LOS ANGELES SAN FRANCISCO 2 71.968 65.3430 80.0885 3 CL4 SEATTLE 3 66.341 56.6215 80.0885 2 CL3 DENVER 4 63.509 61.7747 80.0885 1 CL5 CL2 10 61.775 80.0885 100.0 2 modal clusters have been formed.
Cluster Analysis of Flying Mileages Between 10 American Cities The CLUSTER Procedure Ward's Minimum Variance Cluster Analysis Root-Mean-Square Distance Between Observations = 1580.242 Cluster History T i NCL ---------Clusters Joined---------- FREQ SPRSQ RSQ PSF PST2 e 9 NEW YORK WASHINGTON D.C. 2 0.0019 .998 66.7 . 8 LOS ANGELES SAN FRANCISCO 2 0.0054 .993 39.2 . 7 ATLANTA CHICAGO 2 0.0153 .977 21.7 . 6 CL7 CL9 4 0.0296 .948 14.5 3.4 5 DENVER HOUSTON 2 0.0344 .913 13.2 . 4 CL8 SEATTLE 3 0.0391 .874 13.9 7.3 3 CL6 MIAMI 5 0.0586 .816 15.5 3.8 2 CL3 CL5 7 0.1488 .667 16.0 5.3 1 CL2 CL4 10 0.6669 .000 . 16.0
The following example uses the SAS data set Poverty created in the Getting Started section beginning on page 958. The data, from Rouncefield (1995), are birth rates, death rates, and infant death rates for 97 countries . Six cluster analyses are performed with eight methods. Scatter plots showing cluster membership at selected levels are produced instead of tree diagrams.
Each cluster analysis is performed by a macro called ANALYZE. The macro takes two arguments. The first, &METHOD, specifies the value of the METHOD= option to be used in the PROC CLUSTER statement. The second, &NCL, must be specified as a list of integers, separated by blanks, indicating the number of clusters desired in each scatter plot. For example, the first invocation of ANALYZE specifies the AVERAGE method and requests plots of 3 and 8 clusters. When two-stage density linkage is used, the K= and R= options are specified as part of the first argument.
The ANALYZE macro first invokes the CLUSTER procedure with METHOD=&METHOD, where &METHOD represents the value of the first argument to ANALYZE. This part of the macro produces the PROC CLUSTER output shown.
The %DO loop processes &NCL, the list of numbers of clusters to plot. The macro variable &K is a counter that indexes the numbers within &NCL. The %SCAN function picks out the &Kth number in &NCL, which is then assigned to the macro variable &N. When &K exceeds the number of numbers in &NCL, %SCAN returns a null string. Thus, the %DO loop executes while &N is not equal to a null string. In the %WHILE condition, a null string is indicated by the absence of any nonblank characters between the comparison operator (NE) and the right parenthesis that terminates the condition.
Within the %DO loop, the TREE procedure creates an output data set containing &N clusters. The GPLOT procedure then produces a scatter plot in which each observation is identified by the number of the cluster to which it belongs. The TITLE2 statement uses double quotes so that &N and &METHOD can be used within the title. At the end of the loop, &K is incremented by 1, and the next number is extracted from &NCL by %SCAN.
For this example, plots are obtained only for average linkage. To generate plots for other methods, follow the example shown in the first macro call. The following statements produce Output 23.2.1 through Output 23.2.7.
title 'Cluster Analysis of Birth and Death Rates'; %macro analyze(method,ncl); proc cluster data=poverty outtree=tree method=&method p=15 ccc pseudo; var birth death; title2; run; %let k=1; %let n=%scan(&ncl,&k); %do %while(&n NE); proc tree data=tree noprint out=out ncl=&n; copy birth death; run; legend1 frame cframe=ligr cborder=black position=center value=(justify=center); axis1 label=(angle=90 rotate=0) minor=none; axis2 minor=none; proc gplot; plot death*birth=cluster / frame cframe=ligr legend=legend1 vaxis=axis1 haxis=axis2; title2 "Plot of &n Clusters from METHOD=&METHOD"; run; %let k=%eval(&k+1); %let n=%scan(&ncl,&k); %end; %mend; %analyze(average,3 8) %analyze(complete,3) %analyze(single,7 10) %analyze(two k=10,3) %analyze(two k=18,2)
Cluster Analysis of Birth and Death Rates The CLUSTER Procedure Average Linkage Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 189.106588 173.101020 0.9220 0.9220 2 16.005568 0.0780 1.0000 Root-Mean-Square Total-Sample Standard Deviation = 10.127 Root-Mean-Square Distance Between Observations = 20.25399 Cluster History Norm T RMS i NCL --Clusters Joined--- FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Dist e 15 CL27 CL20 18 0.0035 .980 .975 2.61 292 18.6 0.2325 14 CL23 CL17 28 0.0034 .977 .972 1.97 271 17.7 0.2358 13 CL18 CL54 8 0.0015 .975 .969 2.35 279 7.1 0.2432 12 CL21 CL26 8 0.0015 .974 .966 2.85 290 6.1 0.2493 11 CL19 CL24 12 0.0033 .971 .962 2.78 285 14.8 0.2767 10 CL22 CL16 12 0.0036 .967 .957 2.84 284 17.4 0.2858 9 CL15 CL28 22 0.0061 .961 .951 2.45 271 17.5 0.3353 8 OB23 OB61 2 0.0014 .960 .943 3.59 302 . 0.3703 7 CL25 CL11 17 0.0098 .950 .933 3.01 284 23.3 0.4033 6 CL7 CL12 25 0.0122 .938 .920 2.63 273 14.8 0.4132 5 CL10 CL14 40 0.0303 .907 .902 0.59 225 82.7 0.4584 4 CL13 CL6 33 0.0244 .883 .875 0.77 234 22.2 0.5194 3 CL9 CL8 24 0.0182 .865 .827 2.13 300 27.7 0.735 2 CL5 CL3 64 0.1836 .681 .697 -.55 203 148 0.8402 1 CL2 CL4 97 0.6810 .000 .000 0.00 . 203 1.3348
Cluster Analysis of Birth and Death Rates The CLUSTER Procedure Complete Linkage Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 189.106588 173.101020 0.9220 0.9220 2 16.005568 0.0780 1.0000 Root-Mean-Square Total-Sample Standard Deviation = 10.127 Mean Distance Between Observations = 17.13099 Cluster History Norm T Max i NCL --Clusters Joined--- FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Dist e 15 CL22 CL33 8 0.0015 .983 .975 3.80 329 6.1 0.4092 14 CL56 CL18 8 0.0014 .981 .972 3.97 331 6.6 0.4255 13 CL30 CL44 8 0.0019 .979 .969 4.04 330 19.0 0.4332 12 OB23 OB61 2 0.0014 .978 .966 4.45 340 . 0.4378 11 CL19 CL24 24 0.0034 .974 .962 4.17 327 24.1 0.4962 10 CL17 CL28 12 0.0033 .971 .957 4.18 325 14.8 0.5204 9 CL20 CL13 16 0.0067 .964 .951 3.38 297 25.2 0.5236 8 CL11 CL21 32 0.0054 .959 .943 3.44 297 19.7 0.6001 7 CL26 CL15 13 0.0096 .949 .933 2.93 282 28.9 0.7233 6 CL14 CL10 20 0.0128 .937 .920 2.46 269 27.7 0.8033 5 CL9 CL16 30 0.0237 .913 .902 1.29 241 47.1 0.8993 4 CL6 CL7 33 0.0240 .889 .875 1.38 248 21.7 1.2165 3 CL5 CL12 32 0.0178 .871 .827 2.56 317 13.6 1.2326 2 CL3 CL8 64 0.1900 .681 .697 -.55 203 167 1.5412 1 CL2 CL4 97 0.6810 .000 .000 0.00 . 203 2.5233
Cluster Analysis of Birth and Death Rates The CLUSTER Procedure Single Linkage Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 189.106588 173.101020 0.9220 0.9220 2 16.005568 0.0780 1.0000 Root-Mean-Square Total-Sample Standard Deviation = 10.127 Mean Distance Between Observations = 17.13099 Cluster History Norm T Min i NCL --Clusters Joined--- FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Dist e 15 CL37 CL19 8 0.0014 .968 .975 2.3 178 6.6 0.1331 14 CL20 CL23 15 0.0059 .962 .972 3.1 162 18.7 0.1412 13 CL14 CL16 19 0.0054 .957 .969 3.4 155 8.8 0.1442 12 CL26 OB58 31 0.0014 .955 .966 2.7 165 4.0 0.1486 11 OB86 CL18 4 0.0003 .955 .962 1.6 183 3.8 0.1495 10 CL13 CL11 23 0.0088 .946 .957 2.3 170 11.3 0.1518 9 CL15 CL10 31 0.0210 .925 .951 4.4 136 21.8 0.1593 T 8 CL22 CL17 30 0.0235 .902 .943 5.8 117 45.7 0.1593 7 CL8 OB75 31 0.0052 .897 .933 4.7 130 4.0 0.1628 6 CL7 CL12 62 0.2023 .694 .920 15 41.3 223 0.1725 5 CL6 CL9 93 0.6681 .026 .902 26 0.6 199 0.1756 4 CL5 OB48 94 0.0056 .021 .875 24 0.7 0.5 0.1811 T 3 CL4 OB67 95 0.0083 .012 .827 15 0.6 0.8 0.1811 2 OB23 OB61 2 0.0014 .011 .697 13 1.0 . 0.4378 1 CL3 CL2 97 0.0109 .000 .000 0.00 . 1.0 0.5815
Cluster Analysis of Birth and Death Rates The CLUSTER Procedure Two-Stage Density Linkage Clustering Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 189.106588 173.101020 0.9220 0.9220 2 16.005568 0.0780 1.0000 K = 10 Root-Mean-Square Total-Sample Standard Deviation = 10.127 Cluster History Normalized Maximum Density T Fusion in Each Cluster i NCL --Clusters Joined-- FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Density Lesser Greater e 15 CL16 OB94 22 0.0015 .921 .975 11 68.4 1.4 9.2234 6.7927 15.3069 14 CL19 OB49 28 0.0021 .919 .972 11 72.4 1.8 8.7369 5.9334 33.4385 13 CL15 OB52 23 0.0024 .917 .969 10 76.9 2.3 8.5847 5.9651 15.3069 12 CL13 OB96 24 0.0018 .915 .966 9.3 83.0 1.6 7.9252 5.4724 15.3069 11 CL12 OB93 25 0.0025 .912 .962 8.5 89.5 2.2 7.8913 5.4401 15.3069 10 CL11 OB78 26 0.0031 .909 .957 7.7 96.9 2.5 7.787 5.4082 15.3069 9 CL10 OB76 27 0.0026 .907 .951 6.7 107 2.1 7.7133 5.4401 15.3069 8 CL9 OB77 28 0.0023 .904 .943 5.5 120 1.7 7.4256 4.9017 15.3069 7 CL8 OB43 29 0.0022 .902 .933 4.1 138 1.6 6.927 4.4764 15.3069 6 CL7 OB87 30 0.0043 .898 .920 2.7 160 3.1 4.932 2.9977 15.3069 5 CL6 OB82 31 0.0055 .892 .902 1.1 191 3.7 3.7331 2.1560 15.3069 4 CL22 OB61 37 0.0079 .884 .875 0.93 237 10.6 3.1713 1.6308 100.0 3 CL14 OB23 29 0.0126 .872 .827 2.60 320 10.4 2.0654 1.0744 33.4385 2 CL4 CL3 66 0.2129 .659 .697 1.3 183 172 12.409 33.4385 100.0 1 CL2 CL5 97 0.6588 .000 .000 0.00 . 183 10.071 15.3069 100.0 3 modal clusters have been formed.
Cluster Analysis of Birth and Death Rates The CLUSTER Procedure Two-Stage Density Linkage Clustering Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 189.106588 173.101020 0.9220 0.9220 2 16.005568 0.0780 1.0000 K = 18 Root-Mean-Square Total-Sample Standard Deviation = 10.127 Cluster History Normalized Maximum Density T Fusion in Each Cluster i NCL --Clusters Joined-- FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Density Lesser Greater e 15 CL16 OB72 46 0.0107 .799 .975 21 23.3 3.0 10.118 7.7445 23.4457 14 CL15 OB94 47 0.0098 .789 .972 21 23.9 2.7 9.676 7.1257 23.4457 13 CL14 OB51 48 0.0037 .786 .969 20 25.6 1.0 9.409 6.8398 23.4457 T 12 CL13 OB96 49 0.0099 .776 .966 19 26.7 2.6 9.409 6.8398 23.4457 11 CL12 OB76 50 0.0114 .764 .962 19 27.9 2.9 8.8136 6.3138 23.4457 10 CL11 OB77 51 0.0021 .762 .957 18 31.0 0.5 8.6593 6.0751 23.4457 9 CL10 OB78 52 0.0103 .752 .951 17 33.3 2.5 8.6007 6.0976 23.4457 8 CL9 OB43 53 0.0034 .748 .943 16 37.8 0.8 8.4964 5.9160 23.4457 7 CL8 OB93 54 0.0109 .737 .933 15 42.1 2.6 8.367 5.7913 23.4457 6 CL7 OB88 55 0.0110 .726 .920 13 48.3 2.6 7.916 5.3679 23.4457 5 CL6 OB87 56 0.0120 .714 .902 12 57.5 2.7 6.6917 4.3415 23.4457 4 CL20 OB61 39 0.0077 .707 .875 9.8 74.7 8.3 6.2578 3.2882 100.0 3 CL5 OB82 57 0.0138 .693 .827 5.0 106 3.0 5.3605 3.2834 23.4457 2 CL3 OB23 58 0.0117 .681 .697 .54 203 2.5 3.2687 1.7568 23.4457 1 CL2 CL4 97 0.6812 .000 .000 0.00 . 203 13.764 23.4457 100.0 2 modal clusters have been formed.
For average linkage, the CCC has peaks at 3, 8, 10, and 12 clusters, but the 3-cluster peak is lower than the 8-cluster peak. The pseudo F statistic has peaks at 3, 8, and 12 clusters. The pseudo t 2 statistic drops sharply at 3 clusters, continues to fall at 4 clusters, and has a particularly low value at 12 clusters. However, there are not enough data to seriously consider as many as 12 clusters. Scatter plots are given for 3 and 8 clusters. The results are shown in Output 23.2.1 through Output 23.2.2. In Output 23.2.2, the eighth cluster consists of the two outlying observations, Mexico and Korea.
Complete linkage shows CCC peaks at 3, 8 and 12 clusters. The pseudo F statistic peaks at 3 and 12 clusters. The pseudo t 2 statistic indicates 3 clusters.
The scatter plot for 3 clusters is shown. The results are shown in Output 23.2.4.
The CCC and pseudo F statistics are not appropriate for use with single linkage because of the method s tendency to chop off tails of distributions. The pseudo t 2 statistic can be used by looking for large values and taking the number of clusters to be one greater than the level at which the large pseudo t 2 value is displayed. For these data, there are large values at levels 6 and 9, suggesting 7 or 10 clusters.
The scatter plots for 7 and 10 clusters are shown. The results are shown in Output 23.2.5.
For k th-nearest-neighbor density linkage, the number of modes as a function of k is as follows (not all of these analyses are shown):
k | modes |
---|---|
3 | 13 |
4 | 6 |
5-7 | 4 |
8-15 | 3 |
16-21 | 2 |
22+ | 1 |
Thus, there is strong evidence of 3 modes and an indication of the possibility of 2 modes. Uniform-kernel density linkage gives similar results. For K=10 (10th-nearest-neighbor density linkage), the scatter plot for 3 clusters is shown; and for K=18, the scatter plot for 2 clusters is shown. The results are shown in Output 23.2.6.
In summary, most of the clustering methods indicate 3 or 8 clusters. Most methods agree at the 3-cluster level, but at the other levels, there is considerable disagreement about the composition of the clusters. The presence of numerous ties also complicates the analysis; see Example 23.4 on page 1027.
The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on fifty iris specimens from each of three species, Iris setosa, I. versicolor, and I. virginica . Mezzich and Solomon (1980) discuss a variety of cluster analyses of the iris data.
This example analyzes the iris data by Ward s method and two-stage density linkage and then illustrates how the FASTCLUS procedure can be used in combination with PROC CLUSTER to analyze large data sets.
title 'Cluster Analysis of Fisher (1936) Iris Data'; proc format; value specname 1='Setosa ' 2='Versicolor' 3='Virginica '; run; data iris; input SepalLength SepalWidth PetalLength PetalWidth Species @@; format Species specname.; label SepalLength='Sepal Length in mm.' SepalWidth ='Sepal Width in mm.; PetalLength='Petal Length in mm.' PetalWidth ='Petal Width in mm.'; symbol = put(species, specname10.); datalines; 50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3 63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2 59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2 65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3 68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3 77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3 49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2 64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3 55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1 49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1 67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1 77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2 50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1 61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1 61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1 51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1 51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1 46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1 50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3 57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1 71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3 49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1 49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1 66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1 44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2 47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2 74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1 56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3 49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1 56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2 51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3 54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3 61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3 68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1 45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1 55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1 51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2 63 33 60 25 3 53 37 15 02 1 ;
The following macro, SHOW, is used in the subsequent analyses to display cluster results. It invokes the FREQ procedure to crosstabulate clusters and species. The CANDISC procedure computes canonical variables for discriminating among the clusters, and the first two canonical variables are plotted to show cluster membership. See Chapter 21, The CANDISC Procedure, for a canonical discriminant analysis of the iris species.
%macro show; proc freq; tables cluster*species; run; proc candisc noprint out=can; class cluster; var petal: sepal:; run; legend1 frame cframe=ligr cborder=black position=center value=(justify=center); axis1 label=(angle=90 rotate=0) minor=none; axis2 minor=none; proc gplot; plot can2 *can1=cluster / frame cframe=ligr legend=legend1 vaxis=axis1 haxis=axis2; run; %mend;
The first analysis clusters the iris data by Ward s method and plots the CCC and pseudo F and t 2 statistics. The CCC has a local peak at 3 clusters but a higher peak at 5 clusters. The pseudo F statistic indicates 3 clusters, while the pseudo t 2 statistic suggests 3 or 6 clusters. For large numbers of clusters, Version 6 of the SAS System produces somewhat different results than previous versions of PROC CLUSTER. This is due to changes in the treatment of ties. Results are identical for 5 or fewer clusters.
The TREE procedure creates an output data set containing the 3-cluster partition for use by the SHOW macro. The FREQ procedure reveals 16 misclassifications. The results are shown in Output 23.3.1.
Cluster Analysis of Fisher (1936) Iris Data By Ward's Method The CLUSTER Procedure Ward's Minimum Variance Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 422.824171 398.557096 0.9246 0.9246 2 24.267075 16.446125 0.0531 0.9777 3 7.820950 5.437441 0.0171 0.9948 4 2.383509 0.0052 1.0000 Root-Mean-Square Total-Sample Standard Deviation = 10.69224 Root-Mean-Square Distance Between Observations = 30.24221 Cluster History T i NCL --Clusters Joined--- FREQ SPRSQ RSQ ERSQ CCC PSF PST2 e 15 CL24 CL28 15 0.0016 .971 .958 5.93 324 9.8 14 CL21 CL53 7 0.0019 .969 .955 5.85 329 5.1 13 CL18 CL48 15 0.0023 .967 .953 5.69 334 8.9 12 CL16 CL23 24 0.0023 .965 .950 4.63 342 9.6 11 CL14 CL43 12 0.0025 .962 .946 4.67 353 5.8 10 CL26 CL20 22 0.0027 .959 .942 4.81 368 12.9 9 CL27 CL17 31 0.0031 .956 .936 5.02 387 17.8 8 CL35 CL15 23 0.0031 .953 .930 5.44 414 13.8 7 CL10 CL47 26 0.0058 .947 .921 5.43 430 19.1 6 CL8 CL13 38 0.0060 .941 .911 5.81 463 16.3 5 CL9 CL19 50 0.0105 .931 .895 5.82 488 43.2 4 CL12 CL11 36 0.0172 .914 .872 3.99 515 41.0 3 CL6 CL7 64 0.0301 .884 .827 4.33 558 57.2 2 CL4 CL3 100 0.1110 .773 .697 3.83 503 116 1 CL5 CL2 150 0.7726 .000 .000 0.00 . 503
title2 'By Ward''s Method'; proc cluster data=iris method=ward print=15 ccc pseudo; var petal: sepal:; copy species; run; legend1 frame cframe=ligr cborder=black position=center value=(justify=center); axis1 label=(angle=90 rotate=0) minor=none order=(0 to 600 by 100); axis2 minor=none order=(1 to 30 by 1); axis3 label=(angle=90 rotate=0) minor=none order=(0 to 7 by 1); proc gplot; plot _ccc_*_ncl_ / frame cframe=ligr legend=legend1 vaxis=axis3 haxis=axis2; plot _psf_*_ncl_ _pst2_*_ncl_ /overlay frame cframe=ligr legend=legend1 vaxis=axis1 haxis=axis2; run; proc tree noprint ncl=3 out=out; copy petal: sepal: species; run; %show;
Cluster Analysis of Fisher (1936) Iris Data The FREQ Procedure Table of CLUSTER by Species CLUSTER Species Frequency Percent Row Pct Col Pct Setosa VersicolVirginic Total or a ---------+--------+--------+--------+ 1 0 49 15 64 0.00 32.67 10.00 42.67 0.00 76.56 23.44 0.00 98.00 30.00 ---------+--------+--------+--------+ 2 0 1 35 36 0.00 0.67 23.33 24.00 0.00 2.78 97.22 0.00 2.00 70.00 ---------+--------+--------+--------+ 3 50 0 0 50 33.33 0.00 0.00 33.33 100.00 0.00 0.00 100.00 0.00 0.00 ---------+--------+--------+--------+ Total 50 50 50 150 33.33 33.33 33.33 100.00
The second analysis uses two-stage density linkage. The raw data suggest 2 or 6 modes instead of 3:
k | modes |
---|---|
3 | 12 |
4-6 | 6 |
7 | 4 |
8 | 3 |
9-50 | 2 |
51+ | 1 |
However, the ACECLUS procedure can be used to reveal 3 modes. This analysis uses K=8 to produce 3 clusters for comparison with other analyses. There are only 6 misclassifications. The results are shown in Output 23.3.2.
title2 'By Two-Stage Density Linkage'; proc cluster data=iris method=twostage k=8 print=15 ccc pseudo; var petal: sepal:; copy species; run; proc tree noprint ncl=3 out=out; copy petal: sepal: species; run; %show;
Cluster Analysis of Fisher (1936) Iris Data By Two-Stage Density Linkage The CLUSTER Procedure Two-Stage Density Linkage Clustering Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 422.824171 398.557096 0.9246 0.9246 2 24.267075 16.446125 0.0531 0.9777 3 7.820950 5.437441 0.0171 0.9948 4 2.383509 0.0052 1.0000 K = 8 Root-Mean-Square Total-Sample Standard Deviation = 10.69224 Cluster History Normalized Maximum Density T Fusion in Each Cluster i NCL --Clusters Joined-- FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Density Lesser Greater e 15 CL17 OB127 44 0.0025 .916 .958 11 105 3.4 0.3903 0.2066 3.5156 14 CL16 OB137 50 0.0023 .913 .955 11 110 5.6 0.3637 0.1837 100.0 13 CL15 OB74 45 0.0029 .910 .953 10 116 3.7 0.3553 0.2130 3.5156 12 CL28 OB49 46 0.0036 .907 .950 8.0 122 5.2 0.3223 0.1736 8.3678 T 11 CL12 OB85 47 0.0036 .903 .946 7.6 130 4.8 0.3223 0.1736 8.3678 10 CL11 OB98 48 0.0033 .900 .942 7.1 140 4.1 0.2879 0.1479 8.3678 9 CL13 OB24 46 0.0037 .896 .936 6.5 152 4.4 0.2802 0.2005 3.5156 8 CL10 OB25 49 0.0019 .894 .930 5.5 171 2.2 0.2699 0.1372 8.3678 7 CL8 OB121 50 0.0035 .891 .921 4.5 194 4.0 0.2586 0.1372 8.3678 6 CL9 OB45 47 0.0042 .886 .911 3.3 225 4.6 0.1412 0.0832 3.5156 5 CL6 OB39 48 0.0049 .882 .895 1.7 270 5.0 0.107 0.0605 3.5156 4 CL5 OB21 49 0.0049 .877 .872 0.35 346 4.7 0.0969 0.0541 3.5156 3 CL4 OB90 50 0.0047 .872 .827 3.28 500 4.1 0.0715 0.0370 3.5156 2 CL3 CL7 100 0.0993 .773 .697 3.83 503 91.9 2.6277 3.5156 8.3678 3 modal clusters have been formed.
Cluster Analysis of Fisher (1936) Iris Data The FREQ Procedure Table of CLUSTER by Species CLUSTER Species Frequency Percent Row Pct Col Pct Setosa VersicolVirginic Total or a ---------+--------+--------+--------+ 1 50 0 0 50 33.33 0.00 0.00 33.33 100.00 0.00 0.00 100.00 0.00 0.00 ---------+--------+--------+--------+ 2 0 47 3 50 0.00 31.33 2.00 33.33 0.00 94.00 6.00 0.00 94.00 6.00 ---------+--------+--------+--------+ 3 0 3 47 50 0.00 2.00 31.33 33.33 0.00 6.00 94.00 0.00 6.00 94.00 ---------+--------+--------+--------+ Total 50 50 50 150 33.33 33.33 33.33 100.00
The CLUSTER procedure is not practical for very large data sets because, with most methods, the CPU time varies as the square or cube of the number of observations. The FASTCLUS procedure requires time proportional to the number of observations and can, therefore, be used with much larger data sets than PROC CLUSTER. If you want to hierarchically cluster a very large data set, you can use PROC FASTCLUS for a preliminary cluster analysis producing a large number of clusters and then use PROC CLUSTER to hierarchically cluster the preliminary clusters.
FASTCLUS automatically creates variables _FREQ_ and _RMSSTD_ in the MEAN= output data set. These variables are then automatically used by PROC CLUSTER in the computation of various statistics.
The iris data are used to illustrate the process of clustering clusters. In the preliminary analysis, PROC FASTCLUS produces ten clusters, which are then crosstabulated with species. The data set containing the preliminary clusters is sorted in preparation for later merges. The results are shown in Output 23.3.3.
title2 'Preliminary Analysis by FASTCLUS'; proc fastclus data=iris summary maxc=10 maxiter=99 converge=0 mean=mean out=prelim cluster=preclus; var petal: sepal:; run; proc freq; tables preclus*species; run; proc sort data=prelim; by preclus; run;
Cluster Analysis of Fisher (1936) Iris Data Preliminary Analysis by FASTCLUS The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=10 Maxiter=99 Converge=0 Cluster Summary Maximum Distance RMS Std from Seed Radius Nearest Distance Between Cluster Frequency Deviation to Observation Exceeded Cluster Cluster Centroids -------------------------------------------------------------------------------------------------- 1 9 2.7067 8.2027 5 8.7362 2 19 2.2001 7.7340 4 6.2243 3 18 2.1496 6.2173 8 7.5049 4 4 2.5249 5.3268 2 6.2243 5 3 2.7234 5.8214 1 8.7362 6 7 2.2939 5.1508 2 9.3318 7 17 2.0274 6.9576 10 7.9503 8 18 2.2628 7.1135 3 7.5049 9 22 2.2666 7.5029 8 9.0090 10 33 2.0594 10.0033 7 7.9503 Pseudo F Statistic = 370.58 Observed Over-All R-Squared = 0.95971 Approximate Expected Over-All R-Squared = 0.82928 Cubic Clustering Criterion = 27.077 WARNING: The two values above are invalid for correlated variables.
Cluster Analysis of Fisher (1936) Iris Data Preliminary Analysis by FASTCLUS The FREQ Procedure Table of preclus by Species preclus(Cluster) Species Frequency Percent Row Pct Col Pct Setosa VersicolVirginic Total or a ---------+--------+--------+--------+ 1 0 0 9 9 0.00 0.00 6.00 6.00 0.00 0.00 100.00 0.00 0.00 18.00 ---------+--------+--------+--------+ 2 0 19 0 19 0.00 12.67 0.00 12.67 0.00 100.00 0.00 0.00 38.00 0.00 ---------+--------+--------+--------+ 3 0 18 0 18 0.00 12.00 0.00 12.00 0.00 100.00 0.00 0.00 36.00 0.00 ---------+--------+--------+--------+ 4 0 3 1 4 0.00 2.00 0.67 2.67 0.00 75.00 25.00 0.00 6.00 2.00 ---------+--------+--------+--------+ 5 0 0 3 3 0.00 0.00 2.00 2.00 0.00 0.00 100.00 0.00 0.00 6.00 ---------+--------+--------+--------+ 6 0 7 0 7 0.00 4.67 0.00 4.67 0.00 100.00 0.00 0.00 14.00 0.00 ---------+--------+--------+--------+ 7 17 0 0 17 11.33 0.00 0.00 11.33 100.00 0.00 0.00 34.00 0.00 0.00 ---------+--------+--------+--------+ 8 0 3 15 18 0.00 2.00 10.00 12.00 0.00 16.67 83.33 0.00 6.00 30.00 ---------+--------+--------+--------+ 9 0 0 22 22 0.00 0.00 14.67 14.67 0.00 0.00 100.00 0.00 0.00 44.00 ---------+--------+--------+--------+ 10 33 0 0 33 22.00 0.00 0.00 22.00 100.00 0.00 0.00 66.00 0.00 0.00 ---------+--------+--------+--------+ Total 50 50 50 150 33.33 33.33 33.33 100.00
The following macro, CLUS, clusters the preliminary clusters. There is one argument to choose the METHOD= specification to be used by PROC CLUSTER. The TREE procedure creates an output data set containing the 3-cluster partition, which is sorted and merged with the OUT= data set from PROC FASTCLUS to determine to which cluster each of the original 150 observations belongs. The SHOW macro is then used to display the results. In this example, the CLUS macro is invoked using Ward s method, which produces 16 misclassifications, and Wong s hybrid method, which produces 22 misclassifications. The results are shown in Output 23.3.4 and Output 23.3.5.
%macro clus(method); proc cluster data=mean method=&method ccc pseudo; var petal: sepal:; copy preclus; run; proc tree noprint ncl=3 out=out; copy petal: sepal: preclus; run; proc sort data=out; by preclus; run; data clus; merge out prelim; by preclus; run; %show; %mend; title2 'Clustering Clusters by Ward''s Method'; %clus(ward); title2 'Clustering Clusters by Wong''s Hybrid Method'; %clus(twostage hybrid);
Cluster Analysis of Fisher (1936) Iris Data The FREQ Procedure Table of CLUSTER by Species CLUSTER Species Frequency Percent Row Pct Col Pct Setosa VersicolVirginic Total or a ---------+--------+--------+--------+ 1 0 50 16 66 0.00 33.33 10.67 44.00 0.00 75.76 24.24 0.00 100.00 32.00 ---------+--------+--------+--------+ 2 0 0 34 34 0.00 0.00 22.67 22.67 0.00 0.00 100.00 0.00 0.00 68.00 ---------+--------+--------+--------+ 3 50 0 0 50 33.33 0.00 0.00 33.33 100.00 0.00 0.00 100.00 0.00 0.00 ---------+--------+--------+--------+ Total 50 50 50 150 33.33 33.33 33.33 100.00
Cluster Analysis of Fisher (1936) Iris Data Clustering Clusters by Wong's Hybrid Method The CLUSTER Procedure Two-Stage Density Linkage Clustering Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 416.976349 398.666421 0.9501 0.9501 2 18.309928 14.952922 0.0417 0.9918 3 3.357006 3.126943 0.0076 0.9995 4 0.230063 0.0005 1.0000 Root-Mean-Square Total-Sample Standard Deviation = 10.69224 Cluster History Normalized Maximum Density T Fusion in Each Cluster i NCL --Clusters Joined-- FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Density Lesser Greater e 9 OB10 OB7 50 0.0104 .949 .932 3.81 330 42.2 40.24 58.2179 100.0 8 OB3 OB8 36 0.0074 .942 .926 3.22 329 26.0 27.981 39.4511 48.4350 7 OB2 OB4 23 0.0019 .940 .918 4.24 373 6.3 23.775 8.9675 46.3026 6 CL8 OB9 58 0.0194 .921 .907 2.13 334 46.3 20.724 46.8846 48.4350 5 CL7 OB6 30 0.0069 .914 .892 3.09 383 19.5 13.303 17.6360 46.3026 4 CL6 OB1 67 0.0292 .884 .870 1.21 372 41.0 8.4137 10.8758 48.4350 3 CL4 OB5 70 0.0138 .871 .824 3.33 494 12.3 5.1855 6.2890 48.4350 2 CL3 CL5 100 0.0979 .773 .695 3.94 503 89.5 19.513 46.3026 48.4350 1 CL2 CL9 150 0.7726 .000 .000 0.00 . 503 1.3337 48.4350 100.0 3 modal clusters have been formed.
Cluster Analysis of Fisher (1936) Iris Data The FREQ Procedure Table of CLUSTER by Species CLUSTER Species Frequency Percent Row Pct Col Pct Setosa VersicolVirginic Total or a ---------+--------+--------+--------+ 1 50 0 0 50 33.33 0.00 0.00 33.33 100.00 0.00 0.00 100.00 0.00 0.00 ---------+--------+--------+--------+ 2 0 21 49 70 0.00 14.00 32.67 46.67 0.00 30.00 70.00 0.00 42.00 98.00 ---------+--------+--------+--------+ 3 0 29 1 30 0.00 19.33 0.67 20.00 0.00 96.67 3.33 0.00 58.00 2.00 ---------+--------+--------+--------+ Total 50 50 50 150 33.33 33.33 33.33 100.00
If, at some level of the cluster history, there is a tie for minimum distance between clusters, then one or more levels of the sample cluster tree are not uniquely determined. This example shows how the degree of indeterminacy can be assessed.
Mammals have four kinds of teeth: incisors, canines, premolars, and molars. The following data set gives the number of teeth of each kind on one side of the top and bottom jaws for 32 mammals.
Since all eight variables are measured in the same units, it is not strictly necessary to rescale the data. However, the canines have much less variance than the other kinds of teeth and, therefore, have little effect on the analysis if the variables are not standardized. An average linkage cluster analysis is run with and without standardization to allow comparison of the results. The results are shown in Output 23.4.1 and Output 23.4.2.
title 'Hierarchical Cluster Analysis of Mammals'' Teeth Data'; title2 'Evaluating the Effects of Ties'; data teeth; input mammal $ 1-16 @21 (v1-v8) (1.); label v1='Top incisors' v2='Bottom incisors' v3='Top canines' v4='Bottom canines' v5='Top premolars' v6='Bottom premolars' v7='Top molars' v8='Bottom molars'; datalines; BROWN BAT 23113333 MOLE 32103333 SILVER HAIR BAT 23112333 PIGMY BAT 23112233 HOUSE BAT 23111233 RED BAT 13112233 PIKA 21002233 RABBIT 21003233 BEAVER 11002133 GROUNDHOG 11002133 GRAY SQUIRREL 11001133 HOUSE MOUSE 11000033 PORCUPINE 11001133 WOLF 33114423 BEAR 33114423 RACCOON 33114432 MARTEN 33114412 WEASEL 33113312 WOLVERINE 33114412 BADGER 33113312 RIVER OTTER 33114312 SEA OTTER 32113312 JAGUAR 33113211 COUGAR 33113211 FUR SEAL 32114411 SEA LION 32114411 GREY SEAL 32113322 ELEPHANT SEAL 21114411 REINDEER 04103333 ELK 04103333 DEER 04003333 MOOSE 04003333 ; proc cluster data=teeth method=average nonorm outtree=_null_; var v1-v8; id mammal; title3 'Raw Data'; run; proc cluster data=teeth std method=average nonorm outtree=_null_; var v1-v8; id mammal; title3 'Standardized Data'; run;
Hierarchical Cluster Analysis of Mammals' Teeth Data Evaluating the Effects of Ties Raw Data The CLUSTER Procedure Average Linkage Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 3.76799365 2.33557185 0.5840 0.5840 2 1.43242180 0.91781899 0.2220 0.8061 3 0.51460281 0.08414950 0.0798 0.8858 4 0.43045331 0.30021485 0.0667 0.9525 5 0.13023846 0.03814626 0.0202 0.9727 6 0.09209220 0.04216914 0.0143 0.9870 7 0.04992305 0.01603541 0.0077 0.9947 8 0.03388764 0.0053 1.0000 Root-Mean-Square Total-Sample Standard Deviation = 0.898027 Cluster History T RMS i NCL ----------Clusters Joined----------- FREQ Dist e 31 BEAVER GROUNDHOG 2 0 T 30 GRAY SQUIRREL PORCUPINE 2 0 T 29 WOLF BEAR 2 0 T 28 MARTEN WOLVERINE 2 0 T 27 WEASEL BADGER 2 0 T 26 JAGUAR COUGAR 2 0 T 25 FUR SEAL SEA LION 2 0 T 24 REINDEER ELK 2 0 T 23 DEER MOOSE 2 0 22 BROWN BAT SILVER HAIR BAT 2 1 T 21 PIGMY BAT HOUSE BAT 2 1 T 20 PIKA RABBIT 2 1 T 19 CL31 CL30 4 1 T 18 CL28 RIVER OTTER 3 1 T 17 CL27 SEA OTTER 3 1 T 16 CL24 CL23 4 1 15 CL21 RED BAT 3 1.2247 14 CL17 GREY SEAL 4 1.291 13 CL29 RACCOON 3 1.4142 T 12 CL25 ELEPHANT SEAL 3 1.4142 11 CL18 CL14 7 1.5546 10 CL22 CL15 5 1.5811 9 CL20 CL19 6 1.8708 T 8 CL11 CL26 9 1.9272 7 CL8 CL12 12 2.2278 6 MOLE CL13 4 2.2361 5 CL9 HOUSE MOUSE 7 2.4833 4 CL6 CL7 16 2.5658 3 CL10 CL16 9 2.8107 2 CL3 CL5 16 3.7054 1 CL2 CL4 32 4.2939
Hierarchical Cluster Analysis of Mammals' Teeth Data Evaluating the Effects of Ties Standardized Data The CLUSTER Procedure Average Linkage Cluster Analysis Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative 1 4.74153902 3.27458808 0.5927 0.5927 2 1.46695094 0.70824118 0.1834 0.7761 3 0.75870977 0.25146252 0.0948 0.8709 4 0.50724724 0.30264737 0.0634 0.9343 5 0.20459987 0.05925818 0.0256 0.9599 6 0.14534169 0.03450100 0.0182 0.9780 7 0.11084070 0.04606994 0.0139 0.9919 8 0.06477076 0.0081 1.0000 The data have been standardized to mean 0 and variance 1 Root-Mean-Square Total-Sample Standard Deviation = 1 Cluster History T RMS i NCL ----------Clusters Joined----------- FREQ Dist e 31 BEAVER GROUNDHOG 2 0 T 30 GRAY SQUIRREL PORCUPINE 2 0 T 29 WOLF BEAR 2 0 T 28 MARTEN WOLVERINE 2 0 T 27 WEASEL BADGER 2 0 T 26 JAGUAR COUGAR 2 0 T 25 FUR SEAL SEA LION 2 0 T 24 REINDEER ELK 2 0 T 23 DEER MOOSE 2 0 22 PIGMY BAT RED BAT 2 0.9157 21 CL28 RIVER OTTER 3 0.9169 20 CL31 CL30 4 0.9428 T 19 BROWN BAT SILVER HAIR BAT 2 0.9428 T 18 PIKA RABBIT 2 0.9428 17 CL27 SEA OTTER 3 0.9847 16 CL22 HOUSE BAT 3 1.1437 15 CL21 CL17 6 1.3314 14 CL25 ELEPHANT SEAL 3 1.3447 13 CL19 CL16 5 1.4688 12 CL15 GREY SEAL 7 1.6314 11 CL29 RACCOON 3 1.692 10 CL18 CL20 6 1.7357 9 CL12 CL26 9 2.0285 8 CL24 CL23 4 2.1891 7 CL9 CL14 12 2.2674 6 CL10 HOUSE MOUSE 7 2.317 5 CL11 CL7 15 2.6484 4 CL13 MOLE 6 2.8624 3 CL4 CL8 10 3.5194 2 CL3 CL6 17 4.1265 1 CL2 CL5 32 4.7753
There are ties at 16 levels for the raw data but at only 10 levels for the standardized data. There are more ties for the raw data because the increments between successive values are the same for all of the raw variables but different for the standardized variables.
One way to assess the importance of the ties in the analysis is to repeat the analysis on several random permutations of the observations and then to see to what extent the results are consistent at the interesting levels of the cluster history. Three macros are presented to facilitate this process.
/* --------------------------------------------------------- */ /* */ /* The macro CLUSPERM randomly permutes observations and */ /* does a cluster analysis for each permutation. */ /* The arguments are as follows: */ /* */ /* data data set name */ /* var list of variables to cluster */ /* id id variable for proc cluster */ /* method clustering method (and possibly other options) */ /* nperm number of random permutations. */ /* */ /* --------------------------------------------------------- */ %macro CLUSPERM(data,var,id,method,nperm); /* ------CREATE TEMPORARY DATA SET WITH RANDOM NUMBERS------ */ data _temp_; set &data; array _random_ _ran_1-_ran_&nperm; do over _random_; _random_=ranuni(835297461); end; run; /* ------PERMUTE AND CLUSTER THE DATA----------------------- */ %do n=1 %to &nperm; proc sort data=_temp_(keep=_ran_&n &var &id) out=_perm_; by _ran_&n; run; proc cluster method=&method noprint outtree=_tree_&n; var &var; id &id; run; %end; %mend; /* --------------------------------------------------------- */ /* */ /* The macro PLOTPERM plots various cluster statistics */ /* against the number of clusters for each permutation. */ /* The arguments are as follows: */ /* */ /* stats names of variables from tree data set */ /* nclus maximum number of clusters to be plotted */ /* nperm number of random permutations. */ /* */ /* --------------------------------------------------------- */ %macro PLOTPERM(stat,nclus,nperm); /* ---CONCATENATE TREE DATA SETS FOR 20 OR FEWER CLUSTERS--- */ data _plot_; set %do n=1 %to &nperm; _tree_&n(in=_in_&n) %end; ; if _ncl_<=&nclus; %do n=1 %to &nperm; if _in_&n then _perm_=&n; %end; label _perm_='permutation number'; keep _ncl_ &stat _perm_; run; /* ---PLOT THE REQUESTED STATISTICS BY NUMBER OF CLUSTERS--- */ proc plot; plot (&stat)*_ncl_=_perm_ /vpos=26; title2 'Symbol is value of _PERM_'; run; %mend; /* --------------------------------------------------------- */ /* */ /* The macro TREEPERM generates cluster-membership variables */ /* for a specified number of clusters for each permutation. */ /* PROC PRINT lists the objects in each cluster-combination, */ /* and PROC TABULATE gives the frequencies and means. The */ /* arguments are as follows: */ /* */ /* var list of variables to cluster */ /* (no"-" or ":" allowed) */ /* id id variable for proc cluster */ /* meanfmt format for printing means in PROC TABULATE */ /* nclus number of clusters desired */ /* nperm number of random permutations. */ /* */ /* --------------------------------------------------------- */ %macro TREEPERM(var,id,meanfmt,nclus,nperm); /* ------CREATE DATA SETS GIVING CLUSTER MEMBERSHIP--------- */ %do n=1 %to &nperm; proc tree data=_tree_&n noprint n=&nclus out=_out_&n(drop=clusname rename=(cluster=_clus_&n)); copy &var; id &id; run; proc sort; by &id &var; run; %end; /* ------MERGE THE CLUSTER VARIABLES------------------------ */ data _merge_; merge %do n=1 %to &nperm; _out_&n %end; ; by &id &var; length all_clus $ %eval(3*&nperm); %do n=1 %to &nperm; substr(all_clus, %eval(1+(&n-1)*3), 3) = put(_clus_&n, 3.); %end; run; /* ------PRINT AND TABULATE CLUSTER COMBINATIONS------------ */ proc sort; by _clus_:; run; proc print; var &var; id &id; by all_clus notsorted; run; proc tabulate order=data formchar=' '; class all_clus; var &var; table all_clus, n='FREQ'*f=5. mean*f=&meanfmt*(&var) / rts=%eval(&nperm*3+1); run; %mend;
To use these, it is first convenient to define a macro, VLIST, listing the teeth variables, since the forms V1-V8 or V: cannot be used with the TABULATE procedure in the TREEPERM macro:
/* -TABULATE does not accept hyphens or colons in VAR lists- */ %let vlist=v1 v2 v3 v4 v5 v6 v7 v8;
The CLUSPERM macro is then called to analyze ten random permutations. The PLOTPERM macro plots the pseudo F and t 2 statistics and the cubic clustering criterion. Since the data are discrete, the pseudo F statistic and the cubic clustering criterion can be expected to increase as the number of clusters increases , so local maxima or large jumps in these statistics are more relevant than the global maximum in determining the number of clusters. For the raw data, only the pseudo t 2 statistic indicates the possible presence of clusters, with the 4-cluster level being suggested. Hence, the TREEPERM macro is used to analyze the results at the 4-cluster level:
title3 'Raw Data'; /* ------CLUSTER RAW DATA WITH AVERAGE LINKAGE-------------- */ %clusperm(teeth, &vlist, mammal, average, 10); /* -----PLOT STATISTICS FOR THE LAST 20 LEVELS-------------- */ %plotperm(_psf_ _pst2_ _ccc_, 20, 10); /* ------ANALYZE THE 4-CLUSTER LEVEL------------------------ */ %treeperm(&vlist, mammal, 9.1, 4, 10);
The results are shown in Output 23.4.3.
Hierarchical Cluster Analysis of Mammals' Teeth Data Symbol is value of _PERM_ Plot of _PSF_*_NCL_. Symbol is value of _perm_. 100 + P s e 5 u 80 + d o F 2 S 60 + t 5 4 a 2 t 9 9 1 i 3 3 1 6 s 2 2 1 1 4 t 40 + 2 4 1 i 1 1 1 1 c 2 3 1 2 2 1 1 1 1 1 1 1 1 2 1 20 + ---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Clusters NOTE: 10 obs had missing values. 151 obs hidden.
Hierarchical Cluster Analysis of Mammals' Teeth Data Symbol is value of _PERM_ Plot of _PST2_*_NCL_. Symbol is value of _perm_. P s 30 + e u 1 d o 25 + T - 1 S 20 + q 1 u a r 15 + e d 2 2 1 S 10 + t a 2 2 3 t 1 2 2 1 1 1 i 5 + 1 2 5 s 1 2 3 4 1 t 1 i 1 2 1 c 0 + ---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Clusters NOTE: 69 obs had missing values. 104 obs hidden.
Hierarchical Cluster Analysis of Mammals' Teeth Data Symbol is value of _PERM_ Plot of _CCC_*_NCL_. Symbol is value of _perm_. C u 4 + b i c 2 C l 3 + u 1 1 s t e r 2 i 2 + n 1 g 1 2 C r i 1 + t e r 2 i o 1 n 0 +1 -+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Clusters NOTE: 140 obs had missing values. 50 obs hidden.
-------------------------------------- all_clus=' 1 3 1 1 1 3 3 3 2 3' --------------------------------------- mammal v1 v2 v3 v4 v5 v6 v7 v8 DEER 0 4 0 0 3 3 3 3 ELK 0 4 1 0 3 3 3 3 MOOSE 0 4 0 0 3 3 3 3 REINDEER 0 4 1 0 3 3 3 3 -------------------------------------- all_clus=' 2 2 2 2 2 2 1 2 1 1' --------------------------------------- mammal v1 v2 v3 v4 v5 v6 v7 v8 BADGER 3 3 1 1 3 3 1 2 BEAR 3 3 1 1 4 4 2 3 COUGAR 3 3 1 1 3 2 1 1 ELEPHANT SEAL 2 1 1 1 4 4 1 1 FUR SEAL 3 2 1 1 4 4 1 1 GREY SEAL 3 2 1 1 3 3 2 2 JAGUAR 3 3 1 1 3 2 1 1 MARTEN 3 3 1 1 4 4 1 2 RACCOON 3 3 1 1 4 4 3 2 RIVER OTTER 3 3 1 1 4 3 1 2 SEA LION 3 2 1 1 4 4 1 1 SEA OTTER 3 2 1 1 3 3 1 2 WEASEL 3 3 1 1 3 3 1 2 WOLF 3 3 1 1 4 4 2 3 WOLVERINE 3 3 1 1 4 4 1 2 -------------------------------------- all_clus=' 2 4 2 2 4 2 1 2 1 1' --------------------------------------- mammal v1 v2 v3 v4 v5 v6 v7 v8 MOLE 3 2 1 0 3 3 3 3 -------------------------------------- all_clus=' 3 1 3 3 3 1 2 1 3 2' --------------------------------------- mammal v1 v2 v3 v4 v5 v6 v7 v8 BEAVER 1 1 0 0 2 1 3 3 GRAY SQUIRREL 1 1 0 0 1 1 3 3 GROUNDHOG 1 1 0 0 2 1 3 3 HOUSE MOUSE 1 1 0 0 0 0 3 3 PORCUPINE 1 1 0 0 1 1 3 3 -------------------------------------- all_clus=' 3 4 3 3 4 1 2 1 3 2' --------------------------------------- mammal v1 v2 v3 v4 v5 v6 v7 v8 PIKA 2 1 0 0 2 2 3 3 RABBIT 2 1 0 0 3 2 3 3 -------------------------------------- all_clus=' 4 4 4 4 4 4 4 4 4 4' --------------------------------------- mammal v1 v2 v3 v4 v5 v6 v7 v8 BROWN BAT 2 3 1 1 3 3 3 3 HOUSE BAT 2 3 1 1 1 2 3 3 PIGMY BAT 2 3 1 1 2 2 3 3 RED BAT 1 3 1 1 2 2 3 3 SILVER HAIR BAT 2 3 1 1 2 3 3 3
Mean Top Bottom Top Bottom Top Bottom Top Bottom FREQ incisors incisors canines canines premolars premolars molars molars all_clus 1 3 1 1 1 3 3 3 2 3 4 0.0 4.0 0.5 0.0 3.0 3.0 3.0 3.0 2 2 2 2 2 2 1 2 1 1 15 2.9 2.6 1.0 1.0 3.6 3.4 1.3 1.8 2 4 2 2 4 2 1 2 1 1 1 3.0 2.0 1.0 0.0 3.0 3.0 3.0 3.0 3 1 3 3 3 1 2 1 3 2 5 1.0 1.0 0.0 0.0 1.2 0.8 3.0 3.0 3 4 3 3 4 1 2 1 3 2 2 2.0 1.0 0.0 0.0 2.5 2.0 3.0 3.0 4 4 4 4 4 4 4 4 4 4 5 1.8 3.0 1.0 1.0 2.0 2.4 3.0 3.0
From the TABULATE and PRINT output, you can see that two types of clustering are obtained. In one case, the mole is grouped with the carnivores, while the pika and rabbit are grouped with the rodents. In the other case, both the mole and the lagomorphs are grouped with the bats.
Next, the analysis is repeated with the standardized data. The pseudo F and t 2 statistics indicate 3 or 4 clusters, while the cubic clustering criterion shows a sharp rise up to 4 clusters and then levels off up to 6 clusters. So the TREEPERM macro is used again at the 4-cluster level. In this case, there is no indeterminacy, as the same four clusters are obtained with every permutation, although in different orders. It must be emphasized , however, that lack of indeterminacy in no way indicates validity. The results are shown in Output 23.4.4.
Hierarchical Cluster Analysis of Mammals' Teeth Data Symbol is value of _PERM_ Plot of _PSF_*_NCL_. Symbol is value of _perm_. 100 + P s e 1 u 80 + d 1 o 1 F 1 1 S 60 + t a t i 1 1 s t 40 + 1 1 i 1 c 1 1 1 1 1 1 1 1 1 20 + ---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Clusters NOTE: 10 obs had missing values. 171 obs hidden.
title3 'Standardized Data'; /*------CLUSTER STANDARDIZED DATA WITH AVERAGE LINKAGE------*/ %clusperm(teeth, &vlist, mammal, average std, 10); /*------PLOT STATISTICS FOR THE LAST 20 LEVELS--------------*/ %plotperm(_psf_ _pst2_ _ccc_, 20, 10); /*------ANALYZE THE 4-CLUSTER LEVEL-------------------------*/ %treeperm(&vlist, mammal, 9.1, 4, 10);
Hierarchical Cluster Analysis of Mammals' Teeth Data Symbol is value of _PERM_ Plot of _PST2_*_NCL_. Symbol is value of _perm_. P s e u d o 30 + T 1 - S q u a 20 + r e 1 d 1 S t 10 + 1 a t 1 1 1 1 1 i s 1 1 1 t 1 i 0 + c ---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Clusters NOTE: 70 obs had missing values. 117 obs hidden.
Hierarchical Cluster Analysis of Mammals' Teeth Data Symbol is value of _PERM_ Plot of _CCC_*_NCL_. Symbol is value of _perm_. C u 4 + b 1 1 i c C l 3 + 1 u s t e r 1 i 2 + n g C r i 1 + t e r i o 1 n 0 +1 -+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Clusters NOTE: 140 obs had missing values. 54 obs hidden.
-------------------------------------- all_clus=' 1 3 1 1 1 3 3 3 2 3' --------------------------------------- mammal v1 v2 v3 v4 v5 v6 v7 v8 DEER 0 4 0 0 3 3 3 3 ELK 0 4 1 0 3 3 3 3 MOOSE 0 4 0 0 3 3 3 3 REINDEER 0 4 1 0 3 3 3 3 -------------------------------------- all_clus=' 2 2 2 2 2 2 1 2 1 1' --------------------------------------- mammal v1 v2 v3 v4 v5 v6 v7 v8 BADGER 3 3 1 1 3 3 1 2 BEAR 3 3 1 1 4 4 2 3 COUGAR 3 3 1 1 3 2 1 1 ELEPHANT SEAL 2 1 1 1 4 4 1 1 FUR SEAL 3 2 1 1 4 4 1 1 GREY SEAL 3 2 1 1 3 3 2 2 JAGUAR 3 3 1 1 3 2 1 1 MARTEN 3 3 1 1 4 4 1 2 RACCOON 3 3 1 1 4 4 3 2 RIVER OTTER 3 3 1 1 4 3 1 2 SEA LION 3 2 1 1 4 4 1 1 SEA OTTER 3 2 1 1 3 3 1 2 WEASEL 3 3 1 1 3 3 1 2 WOLF 3 3 1 1 4 4 2 3 WOLVERINE 3 3 1 1 4 4 1 2 -------------------------------------- all_clus=' 3 1 3 3 3 1 2 1 3 2' --------------------------------------- mammal v1 v2 v3 v4 v5 v6 v7 v8 BEAVER 1 1 0 0 2 1 3 3 GRAY SQUIRREL 1 1 0 0 1 1 3 3 GROUNDHOG 1 1 0 0 2 1 3 3 HOUSE MOUSE 1 1 0 0 0 0 3 3 PIKA 2 1 0 0 2 2 3 3 PORCUPINE 1 1 0 0 1 1 3 3 RABBIT 2 1 0 0 3 2 3 3 -------------------------------------- all_clus=' 4 4 4 4 4 4 4 4 4 4' --------------------------------------- mammal v1 v2 v3 v4 v5 v6 v7 v8 BROWN BAT 2 3 1 1 3 3 3 3 HOUSE BAT 2 3 1 1 1 2 3 3 MOLE 3 2 1 0 3 3 3 3 PIGMY BAT 2 3 1 1 2 2 3 3 RED BAT 1 3 1 1 2 2 3 3 SILVER HAIR BAT 2 3 1 1 2 3 3 3 Mean Top Bottom Top Bottom Top Bottom Top Bottom FREQ incisors incisors canines canines premolars premolars molars molars all_clus 1 3 1 1 1 3 3 3 2 3 4 0.0 4.0 0.5 0.0 3.0 3.0 3.0 3.0 2 2 2 2 2 2 1 2 1 1 15 2.9 2.6 1.0 1.0 3.6 3.4 1.3 1.8 3 1 3 3 3 1 2 1 3 2 7 1.3 1.0 0.0 0.0 1.6 1.1 3.0 3.0 4 4 4 4 4 4 4 4 4 4 6 2.0 2.8 1.0 0.8 2.2 2.5 3.0 3.0
An example of the use of distance and similarity measures in cluster analysis is given in Example 26.1 in the PROC DISTANCE chapter.
The following example shows the analysis of a data set in which size information is detrimental to the classification. Imagine that an archaeologist of the future is excavating a 20th century grocery store. The archaeologist has discovered a large number of boxes of various sizes, shapes , and colors and wants to do a preliminary classification based on simple external measurements: height, width, depth, weight, and the predominant color of the box. It is known that a given product may have been sold in packages of different size, so the archaeologist wants to remove the effect of size from the classification. It is not known whether color is relevant to the use of the products, so the analysis should be done both with and without color information.
Unknown to the archaeologist, the boxes actually fall into six general categories according to the use of the product: breakfast cereals, crackers, laundry detergents, Little Debbie snacks, tea, and toothpaste. These categories are shown in the analysis so that you can evaluate the effectiveness of the classification.
Since there is no reason for the archaeologist to assume that the true categories have equal sample sizes or variances, the centroid method is used to avoid undue bias. Each analysis is done with Euclidean distances after suitable transformations of the data. Color is coded as five dummy variables with values of 0 or 1. The DATA step is as follows:
options ls=120; title 'Cluster Analysis of Grocery Boxes'; data grocery2; length name /* name of product */ class /* category of product */ unit /* unit of measurement for weights: g=gram o=ounce l=lb all weights are converted to grams */ color /* predominant color of box */ height 8 /* height of box in cm. */ width 8 /* width of box in cm. */ depth 8 /* depth of box (front to back) in cm. */ weight 8 /* weight of box in grams */ c_white c_yellow c_red c_green c_blue 4; /* dummy variables */ retain class; drop unit; /*--- read name with possible embedded blanks ---*/ input name & @; /*--- if name starts with "---", ---*/ /*--- it's really a category value ---*/ if substr(name,1,3) = '---' then do; class = substr(name,4,index(substr(name,4),'-')-1); delete; return; end; /*--- read the rest of the variables ---*/ input height width depth weight unit color; /*--- convert weights to grams ---*/ select (unit); when ('l') weight = weight * 454; when ('o') weight = weight * 28.3; when ('g') ; otherwise put 'Invalid unit ' unit; end; /*--- use 0/1 coding for dummy variables for colors ---*/ c_white = (color = 'w'); c_yellow = (color = 'y'); c_red = (color = 'r'); c_green = (color = 'g'); c_blue = (color = 'b'); datalines; ---Breakfast cereals--- Cheerios 32.5 22.4 8.4 567 g y Cheerios 30.3 20.4 7.2 425 g y Cheerios 27.5 19 6.2 283 g y Cheerios 24.1 17.2 5.3 198 g y Special K 30.1 20.5 8.5 18 o w Special K 29.6 19.2 6.7 12 o w Special K 23.4 16.6 5.7 7 o w Corn Flakes 33.7 25.4 8 24 o w Corn Flakes 30.2 20.6 8.4 18 o w Corn Flakes 30 19.1 6.6 12 o w Grape Nuts 21.7 16.3 4.9 680 g w Shredded Wheat 19.7 19.9 7.5 283 g y Shredded Wheat, Spoon Size 26.6 19.6 5.6 510 g r All-Bran 21.1 14.3 5.2 13.8 o y Froot Loops 30.2 20.8 8.5 19.7 o r Froot Loops 25 17.7 6.4 11 o r ---Crackers--- Wheatsworth 11.1 25.2 5.5 326 g w Ritz 23.1 16 5.3 340 g r Ritz 23.1 20.7 5.2 454 g r Premium Saltines 11 25 10.7 454 g w Waverly Wafers 14.4 22.5 6.2 454 g g ---Detergent--- Arm & Hammer Detergent 38.8 30 16.9 25 l y Arm & Hammer Detergent 39.5 25.8 11 14.2 l y Arm & Hammer Detergent 33.7 22.8 7 7 l y Arm & Hammer Detergent 27.8 19.4 6.3 4 l y Tide 39.4 24.8 11.3 9.2 l r Tide 32.5 23.2 7.3 4.5 l r Tide 26.5 19.9 6.3 42 o r Tide 19.3 14.6 4.7 17 o r ---Little Debbie--- Figaroos 13.5 18.6 3.7 12 o y Swiss Cake Rolls 10.1 21.8 5.8 13 o w Fudge Brownies 11 30.8 2.5 12 o w Marshmallow Supremes 9.4 32 7 10 o w Apple Delights 11.2 30.1 4.9 15 o w Snack Cakes 13.4 32 3.4 13 o b Nutty Bar 13.2 18.5 4.2 12 o y Lemon Stix 13.2 18.5 4.2 9 o w Fudge Rounds 8.1 28.3 5.4 9.5 o w ---Tea--- Celestial Saesonings Mint Magic 7.8 13.8 6.3 49 g b Celestial Saesonings Cranberry Cove 7.8 13.8 6.3 46 g r Celestial Saesonings Sleepy Time 7.8 13.8 6.3 37 g g Celestial Saesonings Lemon Zinger 7.8 13.8 6.3 56 g y Bigelow Lemon Lift 7.7 13.4 6.9 40 g y Bigelow Plantation Mint 7.7 13.4 6.9 35 g g Bigelow Earl Grey 7.7 13.4 6.9 35 g b Luzianne 8.9 22.8 6.4 6 o r Luzianne 18.4 20.2 6.9 8 o r Luzianne Decaffeinated 8.9 22.8 6.4 5.25 o g Lipton Tea Bags 17.1 20 6.7 8 o r Lipton Tea Bags 11.5 14.4 6.6 3.75 o r Lipton Tea Bags 6.7 10 5.7 1.25 o r Lipton Family Size Tea Bags 13.7 24 9 12 o r Lipton Family Size Tea Bags 8.7 20.8 8.2 6 o r Lipton Family Size Tea Bags 8.9 11.1 8.2 3 o r Lipton Loose Tea 12.7 10.9 5.4 8 o r ---Paste, Tooth--- Colgate 4.4 22 3.5 7 o r Colgate 3.6 15.6 3.3 3 o r Colgate 4.2 18.3 3.5 5 o r Crest 4.3 21.7 3.7 6.4 o w Crest 4.3 17.4 3.6 4.6 o w Crest 3.5 15.2 3.2 2.7 o w Crest 3.0 10.9 2.8 .85 o w Arm & Hammer 4.4 17 3.7 5 o w ; data grocery; length name ; set grocery2;
The FORMAT procedure is used to define to formats to make the output easier to read. The STARS. format is used for graphical crosstabulations in the TABULATE procedure. The $COLOR format displays the names of the colors instead of just the first letter.
/*------ formats and macros for displaying ------ */ /*------ cluster results ------ */ proc format; value stars 0=' ' 1=' #' 2=' ##' 3=' ###' 4=' ####' 5=' #####' 6=' ######' 7=' #######' 8=' ########' 9=' #########' 10=' ##########' 11=' ###########' 12=' ############' 13=' #############' 14=' ##############' 15-high='>##############'; run; proc format; value $color 'w'='White' 'y'='Yellow' 'r'='Red' 'g'='Green' 'b'='Blue'; run;
Since a full display of the results of each cluster analysis would be very long, a macro is used with five macro variables to select parts of the output. The macro variables are set to select only the PROC CLUSTER output and the crosstabulation of clusters and true categories for the first two analyses. The example could be run with different settings of the macro variables to show the full output or other selected parts .
%let cluster=1; /* 1=show CLUSTER output, 0=don't */ %let tree=0; /* 1=print TREE diagram, 0=don't */ %let list=0; /* 1=list clusters, 0=don't */ %let crosstab=1; /* 1=crosstabulate clusters and classes, 0=don't */ %let crosscol=0; /* 1=crosstabulate clusters and colors, 0=don't */ /*--- define macro with options for TREE ---*/ %macro treeopt; %if &tree %then h page=1; %else noprint; %mend; /*--- define macro with options for CLUSTER ---*/ %macro clusopt; %if &cluster %then pseudo ccc p=20; %else noprint; %mend; /*------ macro for showing cluster results ------*/ %macro show(n); /* n=number of clusters to show results for */ proc tree data=tree %treeopt n=&n out=out; id name; copy class height width depth weight color; run; %if &list %then %do; proc sort; by cluster; run; proc print; var class name height width depth weight color; by cluster clusname; run; %end; %if &crosstab %then %do; proc tabulate noseps /* formchar=' ' */; class class cluster; table cluster, class*n=' '*f=stars./rts=10 misstext=' '; run; %end; %if &crosscol %then %do; proc tabulate noseps /* formchar=' ' */; class color cluster; table cluster, color*n=' '*f=stars./rts=10 misstext=' '; format color $color.; run; %end; %mend;
The first analysis uses the variables height , width , depth , and weight in standard-ized form to show the effect of including size information. The CCC, pseudo F , and pseudo t 2 statistics indicate 10 clusters. Most of the clusters do not correspond closely to the true categories, and four of the clusters have only one or two observations.
/**********************************************************/ /* */ /* Analysis 1: standardized box measurements */ /* */ /**********************************************************/ title2 'Analysis 1: Standardized data'; proc cluster data=grocery m=cen std %clusopt outtree=tree; var height width depth weight; id name; copy class color; run; %show(10);
Cluster Analysis of Grocery Boxes Analysis 1: Standardized data The CLUSTER Procedure Centroid Hierarchical Cluster Analysis Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative 1 2.44512438 1.64456210 0.6113 0.6113 2 0.80056228 0.33149770 0.2001 0.8114 3 0.46906458 0.18381582 0.1173 0.9287 4 0.28524876 0.0713 1.0000 The data have been standardized to mean 0 and variance 1 Root-Mean-Square Total-Sample Standard Deviation = 1 Root-Mean-Square Distance Between Observations = 2.828427
Cluster Analysis of Grocery Boxes Analysis 1: Standardized data The CLUSTER Procedure Centroid Hierarchical Cluster Analysis The data have been standardized to mean 0 and variance 1 Root-Mean-Square Total-Sample Standard Deviation = 1 Root-Mean-Square Distance Between Observations = 2.828427 Cluster History Norm T Cent i NCL --Clusters Joined--- FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Dist e 20 CL22 OB54 11 0.0028 .974 . . 85.4 4.5 0.3073 19 CL36 OB8 5 0.0026 .972 . . 83.7 15.3 0.3146 18 CL24 CL41 12 0.0080 .964 . . 70.2 10.0 0.3316 17 CL18 CL30 18 0.0144 .949 . . 53.8 12.7 0.3343 16 OB33 CL29 3 0.0024 .947 . . 55.8 4.7 0.3363 15 CL50 CL33 7 0.0055 .941 . . 55.0 24.4 0.346 14 CL46 CL15 10 0.0069 .934 . . 53.7 8.1 0.3192 13 CL27 OB53 6 0.0035 .931 . . 56.1 6.3 0.362 12 CL31 CL16 5 0.0075 .923 .861 8.03 55.8 6.6 0.4416 11 CL19 CL23 7 0.0102 .913 .848 7.59 54.6 12.7 0.4713 10 OB23 OB26 2 0.0037 .909 .835 8.36 59.1 . 0.4781 9 CL11 CL17 25 0.0393 .870 .819 4.72 45.2 19.3 0.4918 8 CL13 CL14 16 0.0329 .837 .801 2.95 40.4 23.7 0.5215 7 CL8 CL20 27 0.0629 .774 .779 -.31 32.0 25.9 0.5467 6 CL7 OB62 28 0.0112 .763 .752 0.61 36.7 2.4 0.6003 5 CL9 CL6 53 0.1879 .575 .718 -5.9 19.6 43.4 0.6641 4 CL5 CL21 55 0.0345 .541 .672 -5.2 23.2 4.5 0.745 3 CL4 CL12 60 0.1137 .427 .602 -5.3 22.4 14.5 0.8769 2 CL3 CL10 62 0.1511 .276 .471 -4.3 23.2 15.8 1.5559 1 CL2 OB22 63 0.2759 .000 .000 0.00 . 23.2 2.948
----------------------------------------------------------------------------------------------------------- class ----------------------------------------------------------------------------------------------- Breakfast cereal Crackers Detergent Little Debbie Paste, Tooth Tea --------+---------------+---------------+---------------+---------------+---------------+--------------- CLUSTER 1 ########### 2 ## # ### 3 ##### ## 4 ### ####### 5 ########### ## ### ## 6 ##### 7 # # 8 ## 9 # 10 # -----------------------------------------------------------------------------------------------------------
The second analysis uses logarithms of height , width , depth , and the cube root of weight ; the cube root is used for consistency with the linear measures. The rows are then centered to remove size information. Finally, the columns are standardized to have a standard deviation of 1. There is no compelling a priori reason to standardize the columns, but if they are not standardized, height dominates the analysis because of its large variance. The STANDARD procedure is used instead of the STD option in PROC CLUSTER so that a subsequent analysis can separately standardize the dummy variables for color.
/**********************************************************/ /* */ /* Analysis 2: standardized row-centered logarithms */ /* */ /**********************************************************/ title2 'Row-centered logarithms'; data shape; set grocery; array x height width depth weight; array l l_height l_width l_depth l_weight; /* logarithms */ weight=weight**(1/3); /* take cube root to conform with the other linear measurements */ do over l; /* take logarithms */ l=log(x); end; mean=mean(of l(*)); /* find row mean of logarithms */ do over l; l=l-mean; /* center row */ end; run; title2 'Analysis 2: Standardized row-centered logarithms'; proc standard data=shape out=shapstan m=0 s=1; var l_height l_width l_depth l_weight; run; proc cluster data=shapstan m=cen %clusopt outtree=tree; var l_height l_width l_depth l_weight; id name; copy class height width depth weight color; run; %show(8);
The results of the second analysis are shown for eight clusters. Clusters 1 through 4 correspond fairly well to tea, toothpaste, breakfast cereals, and detergents. Crackers and Little Debbie products are scattered among several clusters.
Cluster Analysis of Grocery Boxes Analysis 2: Standardized row-centered logarithms The CLUSTER Procedure Centroid Hierarchical Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 1.94931049 0.34845395 0.4873 0.4873 2 1.60085654 1.15102358 0.4002 0.8875 3 0.44983296 0.44983296 0.1125 1.0000 4 .00000000 0.0000 1.0000 Root-Mean-Square Total-Sample Standard Deviation = 1 Root-Mean-Square Distance Between Observations = 2.828427 Cluster History Norm T Cent i NCL --Clusters Joined--- FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Dist e 20 CL29 OB14 4 0.0017 .977 . . 94.7 2.9 0.2658 19 CL26 CL27 8 0.0045 .972 . . 85.4 8.4 0.3047 18 OB38 OB62 2 0.0016 .971 . . 87.2 . 0.3193 17 OB32 OB35 2 0.0018 .969 . . 89.1 . 0.3331 16 OB22 OB55 2 0.0019 .967 . . 91.3 . 0.3434 15 CL23 CL18 5 0.0050 .962 . . 86.5 4.8 0.3587 14 CL37 CL21 5 0.0051 .957 . . 83.5 10.4 0.3613 13 CL30 CL24 9 0.0068 .950 . . 79.2 12.9 0.3682 12 CL32 CL20 16 0.0142 .936 .892 5.75 67.6 29.3 0.3826 11 CL22 OB34 4 0.0037 .932 .881 6.31 71.4 3.2 0.3901 10 CL11 CL31 7 0.0090 .923 .869 6.17 70.8 6.3 0.4032 9 CL33 CL13 11 0.0092 .914 .853 6.25 71.7 7.6 0.4181 8 CL19 CL16 10 0.0131 .901 .835 6.12 71.4 10.9 0.503 7 CL14 CL9 16 0.0297 .871 .813 4.63 63.1 15.6 0.5173 6 CL10 CL15 12 0.0329 .838 .785 3.69 59.1 13.6 0.5916 5 CL6 CL28 19 0.0557 .783 .748 2.01 52.2 15.8 0.6252 4 CL12 CL8 26 0.0885 .694 .697 .16 44.6 48.8 0.6679 3 CL5 CL17 21 0.0459 .648 .617 1.21 55.3 7.4 0.8863 2 CL4 CL7 42 0.2841 .364 .384 .56 34.9 60.3 0.9429 1 CL2 CL3 63 0.3640 .000 .000 0.00 . 34.9 0.8978
----------------------------------------------------------------------------------------------------------- class ----------------------------------------------------------------------------------------------- Breakfast cereal Crackers Detergent Little Debbie Paste, Tooth Tea --------+---------------+---------------+---------------+---------------+---------------+--------------- CLUSTER 1 # ########## 2 ####### 3 ############## ## 4 # ######## # 5 ## # ## 6 # #### 7 ## ##### 8 ## -----------------------------------------------------------------------------------------------------------
The third analysis is similar to the second analysis except that the rows are standardized rather than just centered. There is a clear indication of seven clusters from the CCC, pseudo F , and pseudo t 2 statistics. The clusters are listed as well as crosstabulated with the true categories and colors.
/**********************************************************/ /* */ /* Analysis 3: standardized row-standardized logarithms */ /* */ /**********************************************************/ %let list=1; %let crosscol=1; title2 'Row-standardized logarithms'; data std; set grocery; array x height width depth weight; array l l_height l_width l_depth l_weight; /* logarithms */ weight=weight**(1/3); /* take cube root to conform with the other linear measurements */ do over l; l=log(x); /* take logarithms */ end; mean=mean(of l(*)); /* find row mean of logarithms */ std=std(of l(*)); /* find row standard deviation */ do over l; l=(l-mean)/std; /* standardize row */ end; run; title2 'Analysis 3: Standardized row-standardized logarithms'; proc standard data=std out=stdstan m=0 s=1; var l_height l_width l_depth l_weight; run; proc cluster data=stdstan m=cen %clusopt outtree=tree; var l_height l_width l_depth l_weight; id name; copy class height width depth weight color; run; %show(7);
The output from the third analysis shows that cluster 1 contains 9 of the 17 teas. Cluster 2 contains all of the detergents plus Grape Nuts, a very heavy cereal. Cluster 3 includes all of the toothpastes and one Little Debbie product that is of very similar shape, although roughly twice as large. Cluster 4 has most of the cereals, Ritz crackers (which come in a box very similar to most of the cereal boxes), and Lipton Loose Tea (all the other teas in the sample come in tea bags). Clusters 5 and 6 each contain several Luzianne and Lipton teas and one or two miscellaneous items. Cluster 7 includes most of the Little Debbie products and two types of crackers. Thus, the crackers are not identified and the teas are broken up into three clusters, but the other categories correspond to single clusters. This analysis classifies toothpaste and Little Debbie products slightly better than the second analysis,
Cluster Analysis of Grocery Boxes Analysis 3: Standardized row-standardized logarithms The CLUSTER Procedure Centroid Hierarchical Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 2.42684848 0.94583675 0.6067 0.6067 2 1.48101173 1.38887193 0.3703 0.9770 3 0.09213980 0.09213980 0.0230 1.0000 4 -.00000000 -0.0000 1.0000 Root-Mean-Square Total-Sample Standard Deviation = 1 Root-Mean-Square Distance Between Observations = 2.828427 Cluster History Norm T Cent i NCL --Clusters Joined--- FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Dist e 20 CL35 CL33 8 0.0024 .990 . . 229 32.0 0.1923 19 CL22 OB19 5 0.0010 .989 . . 224 2.9 0.2014 18 CL44 CL27 6 0.0018 .987 . . 206 20.5 0.2073 17 CL18 CL26 9 0.0025 .985 . . 187 6.4 0.1956 16 OB38 OB62 2 0.0009 .984 . . 192 . 0.24 15 CL24 CL23 5 0.0029 .981 . . 177 7.8 0.2753 14 CL25 OB21 4 0.0021 .979 . . 175 7.7 0.2917 13 CL30 CL19 17 0.0101 .969 . . 130 41.0 0.2974 12 CL16 CL31 9 0.0049 .964 .932 5.49 124 20.5 0.3121 11 CL21 OB52 4 0.0029 .961 .924 5.81 129 8.2 0.3445 10 CL41 CL11 6 0.0045 .957 .915 5.94 130 5.0 0.323 9 CL29 OB50 4 0.0031 .953 .904 6.52 138 20.3 0.3603 8 CL14 CL15 9 0.0101 .943 .890 6.08 131 10.7 0.3761 7 CL20 OB54 9 0.0047 .939 .872 6.89 143 11.7 0.4063 6 CL13 CL9 21 0.0272 .911 .848 5.23 117 30.0 0.5101 5 CL6 CL17 30 0.0746 .837 .814 1.30 74.3 42.2 0.606 4 CL10 CL7 15 0.0440 .793 .764 1.40 75.3 36.4 0.6152 3 CL8 CL12 18 0.0642 .729 .681 2.02 80.6 44.0 0.6648 2 CL3 CL4 33 0.2580 .471 .470 0.01 54.2 54.4 0.9887 1 CL5 CL2 63 0.4707 .000 .000 0.00 . 54.2 0.9636
------------------------------------------------ CLUSTER=1 CLUSNAME=CL7 ------------------------------------------------ Obs class name height width depth weight color 1 Tea Bigelow Plantati 7.7 13.4 6.9 3.27107 g 2 Tea Bigelow Earl Gre 7.7 13.4 6.9 3.27107 b 3 Tea Celestial Saeson 7.8 13.8 6.3 3.65931 b 4 Tea Celestial Saeson 7.8 13.8 6.3 3.58305 r 5 Tea Bigelow Lemon Li 7.7 13.4 6.9 3.41995 y 6 Tea Celestial Saeson 7.8 13.8 6.3 3.82586 y 7 Tea Celestial Saeson 7.8 13.8 6.3 3.33222 g 8 Tea Lipton Tea Bags 6.7 10.0 5.7 3.28271 r 9 Tea Lipton Family Si 8.9 11.1 8.2 4.39510 r ----------------------------------------------- CLUSTER=2 CLUSNAME=CL17 ------------------------------------------------ Obs class name height width depth weight color 10 Detergent Tide 26.5 19.9 6.3 10.5928 r 11 Detergent Tide 19.3 14.6 4.7 7.8357 r 12 Detergent Tide 32.5 23.2 7.3 12.6889 r 13 Breakfast cereal Grape Nuts 21.7 16.3 4.9 8.7937 w 14 Detergent Arm & Hammer Det 33.7 22.8 7.0 14.7023 y 15 Detergent Arm & Hammer Det 27.8 19.4 6.3 12.2003 y 16 Detergent Arm & Hammer Det 38.8 30.0 16.9 22.4732 y 17 Detergent Tide 39.4 24.8 11.3 16.1045 r 18 Detergent Arm & Hammer Det 39.5 25.8 11.0 18.6115 y ----------------------------------------------- CLUSTER=3 CLUSNAME=CL12 ------------------------------------------------ Obs class name height width depth weight color 19 Paste, Tooth Colgate 3.6 15.6 3.3 4.39510 r 20 Paste, Tooth Crest 3.5 15.2 3.2 4.24343 w 21 Paste, Tooth Crest 4.3 17.4 3.6 5.06813 w 22 Paste, Tooth Arm & Hammer 4.4 17.0 3.7 5.21097 w 23 Paste, Tooth Colgate 4.2 18.3 3.5 5.21097 r 24 Paste, Tooth Crest 4.3 21.7 3.7 5.65790 w 25 Paste, Tooth Colgate 4.4 22.0 3.5 5.82946 r 26 Little Debbie Fudge Rounds 8.1 28.3 5.4 6.45411 w 27 Paste, Tooth Crest 3.0 10.9 2.8 2.88670 w
----------------------------------------------- CLUSTER=4 CLUSNAME=CL13 ------------------------------------------------ Obs class name height width depth weight color 28 Breakfast cereal Cheerios 27.5 19.0 6.2 6.56541 y 29 Breakfast cereal Froot Loops 25.0 17.7 6.4 6.77735 r 30 Breakfast cereal Special K 30.1 20.5 8.5 7.98644 w 31 Breakfast cereal Corn Flakes 30.2 20.6 8.4 7.98644 w 32 Breakfast cereal Special K 29.6 19.2 6.7 6.97679 w 33 Breakfast cereal Corn Flakes 30.0 19.1 6.6 6.97679 w 34 Breakfast cereal Froot Loops 30.2 20.8 8.5 8.23034 r 35 Breakfast cereal Cheerios 30.3 20.4 7.2 7.51847 y 36 Breakfast cereal Cheerios 24.1 17.2 5.3 5.82848 y 37 Breakfast cereal Corn Flakes 33.7 25.4 8.0 8.79021 w 38 Breakfast cereal Special K 23.4 16.6 5.7 5.82946 w 39 Breakfast cereal Cheerios 32.5 22.4 8.4 8.27677 y 40 Breakfast cereal Shredded Wheat, 26.6 19.6 5.6 7.98957 r 41 Crackers Ritz 23.1 16.0 5.3 6.97953 r 42 Breakfast cereal All-Bran 21.1 14.3 5.2 7.30951 y 43 Tea Lipton Loose Tea 12.7 10.9 5.4 6.09479 r 44 Crackers Ritz 23.1 20.7 5.2 7.68573 r ----------------------------------------------- CLUSTER=5 CLUSNAME=CL10 ------------------------------------------------ Obs class name height width depth weight color 45 Tea Luzianne 8.9 22.8 6.4 5.53748 r 46 Tea Luzianne Decaffe 8.9 22.8 6.4 5.29641 g 47 Crackers Premium Saltines 11.0 25.0 10.7 7.68573 w 48 Tea Lipton Family Si 8.7 20.8 8.2 5.53748 r 49 Little Debbie Marshmallow Supr 9.4 32.0 7.0 6.56541 w 50 Tea Lipton Family Si 13.7 24.0 9.0 6.97679 r
------------------------------------------------ CLUSTER=6 CLUSNAME=CL9 ------------------------------------------------ Obs class name height width depth weight color 51 Tea Luzianne 18.4 20.2 6.9 6.09479 r 52 Tea Lipton Tea Bags 17.1 20.0 6.7 6.09479 r 53 Breakfast cereal Shredded Wheat 19.7 19.9 7.5 6.56541 y 54 Tea Lipton Tea Bags 11.5 14.4 6.6 4.73448 r ------------------------------------------------ CLUSTER=7 CLUSNAME=CL8 ------------------------------------------------ Obs class name height width depth weight color 55 Crackers Wheatsworth 11.1 25.2 5.5 6.88239 w 56 Little Debbie Swiss Cake Rolls 10.1 21.8 5.8 7.16545 w 57 Little Debbie Figaroos 13.5 18.6 3.7 6.97679 y 58 Little Debbie Nutty Bar 13.2 18.5 4.2 6.97679 y 59 Little Debbie Apple Delights 11.2 30.1 4.9 7.51552 w 60 Little Debbie Lemon Stix 13.2 18.5 4.2 6.33884 w 61 Little Debbie Fudge Brownies 11.0 30.8 2.5 6.97679 w 62 Little Debbie Snack Cakes 13.4 32.0 3.4 7.16545 b 63 Crackers Waverly Wafers 14.4 22.5 6.2 7.68573 g
---------------------------------------------------------------------------------------------------------- class ----------------------------------------------------------------------------------------------- Breakfast cereal Crackers Detergent Little Debbie Paste, Tooth Tea --------+---------------+---------------+---------------+---------------+---------------+--------------- CLUSTER 1 ######### 2 # ######## 3 # ######## 4 ############## ## # 5 # # #### 6 # ### 7 ## ####### ----------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------ color ------------------------------------------------------------------------------- Blue Green Red White Yellow --------+---------------+---------------+---------------+---------------+--------------- CLUSTER 1 ## ## ### ## 2 #### # #### 3 ### ###### 4 ###### ###### ##### 5 # ### ## 6 ### # 7 # # ##### ## ------------------------------------------------------------------------------------------
The last several analyses include color. Obviously, the dummy variables must not be included in calculations to standardize the rows. If the five dummy variables are simply standardized to variance 1.0 and included with the other variables, color dominates the analysis. The dummy variables should be scaled to a smaller variance, which must be determined by trial and error. Four analyses are done using PROC STANDARD to scale the dummy variables to a standard deviation of 0.2, 0.3, 0.4, or 0.8. The cluster listings are suppressed.
Since dummy variables drastically violate the normality assumption on which the CCC depends, the CCC tends to indicate an excessively large number of clusters.
/************************************************************/ /* */ /* Analyses 4-7: standardized row-standardized logs & color */ /* */ /************************************************************/ %let list=0; %let crosscol=1; title2 'Analysis 4: Standardized row-standardized logarithms and color (s=.2)'; proc standard data=stdstan out=stdstan m=0 s=.2; var c_:; run; proc cluster data=stdstan m=cen %clusopt outtree=tree; var l_height l_width l_depth l_weight c_:; id name; copy class height width depth weight color; run; %show(7); title2 'Analysis 5: Standardized row-standardized logarithms and color (s=.3)'; proc standard data=stdstan out=stdstan m=0 s=.3; var c_:; run; proc cluster data=stdstan m=cen %clusopt outtree=tree; var l_height l_width l_depth l_weight c_:; id name; copy class height width depth weight color; run; %show(6); title2 'Analysis 6: Standardized row-standardized logarithms and color (s=.4)'; proc standard data=stdstan out=stdstan m=0 s=.4; var c_:; run; proc cluster data=stdstan m=cen %clusopt outtree=tree; var l_height l_width l_depth l_weight c_:; id name; copy class height width depth weight color; run; %show(3); title2 'Analysis 7: Standardized row-standardized logarithms and color (s=.8)'; proc standard data=stdstan out=stdstan m=0 s=.8; var c_:; run; proc cluster data=stdstan m=cen %clusopt outtree=tree; var l_height l_width l_depth l_weight c_:; id name; copy class height width depth weight color; run; %show(10);
Using PROC STANDARD on the dummy variables with S=0.2 causes four of the Little Debbie products to join the toothpastes. Using S=0.3 causes one of the tea clusters to merge with the breakfast cereals while three cereals defect to the detergents. Using S=0.4 produces three clusters consisting of (1) cereals and detergents, (2) Little Debbie products and toothpaste, and (3) teas, with crackers divided among all three clusters and a few other misclassifications. With S=0.8, ten clusters are indicated, each entirely monochrome. So, S=0.2 or S=0.3 degrades the classification, S=0.4 yields a good but perhaps excessively coarse classification, and higher values of the S= option produce clusters that are determined mainly by color.
Cluster Analysis of Grocery Boxes Analysis 4: Standardized row-standardized logarithms and color (s=.2) The CLUSTER Procedure Centroid Hierarchical Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 2.43584975 0.94791932 0.5800 0.5800 2 1.48793042 1.39363531 0.3543 0.9342 3 0.09429511 0.03686218 0.0225 0.9567 4 0.05743293 0.01036136 0.0137 0.9704 5 0.04707157 0.00489503 0.0112 0.9816 6 0.04217654 0.00693298 0.0100 0.9916 7 0.03524355 0.03524355 0.0084 1.0000 8 0.00000000 0.00000000 0.0000 1.0000 9 .00000000 0.0000 1.0000 Root-Mean-Square Total-Sample Standard Deviation = 0.68313 Root-Mean-Square Distance Between Observations = 2.898275 Cluster History Norm T Cent i NCL --Clusters Joined--- FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Dist e 20 CL46 OB37 3 0.0016 .968 . . 67.5 11.9 0.2706 19 OB46 OB52 2 0.0014 .966 . . 69.7 . 0.2995 18 CL25 CL37 6 0.0041 .962 . . 67.1 5.0 0.3081 17 CL33 CL35 16 0.0099 .952 . . 57.2 16.7 0.3196 16 CL19 OB48 3 0.0024 .950 . . 59.2 1.7 0.3357 15 CL30 CL16 5 0.0042 .946 . . 59.5 2.7 0.3299 14 CL27 CL18 8 0.0057 .940 . . 58.9 4.2 0.3429 13 CL20 OB32 4 0.0031 .937 . . 61.7 3.6 0.3564 12 CL24 OB50 4 0.0031 .934 .905 3.23 65.2 4.7 0.359 11 CL39 CL28 6 0.0068 .927 .896 3.17 65.9 12.1 0.3743 10 CL13 OB35 5 0.0036 .923 .886 3.62 70.8 2.3 0.3755 9 CL11 CL32 13 0.0176 .906 .874 2.70 64.8 16.0 0.4107 8 CL14 OB54 9 0.0052 .900 .859 3.29 71.0 2.6 0.4265 7 OB21 CL10 6 0.0052 .895 .841 4.09 79.8 2.4 0.4378 6 CL17 CL12 20 0.0248 .870 .817 3.52 76.6 19.7 0.4898 5 CL15 CL8 14 0.0326 .838 .783 3.08 75.0 14.0 0.5607 4 CL6 CL21 30 0.0743 .764 .734 1.35 63.5 35.6 0.5877 3 CL9 CL7 19 0.0579 .706 .653 2.17 72.0 22.8 0.6611 2 CL4 CL3 49 0.3632 .343 .450 -2.6 31.8 73.0 0.9838 1 CL2 CL5 63 0.3426 .000 .000 0.00 . 31.8 0.9876
---------------------------------------------------------------------------------------------------------- class ----------------------------------------------------------------------------------------------- Breakfast cereal Crackers Detergent Little Debbie Paste, Tooth Tea --------+---------------+---------------+---------------+---------------+---------------+--------------- CLUSTER 1 ## ######## 2 # #### ######## 3 ############# ## # 4 # ### 5 # ##### 6 ######### 7 # #### ----------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------ color ------------------------------------------------------------------------------- Blue Green Red White Yellow --------+---------------+---------------+---------------+---------------+--------------- CLUSTER 1 #### # ##### 2 ### ########## 3 ###### ###### #### 4 ### # 5 # # ## ## 6 ## ## ### ## 7 # ### # ------------------------------------------------------------------------------------------
Cluster Analysis of Grocery Boxes Analysis 5: Standardized row-standardized logarithms and color (s=.3) The CLUSTER Procedure Centroid Hierarchical Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 2.44752302 0.95026671 0.5500 0.5500 2 1.49725632 1.36701945 0.3365 0.8865 3 0.13023687 0.02135049 0.0293 0.9157 4 0.10888637 0.00867367 0.0245 0.9402 5 0.10021271 0.00628821 0.0225 0.9627 6 0.09392449 0.02196469 0.0211 0.9838 7 0.07195981 0.07195981 0.0162 1.0000 8 0.00000000 0.00000000 0.0000 1.0000 9 .00000000 0.0000 1.0000 Root-Mean-Square Total-Sample Standard Deviation = 0.703167 Root-Mean-Square Distance Between Observations = 2.983287 Cluster History Norm T Cent i NCL --Clusters Joined--- FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Dist e 20 CL24 CL28 4 0.0038 .953 . . 45.7 2.7 0.3448 19 OB11 CL23 6 0.0033 .950 . . 46.0 3.5 0.3477 18 CL46 OB37 3 0.0027 .947 . . 47.1 21.9 0.3558 17 CL21 OB50 4 0.0031 .944 . . 48.2 2.5 0.3577 16 CL39 CL33 6 0.0064 .937 . . 46.9 12.1 0.3637 15 CL19 CL29 14 0.0152 .922 . . 40.6 12.4 0.3707 14 CL18 OB32 4 0.0035 .919 . . 42.5 2.5 0.3813 13 CL16 CL25 13 0.0175 .901 . . 38.0 13.7 0.4103 12 CL22 OB54 5 0.0049 .896 .875 1.76 40.0 3.2 0.4353 11 CL12 CL37 7 0.0089 .887 .865 1.71 40.9 4.6 0.4397 10 CL20 OB48 5 0.0056 .882 .854 2.02 43.9 2.5 0.4669 9 CL26 CL17 16 0.0222 .859 .841 1.20 41.3 16.6 0.479 8 CL32 CL11 9 0.0125 .847 .826 1.31 43.5 4.5 0.4988 7 CL14 OB35 5 0.0070 .840 .806 1.95 49.0 3.3 0.519 6 OB21 CL7 6 0.0077 .832 .782 2.79 56.6 2.3 0.5366 5 CL9 CL15 30 0.0716 .761 .749 0.54 46.1 28.3 0.5452 4 CL10 CL8 14 0.0318 .729 .700 1.21 52.9 8.6 0.5542 3 CL5 CL6 36 0.0685 .660 .622 1.50 58.3 14.2 0.6516 2 CL13 CL4 27 0.2008 .460 .427 0.90 51.9 46.6 0.9611 1 CL3 CL2 63 0.4595 .000 .000 0.00 . 51.9 0.9609
---------------------------------------------------------------------------------------------------------- class ----------------------------------------------------------------------------------------------- Breakfast cereal Crackers Detergent Little Debbie Paste, Tooth Tea --------+---------------+---------------+---------------+---------------+---------------+--------------- CLUSTER 1 ### ## ######## # 2 # #### ######## 3 ############# ### 4 # ##### 5 ######### 6 # #### ----------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------ color ------------------------------------------------------------------------------- Blue Green Red White Yellow --------+---------------+---------------+---------------+---------------+--------------- CLUSTER 1 ######## # ##### 2 ### ########## 3 ##### ###### ##### 4 # # ## ## 5 ## ## ### ## 6 # ### # ------------------------------------------------------------------------------------------
Cluster Analysis of Grocery Boxes Analysis 6: Standardized row-standardized logarithms and color (s=.4) The CLUSTER Procedure Centroid Hierarchical Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 2.46469435 0.95296119 0.5135 0.5135 2 1.51173316 1.28149311 0.3149 0.8284 3 0.23024005 0.04306536 0.0480 0.8764 4 0.18717469 0.01766446 0.0390 0.9154 5 0.16951023 0.01827481 0.0353 0.9507 6 0.15123542 0.06582379 0.0315 0.9822 7 0.08541162 0.08541162 0.0178 1.0000 8 0.00000000 0.00000000 0.0000 1.0000 9 .00000000 0.0000 1.0000 Root-Mean-Square Total-Sample Standard Deviation = 0.730297 Root-Mean-Square Distance Between Observations = 3.098387 Cluster History Norm T Cent i NCL --Clusters Joined--- FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Dist e 20 CL29 CL44 10 0.0074 .955 . . 47.7 8.2 0.3789 19 CL38 OB54 3 0.0031 .952 . . 48.1 9.3 0.3792 18 CL25 CL41 11 0.0155 .936 . . 38.8 36.7 0.4192 17 CL23 CL43 10 0.0120 .924 . . 35.0 11.6 0.4208 16 OB11 CL26 6 0.0050 .919 . . 35.6 5.8 0.4321 15 CL19 CL31 5 0.0074 .912 . . 35.4 5.3 0.4362 14 OB20 CL27 4 0.0046 .907 . . 36.8 2.9 0.4374 13 CL18 CL20 21 0.0352 .872 . . 28.4 19.7 0.4562 12 CL13 CL16 27 0.0372 .835 .839 -.37 23.4 12.0 0.4968 11 CL21 CL17 15 0.0289 .806 .828 -1.5 21.6 13.6 0.5183 10 CL14 CL15 9 0.0200 .786 .815 -1.8 21.6 7.2 0.5281 9 OB21 OB48 2 0.0047 .781 .801 -1.2 24.1 . 0.5425 8 CL10 CL24 12 0.0243 .757 .785 -1.3 24.5 5.8 0.5783 7 CL12 CL46 29 0.0224 .735 .765 -1.3 25.8 5.3 0.6105 6 CL8 CL37 14 0.0220 .712 .740 -1.1 28.3 4.0 0.6313 5 CL6 CL32 16 0.0251 .687 .707 -.78 31.9 3.9 0.6664 4 CL11 CL9 17 0.0287 .659 .660 -.04 38.0 7.0 0.7098 3 CL4 OB35 18 0.0180 .641 .584 2.21 53.5 3.2 0.7678 2 CL3 CL5 34 0.2175 .423 .400 0.67 44.8 31.4 0.8923 1 CL7 CL2 63 0.4232 .000 .000 0.00 . 44.8 0.9156
---------------------------------------------------------------------------------------------------------- class ----------------------------------------------------------------------------------------------- Breakfast cereal Crackers Detergent Little Debbie Paste, Tooth Tea --------+---------------+---------------+---------------+---------------+---------------+--------------- CLUSTER 1 >############## ## ######## ## # 2 ## ####### ######## # 3 # >############## ----------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------ color ------------------------------------------------------------------------------- Blue Green Red White Yellow --------+---------------+---------------+---------------+---------------+--------------- CLUSTER 1 ########## ####### ############ 2 # ## ### ############ 3 ## ## ######### # ## ------------------------------------------------------------------------------------------
Cluster Analysis of Grocery Boxes Analysis 7: Standardized row-standardized logarithms and color (s=.8) The CLUSTER Procedure Centroid Hierarchical Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 2.61400794 0.93268930 0.3631 0.3631 2 1.68131864 0.77645948 0.2335 0.5966 3 0.90485916 0.22547234 0.1257 0.7222 4 0.67938683 0.00292216 0.0944 0.8166 5 0.67646466 0.12119211 0.0940 0.9106 6 0.55527255 0.46658428 0.0771 0.9877 7 0.08868827 0.08868827 0.0123 1.0000 8 0.00000000 0.00000000 0.0000 1.0000 9 .00000000 0.0000 1.0000 Root-Mean-Square Total-Sample Standard Deviation = 0.894427 Root-Mean-Square Distance Between Observations = 3.794733 Cluster History Norm T Cent i NCL --Clusters Joined--- FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Dist e 20 CL29 CL44 10 0.0049 .970 . . 72.7 8.2 0.3094 19 CL38 OB54 3 0.0021 .968 . . 73.3 9.3 0.3096 18 CL21 CL23 12 0.0153 .952 . . 53.0 15.0 0.4029 17 OB21 OB48 2 0.0032 .949 . . 53.8 . 0.443 16 CL27 CL24 6 0.0095 .940 . . 48.9 10.4 0.444 15 CL19 CL16 9 0.0136 .926 . . 43.0 6.1 0.4587 14 CL41 OB11 7 0.0058 .920 . . 43.6 51.2 0.4591 13 CL26 CL46 7 0.0105 .910 . . 42.1 22.0 0.4769 12 CL25 CL13 12 0.0205 .889 .743 16.5 37.3 13.8 0.467 11 CL18 OB20 13 0.0093 .880 .726 16.7 38.2 4.0 0.5586 10 CL17 CL37 4 0.0134 .867 .706 16.5 38.3 7.9 0.6454 9 CL14 CL20 17 0.0567 .810 .684 11.0 28.8 52.6 0.6534 8 CL12 CL9 29 0.0828 .727 .659 5.03 20.9 20.7 0.604 7 CL11 CL43 16 0.0359 .691 .631 4.25 20.9 14.4 0.6758 6 CL15 CL31 11 0.0263 .665 .598 4.24 22.6 8.0 0.7065 5 CL7 CL6 27 0.1430 .522 .557 -1.7 15.8 28.2 0.8247 4 CL8 CL5 56 0.2692 .253 .507 -9.1 6.6 31.5 0.7726 3 OB35 CL32 3 0.0216 .231 .435 -6.6 9.0 46.0 1.0027 2 CL4 CL10 60 0.1228 .108 .289 -5.6 7.4 9.5 1.0096 1 CL2 CL3 63 0.1083 .000 .000 0.00 . 7.4 1.0839
---------------------------------------------------------------------------------------------------------- class ----------------------------------------------------------------------------------------------- Breakfast cereal Crackers Detergent Little Debbie Paste, Tooth Tea --------+---------------+---------------+---------------+---------------+---------------+--------------- CLUSTER 1 ### ## #### # 2 ## ###### ##### 3 ####### 4 ###### #### ## 5 ### 6 ######### 7 # ### 8 ## 9 ## 10 # ----------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------ color ------------------------------------------------------------------------------- Blue Green Red White Yellow --------+---------------+---------------+---------------+---------------+--------------- CLUSTER 1 ########## 2 ############# 3 ####### 4 ############ 5 ### 6 ######### 7 #### 8 ## 9 ## 10 # ------------------------------------------------------------------------------------------