A wide variety of distance and similarity measures are used in cluster analysis (Anderberg 1973, Sneath and Sokal 1973). If your data are in coordinate form and you want to use a non-Euclidean distance for clustering, you can compute a distance matrix using the DISTANCE procedure.
Similarity measures must be converted to dissimilarities before being used in PROC CLUSTER. Such conversion can be done in a variety of ways, such as taking reciprocals or subtracting from a large value. The choice of conversion method depends on the application and the similarity measure. If applicable , PROC DISTANCE provides a corresponding dissimilarity measure for each similarity measure.
In the following example, the observations are states. Binary-valued variables correspond to various grounds for divorce and indicate whether the grounds for divorce apply in each of the states of the USA. A value of 1 indicates that the ground for divorce applies and a value of 0 indicates the opposite . The 0-0 matches are treated totally irrelevant; therefore, each variable has an asymmetric nominal level of measurement. The absence value is 0.
The DISTANCE procedure is used to compute the Jaccard coefficient (Anderberg 1973, pp. 89, 115, and 117) between each pair of states. The Jaccard coefficient is defined as the number of variables that are coded as 1 for both states divided by the number of variables that are coded as 1 for either or both states. Since dissimilarity measures are required by the CLUSTER procedure, the DJACCARD coefficient is selected. Output 26.1.1 displays the distance matrix between the first ten states.
Grounds for Divorce First 10 states state Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Alabama 0.00000 . . . . . . . . . Alaska 0.22222 0.00000 . . . . . . . . Arizona 0.88889 0.85714 0.00000 . . . . . . . Arkansas 0.11111 0.33333 1.00000 0.00000 . . . . . . California 0.77778 0.71429 0.50000 0.88889 0.00000 . . . . . Colorado 0.88889 0.85714 0.00000 1.00000 0.50000 0.00000 . . . . Connecticut 0.11111 0.33333 0.87500 0.22222 0.75000 0.87500 0.00000 . . . Delaware 0.77778 0.87500 0.50000 0.88889 0.66667 0.50000 0.75000 0.00000 . . Florida 0.77778 0.71429 0.50000 0.88889 0.00000 0.50000 0.75000 0.66667 0.00000 . Georgia 0.22222 0.00000 0.85714 0.33333 0.71429 0.85714 0.33333 0.87500 0.71429 0
The CENTROID method is used to perform the cluster analysis, and the the resulting tree diagram from PROC CLUSTER is saved into the tree output data set. Output 26.1.2 displays the cluster history.
The CLUSTER Procedure Centroid Hierarchical Cluster Analysis Root-Mean-Square Distance Between Observations = 0.694873 Cluster History Norm T Cent i NCL ---------Clusters Joined---------- FREQ PSF PST2 Dist e 49 Arizona Colorado 2 . . 0 T 48 California Florida 2 . . 0 T 47 Alaska Georgia 2 . . 0 T 46 Delaware Hawaii 2 . . 0 T 45 Connecticut Idaho 2 . . 0 T 44 CL49 Iowa 3 . . 0 T 43 CL47 Kansas 3 . . 0 T 42 CL44 Kentucky 4 . . 0 T 41 CL42 Michigan 5 . . 0 T 40 CL41 Minnesota 6 . . 0 T 39 CL43 Mississippi 4 . . 0 T 38 CL40 Missouri 7 . . 0 T 37 CL38 Montana 8 . . 0 T 36 CL37 Nebraska 9 . . 0 T 35 North Dakota Oklahoma 2 . . 0 T 34 CL36 Oregon 10 . . 0 T 33 Massachusetts Rhode Island 2 . . 0 T 32 New Hampshire Tennessee 2 . . 0 T 31 CL46 Washington 3 . . 0 T 30 CL31 Wisconsin 4 . . 0 T 29 Nevada Wyoming 2 . . 0 28 Alabama Arkansas 2 1561 . 0.1599 T 27 CL33 CL32 4 479 . 0.1799 T 26 CL39 CL35 6 265 . 0.1799 T 25 CL45 West Virginia 3 231 . 0.1799 24 Maryland Pennsylvania 2 199 . 0.2399 23 CL28 Utah 3 167 3.2 0.2468 22 CL27 Ohio 5 136 5.4 0.2698 21 CL26 Maine 7 111 8.9 0.2998 20 CL23 CL21 10 75.2 8.7 0.3004 19 CL25 New Jersey 4 71.8 6.5 0.3053 T 18 CL19 Texas 5 69.1 2.5 0.3077 17 CL20 CL22 15 48.7 9.9 0.3219 16 New York Virginia 2 50.1 . 0.3598 15 CL18 Vermont 6 49.4 2.9 0.3797 14 CL17 Illinois 16 47.0 3.2 0.4425 13 CL14 CL15 22 29.2 15.3 0.4722 12 CL48 CL29 4 29.5 . 0.4797 T 11 CL13 CL24 24 27.6 4.5 0.5042 10 CL11 South Dakota 25 28.4 2.4 0.5449 9 Louisiana CL16 3 30.3 3.5 0.5844 8 CL34 CL30 14 23.3 . 0.7196 7 CL8 CL12 18 19.3 15.0 0.7175 6 CL10 South Carolina 26 21.4 4.2 0.7384 5 CL6 New Mexico 27 24.0 4.7 0.8303 4 CL5 Indiana 28 28.9 4.1 0.8343 3 CL4 CL9 31 31.7 10.9 0.8472 2 CL3 North Carolina 32 55.1 4.1 1.0017 1 CL2 CL7 50 . 55.1 1.0663
The TREE procedure generates nine clusters in the output data set out . After being sorted by the state, the out data set is then merged with the input data set divorce. After being sorted by the state, the merged data set is printed to display the cluster membership as shown in Output 26.1.3.
options ls=120 ps=60; data divorce; title 'Grounds for Divorce'; input state . (incompat cruelty desertn non_supp alcohol felony impotenc insanity separate) (1.) @@; if mod(_n_,2) then input +4 @@; else input; datalines; Alabama 111111111 Alaska 111011110 Arizona 100000000 Arkansas 011111111 California 100000010 Colorado 100000000 Connecticut 111111011 Delaware 100000001 Florida 100000010 Georgia 111011110 Hawaii 100000001 Idaho 111111011 Illinois 011011100 Indiana 100001110 Iowa 100000000 Kansas 111011110 Kentucky 100000000 Louisiana 000001001 Maine 111110110 Maryland 011001111 Massachusetts 111111101 Michigan 100000000 Minnesota 100000000 Mississippi 111011110 Missouri 100000000 Montana 100000000 Nebraska 100000000 Nevada 100000011 New Hampshire 111111100 New Jersey 011011011 New Mexico 111000000 New York 011001001 North Carolina 000000111 North Dakota 111111110 Ohio 111011101 Oklahoma 111111110 Oregon 100000000 Pennsylvania 011001110 Rhode Island 111111101 South Carolina 011010001 South Dakota 011111000 Tennessee 111111100 Texas 111001011 Utah 011111110 Vermont 011101011 Virginia 010001001 Washington 100000001 West Virginia 111011011 Wisconsin 100000001 Wyoming 100000011 ; proc distance data=divorce method=djaccard absent=0 out=distjacc; var anominal(incompat--separate); id state; run; proc print data=distjacc(obs=10); id state; var alabama--georgia; title2 'First 10 states'; run; title2; proc cluster data=distjacc method=centroid pseudo outtree=tree; id state; var alabama--wyoming; run; proc tree data=tree noprint n=9 out=out; id state; run; proc sort; by state; run; data clus; merge divorce out; by state; run; proc sort; by cluster; run; proc print; id state; var incompat--separate; by cluster; run;
------------------------------------------------------ CLUSTER=1 ------------------------------------------------------- state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate Arizona 1 0 0 0 0 0 0 0 0 Colorado 1 0 0 0 0 0 0 0 0 Iowa 1 0 0 0 0 0 0 0 0 Kentucky 1 0 0 0 0 0 0 0 0 Michigan 1 0 0 0 0 0 0 0 0 Minnesota 1 0 0 0 0 0 0 0 0 Missouri 1 0 0 0 0 0 0 0 0 Montana 1 0 0 0 0 0 0 0 0 Nebraska 1 0 0 0 0 0 0 0 0 Oregon 1 0 0 0 0 0 0 0 0 ------------------------------------------------------ CLUSTER=2 ------------------------------------------------------- state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate California 1 0 0 0 0 0 0 1 0 Florida 1 0 0 0 0 0 0 1 0 Nevada 1 0 0 0 0 0 0 1 1 Wyoming 1 0 0 0 0 0 0 1 1 ------------------------------------------------------ CLUSTER=3 ------------------------------------------------------- state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate Alabama 1 1 1 1 1 1 1 1 1 Alaska 1 1 1 0 1 1 1 1 0 Arkansas 0 1 1 1 1 1 1 1 1 Connecticut 1 1 1 1 1 1 0 1 1 Georgia 1 1 1 0 1 1 1 1 0 Idaho 1 1 1 1 1 1 0 1 1 Illinois 0 1 1 0 1 1 1 0 0 Kansas 1 1 1 0 1 1 1 1 0 Maine 1 1 1 1 1 0 1 1 0 Maryland 0 1 1 0 0 1 1 1 1 Massachusetts 1 1 1 1 1 1 1 0 1 Mississippi 1 1 1 0 1 1 1 1 0 New Hampshire 1 1 1 1 1 1 1 0 0 New Jersey 0 1 1 0 1 1 0 1 1 North Dakota 1 1 1 1 1 1 1 1 0 Ohio 1 1 1 0 1 1 1 0 1 Oklahoma 1 1 1 1 1 1 1 1 0 Pennsylvania 0 1 1 0 0 1 1 1 0 Rhode Island 1 1 1 1 1 1 1 0 1 South Dakota 0 1 1 1 1 1 0 0 0 Tennessee 1 1 1 1 1 1 1 0 0 Texas 1 1 1 0 0 1 0 1 1 Utah 0 1 1 1 1 1 1 1 0 Vermont 0 1 1 1 0 1 0 1 1 West Virginia 1 1 1 0 1 1 0 1 1 ------------------------------------------------------ CLUSTER=4 ------------------------------------------------------- state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate Delaware 1 0 0 0 0 0 0 0 1 Hawaii 1 0 0 0 0 0 0 0 1 Washington 1 0 0 0 0 0 0 0 1 Wisconsin 1 0 0 0 0 0 0 0 1 ------------------------------------------------------ CLUSTER=5 ------------------------------------------------------- state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate Louisiana 0 0 0 0 0 1 0 0 1 New York 0 1 1 0 0 1 0 0 1 Virginia 0 1 0 0 0 1 0 0 1 ------------------------------------------------------ CLUSTER=6 ------------------------------------------------------- state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate South Carolina 0 1 1 0 1 0 0 0 1 ------------------------------------------------------ CLUSTER=7 ------------------------------------------------------- state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate New Mexico 1 1 1 0 0 0 0 0 0 ------------------------------------------------------ CLUSTER=8 ------------------------------------------------------- state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate Indiana 1 0 0 0 0 1 1 1 0 ------------------------------------------------------ CLUSTER=9 ------------------------------------------------------- state incompat cruelty desertn non_supp alcohol felony impotenc insanity separate North Carolina 0 0 0 0 0 0 1 1 1
The following data set contains the average dividend yields for 15 utility stocks in the U.S. The observations are names of the companies, and the variables correspond to the annual dividend yields for the period of 1986-1990. The objective is to group similar stocks into clusters.
Before the cluster analysis is performed, the correlation similarity is chosen for measuring the closeness between each observation. Since distance type of measures are required by the CLUSTER procedure, METHOD= DCORR is used in the PROC DISTANCE statement to transform the correlation measures to the distance measures. Notice that in Output 26.2.1, all the values in the distance matrix are between 0. and 2.
Stock Dividends Distance Matrix for 15 Utility Stocks Orange___ Kansas_ Cincinnati_ Texas_ Detroit_ Rockland_ Kentucky_ Power___ Union_ Dominion_ compname G_E Utilities Edison Utilitie Utilities Light Electric Resources Cincinnati G&E 0.00000 . . . . . . . Texas Utilities 0.82056 0.00000 . . . . . . Detroit Edison 0.40511 0.65453 0.00000 . . . . . Orange & Rockland Utilitie 1.35380 0.88583 1.27306 0.00000 . . . . Kentucky Utilities 1.35581 0.92539 1.29382 0.12268 0.00000 . . . Kansas Power & Light 1.34227 0.94371 1.31696 0.19905 0.12874 0.00000 . . Union Electric 0.98516 0.29043 0.89048 0.68798 0.71824 0.72082 0.00000 . Dominion Resources 1.32945 0.96853 1.29016 0.33290 0.21510 0.24189 0.76587 0.00000 Allegheny Power 1.30492 0.81666 1.24565 0.17844 0.15759 0.17029 0.58452 0.27819 Minnesota Power & Light 1.24069 0.74082 1.20432 0.32581 0.30462 0.27231 0.48372 0.35733 Iowa-Ill Gas & Electric 1.04924 0.43100 0.97616 0.61166 0.61760 0.61736 0.16923 0.63545 Pennsylvania Power & Light 0.74931 0.37821 0.44256 1.03566 1.08878 1.12876 0.63285 1.14354 Oklahoma Gas & Electric 1.00604 0.30141 0.86200 0.68021 0.70259 0.73158 0.17122 0.72977 Wisconsin Energy 1.17988 0.54830 1.03081 0.45013 0.47184 0.53381 0.37405 0.51969 Green Mountain Power 1.30397 0.88063 1.27176 0.26948 0.17909 0.15377 0.64869 0.17360 Minnesota_ Iowa_Ill_ Oklahoma_ Green_ Allegheny_ Power___ Gas___ Pennsylvania_ Gas___ Wisconsin_ Mountain_ compname Power Light Electric Power___Light Electric Energy Power Cincinnati G&E . . . . . . . Texas Utilities . . . . . . . Detroit Edison . . . . . . . Orange & Rockland Utilitie . . . . . . . Kentucky Utilities . . . . . . . Kansas Power & Light . . . . . . . Union Electric . . . . . . . Dominion Resources . . . . . . . Allegheny Power 0.00000 . . . . . . Minnesota Power & Light 0.15615 0.00000 . . . . . Iowa-Ill Gas & Electric 0.47900 0.36368 0.00000 . . . . Pennsylvania Power & Light 1.02358 0.99384 0.75596 0.00000 . . . Oklahoma Gas & Electric 0.58391 0.50744 0.19673 0.60216 0.00000 . . Wisconsin Energy 0.37522 0.36319 0.30259 0.76085 0.28070 0.00000 . Green Mountain Power 0.13958 0.19370 0.52083 1.09269 0.64175 0.44814 0
The macro function do_cluster performs cluster analysis and presents the results in graphs. The CLUSTER procedure performs hierarchical clustering using agglomerative methods based on the distance data created from the previous PROC DISTANCE statement. The resulting tree diagrams can be saved into an output data set and later be plotted by the TREE procedure. Since the CCC statistics is not suitable for distance type of data, only the Pseudo Statistics is requested to identify the number of clusters.
Two clustering methods are invoked in the do_cluster macro: the WARD s and the average linkage methods. Since the results of the Pseudo T statistics from both the WARD s and the average linkage methods contain many missing values, only the graphs of the Pseudo F statistics versus the number of clusters are plotted.
Both Output 26.2.2 and Output 26.2.3 suggest a possible clusters of 4, and the resulting clusters are agreed by both clustering methods as shown from Output 26.2.4 to Output 26.2.5. The four clusters are:
Cincinnati G&E and Detroit Edison
Texas Utilities and Pennsylvania Power & Light
Union Electric, Iowa-Ill Gas & Electric, Oklahoma Gas & Electric, and Wisconsin Energy.
Orange & Rockland Utilities, Kentucky Utilities, Kansas Power & Light, Allegheny Power, Green Mountain Power, Dominion Resources, and Minnesota Power & Light.
data stock; title 'Stock Dividends'; input compname &. div_1986 div_1987 div_1988 div_1989 div_1990; datalines; Cincinnati G&E 8.4 8.2 8.4 8.1 8.0 Texas Utilities 7.9 8.9 10.4 8.9 8.3 Detroit Edison 9.7 10.7 11.4 7.8 6.5 Orange & Rockland Utilities 6.5 7.2 7.3 7.7 7.9 Kentucky Utilities 6.5 6.9 7.0 7.2 7.5 Kansas Power & Light 5.9 6.4 6.9 7.4 8.0 Union Electric 7.1 7.5 8.4 7.8 7.7 Dominion Resources 6.7 6.9 7.0 7.0 7.4 Allegheny Power 6.7 7.3 7.8 7.9 8.3 Minnesota Power & Light 5.6 6.1 7.2 7.0 7.5 Iowa-Ill Gas & Electric 7.1 7.5 8.5 7.8 8.0 Pennsylvania Power & Light 7.2 7.6 7.7 7.4 7.1 Oklahoma Gas & Electric 6.1 6.7 7.4 6.7 6.8 Wisconsin Energy 5.1 5.7 6.0 5.7 5.9 Green Mountain Power 7.1 7.4 7.8 7.8 8.3 ; proc distance data=stock method=dcorr out=distdcorr; var interval(div_1986 div_1987 div_1988 div_1989 div_1990); id compname; run; proc print data=distdcorr; id compname; title2 'Distance Matrix for 15 Utility Stocks'; run; title2; %macro do_cluster(clusmtd); goptions vsize=5in hsize=5in htitle=2pct htext=1.5pct; %let clusmtd = %upcase(&clusmtd); proc cluster data=distdcorr method=&clusmtd outtree=Tree pseudo id compname; run; /* plot pseudo statistics vs number of cluster */ legend1 frame cframe=white cborder=black position=center value=(justify=center); axis1 label=(angle=90 rotate=0) minor=none; axis2 minor=none order=(0 to 15); proc gplot; title2 "Cluster Method= &clusmtd"; plot _psf_*_ncl_='F' / frame cframe=white legend=legend1 vaxis=axis1 haxis=axis2; run; proc tree data=Tree horizontal; title2 "Cluster Method= &clusmtd"; id compname; run; %mend; %do_cluster(ward); %do_cluster(average);