Examples


Example 26.1. Divorce Grounds “ the Jaccard Coefficient

A wide variety of distance and similarity measures are used in cluster analysis (Anderberg 1973, Sneath and Sokal 1973). If your data are in coordinate form and you want to use a non-Euclidean distance for clustering, you can compute a distance matrix using the DISTANCE procedure.

Similarity measures must be converted to dissimilarities before being used in PROC CLUSTER. Such conversion can be done in a variety of ways, such as taking reciprocals or subtracting from a large value. The choice of conversion method depends on the application and the similarity measure. If applicable , PROC DISTANCE provides a corresponding dissimilarity measure for each similarity measure.

In the following example, the observations are states. Binary-valued variables correspond to various grounds for divorce and indicate whether the grounds for divorce apply in each of the states of the USA. A value of 1 indicates that the ground for divorce applies and a value of 0 indicates the opposite . The 0-0 matches are treated totally irrelevant; therefore, each variable has an asymmetric nominal level of measurement. The absence value is 0.

The DISTANCE procedure is used to compute the Jaccard coefficient (Anderberg 1973, pp. 89, 115, and 117) between each pair of states. The Jaccard coefficient is defined as the number of variables that are coded as 1 for both states divided by the number of variables that are coded as 1 for either or both states. Since dissimilarity measures are required by the CLUSTER procedure, the DJACCARD coefficient is selected. Output 26.1.1 displays the distance matrix between the first ten states.

Output 26.1.1: Distance Matrix Based on the Jaccard Coefficient
start example
  Grounds for Divorce   First 10 states   state        Alabama   Alaska  Arizona  Arkansas  California  Colorado  Connecticut Delaware  Florida  Georgia   Alabama      0.00000   .         .        .          .          .          .          .        .          .   Alaska       0.22222  0.00000    .        .          .          .          .          .        .          .   Arizona      0.88889  0.85714   0.00000   .          .          .          .          .        .          .   Arkansas     0.11111  0.33333   1.00000  0.00000     .          .          .          .        .          .   California   0.77778  0.71429   0.50000  0.88889    0.00000     .          .          .        .          .   Colorado     0.88889  0.85714   0.00000  1.00000    0.50000    0.00000     .          .        .          .   Connecticut  0.11111  0.33333   0.87500  0.22222    0.75000    0.87500    0.00000     .        .          .   Delaware     0.77778  0.87500   0.50000  0.88889    0.66667    0.50000    0.75000    0.00000   .          .   Florida      0.77778  0.71429   0.50000  0.88889    0.00000    0.50000    0.75000    0.66667  0.00000     .   Georgia      0.22222  0.00000   0.85714  0.33333    0.71429    0.85714    0.33333    0.87500  0.71429     0  
end example
 

The CENTROID method is used to perform the cluster analysis, and the the resulting tree diagram from PROC CLUSTER is saved into the tree output data set. Output 26.1.2 displays the cluster history.

Output 26.1.2: Clustering History
start example
  The CLUSTER Procedure   Centroid Hierarchical Cluster Analysis   Root-Mean-Square Distance Between Observations   = 0.694873   Cluster History   Norm    T   Cent    i   NCL    ---------Clusters Joined----------      FREQ     PSF    PST2      Dist    e   49    Arizona            Colorado                2      .       .          0    T   48    California         Florida                 2      .       .          0    T   47    Alaska             Georgia                 2      .       .          0    T   46    Delaware           Hawaii                  2      .       .          0    T   45    Connecticut        Idaho                   2      .       .          0    T   44    CL49               Iowa                    3      .       .          0    T   43    CL47               Kansas                  3      .       .          0    T   42    CL44               Kentucky                4      .       .          0    T   41    CL42               Michigan                5      .       .          0    T   40    CL41               Minnesota               6      .       .          0    T   39    CL43               Mississippi             4      .       .          0    T   38    CL40               Missouri                7      .       .          0    T   37    CL38               Montana                 8      .       .          0    T   36    CL37               Nebraska                9      .       .          0    T   35    North Dakota       Oklahoma                2      .       .          0    T   34    CL36               Oregon                 10      .       .          0    T   33    Massachusetts      Rhode Island            2      .       .          0    T   32    New Hampshire      Tennessee               2      .       .          0    T   31    CL46               Washington              3      .       .          0    T   30    CL31               Wisconsin               4      .       .          0    T   29    Nevada             Wyoming                 2      .       .          0   28    Alabama            Arkansas                2    1561      .     0.1599    T   27    CL33               CL32                    4     479      .     0.1799    T   26    CL39               CL35                    6     265      .     0.1799    T   25    CL45               West Virginia           3     231      .     0.1799   24    Maryland           Pennsylvania            2     199      .     0.2399   23    CL28               Utah                    3     167     3.2    0.2468   22    CL27               Ohio                    5     136     5.4    0.2698   21    CL26               Maine                   7     111     8.9    0.2998   20    CL23               CL21                   10    75.2     8.7    0.3004   19    CL25               New Jersey              4    71.8     6.5    0.3053    T   18    CL19               Texas                   5    69.1     2.5    0.3077   17    CL20               CL22                   15    48.7     9.9    0.3219   16    New York           Virginia                2    50.1      .     0.3598   15    CL18               Vermont                 6    49.4     2.9    0.3797   14    CL17               Illinois               16    47.0     3.2    0.4425   13    CL14               CL15                   22    29.2    15.3    0.4722   12    CL48               CL29                    4    29.5      .     0.4797    T   11    CL13               CL24                   24    27.6     4.5    0.5042   10    CL11               South Dakota           25    28.4     2.4    0.5449   9    Louisiana          CL16                    3    30.3     3.5    0.5844   8    CL34               CL30                   14    23.3      .     0.7196   7    CL8                CL12                   18    19.3    15.0    0.7175   6    CL10               South Carolina         26    21.4     4.2    0.7384   5    CL6                New Mexico             27    24.0     4.7    0.8303   4    CL5                Indiana                28    28.9     4.1    0.8343   3    CL4                CL9                    31    31.7    10.9    0.8472   2    CL3                North Carolina         32    55.1     4.1    1.0017   1    CL2                CL7                    50      .     55.1    1.0663  
end example
 

The TREE procedure generates nine clusters in the output data set out . After being sorted by the state, the out data set is then merged with the input data set divorce. After being sorted by the state, the merged data set is printed to display the cluster membership as shown in Output 26.1.3.

  options ls=120 ps=60;   data divorce;   title 'Grounds for Divorce';   input state .   (incompat cruelty desertn non_supp alcohol   felony impotenc insanity separate) (1.) @@;   if mod(_n_,2) then input +4 @@; else input;   datalines;   Alabama        111111111    Alaska         111011110   Arizona        100000000    Arkansas       011111111   California     100000010    Colorado       100000000   Connecticut    111111011    Delaware       100000001   Florida        100000010    Georgia        111011110   Hawaii         100000001    Idaho          111111011   Illinois       011011100    Indiana        100001110   Iowa           100000000    Kansas         111011110   Kentucky       100000000    Louisiana      000001001   Maine          111110110    Maryland       011001111   Massachusetts  111111101    Michigan       100000000   Minnesota      100000000    Mississippi    111011110   Missouri       100000000    Montana        100000000   Nebraska       100000000    Nevada         100000011   New Hampshire  111111100    New Jersey     011011011   New Mexico     111000000    New York       011001001   North Carolina 000000111    North Dakota   111111110   Ohio           111011101    Oklahoma       111111110   Oregon         100000000    Pennsylvania   011001110   Rhode Island   111111101    South Carolina 011010001   South Dakota   011111000    Tennessee      111111100   Texas          111001011    Utah           011111110   Vermont        011101011    Virginia       010001001   Washington     100000001    West Virginia  111011011   Wisconsin      100000001    Wyoming        100000011   ;   proc distance data=divorce method=djaccard absent=0 out=distjacc;   var anominal(incompat--separate);   id state;   run;   proc print data=distjacc(obs=10);   id state; var alabama--georgia;   title2 'First 10 states';   run;   title2;   proc cluster data=distjacc method=centroid   pseudo outtree=tree;   id state;   var alabama--wyoming;   run;   proc tree data=tree noprint n=9 out=out;   id state;   run;   proc sort;   by state;   run;   data clus;   merge divorce out;   by state;   run;   proc sort;   by cluster;   run;   proc print;   id state;   var incompat--separate;   by cluster;   run;  
Output 26.1.3: Cluster Membership
start example
  ------------------------------------------------------ CLUSTER=1 -------------------------------------------------------   state        incompat    cruelty    desertn    non_supp    alcohol    felony   impotenc    insanity    separate   Arizona          1          0          0           0          0          0         0           0           0   Colorado         1          0          0           0          0          0         0           0           0   Iowa             1          0          0           0          0          0         0           0           0   Kentucky         1          0          0           0          0          0         0           0           0   Michigan         1          0          0           0          0          0         0           0           0   Minnesota        1          0          0           0          0          0         0           0           0   Missouri         1          0          0           0          0          0         0           0           0   Montana          1          0          0           0          0          0         0           0           0   Nebraska         1          0          0           0          0          0         0           0           0   Oregon           1          0          0           0          0          0         0           0           0   ------------------------------------------------------ CLUSTER=2 -------------------------------------------------------   state         incompat    cruelty    desertn    non_supp    alcohol    felony   impotenc    insanity    separate   California        1          0          0           0          0          0         0           1           0   Florida           1          0          0           0          0          0         0           1           0   Nevada            1          0          0           0          0          0         0           1           1   Wyoming           1          0          0           0          0          0         0           1           1   ------------------------------------------------------ CLUSTER=3 -------------------------------------------------------   state            incompat    cruelty    desertn    non_supp    alcohol    felony    impotenc    insanity    separate   Alabama              1          1          1           1          1          1         1           1           1   Alaska               1          1          1           0          1          1         1           1           0   Arkansas             0          1          1           1          1          1         1           1           1   Connecticut          1          1          1           1          1          1         0           1           1   Georgia              1          1          1           0          1          1         1           1           0   Idaho                1          1          1           1          1          1         0           1           1   Illinois             0          1          1           0          1          1         1           0           0   Kansas               1          1          1           0          1          1         1           1           0   Maine                1          1          1           1          1          0         1           1           0   Maryland             0          1          1           0          0          1         1           1           1   Massachusetts        1          1          1           1          1          1         1           0           1   Mississippi          1          1          1           0          1          1         1           1           0   New Hampshire        1          1          1           1          1          1         1           0           0   New Jersey           0          1          1           0          1          1         0           1           1   North Dakota         1          1          1           1          1          1         1           1           0   Ohio                 1          1          1           0          1          1         1           0           1   Oklahoma             1          1          1           1          1          1         1           1           0   Pennsylvania         0          1          1           0          0          1         1           1           0   Rhode Island         1          1          1           1          1          1         1           0           1   South Dakota         0          1          1           1          1          1         0           0           0   Tennessee            1          1          1           1          1          1         1           0           0   Texas                1          1          1           0          0          1         0           1           1   Utah                 0          1          1           1          1          1         1           1           0   Vermont              0          1          1           1          0          1         0           1           1   West Virginia        1          1          1           0          1          1         0           1           1   ------------------------------------------------------ CLUSTER=4 -------------------------------------------------------   state         incompat    cruelty    desertn    non_supp    alcohol    felony   impotenc    insanity    separate   Delaware          1          0          0           0          0          0         0           0           1   Hawaii            1          0          0           0          0          0         0           0           1   Washington        1          0          0           0          0          0         0           0           1   Wisconsin         1          0          0           0          0          0         0           0           1   ------------------------------------------------------ CLUSTER=5 -------------------------------------------------------   state      incompat    cruelty    desertn    non_supp    alcohol    felony   impotenc    insanity    separate   Louisiana        0          0          0           0          0          1         0           0           1   New York         0          1          1           0          0          1         0           0           1   Virginia         0          1          0           0          0          1         0           0           1   ------------------------------------------------------ CLUSTER=6 -------------------------------------------------------   state         incompat    cruelty    desertn    non_supp    alcohol    felony    impotenc    insanity    separate   South Carolina        0          1          1           0          1          0          0           0           1   ------------------------------------------------------ CLUSTER=7 -------------------------------------------------------   state       incompat    cruelty    desertn    non_supp    alcohol    felony    impotenc    insanity    separate   New Mexico        1          1          1           0          0          0          0           0           0   ------------------------------------------------------ CLUSTER=8 -------------------------------------------------------   state     incompat    cruelty    desertn    non_supp    alcohol    felony    impotenc    insanity    separate   Indiana        1          0          0           0          0          1          1           1           0   ------------------------------------------------------ CLUSTER=9 -------------------------------------------------------   state         incompat    cruelty    desertn    non_supp    alcohol    felony    impotenc    insanity    separate   North Carolina        0          0          0           0          0          0          1           1           1  
end example
 

Example 26.2. Financial Data “ Stock Dividends

The following data set contains the average dividend yields for 15 utility stocks in the U.S. The observations are names of the companies, and the variables correspond to the annual dividend yields for the period of 1986-1990. The objective is to group similar stocks into clusters.

Before the cluster analysis is performed, the correlation similarity is chosen for measuring the closeness between each observation. Since distance type of measures are required by the CLUSTER procedure, METHOD= DCORR is used in the PROC DISTANCE statement to transform the correlation measures to the distance measures. Notice that in Output 26.2.1, all the values in the distance matrix are between 0. and 2.

Output 26.2.1: Distance Matrix Based on the DCORR Coefficient
start example
  Stock Dividends   Distance Matrix for 15 Utility Stocks   Orange___              Kansas_   Cincinnati_    Texas_   Detroit_  Rockland_  Kentucky_  Power___   Union_   Dominion_   compname                        G_E      Utilities   Edison    Utilitie  Utilities    Light   Electric  Resources   Cincinnati G&E                0.00000      .          .         .          .          .         .         .   Texas Utilities               0.82056     0.00000     .         .          .          .         .         .   Detroit Edison                0.40511     0.65453    0.00000    .          .          .         .         .   Orange & Rockland Utilitie    1.35380     0.88583    1.27306   0.00000     .          .         .         .   Kentucky Utilities            1.35581     0.92539    1.29382   0.12268    0.00000     .         .         .   Kansas Power & Light          1.34227     0.94371    1.31696   0.19905    0.12874    0.00000    .         .   Union Electric                0.98516     0.29043    0.89048   0.68798    0.71824    0.72082   0.00000    .   Dominion Resources            1.32945     0.96853    1.29016   0.33290    0.21510    0.24189   0.76587   0.00000   Allegheny Power               1.30492     0.81666    1.24565   0.17844    0.15759    0.17029   0.58452   0.27819   Minnesota Power & Light       1.24069     0.74082    1.20432   0.32581    0.30462    0.27231   0.48372   0.35733   Iowa-Ill Gas & Electric       1.04924     0.43100    0.97616   0.61166    0.61760    0.61736   0.16923   0.63545   Pennsylvania Power & Light    0.74931     0.37821    0.44256   1.03566    1.08878    1.12876   0.63285   1.14354   Oklahoma Gas & Electric       1.00604     0.30141    0.86200   0.68021    0.70259    0.73158   0.17122   0.72977   Wisconsin Energy              1.17988     0.54830    1.03081   0.45013    0.47184    0.53381   0.37405   0.51969   Green Mountain Power          1.30397     0.88063    1.27176   0.26948    0.17909    0.15377   0.64869   0.17360   Minnesota_   Iowa_Ill_                   Oklahoma_                  Green_   Allegheny_    Power___      Gas___    Pennsylvania_     Gas___    Wisconsin_   Mountain_   compname                       Power        Light      Electric   Power___Light    Electric     Energy       Power   Cincinnati G&E                 .            .           .             .             .            .             .   Texas Utilities                .            .           .             .             .            .             .   Detroit Edison                 .            .           .             .             .            .             .   Orange & Rockland Utilitie     .            .           .             .             .            .             .   Kentucky Utilities             .            .           .             .             .            .             .   Kansas Power & Light           .            .           .             .             .            .             .   Union Electric                 .            .           .             .             .            .             .   Dominion Resources             .            .           .             .             .            .             .   Allegheny Power               0.00000       .           .             .             .            .             .   Minnesota Power & Light       0.15615      0.00000      .             .             .            .             .   Iowa-Ill Gas & Electric       0.47900      0.36368     0.00000        .             .            .             .   Pennsylvania Power & Light    1.02358      0.99384     0.75596       0.00000        .            .             .   Oklahoma Gas & Electric       0.58391      0.50744     0.19673       0.60216       0.00000       .             .   Wisconsin Energy              0.37522      0.36319     0.30259       0.76085       0.28070      0.00000        .   Green Mountain Power          0.13958      0.19370     0.52083       1.09269       0.64175      0.44814        0  
end example
 

The macro function do_cluster performs cluster analysis and presents the results in graphs. The CLUSTER procedure performs hierarchical clustering using agglomerative methods based on the distance data created from the previous PROC DISTANCE statement. The resulting tree diagrams can be saved into an output data set and later be plotted by the TREE procedure. Since the CCC statistics is not suitable for distance type of data, only the Pseudo Statistics is requested to identify the number of clusters.

Two clustering methods are invoked in the do_cluster macro: the WARD s and the average linkage methods. Since the results of the Pseudo T statistics from both the WARD s and the average linkage methods contain many missing values, only the graphs of the Pseudo F statistics versus the number of clusters are plotted.

Both Output 26.2.2 and Output 26.2.3 suggest a possible clusters of 4, and the resulting clusters are agreed by both clustering methods as shown from Output 26.2.4 to Output 26.2.5. The four clusters are:

  • Cincinnati G&E and Detroit Edison

  • Texas Utilities and Pennsylvania Power & Light

  • Union Electric, Iowa-Ill Gas & Electric, Oklahoma Gas & Electric, and Wisconsin Energy.

  • Orange & Rockland Utilities, Kentucky Utilities, Kansas Power & Light, Allegheny Power, Green Mountain Power, Dominion Resources, and Minnesota Power & Light.

  data stock;   title 'Stock Dividends';   input compname &.  div_1986 div_1987 div_1988   div_1989 div_1990;   datalines;   Cincinnati G&E               8.4    8.2    8.4    8.1    8.0   Texas Utilities              7.9    8.9   10.4    8.9    8.3   Detroit Edison               9.7   10.7   11.4    7.8    6.5   Orange & Rockland Utilities  6.5    7.2    7.3    7.7    7.9   Kentucky Utilities           6.5    6.9    7.0    7.2    7.5   Kansas Power & Light         5.9    6.4    6.9    7.4    8.0   Union Electric               7.1    7.5    8.4    7.8    7.7   Dominion Resources           6.7    6.9    7.0    7.0    7.4   Allegheny Power              6.7    7.3    7.8    7.9    8.3   Minnesota Power & Light      5.6    6.1    7.2    7.0    7.5   Iowa-Ill Gas & Electric      7.1    7.5    8.5    7.8    8.0   Pennsylvania Power & Light   7.2    7.6    7.7    7.4    7.1   Oklahoma Gas & Electric      6.1    6.7    7.4    6.7    6.8   Wisconsin Energy             5.1    5.7    6.0    5.7    5.9   Green Mountain Power         7.1    7.4    7.8    7.8    8.3   ;   proc distance data=stock method=dcorr out=distdcorr;   var interval(div_1986 div_1987 div_1988 div_1989 div_1990);   id compname;   run;   proc print data=distdcorr;   id compname;   title2 'Distance Matrix for 15 Utility Stocks';   run;   title2;   %macro do_cluster(clusmtd);   goptions vsize=5in hsize=5in htitle=2pct htext=1.5pct;   %let clusmtd = %upcase(&clusmtd);   proc cluster data=distdcorr method=&clusmtd outtree=Tree pseudo   id compname;   run;   /* plot pseudo statistics vs number of cluster */   legend1 frame cframe=white cborder=black position=center   value=(justify=center);   axis1 label=(angle=90 rotate=0) minor=none;   axis2 minor=none order=(0 to 15);   proc gplot;   title2 "Cluster Method= &clusmtd";   plot _psf_*_ncl_='F' /   frame cframe=white legend=legend1 vaxis=axis1 haxis=axis2;   run;   proc tree data=Tree horizontal;   title2 "Cluster Method= &clusmtd";   id compname;   run;   %mend;   %do_cluster(ward);   %do_cluster(average);  
Output 26.2.2: Pseudo F versus Number of Clusters when METHOD= WARD
start example
click to expand
end example
 
Output 26.2.3: Pseudo F versus Number of Clusters when METHOD= AVERAGE
start example
click to expand
end example
 
Output 26.2.4: Tree Diagram of Clusters versus Semi-Partial R-Square Values when METHOD= WARD
start example
click to expand
end example
 
Output 26.2.5: Tree Diagram of Clusters versus Average Distance Between Clusters when METHOD= AVERAGE
start example
click to expand
end example
 



SAS.STAT 9.1 Users Guide (Vol. 2)
SAS/STAT 9.1 Users Guide Volume 2 only
ISBN: B003ZVJDOK
EAN: N/A
Year: 2004
Pages: 92

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net