Example


Example 78.1. Correlations among Physical Variables

The following data are correlations among eight physical variables as given by Harman (1976). The first PROC VARCLUS run clusters on the basis of principal components, the second run clusters on the basis of centroid components . The third analysis is hierarchical, and the TREE procedure is used to display a tree diagram. The results of the analyses follow.

  data phys8(type=corr);   title 'Eight Physical Measurements on 305 School Girls';   title2 'Harman: Modern Factor Analysis, 3rd Ed, p22';   label height='Height'      arm_span='Arm Span'   forearm='Length of Forearm'   low_leg='Length of Lower Leg'   weight='Weight'      bit_diam='Bitrochanteric Diameter'   girth='Chest Girth' width='Chest Width';   input _name_ $ 1-8   (height arm_span forearm low_leg weight bit_diam   girth width)(7.);   _type_='corr';   datalines;   height  1.0    .846   .805   .859   .473   .398   .301   .382   arm_span.846   1.0    .881   .826   .376   .326   .277   .415   forearm .805   .881   1.0    .801   .380   .319   .237   .345   low_leg .859   .826   .801   1.0    .436   .329   .327   .365   weight  .473   .376   .380   .436   1.0    .762   .730   .629   bit_diam.398   .326   .319   .329   .762   1.0    .583   .577   girth   .301   .277   .237   .327   .730   .583   1.0    .539   width   .382   .415   .345   .365   .629   .577   .539   1.0   ;   proc varclus data=phys8;   run;  

The PROC VARCLUS statement invokes the procedure. By default, PROC VARCLUS clusters on the basis of principal components.

As displayed in Output 78.1.1, the cluster component (by default, the first principal component) explains 58.41% of the total variation in the 8 variables.

Output 78.1.1: Principal Cluster Components: Cluster Summary
start example
  Eight Physical Measurements on 305 School Girls   Harman: Modern Factor Analysis, 3rd Ed, p22   Oblique Principal Component Cluster Analysis   Cluster Summary for 1 Cluster   Cluster    Variation    Proportion        Second   Cluster    Members    Variation    Explained     Explained    Eigenvalue   -------------------------------------------------------------------------   1          8            8      4.67288        0.5841        1.7710   Total variation explained = 4.67288 Proportion = 0.5841   Cluster 1 will be split.   Cluster Summary for 2 Clusters   Cluster    Variation    Proportion        Second   Cluster    Members    Variation    Explained     Explained    Eigenvalue   -------------------------------------------------------------------------   1          4            4     3.509218        0.8773        0.2361   2          4            4     2.917284        0.7293        0.4764   Total variation explained = 6.426502 Proportion = 0.8033   R-squared with   2 Clusters              -----------------   Own      Next   1-R**2   Variable   Cluster      Variable   Cluster   Closest    Ratio   Label   -----------------------------------------------------------------------------   Cluster 1    height      0.8777    0.2088   0.1545   Height   arm_span    0.9002    0.1658   0.1196   Arm Span   forearm     0.8661    0.1413   0.1560   Length of Forearm   low_leg     0.8652    0.1829   0.1650   Length of Lower Leg   -----------------------------------------------------------------------------   Cluster 2    weight      0.8477    0.1974   0.1898   Weight   bit_diam    0.7386    0.1341   0.3019   Bitrochanteric Diameter   girth       0.6981    0.0929   0.3328   Chest Girth   width       0.6329    0.1619   0.4380   Chest Width   No cluster meets the criterion for splitting.  
end example
 

The cluster is split because the second eigenvalue is greater than 1 (the default value of the MAXEIGEN option).

The two resulting cluster components explain 80.33% of the variation in the original variables. The cluster summary table shows that the variables height , arm_span , forearm , and low_leg have been assigned to the first cluster; and that the variables weight , bit_diam , girth , and width have been assigned to the second cluster.

The standardized scoring coefficients in Output 78.1.2 show that each cluster component has similar scores for each of its associated variables. This suggests that the principal cluster component solution should be similar to the centroid cluster component solution, which follows in the next PROC VARCLUS run.

Output 78.1.2: Standard Scoring Coefficients and Cluster Structure Table
start example
  Oblique Principal Component Cluster Analysis   Standardized Scoring Coefficients   Cluster                                           1             2   ------------------------------------------------------------------   height        Height                       0.266977      0.000000   arm_span      Arm Span                     0.270377      0.000000   forearm       Length of Forearm            0.265194      0.000000   low_leg       Length of Lower Leg          0.265057      0.000000   weight        Weight                       0.000000      0.315597   bit_diam      Bitrochanteric Diameter      0.000000      0.294591   girth         Chest Girth                  0.000000      0.286407   width         Chest Width                  0.000000      0.272710   Cluster Structure   Cluster                                           1             2   ------------------------------------------------------------------   height        Height                       0.936881      0.456908   arm_span      Arm Span                     0.948813      0.407210   forearm       Length of Forearm            0.930624      0.375865   low_leg       Length of Lower Leg          0.930142      0.427715   weight        Weight                       0.444281      0.920686   bit_diam      Bitrochanteric Diameter      0.366201      0.859404   girth         Chest Girth                  0.304779      0.835529   width         Chest Width                  0.402430      0.795572  
end example
 

The cluster structure table displays high correlations between the variables and their own cluster component. The correlations between the variables and the opposite cluster component are all moderate.

Output 78.1.3: Inter-Cluster Correlations
start example
  Oblique Principal Component Cluster Analysis   Inter-Cluster Correlations   Cluster             1             2   1             1.00000       0.44513   2             0.44513       1.00000  
end example
 

The intercluster correlation table shows that the cluster components are moderately correlated with =0 . 44513.

In the following statements, the CENTROID option in the PROC VARCLUS statement specifies that cluster centroids be used as the basis for clustering.

  proc varclus data=phys8 centroid;   run;  

The first cluster component, which, in the centroid method, is an unweighted sum of the standardized variables, explains 57.89% of the variation in the data. This value is near the maximum possible variance explained, 58.41%, which is attained by the first principal component (Output 78.1.1).

Output 78.1.4: Centroid Cluster Components: Cluster Summary
start example
  Oblique Centroid Component Cluster Analysis   Cluster Summary for 1 Cluster   Cluster    Variation    Proportion   Cluster    Members    Variation    Explained     Explained   ----------------------------------------------------------   1          8            8        4.631        0.5789   Total variation explained = 4.631 Proportion = 0.5789   Cluster Summary for 2 Clusters   Cluster    Variation    Proportion   Cluster    Members    Variation    Explained     Explained   -----------------------------------------------------------   1          4            4        3.509        0.8773   2          4            4         2.91        0.7275   Total variation explained = 6.419 Proportion = 0.8024   R-squared with   2 Clusters                ------------------   Own       Next    1-R**2    Variable   Cluster       Variable    Cluster    Closest     Ratio    Label   ----------------------------------------------------------------------------------   Cluster 1     height       0.8778     0.2075    0.1543    Height   arm_span     0.8994     0.1669    0.1208    Arm Span   forearm      0.8663     0.1410    0.1557    Length of Forearm   low_leg      0.8658     0.1824    0.1641    Length of Lower Leg   ----------------------------------------------------------------------------------   Cluster 2     weight       0.8368     0.1975    0.2033    Weight   bit_diam     0.7335     0.1341    0.3078    Bitrochanteric Diameter   girth        0.6988     0.0929    0.3321    Chest Girth   width        0.6473     0.1618    0.4207    Chest Width  
end example
 

The centroid clustering algorithm splits the variables into the same two clusters created in the principal component method. Recall that this outcome was suggested by the similar standardized scoring coefficients in the principal cluster component solution.

The default behavior in the centroid method is to split any cluster with less than 75% of the total cluster variance explained by the centroid component. In the next step, the second cluster, with a component that explains only 72.75% of the total variation of the cluster, is split.

In the R-squared table for two clusters, the width variable has a weaker relation to its cluster than any other variable; in the three cluster solution this variable is in a cluster of its own.

Each cluster component (Output 78.1.5) is an unweighted average of the cluster's standardized variables. Thus, the coefficients for each of the cluster's associated variables are identical in the centroid cluster component solution.

Output 78.1.5: Standardized Scoring Coefficients
start example
  Oblique Centroid Component Cluster Analysis   Standardized Scoring Coefficients   Cluster                                           1             2   ------------------------------------------------------------------   height        Height                       0.266918      0.000000   arm_span      Arm Span                     0.266918      0.000000   forearm       Length of Forearm            0.266918      0.000000   low_leg       Length of Lower Leg          0.266918      0.000000   weight        Weight                       0.000000      0.293105   bit_diam      Bitrochanteric Diameter      0.000000      0.293105   girth         Chest Girth                  0.000000      0.293105   width         Chest Width                  0.000000      0.293105  
end example
 

The centroid method stops at the three cluster solution. As displayed in Output 78.1.6 and Output 78.1.7, the three centroid components account for 86.15% of the variability in the eight variables, and all cluster components account for at least 79.44% of the total variation in the corresponding cluster. Additionally, the smallest squared correlation between the variables and their own cluster component is 0.7482.

Output 78.1.6: Cluster Summary for Three Clusters
start example
  Oblique Centroid Component Cluster Analysis   Cluster Summary for 3 Clusters   Cluster    Variation    Proportion   Cluster    Members    Variation    Explained     Explained   -----------------------------------------------------------   1          4            4        3.509        0.8773   2          3            3     2.383333        0.7944   3          1            1            1        1.0000   Total variation explained = 6.892333 Proportion = 0.8615   R-squared with   3 Clusters                ------------------   Own       Next    1-R**2    Variable   Cluster       Variable    Cluster    Closest     Ratio    Label   ----------------------------------------------------------------------------------   Cluster 1     height       0.8778     0.1921    0.1513    Height   arm_span     0.8994     0.1722    0.1215    Arm Span   forearm      0.8663     0.1225    0.1524    Length of Forearm   low_leg      0.8658     0.1668    0.1611    Length of Lower Leg   ----------------------------------------------------------------------------------   Cluster 2     weight       0.8685     0.3956    0.2175    Weight   bit_diam     0.7691     0.3329    0.3461    Bitrochanteric Diameter   girth        0.7482     0.2905    0.3548    Chest Girth   ----------------------------------------------------------------------------------   Cluster 3     width        1.0000     0.4259    0.0000    Chest Width  
end example
 
Output 78.1.7: Cluster Quality Table
start example
  Oblique Centroid Component Cluster Analysis   Total      Proportion         Minimum      Minimum        Maximum   Number      Variation    of Variation      Proportion    R-squared   1-R**2 Ratio   of      Explained       Explained       Explained        for a          for a   Clusters    by Clusters     by Clusters    by a Cluster     Variable       Variable   ------------------------------------------------------------------------------------   1       4.631000          0.5789          0.5789       0.4306   2       6.419000          0.8024          0.7275       0.6473         0.4207   3       6.892333          0.8615          0.7944       0.7482         0.3548  
end example
 

Note that, if the proportion option were set to a value between 0.5789 (the proportion of variance explained in the 1-cluster solution) and 0.7275 (the minimum proportion of variance explained in the 2-cluster solution), PROC VARCLUS would stop at a two cluster solution, and the centroid solution would find the same clusters as the principal components solution.

In the following statements, the MAXC= option computes all clustering solutions, from one to eight clusters. The SUMMARY option suppresses all output except the final cluster quality table, and the OUTTREE= option saves the results of the analysis to an output data set and forces the clusters to be hierarchical. The TREE procedure is invoked to produce a graphical display of the clusters.

  proc varclus data=phys8 maxc=8 summary outtree=tree;   run;   goptions ftext=swiss;   axis2 label=(justify=left);   axis1 order=(0.5 to 1.0 by 0.1);   proc tree horizontal vaxis=axis2 haxis=axis1 lines=(width=2);   height _propor_;   id _label_;   run;  
Output 78.1.8: Output 78.1.8. Hierarchical Clusters and the SUMMARY Option
start example
  Oblique Principal Component Cluster Analysis   Total    Proportion       Minimum       Maximum    Minimum      Maximum   Number    Variation  of Variation    Proportion        Second  R-squared 1-R**2 Ratio   of    Explained     Explained     Explained    Eigenvalue      for a        for a   Clusters  by Clusters   by Clusters  by a Cluster  in a Cluster   Variable     Variable   ----------------------------------------------------------------------------------------   1     4.672880        0.5841        0.5841      1.770983     0.3810   2     6.426502        0.8033        0.7293      0.476418     0.6329       0.4380   3     6.895347        0.8619        0.7954      0.418369     0.7421       0.3634   4     7.271218        0.9089        0.8773      0.238000     0.8652       0.2548   5     7.509218        0.9387        0.8773      0.236135     0.8652       0.1665   6     7.740000        0.9675        0.9295      0.141000     0.9295       0.2560   7     7.881000        0.9851        0.9405      0.119000     0.9405       0.2093   8     8.000000        1.0000        1.0000      0.000000     1.0000       0.0000  
end example
 

The principal component method first separates the variables into the same two clusters that were created in the first PROC VARCLUS run. Note that, in creating the third cluster, the principal component method identifies the variable width . This is the same variable that is put into its own cluster in the preceding centroid method example.

The tree diagram in Output 78.1.9 displays the cluster hierarchy. It is clear from the diagram that there are two, or possibly three, clusters present. However, the MAXC=8 option forces PROC VARCLUS to split the clusters until each variable is in its own cluster.

Output 78.1.9: TREE Diagram from PROC TREE
start example
  click to expand  
end example
 



SAS.STAT 9.1 Users Guide (Vol. 7)
SAS/STAT 9.1 Users Guide, Volumes 1-7
ISBN: 1590472438
EAN: 2147483647
Year: 2004
Pages: 132

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net