Getting Started


This example demonstrates how you can use the VARCLUS procedure to create hierarchical, unidimensional clusters of variables .

The following data, from Hand, et al. (1994), represent amounts of protein consumed from nine food groups for each of 25 European countries . The nine food groups are red meat ( RedMeat ), white meat ( WhiteMeat ), eggs ( Eggs ), milk ( Milk ), fish ( Fish ), cereal ( Cereal ), starch ( Starch ), nuts ( Nuts ), and fruits and vegetables ( FruitVeg ).

Suppose you want to simplify interpretation of the data by reducing the number of variables to a smaller set of variable cluster components . You can use the VARCLUS procedure for this type of variable reduction.

The following DATA step creates the SAS data set Protein :

  data Protein;   input Country . RedMeat WhiteMeat Eggs Milk   Fish Cereal Starch Nuts FruitVeg;   datalines;   Albania        10.1  1.4  0.5   8.9  0.2  42.3  0.6  5.5  1.7   Austria         8.9 14.0  4.3  19.9  2.1  28.0  3.6  1.3  4.3   Belgium        13.5  9.3  4.1  17.5  4.5  26.6  5.7  2.1  4.0   Bulgaria        7.8  6.0  1.6   8.3  1.2  56.7  1.1  3.7  4.2   Czechoslovakia  9.7 11.4  2.8  12.5  2.0  34.3  5.0  1.1  4.0   Denmark        10.6 10.8  3.7  25.0  9.9  21.9  4.8  0.7  2.4   E Germany       8.4 11.6  3.7  11.1  5.4  24.6  6.5  0.8  3.6   Finland         9.5  4.9  2.7  33.7  5.8  26.3  5.1  1.0  1.4   France         18.0  9.9  3.3  19.5  5.7  28.1  4.8  2.4  6.5   Greece         10.2  3.0  2.8  17.6  5.9  41.7  2.2  7.8  6.5   Hungary         5.3 12.4  2.9   9.7  0.3  40.1  4.0  5.4  4.2   Ireland        13.9 10.0  4.7  25.8  2.2  24.0  6.2  1.6  2.9   Italy           9.0  5.1  2.9  13.7  3.4  36.8  2.1  4.3  6.7   Netherlands     9.5 13.6  3.6  23.4  2.5  22.4  4.2  1.8  3.7   Norway          9.4  4.7  2.7  23.3  9.7  23.0  4.6  1.6  2.7   Poland          6.9 10.2  2.7  19.3  3.0  36.1  5.9  2.0  6.6   Portugal        6.2  3.7  1.1   4.9 14.2  27.0  5.9  4.7  7.9   Romania         6.2  6.3  1.5  11.1  1.0  49.6  3.1  5.3  2.8   Spain           7.1  3.4  3.1   8.6  7.0  29.2  5.7  5.9  7.2   Sweden          9.9  7.8  3.5   4.7  7.5  19.5  3.7  1.4  2.0   Switzerland    13.1 10.1  3.1  23.8  2.3  25.6  2.8  2.4  4.9   UK             17.4  5.7  4.7  20.6  4.3  24.3  4.7  3.4  3.3   USSR            9.3  4.6  2.1  16.6  3.0  43.6  6.4  3.4  2.9   W Germany      11.4 12.5  4.1  18.8  3.4  18.6  5.2  1.5  3.8   Yugoslavia      4.4  5.0  1.2   9.5  0.6  55.9  3.0  5.7  3.2   ;  

The data set Protein contains the character variable Country and the nine numeric variables representing the food groups. The $18. in the INPUT statement specifies that the variable Country is a character variable with a length of 18.

The following statements create the variable clusters.

  proc varclus data=Protein outtree=tree centroid maxclusters=4;   var RedMeat--FruitVeg;   run;  

The DATA= option specifies the SAS data set Protein as input. The OUTTREE= option creates the output SAS data set Tree to contain the tree structure information. When you specify this option, you are implicitly requiring the clusters to be hierarchical rather than disjoint .

The CENTROID option specifies the centroid method of clustering. This means that the calculated cluster components are the unweighted averages of the standardized variables. The MAXCLUSTERS=4 option specifies that no more than four clusters be computed.

The VAR statement lists the numeric variables ( RedMeat - FruitVeg ) to beusedin the analysis.

The results of this analysis are displayed in the following figures.

Although PROC VARCLUS displays output for each step in the clustering process, the following figures display only the final analysis for four clusters. Figure 78.1 displays the final cluster summary.

start figure
  Oblique Centroid Component Cluster Analysis   Cluster Summary for 4 Clusters   Cluster    Variation    Proportion   Cluster    Members    Variation    Explained     Explained   ----------------------------------------------------------   1          4            4     2.173024        0.5433   2          2            2     1.650997        0.8255   3          2            2     1.403853        0.7019   4          1            1            1        1.0000   Total variation explained = 6.227874 Proportion = 0.6920  
end figure

Figure 78.1: Final Cluster Summary from the VARCLUS Procedure

For each cluster, Figure 78.1 displays the number of variables in the cluster, the cluster variation, the total explained variation, and the proportion of the total variance explained by the variables in the cluster. The variance explained by the variables in a cluster is similar to the variance explained by a factor in common factor analysis, but it includes contributions only from the variables in the cluster rather than from all variables.

The line labeled 'Total variation explained' in Figure 78.1 gives the sum of the explained variation over all clusters. The final 'Proportion' represents the total explained variation divided by the sum of cluster variation. This value, 0.6920, indicates that about 69% of the total variation in the data can be accounted for by the four cluster components.

Figure 78.2 shows how the variables are clustered. The first cluster represents animal protein ( RedMeat , WhiteMeat , Eggs , and Milk ), the second cluster contains the variables Cereal and Nuts , the third cluster is composed of the variables Fish and Starch , and the last cluster contains the single variable representing fruits and vegetables ( FruitVeg ).

start figure
  Oblique Centroid Component Cluster Analysis   R-squared with   4 Clusters                 ------------------   Own       Next    1-R**2   Cluster       Variable     Cluster    Closest     Ratio   -------------------------------------------------------   Cluster 1     RedMeat       0.4375     0.1518    0.6631   WhiteMeat     0.6302     0.3331    0.5545   Eggs          0.7024     0.4902    0.5837   Milk          0.4288     0.2721    0.7847   -------------------------------------------------------   Cluster 2     Cereal        0.8255     0.3983    0.2900   Nuts          0.8255     0.5901    0.4257   -------------------------------------------------------   Cluster 3     Fish          0.7019     0.1365    0.3452   Starch        0.7019     0.3075    0.4304   -------------------------------------------------------   Cluster 4     FruitVeg      1.0000     0.0578    0.0000  
end figure

Figure 78.2: R-square Values from the VARCLUS Procedure

Figure 78.2 also displays the R 2 value of each variable with its own cluster and the R 2 value with its nearest cluster. The R 2 value for a variable with the nearest cluster should be low if the clusters are well separated. The last column displays the ratio of click to expand for each variable. Small values of this ratio indicate good clustering.

Figure 78.3 displays the cluster structure and the intercluster correlations. The structure table displays the correlation of each variable with each cluster component. The table of intercorrelations contains the correlations between the cluster components.

start figure
  Oblique Centroid Component Cluster Analysis   Cluster Structure   Cluster               1             2             3             4   ------------------------------------------------------------------   RedMeat         0.66145   0.38959       0.06450   0.34109   WhiteMeat       0.79385   0.57715       0.04760   0.06132   Eggs            0.83811   0.70012       0.30902   0.04552   Milk            0.65483   0.52163       0.16805   0.26096   Fish   0.08108   0.36947       0.83781       0.26614   Cereal   0.58070       0.90857   0.63111       0.04655   Starch          0.41593   0.55448       0.83781       0.08441   Nuts   0.76817       0.90857   0.37089       0.37497   FruitVeg   0.24045       0.23197       0.20920       1.00000   Inter-Cluster Correlations   Cluster             1             2             3             4   1             1.00000   0.74230       0.19984   0.24045   2   0.74230       1.00000   0.55141       0.23197   3             0.19984   0.55141       1.00000       0.20920   4   0.24045       0.23197       0.20920       1.00000  
end figure

Figure 78.3: Cluster Correlations and Intercorrelations

PROC VARCLUS next displays the summary table of statistics for the cluster history (Figure 78.4). The first three columns give the number of clusters, the total variation explained by clusters, and the proportion of variation explained by clusters.

start figure
  Oblique Centroid Component Cluster Analysis   Total     Proportion        Minimum     Minimum        Maximum   Number     Variation   of Variation     Proportion   R-squared   1-R**2 Ratio   of     Explained      Explained      Explained       for a          for a   Clusters  by Clusters    by Clusters   by a Cluster    Variable       Variable   --------------------------------------------------------------------------------   1      0.732343         0.0814         0.0814      0.0875   2      3.960717         0.4401         0.3743      0.1007         1.0213   3      5.291887         0.5880         0.5433      0.3928         0.7978   4      6.227874         0.6920         0.5433      0.4288         0.7847  
end figure

Figure 78.4: Final Cluster Summary Table from the VARCLUS Procedure

As displayed in Figure 78.4, when the number of allowable clusters is two, the total variation explained is 3.9607, and the cumulative proportion of variation explained by two clusters is 0.4401. When the number of clusters increases to three, the proportion of explained variance increases to 0.5880. When four clusters are computed, the explained variation is 0.6920.

Figure 78.4 also displays the minimum proportion of variance explained by a cluster, the minimum R 2 for a variable, and the maximum (1 ˆ’ R 2 ) ratio for a variable. The last quantity is the ratio of the value 1 ˆ’ R 2 for a variable's own cluster to the value 1 ˆ’ R 2 for its nearest cluster.

The following statements produce a tree diagram of the cluster structure created by PROC VARCLUS. The AXIS1 statement suppresses the label for the vertical axis, which would otherwise be ' Name of Variable or Cluster'.

  axis1 label=none;   proc tree data=tree horizontal vaxis=axis1;   height _propor_;   run;  

Next, the TREE procedure is invoked using the SAS data set TREE , created by the OUTTREE= option in the preceding PROC VARCLUS statement. The HORIZONTAL option orients the tree diagram horizontally. The VAXIS option associates the vertical axis with the the AXIS1 statement. The HEIGHT statement specifies the use of the variable _PROPOR_ (the proportion of variance explained) as the height variable.

Figure 78.5 shows how the clusters are created. The ordered variable names are displayed on the vertical axis. The horizontal axis displays the proportion of variance explained at each clustering level.

click to expand
Figure 78.5: Horizontal Tree Diagram from PROC TREE

As you look from left to right in the diagram, objects and clusters are progressively joined until a single, all-encompassing cluster is formed at the right (or root) of the diagram. Clusters exist at each level of the diagram, and every vertical line connects leaves and branches into progressively larger clusters.

For example, when the variables are formed into three clusters, one cluster contains the variables RedMeat , WhiteMeat , Eggs ,and Milk ; the second cluster contains the variables Fish and Starch ; the third cluster contains the variables Cereal , Nuts ,and FruitVeg . The proportion of variance explained at that level is 0.5880 (from Figure 78.4). At the next stage of clustering, the third cluster is split as the variable FruitVeg forms the fourth cluster; the proportion of variance explained is 0.6920.




SAS.STAT 9.1 Users Guide (Vol. 7)
SAS/STAT 9.1 Users Guide, Volumes 1-7
ISBN: 1590472438
EAN: 2147483647
Year: 2004
Pages: 132

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net