This example demonstrates how you can use the VARCLUS procedure to create hierarchical, unidimensional clusters of variables .
The following data, from Hand, et al. (1994), represent amounts of protein consumed from nine food groups for each of 25 European countries . The nine food groups are red meat ( RedMeat ), white meat ( WhiteMeat ), eggs ( Eggs ), milk ( Milk ), fish ( Fish ), cereal ( Cereal ), starch ( Starch ), nuts ( Nuts ), and fruits and vegetables ( FruitVeg ).
Suppose you want to simplify interpretation of the data by reducing the number of variables to a smaller set of variable cluster components . You can use the VARCLUS procedure for this type of variable reduction.
The following DATA step creates the SAS data set Protein :
data Protein; input Country . RedMeat WhiteMeat Eggs Milk Fish Cereal Starch Nuts FruitVeg; datalines; Albania 10.1 1.4 0.5 8.9 0.2 42.3 0.6 5.5 1.7 Austria 8.9 14.0 4.3 19.9 2.1 28.0 3.6 1.3 4.3 Belgium 13.5 9.3 4.1 17.5 4.5 26.6 5.7 2.1 4.0 Bulgaria 7.8 6.0 1.6 8.3 1.2 56.7 1.1 3.7 4.2 Czechoslovakia 9.7 11.4 2.8 12.5 2.0 34.3 5.0 1.1 4.0 Denmark 10.6 10.8 3.7 25.0 9.9 21.9 4.8 0.7 2.4 E Germany 8.4 11.6 3.7 11.1 5.4 24.6 6.5 0.8 3.6 Finland 9.5 4.9 2.7 33.7 5.8 26.3 5.1 1.0 1.4 France 18.0 9.9 3.3 19.5 5.7 28.1 4.8 2.4 6.5 Greece 10.2 3.0 2.8 17.6 5.9 41.7 2.2 7.8 6.5 Hungary 5.3 12.4 2.9 9.7 0.3 40.1 4.0 5.4 4.2 Ireland 13.9 10.0 4.7 25.8 2.2 24.0 6.2 1.6 2.9 Italy 9.0 5.1 2.9 13.7 3.4 36.8 2.1 4.3 6.7 Netherlands 9.5 13.6 3.6 23.4 2.5 22.4 4.2 1.8 3.7 Norway 9.4 4.7 2.7 23.3 9.7 23.0 4.6 1.6 2.7 Poland 6.9 10.2 2.7 19.3 3.0 36.1 5.9 2.0 6.6 Portugal 6.2 3.7 1.1 4.9 14.2 27.0 5.9 4.7 7.9 Romania 6.2 6.3 1.5 11.1 1.0 49.6 3.1 5.3 2.8 Spain 7.1 3.4 3.1 8.6 7.0 29.2 5.7 5.9 7.2 Sweden 9.9 7.8 3.5 4.7 7.5 19.5 3.7 1.4 2.0 Switzerland 13.1 10.1 3.1 23.8 2.3 25.6 2.8 2.4 4.9 UK 17.4 5.7 4.7 20.6 4.3 24.3 4.7 3.4 3.3 USSR 9.3 4.6 2.1 16.6 3.0 43.6 6.4 3.4 2.9 W Germany 11.4 12.5 4.1 18.8 3.4 18.6 5.2 1.5 3.8 Yugoslavia 4.4 5.0 1.2 9.5 0.6 55.9 3.0 5.7 3.2 ;
The data set Protein contains the character variable Country and the nine numeric variables representing the food groups. The $18. in the INPUT statement specifies that the variable Country is a character variable with a length of 18.
The following statements create the variable clusters.
proc varclus data=Protein outtree=tree centroid maxclusters=4; var RedMeat--FruitVeg; run;
The DATA= option specifies the SAS data set Protein as input. The OUTTREE= option creates the output SAS data set Tree to contain the tree structure information. When you specify this option, you are implicitly requiring the clusters to be hierarchical rather than disjoint .
The CENTROID option specifies the centroid method of clustering. This means that the calculated cluster components are the unweighted averages of the standardized variables. The MAXCLUSTERS=4 option specifies that no more than four clusters be computed.
The VAR statement lists the numeric variables ( RedMeat - FruitVeg ) to beusedin the analysis.
The results of this analysis are displayed in the following figures.
Although PROC VARCLUS displays output for each step in the clustering process, the following figures display only the final analysis for four clusters. Figure 78.1 displays the final cluster summary.
Oblique Centroid Component Cluster Analysis Cluster Summary for 4 Clusters Cluster Variation Proportion Cluster Members Variation Explained Explained ---------------------------------------------------------- 1 4 4 2.173024 0.5433 2 2 2 1.650997 0.8255 3 2 2 1.403853 0.7019 4 1 1 1 1.0000 Total variation explained = 6.227874 Proportion = 0.6920
For each cluster, Figure 78.1 displays the number of variables in the cluster, the cluster variation, the total explained variation, and the proportion of the total variance explained by the variables in the cluster. The variance explained by the variables in a cluster is similar to the variance explained by a factor in common factor analysis, but it includes contributions only from the variables in the cluster rather than from all variables.
The line labeled 'Total variation explained' in Figure 78.1 gives the sum of the explained variation over all clusters. The final 'Proportion' represents the total explained variation divided by the sum of cluster variation. This value, 0.6920, indicates that about 69% of the total variation in the data can be accounted for by the four cluster components.
Figure 78.2 shows how the variables are clustered. The first cluster represents animal protein ( RedMeat , WhiteMeat , Eggs , and Milk ), the second cluster contains the variables Cereal and Nuts , the third cluster is composed of the variables Fish and Starch , and the last cluster contains the single variable representing fruits and vegetables ( FruitVeg ).
Oblique Centroid Component Cluster Analysis R-squared with 4 Clusters ------------------ Own Next 1-R**2 Cluster Variable Cluster Closest Ratio ------------------------------------------------------- Cluster 1 RedMeat 0.4375 0.1518 0.6631 WhiteMeat 0.6302 0.3331 0.5545 Eggs 0.7024 0.4902 0.5837 Milk 0.4288 0.2721 0.7847 ------------------------------------------------------- Cluster 2 Cereal 0.8255 0.3983 0.2900 Nuts 0.8255 0.5901 0.4257 ------------------------------------------------------- Cluster 3 Fish 0.7019 0.1365 0.3452 Starch 0.7019 0.3075 0.4304 ------------------------------------------------------- Cluster 4 FruitVeg 1.0000 0.0578 0.0000
Figure 78.2 also displays the R 2 value of each variable with its own cluster and the R 2 value with its nearest cluster. The R 2 value for a variable with the nearest cluster should be low if the clusters are well separated. The last column displays the ratio of for each variable. Small values of this ratio indicate good clustering.
Figure 78.3 displays the cluster structure and the intercluster correlations. The structure table displays the correlation of each variable with each cluster component. The table of intercorrelations contains the correlations between the cluster components.
Oblique Centroid Component Cluster Analysis Cluster Structure Cluster 1 2 3 4 ------------------------------------------------------------------ RedMeat 0.66145 0.38959 0.06450 0.34109 WhiteMeat 0.79385 0.57715 0.04760 0.06132 Eggs 0.83811 0.70012 0.30902 0.04552 Milk 0.65483 0.52163 0.16805 0.26096 Fish 0.08108 0.36947 0.83781 0.26614 Cereal 0.58070 0.90857 0.63111 0.04655 Starch 0.41593 0.55448 0.83781 0.08441 Nuts 0.76817 0.90857 0.37089 0.37497 FruitVeg 0.24045 0.23197 0.20920 1.00000 Inter-Cluster Correlations Cluster 1 2 3 4 1 1.00000 0.74230 0.19984 0.24045 2 0.74230 1.00000 0.55141 0.23197 3 0.19984 0.55141 1.00000 0.20920 4 0.24045 0.23197 0.20920 1.00000
PROC VARCLUS next displays the summary table of statistics for the cluster history (Figure 78.4). The first three columns give the number of clusters, the total variation explained by clusters, and the proportion of variation explained by clusters.
Oblique Centroid Component Cluster Analysis Total Proportion Minimum Minimum Maximum Number Variation of Variation Proportion R-squared 1-R**2 Ratio of Explained Explained Explained for a for a Clusters by Clusters by Clusters by a Cluster Variable Variable -------------------------------------------------------------------------------- 1 0.732343 0.0814 0.0814 0.0875 2 3.960717 0.4401 0.3743 0.1007 1.0213 3 5.291887 0.5880 0.5433 0.3928 0.7978 4 6.227874 0.6920 0.5433 0.4288 0.7847
As displayed in Figure 78.4, when the number of allowable clusters is two, the total variation explained is 3.9607, and the cumulative proportion of variation explained by two clusters is 0.4401. When the number of clusters increases to three, the proportion of explained variance increases to 0.5880. When four clusters are computed, the explained variation is 0.6920.
Figure 78.4 also displays the minimum proportion of variance explained by a cluster, the minimum R 2 for a variable, and the maximum (1 ˆ’ R 2 ) ratio for a variable. The last quantity is the ratio of the value 1 ˆ’ R 2 for a variable's own cluster to the value 1 ˆ’ R 2 for its nearest cluster.
The following statements produce a tree diagram of the cluster structure created by PROC VARCLUS. The AXIS1 statement suppresses the label for the vertical axis, which would otherwise be ' Name of Variable or Cluster'.
axis1 label=none; proc tree data=tree horizontal vaxis=axis1; height _propor_; run;
Next, the TREE procedure is invoked using the SAS data set TREE , created by the OUTTREE= option in the preceding PROC VARCLUS statement. The HORIZONTAL option orients the tree diagram horizontally. The VAXIS option associates the vertical axis with the the AXIS1 statement. The HEIGHT statement specifies the use of the variable _PROPOR_ (the proportion of variance explained) as the height variable.
Figure 78.5 shows how the clusters are created. The ordered variable names are displayed on the vertical axis. The horizontal axis displays the proportion of variance explained at each clustering level.
As you look from left to right in the diagram, objects and clusters are progressively joined until a single, all-encompassing cluster is formed at the right (or root) of the diagram. Clusters exist at each level of the diagram, and every vertical line connects leaves and branches into progressively larger clusters.
For example, when the variables are formed into three clusters, one cluster contains the variables RedMeat , WhiteMeat , Eggs ,and Milk ; the second cluster contains the variables Fish and Starch ; the third cluster contains the variables Cereal , Nuts ,and FruitVeg . The proportion of variance explained at that level is 0.5880 (from Figure 78.4). At the next stage of clustering, the third cluster is split as the variable FruitVeg forms the fourth cluster; the proportion of variance explained is 0.6920.