The following example
The following data, originated by A. Weber and cited in Hand et al. (1994, pp. 297), measure the amount of protein consumed for nine food groups in 25 European
The following DATA step creates the SAS data set Protein :
title 'Protein Consumption in Europe'; data Protein; input Country . RedMeat WhiteMeat Eggs Milk Fish Cereal Starch Nuts FruitVeg; datalines; Albania 10.1 1.4 0.5 8.9 0.2 42.3 0.6 5.5 1.7 Austria 8.9 14.0 4.3 19.9 2.1 28.0 3.6 1.3 4.3 Belgium 13.5 9.3 4.1 17.5 4.5 26.6 5.7 2.1 4.0 Bulgaria 7.8 6.0 1.6 8.3 1.2 56.7 1.1 3.7 4.2 Czechoslovakia 9.7 11.4 2.8 12.5 2.0 34.3 5.0 1.1 4.0 Denmark 10.6 10.8 3.7 25.0 9.9 21.9 4.8 0.7 2.4 E Germany 8.4 11.6 3.7 11.1 5.4 24.6 6.5 0.8 3.6 Finland 9.5 4.9 2.7 33.7 5.8 26.3 5.1 1.0 1.4 France 18.0 9.9 3.3 19.5 5.7 28.1 4.8 2.4 6.5 Greece 10.2 3.0 2.8 17.6 5.9 41.7 2.2 7.8 6.5 Hungary 5.3 12.4 2.9 9.7 0.3 40.1 4.0 5.4 4.2 Ireland 13.9 10.0 4.7 25.8 2.2 24.0 6.2 1.6 2.9 Italy 9.0 5.1 2.9 13.7 3.4 36.8 2.1 4.3 6.7 Netherlands 9.5 13.6 3.6 23.4 2.5 22.4 4.2 1.8 3.7 Norway 9.4 4.7 2.7 23.3 9.7 23.0 4.6 1.6 2.7 Poland 6.9 10.2 2.7 19.3 3.0 36.1 5.9 2.0 6.6 Portugal 6.2 3.7 1.1 4.9 14.2 27.0 5.9 4.7 7.9 Romania 6.2 6.3 1.5 11.1 1.0 49.6 3.1 5.3 2.8 Spain 7.1 3.4 3.1 8.6 7.0 29.2 5.7 5.9 7.2 Sweden 9.9 7.8 3.5 4.7 7.5 19.5 3.7 1.4 2.0 Switzerland 13.1 10.1 3.1 23.8 2.3 25.6 2.8 2.4 4.9 UK 17.4 5.7 4.7 20.6 4.3 24.3 4.7 3.4 3.3 USSR 9.3 4.6 2.1 16.6 3.0 43.6 6.4 3.4 2.9 W Germany 11.4 12.5 4.1 18.8 3.4 18.6 5.2 1.5 3.8 Yugoslavia 4.4 5.0 1.2 9.5 0.6 55.9 3.0 5.7 3.2 ;
The data set
Protein
contains the character variable
Country
and the nine numeric
The following statements create the distance matrix and display part of it.
proc distance data=Protein out=Dist method=Euclid; var interval(RedMeat--FruitVeg / std=Std); id Country; run; options ls=120; proc print data=Dist(Obs=10); title2 'First 10 observations in the output data set from PROC DISTANCE'; run;
An output SAS data set called
Dist
that contains the distance matrix is created through the OUT= option. METHOD=
Euclid
The VAR statement lists the variables ( RedMeat ” FruitVeg ) along with their measurement level to be used in the analysis. An interval level of measurement is assigned to those variables. Since variables with large variances tend to have more effect on the proximity measure than those with small variances, each variable is standardized by the STD method to have a mean of 0 and a standard deviation of 1. This is done by adding / with the STD= Std option at the end of the variables list specification.
The ID statement specifies that the variable
Country
should be
There are 25 observations in the input data set; therefore, the output data set Dist contains a 25 by 25 lower triangle matrix.
The PROC PRINT statement displays the first 10 observations in the output data set Dist as shown in Figure 26.1.
|
|
{% if main.adsdop %}{% include 'adsenceinline.tpl' %}{% endif %}
Prptein Consumption in Europe First 10 observations in the output data set from PROC DISTANCE OBS Country Albania Austria Belgium Bulgaria Czechoslovakia Denmark E_Germany Finland France Greece Hungary 1 Albania 0.00000 . . . . . . . . . . 2 Austria 6.12388 0.00000 . . . . . . . . . 3 Belgium 5.94109 2.44987 0.00000 . . . . . . . . 4 Bulgaria 2.76446 4.88331 5.22711 0.00000 . . . . . . . 5 Czechoslovakia 5.13959 2.11498 2.21330 3.94761 0.00000 . . . . . . 6 Denmark 6.61002 3.01392 2.52541 6.00803 3.34049 0.00000 . . . . . 7 E Germany 6.39178 2.56341 2.10211 5.40824 1.87962 2.72112 0.00000 . . . . 8 Finland 5.81458 4.04271 3.45779 5.74882 3.91378 2.61570 3.99426 0.00000 . . . 9 France 6.29601 3.58891 2.19329 5.54675 3.36011 3.65772 3.78184 4.56796 0.00000 . . 10 Greece 4.24495 5.16330 4.69515 3.74849 4.86684 5.59084 5.61496 5.47453 4.54456 0 . OBS Ireland Italy Netherlands Norway Poland Portugal Romania Spain Sweden Switzerland UK USSR W_Germany Yugoslavia 1 . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . 3 . . . . . . . . . . . . . . 4 . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . 6 . . . . . . . . . . . . . . 7 . . . . . . . . . . . . . . 8 . . . . . . . . . . . . . . 9 . . . . . . . . . . . . . . 10 . . . . . . . . . . . . . .
|
|
The following statements produce the tree diagram in Figure 26.2:
proc cluster data=Dist method=Ward outtree=Tree noprint; id Country; run; axis1 order=(0 to 1 by 0.1); proc tree data=Tree haxis=axis1 horizontal; height _rsq_; id Country; run;
Figure 26.2:
Tree Diagram of Clusters versus R-Square Values
The CLUSTER procedure
After inspecting the tree diagram in Figure 26.2, you will see that when the countries are grouped into six clusters, the proportion of variance accounted for by these clusters is slightly less than 70% (69.3%).