Details


Clustering Methods

The following notation is used, with lowercase symbols generally pertaining to observations and uppercase symbols pertaining to clusters:

n

number of observations

v

number of variables if data are coordinates

G

number of clusters at any given level of the hierarchy

x i or x i

i th observation (row vector if coordinate data)

C K

K th cluster, subset of {1 , 2 ,..., n }

N K

number of observations in C K

x

sample mean vector

x K

mean vector for cluster C K

x

Euclidean length of the vector x , that is, the square root of the sum of the squares of the elements of x

T

W K

click to expand

P G

ˆ‘ W J , where summation is over the G clusters at the G th level of the hierarchy

B KL

W M ˆ’ W K ˆ’ W L if C M = C K C L

d ( x , y )

any distance or dissimilarity measure between observations or vectors x and y

D KL

any distance or dissimilarity measure between clusters C K and C L

The distance between two clusters can be defined either directly or combinatorially (Lance and Williams 1967), that is, by an equation for updating a distance matrix when two clusters are joined. In all of the following combinatorial formulas, it is assumed that clusters C K and C L are merged to form C M , and the formula gives the distance between the new cluster C M and any other cluster C J .

For an introduction to most of the methods used in the CLUSTER procedure, refer to Massart and Kaufman (1983).

Average Linkage

The following method is obtained by specifying METHOD=AVERAGE. The distance between two clusters is defined by

click to expand

If d ( x , y ) = x ˆ’ y 2 , then

click to expand

The combinatorial formula is

click to expand

In average linkage the distance between two clusters is the average distance between pairs of observations, one in each cluster. Average linkage tends to join clusters with small variances, and it is slightly biased toward producing clusters with the same variance.

Average linkage was originated by Sokal and Michener (1958).

Centroid Method

The following method is obtained by specifying METHOD=CENTROID. The distance between two clusters is defined by

click to expand

If d ( x , y ) = x ˆ’ y 2 , then the combinatorial formula is

click to expand

In the centroid method, the distance between two clusters is defined as the (squared) Euclidean distance between their centroids or means. The centroid method is more robust to outliers than most other hierarchical methods but in other respects may not perform as well as Wards method or average linkage (Milligan 1980).

The centroid method was originated by Sokal and Michener (1958).

Complete Linkage

The following method is obtained by specifying METHOD=COMPLETE. The distance between two clusters is defined by

click to expand

The combinatorial formula is

  D    JM   = max(  D    JK    , D    JL   ) 

In complete linkage, the distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster. Complete linkage is strongly biased toward producing clusters with roughly equal diameters, and it can be severely distorted by moderate outliers (Milligan 1980).

Complete linkage was originated by Sorensen (1948).

Density Linkage

The phrase density linkage is used here to refer to a class of clustering methods using nonparametric probability density estimates (for example, Hartigan 1975, pp. 205“212; Wong 1982; Wong and Lane 1983). Density linkage consists of two steps:

  1. A new dissimilarity measure, d *, based on density estimates and adjacencies is computed. If x i and x j are adjacent (the definition of adjacency depends on the method of density estimation), then d *( x i ,x j ) is the reciprocal of an estimate of the density midway between x i and x j ; otherwise , d *( x i ,x j ) is infinite.

  2. A single linkage cluster analysis is performed using d *.

The CLUSTER procedure supports three types of density linkage: the k th- nearest -neighbor method, the uniform kernel method, and Wongs hybrid method. These are obtained by using METHOD=DENSITY and the K=, R=, and HYBRID options, respectively.

k th-Nearest Neighbor Method

The k th-nearest-neighbor method (Wong and Lane 1983) uses k th-nearest neighbor density estimates. Let r k ( x ) be the distance from point x to the k th-nearest observation, where k is the value specified for the K= option. Consider a closed sphere centered at x with radius r k ( x ). The estimated density at x , f ( x ), is the proportion of observations within the sphere divided by the volume of the sphere. The new dissimilarity measure is computed as

click to expand

Wong and Lane (1983) show that k th-nearest-neighbor density linkage is strongly set consistent for high-density (density-contour) clusters if k is chosen such that k/n 0 and k/ ln( n ) ˆ as n ˆ . Wong and Schaack (1982) discuss methods for estimating the number of population clusters using k th-nearest-neighbor clustering.

Uniform-Kernel Method

The uniform-kernel method uses uniform-kernel density estimates. Let r be the value specified for the R= option. Consider a closed sphere centered at point x with radius r . The estimated density at x , f ( x ), is the proportion of observations within the sphere divided by the volume of the sphere. The new dissimilarity measure is computed as

click to expand
Wongs Hybrid Method

Wongs (1982) hybrid clustering method uses density estimates based on a preliminary cluster analysis by the k -means method. The preliminary clustering can be done by the FASTCLUS procedure, using the MEAN= option to create a data set containing cluster means, frequencies, and root-mean-square standard deviations. This data set is used as input to the CLUSTER procedure, and the HYBRID option is specified with METHOD=DENSITY to request the hybrid analysis. The hybrid method is appropriate for very large data sets but should not be used with small data sets, say fewer than 100 observations in the original data. The term preliminary cluster refers to an observation in the DATA= data set.

For preliminary cluster C K , N K and W K are obtained from the input data set, as are the cluster means or the distances between the cluster means. Preliminary clusters C K and C L are considered adjacent if the midpoint between x K and x L is closer to either x K or x L than to any other preliminary cluster mean or, equivalently, if d 2 ( x K , x L ) <d 2 ( x K , x M )+ d 2 ( x L , x M ) for all other preliminary clusters C M , M   K or L . The new dissimilarity measure is computed as

click to expand
Using the K= and R= Options

The values of the K= and R= options are called smoothing parameters . Small values of K= or R= produce jagged density estimates and, as a consequence, many modes. Large values of K= or R= produce smoother density estimates and fewer modes. In the hybrid method, the smoothing parameter is the number of clusters in the preliminary cluster analysis. The number of modes in the final analysis tends to increase as the number of clusters in the preliminary analysis increases . Wong (1982) suggests using n . 3 preliminary clusters, where n is the number of observations in the original data set. There is no general rule-of-thumb for selecting K= values. For all types of density linkage, you should repeat the analysis with several different values of the smoothing parameter (Wong and Schaack 1982).

There is no simple answer to the question of which smoothing parameter to use (Silverman 1986, pp. 43“61, 84“88, and 98“99). It is usually necessary to try several different smoothing parameters. A reasonable first guess for the R= option in many coordinate data sets is given by

click to expand

where is the standard deviation of the l th variable. The estimate for R= can be computed in a DATA step using the GAMMA function for . This formula is derived under the assumption that the data are sampled from a multivariate normal distribution and tends, therefore, to be too large (oversmooth) if the true distribution is multimodal. Robust estimates of the standard deviations may be preferable if there are outliers. If the data are distances, the factor can be replaced by an average (mean, trimmed mean, median, root-mean-square, and so on) distance divided by . To prevent outliers from appearing as separate clusters, you can also specify K=2, or more generally K= m , m 2, which in most cases forces clusters to have at least m members .

If the variables all have unit variance (for example, if the STANDARD option is used), Table 23.1 can be used to obtain an initial guess for the R= option:

Table 23.1: Reasonable First Guess for the R= Option for Standardized Data

Number of Observations

Number of Variables

1

2

3

4

5

6

7

8

9

10

20

1.01

1.36

1.77

2.23

2.73

3.25

3.81

4.38

4.98

5.60

35

0.91

1.24

1.64

2.08

2.56

3.08

3.62

4.18

4.77

5.38

50

0.84

1.17

1.56

1.99

2.46

2.97

3.50

4.06

4.64

5.24

75

0.78

1.09

1.47

1.89

2.35

2.85

3.38

3.93

4.50

5.09

100

0.73

1.04

1.41

1.82

2.28

2.77

3.29

3.83

4.40

4.99

150

0.68

0.97

1.33

1.73

2.18

2.66

3.17

3.71

4.27

4.85

200

0.64

0.93

1.28

1.67

2.11

2.58

3.09

3.62

4.17

4.75

350

0.57

0.85

1.18

1.56

1.98

2.44

2.93

3.45

4.00

4.56

500

0.53

0.80

1.12

1.49

1.91

2.36

2.84

3.35

3.89

4.45

750

0.49

0.74

1.06

1.42

1.82

2.26

2.74

3.24

3.77

4.32

1000

0.46

0.71

1.01

1.37

1.77

2.20

2.67

3.16

3.69

4.23

1500

0.43

0.66

0.96

1.30

1.69

2.11

2.57

3.06

3.57

4.11

2000

0.40

0.63

0.92

1.25

1.63

2.05

2.50

2.99

3.49

4.03

Since infinite d * values occur in density linkage, the final number of clusters can exceed one when there are wide gaps between the clusters or when the smoothing parameter results in little smoothing.

Density linkage applies no constraints to the shapes of the clusters and, unlike most other hierarchical clustering methods, is capable of recovering clusters with elongated or irregular shapes . Since density linkage employs less prior knowledge about the shape of the clusters than do methods restricted to compact clusters, density linkage is less effective at recovering compact clusters from small samples than are methods that always recover compact clusters, regardless of the data.

EML

The following method is obtained by specifying METHOD=EML. The distance between two clusters is given by

click to expand

The EML method joins clusters to maximize the likelihood at each level of the hierarchy under the following assumptions.

  • multivariate normal mixture

  • equal spherical covariance matrices

  • unequal sampling probabilities

The EML method is similar to Wards minimum-variance method but removes the bias toward equal-sized clusters. Practical experience has indicated that EML is somewhat biased toward unequal- sized clusters. You can specify the PENALTY= option to adjust the degree of bias. If you specify PENALTY= p , the formula is modified to

click to expand

The EML method was derived by W.S. Sarle of SAS Institute Inc. from the maximum-likelihood formula obtained by Symons (1981, p. 37, equation 8) for disjoint clustering. There are currently no other published references on the EML method.

Flexible-Beta Method

The following method is obtained by specifying METHOD=FLEXIBLE. The combinatorial formula is

click to expand

where b is the value of the BETA= option, or ˆ’ . 25 by default.

The flexible-beta method was developed by Lance and Williams (1967). See also Milligan (1987).

McQuittys Similarity Analysis

The following method is obtained by specifying METHOD=MCQUITTY. The combinatorial formula is

click to expand

The method was independently developed by Sokal and Michener (1958) and McQuitty (1966).

Median Method

The following method is obtained by specifying METHOD=MEDIAN. If d ( x , y )= x ˆ’ y 2 , then the combinatorial formula is

click to expand

The median method was developed by Gower (1967).

Single Linkage

The following method is obtained by specifying METHOD=SINGLE. The distance between two clusters is defined by

click to expand

The combinatorial formula is

click to expand

In single linkage, the distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster. Single linkage has many desirable theoretical properties (Jardine and Sibson 1971; Fisher and Van Ness 1971; Hartigan 1981) but has fared poorly in Monte Carlo studies (for example, Milligan 1980). By imposing no constraints on the shape of clusters, single linkage sacrifices performance in the recovery of compact clusters in return for the ability to detect elongated and irregular clusters. You must also recognize that single linkage tends to chop off the tails of distributions before separating the main clusters (Hartigan 1981). The notorious chaining tendency of single linkage can be alleviated by specifying the TRIM= option (Wishart 1969, pp. 296“298).

Density linkage and two-stage density linkage retain most of the virtues of single linkage while performing better with compact clusters and possessing better asymptotic properties (Wong and Lane 1983).

Single linkage was originated by Florek et al. (1951a, 1951b) and later reinvented by McQuitty (1957) and Sneath (1957).

Two-Stage Density Linkage

If you specify METHOD=DENSITY, the modal clusters often merge before all the points in the tails have clustered. The option METHOD=TWOSTAGE is a modification of density linkage that ensures that all points are assigned to modal clusters before the modal clusters are allowed to join. The CLUSTER procedure supports the same three varieties of two-stage density linkage as of ordinary density linkage: k th-nearest neighbor, uniform kernel, and hybrid.

In the first stage, disjoint modal clusters are formed . The algorithm is the same as the single linkage algorithm ordinarily used with density linkage, with one exception: two clusters are joined only if at least one of the two clusters has fewer members than the number specified by the MODE= option. At the end of the first stage, each point belongs to one modal cluster.

In the second stage, the modal clusters are hierarchically joined by single linkage. The final number of clusters can exceed one when there are wide gaps between the clusters or when the smoothing parameter is small.

Each stage forms a tree that can be plotted by the TREE procedure. By default, the TREE procedure plots the tree from the first stage. To obtain the tree for the second stage, use the option HEIGHT=MODE in the PROC TREE statement. You can also produce a single tree diagram containing both stages, with the number of clusters as the height axis, by using the option HEIGHT=N in the PROC TREE statement. To produce an output data set from PROC TREE containing the modal clusters, use _HEIGHT_ for the HEIGHT variable (the default) and specify LEVEL=0.

Two-stage density linkage was developed by W.S. Sarle of SAS Institute Inc. There are currently no other published references on two-stage density linkage.

Wards Minimum-Variance Method

The following method is obtained by specifying METHOD=WARD. The distance between two clusters is defined by

click to expand

If d ( x , y ) = 1/2 x ˆ’ y 2 , then the combinatorial formula is

click to expand

In Wards minimum-variance method, the distance between two clusters is the ANOVA sum of squares between the two clusters added up over all the variables. At each generation, the within-cluster sum of squares is minimized over all partitions obtainable by merging two clusters from the previous generation. The sums of squares are easier to interpret when they are divided by the total sum of squares to give proportions of variance (squared semipartial correlations ).

Wards method joins clusters to maximize the likelihood at each level of the hierarchy under the following assumptions:

  • multivariate normal mixture

  • equal spherical covariance matrices

  • equal sampling probabilities

Wards method tends to join clusters with a small number of observations, and it is strongly biased toward producing clusters with roughly the same number of observations. It is also very sensitive to outliers (Milligan 1980).

Ward (1963) describes a class of hierarchical clustering methods including the minimum variance method.

Miscellaneous Formulas

The root-mean-square standard deviation of a cluster C K is

click to expand

The R 2 statistic for a given level of the hierarchy is

The squared semipartial correlation for joining clusters C K and C L is

click to expand

The bimodality coefficient is

click to expand

where m 3 is skewness and m 4 is kurtosis . Values of b greater than 0.555 (the value for a uniform population) may indicate bimodal or multimodal marginal distributions. The maximum of 1.0 (obtained for the Bernoulli distribution) is obtained for a population with only two distinct values. Very heavy-tailed distributions have small values of b regardless of the number of modes.

Formulas for the cubic-clustering criterion and approximate expected R 2 are given in Sarle (1983).

The pseudo F statistic for a given level is

click to expand

The pseudo t 2 statistic for joining C K and C L is

click to expand

The pseudo F and t 2 statistics may be useful indicators of the number of clusters, but they are not distributed as F and t 2 random variables. If the data are independently sampled from a multivariate normal distribution with a scalar covariance matrix and if the clustering method allocates observations to clusters randomly (which no clustering method actually does), then the pseudo F statistic is distributed as an F random variable with v ( G ˆ’ 1) and v ( n ˆ’ G ) degrees of freedom. Under the same assumptions, the pseudo t 2 statistic is distributed as an F random variable with v and v ( N K + N L ˆ’ 2) degrees of freedom. The pseudo t 2 statistic differs computationally from Hotellings T 2 in that the latter uses a general symmetric covariance matrix instead of a scalar covariance matrix. The pseudo F statistic was suggested by Calinski and Harabasz (1974). The pseudo t 2 statistic is related to the J e (2) /J e (1) statistic of Duda and Hart (1973) by

click to expand

See Milligan and Cooper (1985) and Cooper and Milligan (1988) regarding the performance of these statistics in estimating the number of population clusters. Conservative tests for the number of clusters using the pseudo F and t 2 statistics can be obtained by the Bonferroni approach (Hawkins, Muller, and ten Krooden 1982, pp. 337“340).

Ultrametrics

A dissimilarity measure d ( x,y ) is called an ultrametric if it satisfies the following conditions:

  • d ( x, x ) = 0 for all x

  • d ( x, y ) 0 for all x , y

  • d ( x, y ) = d ( y, x ) for all x , y

  • d ( x, y ) max ( d ( x, z ) , d ( y, z )) for all x , y , and z

Any hierarchical clustering method induces a dissimilarity measure on the observations, say h ( x i ,x j ). Let C M be the cluster with the fewest members that contains both x i and x j . Assume C M was formed by joining C K and C L . Then define h ( x i , x j ) = D KL .

If the fusion of C K and C L reduces the number of clusters from g to g ˆ’ 1, then define D ( g ) = D KL . Johnson (1967) shows that if

click to expand

then h (. , .) is an ultrametric. A method that always satisfies this condition is said to be a monotonic or ultrametric clustering method . All methods implemented in PROC CLUSTER except CENTROID, EML, and MEDIAN are ultrametric (Milligan 1979; Batagelj 1981).

Algorithms

Anderberg (1973) describes three algorithms for implementing agglomerative hierarchical clustering: stored data, stored distance, and sorted distance. The algorithms used by PROC CLUSTER for each method are indicated in Table 23.2. For METHOD=AVERAGE, METHOD=CENTROID, or METHOD=WARD, either the stored data or the stored distance algorithm can be used. For these methods, if the data are distances or if you specify the NOSQUARE option, the stored distance algorithm is used; otherwise, the stored data algorithm is used.

Table 23.2: Three Algorithms for Implementing Agglomerative Hierarchical Clustering

Stored Method

Algorithm

Stored Data

Stored Distance

Sorted Distance

AVERAGE

x

x

 

CENTROID

x

x

 

COMPLETE

 

x

 

DENSITY

   

x

EML

x

   

FLEXIBLE

 

x

 

MCQUITTY

 

x

 

MEDIAN

 

x

 

SINGLE

 

x

 

TWOSTAGE

   

x

WARD

x

x

 

Computational Resources

The CLUSTER procedure stores the data (including the COPY and ID variables) in memory or, if necessary, on disk. If eigenvalues are computed, the covariance matrix is stored in memory. If the stored distance or sorted distance algorithm is used, the distances are stored in memory or, if necessary, on disk.

With coordinate data, the increase in CPU time is roughly proportional to the number of variables. The VAR statement should list the variables in order of decreasing variance for greatest efficiency.

For both coordinate and distance data, the dominant factor determining CPU time is the number of observations. For density methods with coordinate data, the asymptotic time requirements are somewhere between n ln( n ) and n 2 , depending on how the smoothing parameter increases. For other methods except EML, time is roughly proportional to n 2 . For the EML method, time is roughly proportional to n 3 .

PROC CLUSTER runs much faster if the data can be stored in memory and, if the stored distance algorithm is used, the distance matrix can be stored in memory as well. To estimate the bytes of memory needed for the data, use the following equation and round up to the nearest multiple of d .

n ( vd

+8 d + i

 
 

+ i

if density estimation or the sorted distance algorithm used

 

+3 d

if stored data algorithm used

 

+3 d

if density estimation used

 

+ max(8, length of ID variable)

if ID variable used

 

+ length of ID variable

if ID variable used

 

+ sum of lengths of COPY variables)

if COPY variables used

where

n

is the number of observations

v

is the number of variables

d

is the size of a C variable of type double . For most computers, d = 8.

i

is the size of a C variable of type int . For most computers, i = 4.

The number of bytes needed for the distance matrix is dn ( n + 1) / 2.

Missing Values

If the data are coordinates, observations with missing values are excluded from the analysis. If the data are distances, missing values are not allowed in the lower triangle of the distance matrix. The upper triangle is ignored. For more on TYPE=DISTANCE data sets, see Appendix A, Special SAS Data Sets.

Ties

At each level of the clustering algorithm, PROC CLUSTER must identify the pair of clusters with the minimum distance. Sometimes, usually when the data are discrete, there may be two or more pairs with the same minimum distance. In such cases the tie must be broken in some arbitrary way. If there are ties, then the results of the cluster analysis depend on the order of the observations in the data set. The presence of ties is reported in the SAS log and in the column of the cluster history labeled Tie unless the NOTIE option is specified.

PROC CLUSTER breaks ties as follows . Each cluster is identified by the smallest observation number among its members. For each pair of clusters, there is a smaller identification number and a larger identification number. If two or more pairs of clusters are tied for minimum distance between clusters, the pair that has the minimum larger identification number is merged. If there is a tie for minimum larger identification number, the pair that has the minimum smaller identification number is merged. This method for breaking ties is different from that used in Version 5. The change in the algorithm may produce changes in the resulting clusters.

A tie means that the level in the cluster history at which the tie occurred and possibly some of the subsequent levels are not uniquely determined. Ties that occur early in the cluster history usually have little effect on the later stages. Ties that occur in the middle part of the cluster history are cause for further investigation. Ties late in the cluster history indicate important indeterminacies.

The importance of ties can be assessed by repeating the cluster analysis for several different random permutations of the observations. The discrepancies at a given level can be examined by crosstabulating the clusters obtained at that level for all of the permutations . See Example 23.4 on page 1027 for details.

Size, Shape, and Correlation

In some biological applications, the organisms that are being clustered may be at different stages of growth. Unless it is the growth process itself that is being studied, differences in size among such organisms are not of interest. Therefore, distances among organisms should be computed in such a way as to control for differences in size while retaining information about differences in shape.

If coordinate data are measured on an interval scale, you can control for size by subtracting a measure of the overall size of each observation from each datum. For example, if no other direct measure of size is available, you could subtract the mean of each row of the data matrix, producing a row-centered coordinate matrix. An easy way to subtract the mean of each row is to use PROC STANDARD on the transposed coordinate matrix:

  proc transpose data= coordinate-datatype ;   proc standard m=0;   proc transpose out=row-centered-coordinate-data;  

Another way to remove size effects from interval-scale coordinate data is to do a principal component analysis and discard the first component (Blackith and Reyment 1971).

If the data are measured on a ratio scale, you can control for size by dividing each datum by a measure of overall size; in this case, the geometric mean is a more natural measure of size than the arithmetic mean. However, it is often more meaningful to analyze the logarithms of ratio-scaled data, in which case you can subtract the arithmetic mean after taking logarithms. You must also consider the dimensions of measurement. For example, if you have measures of both length and weight, you may need to cube the measures of length or take the cube root of the weights. Various other complications may also arise in real applications, such as different growth rates for different parts of the body (Sneath and Sokal 1973).

Issues of size and shape are pertinent to many areas besides biology (for example, Hamer and Cunningham 1981). Suppose you have data consisting of subjective ratings made by several different raters. Some raters may tend to give higher overall ratings than other raters. Some raters may also tend to spread out their ratings over more of the scale than do other raters. If it is impossible for you to adjust directly for rater differences, then distances should be computed in such a way as to control for both differences in size and variability. For example, if the data are considered to be measured on an interval scale, you can subtract the mean of each observation and divide by the standard deviation, producing a row-standardized coordinate matrix. With some clustering methods, analyzing squared Euclidean distances from a row-standardized coordinate matrix is equivalent to analyzing the matrix of correlations among rows, since squared Euclidean distance is an affine transformation of the correlation (Hartigan 1975, p. 64).

If you do an analysis of row-centered or row-standardized data, you need to consider whether the columns (variables) should be standardized before centering or standardizing the rows, after centering or standardizing the rows, or both before and after. If you standardize the columns after standardizing the rows, then strictly speaking you are not analyzing shape because the profiles are distorted by standardizing the columns; however, this type of double standardization may be necessary in practice to get reasonable results. It is not clear whether iterating the standardization of rows and columns may be of any benefit.

The choice of distance or correlation measure should depend on the meaning of the data and the purpose of the analysis. Simulation studies that compare distance and correlation measures are useless unless the data are generated to mimic data from your field of application; conclusions drawn from artificial data cannot be generalized because it is possible to generate data such that distances that include size effects work better or such that correlations work better.

You can standardize the rows of a data set by using a DATA step or by using the TRANSPOSE and STANDARD procedures. You can also use PROC TRANSPOSE and then have PROC CORR create a TYPE=CORR data set containing a correlation matrix. If you want to analyze a TYPE=CORR data set with PROC CLUSTER, you must use a DATA step to perform the following steps:

  1. Set the data set TYPE= to DISTANCE.

  2. Convert the correlations to dissimilarities by computing 1 ˆ’ r , , 1 ˆ’ r 2 , or some other decreasing function.

  3. Delete observations for which the variable _TYPE_ does not have the value CORR.

See Example 23.6 on page 1044 for an analysis of a data set in which size information is detrimental to the classification.

Output Data Set

The OUTTREE= data set contains one observation for each observation in the input data set, plus one observation for each cluster of two or more observations (that is, one observation for each node of the cluster tree). The total number of output observations is usually 2 n ˆ’ 1, where n is the number of input observations. The density methods may produce fewer output observations when the number of clusters cannot be reduced to one.

The label of the OUTTREE= data set identifies the type of cluster analysis performed and is automatically displayed when the TREE procedure is invoked.

The variables in the OUTTREE= data set are as follows:

If the input data set contains coordinates, the following variables appear in the output data set:

  • the variables containing the coordinates used in the cluster analysis. For output observations that correspond to input observations, the values of the coordinates are the same in both data sets except for some slight numeric error possibly introduced by standardizing and unstandardizing if the STANDARD option is used. For output observations that correspond to clusters of more than one input observation, the values of the coordinates are the cluster means.

  • _ERSQ_ , the approximate expected value of R 2 under the uniform null hypothesis

  • _RATIO_ , equal to

  • _LOGR_ , natural logarithm of _RATIO_

  • _CCC_ , the cubic clustering criterion

The variables _ERSQ_ , _RATIO_ , _LOGR_ , and _CCC_ have missing values when the number of clusters is greater than one-fifth the number of observations.

If the input data set contains coordinates and METHOD=AVERAGE, METHOD=CENTROID, or METHOD=WARD, then the following variables appear in the output data set.

  • _DIST_ , the Euclidean distance between the means of the last clusters joined

  • _AVLINK_ , the average distance between the last clusters joined

If the input data set contains coordinates or METHOD=AVERAGE, METHOD=CENTROID, or METHOD=WARD, then the following variables appear in the output data set:

  • _RMSSTD_ , the root-mean-square standard deviation of the current cluster

  • _SPRSQ_ , the semipartial squared multiple correlation or the decrease in the proportion of variance accounted for due to joining two clusters to form the current cluster

  • _RSQ_ , the squared multiple correlation

  • _PSF_ , the pseudo F statistic

  • _PST2_ , the pseudo t 2 statistic

If METHOD=EML, then the following variable appears in the output data set:

  • _LNLR_ , the log-likelihood ratio

If METHOD=TWOSTAGE or METHOD=DENSITY, the following variable appears in the output data set:

  • _MODE_ , pertaining to the modal clusters. With METHOD=DENSITY, the _MODE_ variable indicates the number of modal clusters contained by the current cluster. With METHOD=TWOSTAGE, the _MODE_ variable gives the maximum density in each modal cluster and the fusion density, d *, for clusters containing two or more modal clusters; for clusters containing no modal clusters, _MODE_ is missing.

If nonparametric density estimates are requested (when METHOD=DENSITY or METHOD=TWOSTAGE and the HYBRID option is not used; or when the TRIM= option is used), the output data set contains

  • _DENS_ , the maximum density in the current cluster

Displayed Output

If you specify the SIMPLE option and the data are coordinates, PROC CLUSTER produces simple descriptive statistics for each variable:

  • the Mean

  • the standard deviation, Std Dev

  • the Skewness

  • the Kurtosis

  • a coefficient of Bimodality

If the data are coordinates and you do not specify the NOEIGEN option, PROC CLUSTER displays

  • the Eigenvalues of the Correlation or Covariance Matrix

  • the Difference between successive eigenvalues

  • the Proportion of variance explained by each eigenvalue

  • the Cumulative proportion of variance explained

If the data are coordinates, PROC CLUSTER displays the Root-Mean-Square Total-Sample Standard Deviation of the variables

If the distances are normalized, PROC CLUSTER displays one of the following, depending on whether squared or unsquared distances are used:

  • the Root-Mean-Square Distance Between Observations

  • the Mean Distance Between Observations

For the generations in the clustering process specified by the PRINT= option, PROC CLUSTER displays

  • the Number of Clusters or NCL

  • the names of the Clusters Joined. The observations are identified by the formatted value of the ID variable, if any; otherwise, the observations are identified by OB n , where n is the observation number. The CLUSTER procedure displays the entire value of the ID variable in the cluster history instead of truncating at 16 characters . Long ID values may be flowed onto several lines. Clusters of two or more observations are identified as CL n , where n is the number of clusters existing after the cluster in question is formed.

  • the number of observations in the new cluster, Frequency of New Cluster or FREQ

If you specify the RMSSTD option and if the data are coordinates or if you specify METHOD=AVERAGE, METHOD=CENTROID, or METHOD=WARD, then PROC CLUSTER displays the root-mean-square standard deviation of the new cluster, RMS Std of New Cluster or RMS Std.

PROC CLUSTER displays the following items if you specify METHOD=WARD. It also displays them if you specify the RSQUARE option and either the data are coordinates or you specify METHOD=AVERAGE or METHOD=CENTROID:

  • the decrease in the proportion of variance accounted for resulting from joining the two clusters, Semipartial R-Squared or SPRSQ. This equals the between-cluster sum of squares divided by the corrected total sum of squares.

  • the squared multiple correlation, R-Squared or RSQ. R 2 is the proportion of variance accounted for by the clusters.

If you specify the CCC option and the data are coordinates, PROC CLUSTER displays

  • Approximate Expected R-Squared or ERSQ, the approximate expected value of R 2 under the uniform null hypothesis

  • the Cubic Clustering Criterion or CCC. The cubic clustering criterion and approximate expected R 2 are given missing values when the number of clusters is greater than one-fifth the number of observations.

If you specify the PSEUDO option and if the data are coordinates or METHOD=AVERAGE, METHOD=CENTROID, or METHOD=WARD, then PROC CLUSTER displays

  • Pseudo F or PSF, the pseudo F statistic measuring the separation among all the clusters at the current level

  • Pseudo t 2 or PST2, the pseudo t 2 statistic measuring the separation between the two clusters most recently joined

If you specify the NOSQUARE option and METHOD=AVERAGE, PROC CLUSTER displays the (Normalized) Average Distance or (Norm) Aver Dist, the average distance between pairs of objects in the two clusters joined with one object from each cluster.

If you do not specify the NOSQUARE option and METHOD=AVERAGE, PROC CLUSTER displays the (Normalized) RMS Distance or (Norm) RMS Dist, the root-mean-square distance between pairs of objects in the two clusters joined with one object from each cluster.

If METHOD=CENTROID, PROC CLUSTER displays the (Normalized) Centroid Distance or (Norm) Cent Dist, the distance between the two cluster centroids.

If METHOD=COMPLETE, PROC CLUSTER displays the (Normalized) Maximum Distance or (Norm) Max Dist, the maximum distance between the two clusters.

If METHOD=DENSITY or METHOD=TWOSTAGE, PROC CLUSTER displays

  • Normalized Fusion Density or Normalized Fusion Dens, the value of d * as defined in the section Clustering Methods on page 975

  • the Normalized Maximum Density in Each Cluster joined, including the Lesser or Min, and the Greater or Max, of the two maximum density values

If METHOD=EML, PROC CLUSTER displays

  • Log Likelihood Ratio or LNLR

  • Log Likelihood or LNLIKE

If METHOD=FLEXIBLE, PROC CLUSTER displays the (Normalized) Flexible Distance or (Norm) Flex Dist, the distance between the two clusters based on the Lance-Williams flexible formula.

If METHOD=MEDIAN, PROC CLUSTER displays the (Normalized) Median Distance or (Norm) Med Dist, the distance between the two clusters based on the median method.

If METHOD=MCQUITTY, PROC CLUSTER displays the (Normalized) McQuittys Similarity or (Norm) MCQ, the distance between the two clusters based on McQuittys similarity method.

If METHOD=SINGLE, PROC CLUSTER displays the (Normalized) Minimum Distance or (Norm) Min Dist, the minimum distance between the two clusters.

If you specify the NONORM option and METHOD=WARD, PROC CLUSTER displays the Between-Cluster Sum of Squares or BSS, the ANOVA sum of squares between the two clusters joined.

If you specify neither the NOTIE option nor METHOD=TWOSTAGE or METHOD=DENSITY, PROC CLUSTER displays Tie , where a T in the column indicates a tie for minimum distance and a blank indicates the absence of a tie.

After the cluster history, if METHOD=TWOSTAGE or METHOD=DENSITY, PROC CLUSTER displays the number of modal clusters.

ODS Table Names

PROC CLUSTER assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.

Table 23.3: ODS Tables Produced in PROC CLUSTER

ODS Table Name

Description

Statement

Option

ClusterHistory

Obs or clusters joined, frequencies and other cluster statistics

PROC

default

SimpleStatistics

Simple statistics, before or after trimming

PROC

SIMPLE

EigenvalueTable

Eigenvalues of the CORR or COV matrix

PROC

default




SAS.STAT 9.1 Users Guide (Vol. 2)
SAS/STAT 9.1 Users Guide Volume 2 only
ISBN: B003ZVJDOK
EAN: N/A
Year: 2004
Pages: 92

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net