Details


Density Estimation

Refer to Silverman (1986) or Scott (1992) for an introduction to nonparametric density estimation.

PROC MODECLUS uses (hyper)spherical uniform kernels of fixed or variable radius. The density estimate at a point is computed by dividing the number of observations within a sphere centered at the point by the product of the sample size and the volume of the sphere. The size of the sphere is determined by the smoothing parameters that you are required to specify.

For fixed-radius kernels, specify the radius as a Euclidean distance with either the DR= or R= option. For variable-radius kernels, specify the number of neighbors desired within the sphere with either the DK= or K= option; the radius is then the smallest radius that contains at least the specified number of observations including the observation at which the density is being estimated. If you specify both the DR= or R= option and the DK= or K= option, the radius used is the maximum of the two indicated radii; this is useful for dealing with outliers.

It is convenient to refer to the sphere of support of the kernel at observation x i as the neighborhood of x i . The observations within the neighborhood of x i are the neighbors of x i . In some contexts, x i is considered a neighbor of itself, but in other contexts it is not. The following notation is used in this chapter.

x i

the i th observation

d( x , y )

the distance between points x and y

n

the total number of observations in the sample

n i

the number of observations within the neighborhood of x i including x i itself

the number of observations within the neighborhood of x i not including x i itself

N i

the set of indices of neighbors of x i including i

the set of indices of neighbors of x i not including i

v i

the volume of the neighborhood of x i

i

the estimated density at x i

the cross- validated density estimate at x i

C k

the set of indices of observations assigned to cluster k

v

the number of variables or the dimensionality

s l

standard deviation of the l th variable

The estimated density at x i is

that is, the number of neighbors of x i divided by the product of the sample size and the volume of the neighborhood at x i .

The density estimates provided by uniform kernels are not quite as good as those provided by some other types of kernels, but they are quite satisfactory for clustering. The significance tests for the number of clusters require the use of fixed-size uniform kernels.

There is no simple answer to the question of which smoothing parameter to use (Silverman 1986, pp. 43 “61, 84 “88, 98 “99). It is usually necessary to try several different smoothing parameters. A reasonable first guess for the K= option is in the range of 0.1 to 1 times n 4 / ( v +4) , smaller values being suitable for higher dimensionalities. A reasonable first guess for the R= option in many coordinate data sets is given by

click to expand

which can be computed in a DATA step using the GAMMA function for .The MODECLUS procedure also provides this first guess as a default smoothing parameter if none of the options (DR=, CR=, R=, DK=, CK=, and K= ) is specified. This formula is derived under the assumption that the data are sampled from a multivariate normal distribution and, therefore, tend to be too large (oversmooth) if the true distribution is multimodal. Robust estimates of the standard deviations may be preferable if there are outliers. If the data are distances, the factor can be replaced by an average root-mean-square Euclidean distance divided by . To prevent outliers from appearing as separate clusters, you can also specify K=2 or CK=2 or, more generally , K= m or CK= m , m 2, which in most cases forces clusters to have at least m members .

If the variables all have unit variance (for example, if you specify the STD option), you can use Table 47.2 to obtain an initial guess for the R= option.

Table 47.2: Reasonable First Guess for R= for Standardized Data

Number of Obs

Number of Variables

1

2

3

4

5

6

7

8

9

10

20

1.01

1.36

1.77

2.23

2.73

3.25

3.81

4.38

4.98

5.60

35

0.91

1.24

1.64

2.08

2.56

3.08

3.62

4.18

4.77

5.38

50

0.84

1.17

1.56

1.99

2.46

2.97

3.50

4.06

4.64

5.24

75

0.78

1.09

1.47

1.89

2.35

2.85

3.38

3.93

4.50

5.09

100

0.73

1.04

1.41

1.82

2.28

2.77

3.29

3.83

4.40

4.99

150

0.68

0.97

1.33

1.73

2.18

2.66

3.17

3.71

4.27

4.85

200

0.64

0.93

1.28

1.67

2.11

2.58

3.09

3.62

4.17

4.75

350

0.57

0.85

1.18

1.56

1.98

2.44

2.93

3.45

4.00

4.56

500

0.53

0.80

1.12

1.49

1.91

2.36

2.84

3.35

3.89

4.45

750

0.49

0.74

1.06

1.42

1.82

2.26

2.74

3.24

3.77

4.32

1000

0.46

0.71

1.01

1.37

1.77

2.20

2.67

3.16

3.69

4.23

1500

0.43

0.66

0.96

1.30

1.69

2.11

2.57

3.06

3.57

4.11

2000

0.40

0.63

0.92

1.25

1.63

2.05

2.50

2.99

3.49

4.03

One data-based method for choosing the smoothing parameter is likelihood cross validation (Silverman 1986, pp. 52 “55). The cross-validated density estimate at an observation is obtained by omitting the observation from the computations .

The (log) likelihood cross-validation criterion is then computed as

The suggested smoothing parameter is the one that maximizes this criterion. With fixed-radius kernels, likelihood cross validation oversmooths long-tailed distributions; for purposes of clustering, it tends to undersmooth short-tailed distributions. With k - nearest -neighbor density estimation, likelihood cross validation is useless because it almost always indicates k =2.

Cascaded density estimates are obtained by computing initial kernel density estimates and then, at each observation, taking the arithmetic mean, harmonic mean, or sum of the initial density estimates of the observations within the neighborhood. The cascaded density estimates can, in turn , be cascaded, and so on. Let k i be the density estimate at x i cascaded k times. For all types of cascading, i = i . If the cascading is done by arithmetic means, then, for k 0,

click to expand

For harmonic means,

click to expand

and for sums,

click to expand

To avoid cluttering formulas, the symbol i is used from now on to denote the density estimate at x i whether cascaded or not, since the clustering methods and significance tests do not depend on the degree of cascading.

Cascading increases the smoothness of the estimates with less computation than would be required by increasing the smoothing parameters to yield a comparable degree of smoothness. For population densities with bounded support and discontinuities at the boundaries, cascading improves estimates near the boundaries. Cascaded estimates, especially using sums, may be more sensitive to the local covariance structure of the distribution than are the uncascaded kernel estimates. Cascading seems to be useful for detecting very nonspherical clusters. Cascading was suggested by Tukey and Tukey (1981, p. 237). Additional research into the properties of cascaded density estimates is needed.

Clustering Methods

The number of clusters is a function of the smoothing parameters. The number of clusters tends to decrease as the smoothing parameters increase, but the relationship is not strictly monotonic. Generally, you should specify several different values of the smoothing parameters to see how the number of clusters varies.

The clustering methods used by PROC MODECLUS use spherical clustering neighborhoods of fixed or variable radius that are similar to the spherical kernels used for density estimation. For fixed-radius neighborhoods, specify the radius as a Euclidean distance with either the CR= or R= option. For variable-radius neighborhoods, specify the number of neighbors desired within the sphere with either the CK= or K= option; the radius is then the smallest radius that contains at least the specified number of observations including the observation for which the neighborhood is being determined. However, in the following descriptions of clustering methods, an observation is not considered to be one of its own neighbors. If you specify both the CR= or R= option and the CK= or K= option, the radius used is the maximum of the two indicated radii; this is useful for dealing with outliers. In this section, the symbols N i , , n i , and refer to clustering neighborhoods, not density estimation neighborhoods.

METHOD=0

Begin with each observation in a separate cluster. For each observation and each of its neighbors, join the cluster to which the observation belongs with the cluster to which the neighbor belongs. This method does not use density estimates. With a fixed clustering radius, the clusters are those obtained by cutting the single linkage tree at the specified radius (see Chapter 23, The CLUSTER Procedure, ).

METHOD=1

Begin with each observation in a separate cluster. For each observation, find the nearest neighbor with a greater estimated density. If such a neighbor exists, join the cluster to which the observation belongs with the cluster to which the specified neighbor belongs.

Next , consider each observation with density estimates equal to that of one or more neighbors but not less than the estimate at any neighbor. Join the cluster containing the observation with (1) each cluster containing a neighbor of the observation such that the maximum density estimate in the cluster equals the density estimate at the observation and (2) the cluster containing the nearest neighbor of the observation such that the maximum density estimate in the cluster exceeds the density estimate at the observation.

This method is similar to the classification or assignment stage of algorithms described by Gitman (1973) and Huizinga (1978).

METHOD=2

Begin with each observation in a separate cluster. For each observation, find the neighbor with the greatest estimated density exceeding the estimated density of the observation. If such a neighbor exists, join the cluster to which the observation belongs with the cluster to which the specified neighbor belongs.

Observations with density estimates equal to that of one or more neighbors but not less than the estimate at any neighbor are treated the same way as they are in METHOD=1.

This method is similar to the first stage of an algorithm proposed by Mizoguchi and Shimura (1980).

METHOD=3

Begin with each observation in a separate cluster. For each observation, find the neighbor with greater estimated density such that the slope of the line connecting the point on the estimated density surface at the observation with the point on the estimated density surface at the neighbor is a maximum. That is, for observation x i , find a neighbor x j such that ( j ˆ’ i ) / d( x j , x i ) is a maximum. If this slope is positive, join the cluster to which observation x i belongs with the cluster to which the specified neighbor x j belongs. This method was invented by Koontz, Narendra, and Fukunaga (1976).

Observations with density estimates equal to that of one or more neighbors but not less than the estimate at any neighbor are treated the same way as they are in METHOD=1. The algorithm suggested for this situation by Koontz, Narendra, and

Fukunaga (1976) may fail for flat areas in the estimated density that contain four or more observations.

METHOD=4

This method is equivalent to the first stage of two-stage density linkage (see Chapter 23, The CLUSTER Procedure, ) without the use of the MODE=option.

METHOD=5

This method is equivalent to the first stage of two-stage density linkage (see Chapter 23, The CLUSTER Procedure, ) with the use of the MODE=option.

METHOD=6

Begin with all observations unassigned .

Step 1: Form a list of seeds , each seed being a single observation such that the estimated density of the observation is not less than the estimated density of any of its neighbors. If you specify the MAXCLUSTERS= n option, retain only the n seeds with the greatest estimated densities.

Step 2: Consider each seed in decreasing order of estimated density.

  1. If the current seed has already been assigned, proceed to the next seed. Otherwise , form a new cluster consisting of the current seed.

  2. Add to the cluster any unassigned seed that is a neighbor of a member of the cluster or that shares a neighbor with a member of the cluster; repeat until no unassigned seed satisfies these conditions.

  3. Add to the cluster all neighbors of seeds that belong to the cluster.

  4. Consider each unassigned observation. Compute the ratio of the sum of the p ˆ’ 1 powers of the estimated density of the neighbors that belong to the current cluster to the sum of the p ˆ’ 1 powers of the estimated density of all of its neighbors, where p is specified by the POWER= option and is 2 by default. Let x i be the current observation, and let k be the index of the current cluster. Then this ratio is

    click to expand
  • (The sum of the p ˆ’ 1 powers of the estimated density of the neighbors of an observation is an estimate of the integral of the p th power of the density over the neighborhood.) If r ik exceeds the maximum of 0.5 and the value of the THRESHOLD= option, add the observation x i to the current cluster k . Repeat until no more observations can be added to the current cluster.

  • Step 3: (This step is performed only if the value of the THRESHOLD= option is less than 0.5.) Form a list of unassigned observations in decreasing order of estimated density. Repeat the following actions until the list is empty.

    1. Remove the first observation from the list, for example, observation x i .

    2. For each cluster k , compute r ik .

    3. If the maximum over clusters of r ik exceeds the value of the THRESHOLD= option, assign observation x i to the corresponding cluster and insert all observations of which the current observation is a neighbor into the list, keeping the list in decreasing order of estimated density.

METHOD=6 is related to a method invented by Koontz and Fukunaga (1972a) and discussed by Koontz and Fukunaga (1972b).

Significance Tests

Significance tests require that a fixed-radius kernel be specified for density estimation via the DR= or R= option. You can also specify the DK= or K= option, but only the fixed radius is used for the significance tests.

The purpose of the significance tests is as follows : given a simple random sample of objects from a population, obtain an estimate of the number of clusters in the population such that the probability in repeated sampling that the estimate exceeds the true number of clusters is not much greater than ± , 1% ± 10%. In other words, a sequence of null hypotheses of the form

  • :The number of population clusters is i or less

where i =1 , 2 , · · · ,n , is tested against the alternatives such as

  • : The number of population clusters exceeds i

with a maximum experimentwise error rate of approximately ± . The tests protect you from overestimating the number of population clusters. It is impossible to protect against underestimating the number of population clusters without introducing much stronger assumptions than are used here, since the number of population clusters could conceivably exceed the sample size.

The method for conducting significance tests is as follows:

  1. Estimate densities using fixed-radius uniform kernels.

  2. Obtain preliminary clusters by a valley-seeking method. Other clustering methods could be used but would yield less power.

  3. Compute an approximate p -value for each cluster by comparing the estimated maximum density in the cluster with the estimated maximum density on the cluster boundary.

  4. Repeatedly join the least significant cluster with a neighboring cluster until all remaining clusters are significant.

  5. Estimate the number of population clusters as the number of significant sample clusters.

  6. The preceding steps can be repeated for any number of different radii, and the estimate of the number of population clusters can be taken to be the maximum number of significant sample clusters for any radius.

This method has the following useful features:

  • No distributional assumptions are required.

  • The choice of smoothing parameter is not critical since you can try any number of different values.

  • The data can be coordinates or distances.

  • Time and space requirements for the significance tests are no worse than those for obtaining the clusters.

  • The power is high enough to be useful for practical purposes.

The method for computing the p -values is based on a series of plausible approximations. There are as yet no rigorous proofs that the method is infallible. Neither are there any asymptotic results. However, simulations for sample sizes ranging from 20 to 2000 indicate that the p -values are almost always conservative. The only case discovered so far in which the p -values are liberal is a uniform distribution in one dimension for which the simulated error rates exceed the nominal significance level only slightly for a limited range of sample sizes.

To make inferences regarding population clusters, it is first necessary to define what is meant by a cluster. For clustering methods using nonparametric density estimation, a cluster is usually loosely defined as a region surrounding a local maximum of the probability density function or a maximal connected set of local maxima. This definition may not be satisfactory for very rough densities with many local maxima. It is not applicable at all to discrete distributions for which the density does not exist. As another example in which this definition is not intuitively reasonable, consider a uniform distribution in two dimensions with support in the shape of a figure eight (including the interior). This density might be considered to contain two clusters even though it does not have two distinct modes.

These difficulties can be avoided by defining clusters in terms of the local maxima of a smoothed probability density or mass function. For example, define the neighborhood distribution function (NDF) with radius r at a point x as the probability that a randomly selected point will lie within a radius r of x , that is, the probability integral over a hypersphere of radius r centered at x :

click to expand

where X is the random variable being sampled, r is a user -specified radius, and d( x , y ) is the distance between points x and y .

The NDF exists for all probability distributions. You can select the radius according to the degree of resolution required. The minimum-variance unbiased estimate of the NDF at a point x is proportional to the uniform-kernel density estimate with corresponding support.

You can define a modal region as a maximal connected set of local maxima of the NDF. A cluster is a connected set containing exactly one modal region. This definition seems to give intuitively reasonable results in most cases. An exception is a uniform density on the perimeter of a square. The NDF has four local maxima. There are eight local maxima along the perimeter, but running PROC MODECLUS with the R= option would yield four clusters since the two local maxima at each corner are separated by a distance equal to the radius. While this density does indeed have four distinctive features (the corners), it is not obvious that each corner should be considered a cluster.

The number of population clusters depends on the radius of the NDF. The significance tests in PROC MODECLUS protect against overestimating the number of clusters at any specified radius. It is often useful to look at the clustering results across a range of radii. A plot of the number of sample clusters as a function of the radius is a useful descriptive display, especially for high-dimensional data (Wong and Schaack 1982).

If a population has two clusters, it must have two modal regions. If there are two modal regions , there must be a valley between them. It seems intuitively desirable that the boundary between the two clusters should follow the bottom of this valley. All the clustering methods in PROC MODECLUS are designed to locate the estimated cluster boundaries in this way, although methods 1 and 6 seem to be much more successful at this than the others. Regardless of the precise location of the cluster boundary, it is clear that the maximum of the NDF along the boundary between two clusters must be strictly less than the value of the NDF in either modal region; otherwise, there would be only a single modal region; according to Hartigan and Hartigan (1985), there must be a dip between the two modes. PROC MODECLUS assesses the significance of a sample cluster by comparing the NDF in the modal region with the maximum of the NDF along the cluster boundary. If the NDF has second-order derivatives in the region of interest and if the boundary between the two clusters is indeed at the bottom of the valley, then the maximum value of the NDF along the boundary occurs at a saddle point. Hence, this test is called a saddle test . This term is intended to describe any test for clusters that compares modal densities with saddle densities, not just the test currently implemented in the MODECLUS procedure.

The obvious estimate of the maximum NDF in a sample cluster is the maximum estimated NDF at an observation in the cluster. Let m ( k ) be the index of the observation for which the maximum is attained in cluster k .

Estimating the maximum NDF on the cluster boundary is more complicated. One approach is to take the maximum NDF estimate at an observation in the cluster that has a neighbor belonging to another cluster. This method yields excessively large estimates when the neighborhood is large. Another approach is to try to choose an object closer to the boundary by taking the observation with the maximum sum of estimated densities of neighbors belonging to a different cluster. After some experimentation, it is found that a combination of these two methods works well. Let B k be the set of indices of observations in cluster k that have neighbors belonging to a different cluster, and compute

click to expand

Let s ( k ) be the index of the observation for which the maximum is attained.

Using the notation #( S ) for the cardinality of set S , let

click to expand
click to expand

Let R ( u ) be a random variable distributed as the range of a random sample of u observations from a standard normal distribution. Then the approximate p -value p k for cluster k is

click to expand

If points m ( k ) and s ( k ) are fixed a priori , z k would be the usual approximately normal test statistic for comparing two binomial random variables. In fact, m ( k ) and s ( k ) are selected in such a way that c m ( k ) tends to be large and c s ( k ) tends to be small. For this reason, and because there may be a large number of clusters, each with its own z k to be tested, each z k is referred to the distribution of R ( u ) instead of a standard normal distribution. If the tests are conducted for only one radius and if u is chosen equal to n , then the p -values are very conservative because (1) you are not making all possible pairwise comparisons of observations in the sample and (2) and are positively correlated if the neighborhoods overlap. In the formula for u , the summation overcorrects somewhat for the conservativeness due to correlated . The factor is empirically estimated from simulation results to adjust for the use of more than one radius.

If the JOIN option is specified, the least significant cluster (the cluster with the smallest z k ) is either dissolved or joined with a neighboring cluster. If no members of the cluster have neighbors belonging to a different cluster, all members of the cluster are unassigned. Otherwise, the cluster is joined to the neighboring cluster such that the sum of density estimates of neighbors of the estimated saddle point belonging to it is a maximum. Joining clusters increases the power of the saddle test. For example, consider a population with two well-separated clusters. Suppose that, for a certain radius, each population cluster is divided into two sample clusters. None of the four sample clusters is likely to be significant, but after the two sample clusters corresponding to each population cluster are joined, the remaining two clusters may be highly significant.

The saddle test implemented in PROC MODECLUS has been evaluated by simulation from known distributions. Some results are given in the following three tables. In Table 47.3, samples of 20 to 2000 observations are generated from a one-dimensional uniform distribution. For sample sizes of 1000 or less, 2000 samples are generated and analyzed by PROC MODECLUS. For a sample size of 2000, only 1000 samples are generated. The analysis is done with at least 20 different values of the R= option spread across the range of radii most likely to yield significant results. The six central columns of the table give the observed error rates at the nominal error rates ( ± ) at the head of each column. The standard errors of the observed error rates are given at the bottom of the table. The observed error rates are conservative for ± 5%, but they increase with ± and become slightly liberal for sample sizes in the middle of the range tested.

Table 47.3: Observed Error Rates (%) for Uniform Distribution

Sample Size

Nominal Type 1 Error Rate

Number of Simulations

1

2

5

10

15

20

20

0.00

0.00

0.00

0.60

11.65

27.05

2000

50

0.35

0.70

4.50

10.95

20.55

29.80

2000

100

0.35

0.85

3.90

11.05

18.95

28.05

2000

200

0.30

1.35

4.00

10.50

18.60

27.05

2000

500

0.45

1.05

4.35

9.80

16.55

23.55

2000

1000

0.70

1.30

4.65

9.55

15.45

19.95

2000

2000

0.40

1.10

3.00

7.40

11.50

16.70

1000

Standard Error

0.22

0.31

0.49

0.67

0.80

0.89

2000

0.31

0.44

0.69

0.95

1.13

1.26

1000

All unimodal distributions other than the uniform that have been tested, including normal, Cauchy, and exponential distributions and uniform mixtures, have produced much more conservative results. Table 47.4 displays results from a unimodal mixture of two normal distributions with equal variances and equal sampling probabilities and with means separated by two standard deviations. Any greater separation would produce a bimodal distribution. The observed error rates are quite conservative.

Table 47.4: Observed Error Rates (%) for Normal Mixture with 2 ƒ Separation

Sample Size

Nominal Type 1 Error Rate

Number of Simulations

1

2

5

10

15

20

100

0.0

0.0

0.0

1.0

2.0

4.0

200

200

0.0

0.0

0.0

2.0

3.0

3.0

200

500

0.0

0.0

0.5

0.5

0.5

0.5

200

All distributions in two or more dimensions that have been tested yield extremely conservative results. For example, a uniform distribution on a circle yields observed error rates that are never more than one-tenth of the nominal error rates for sample sizes up to 1000. This conservatism is due to the fact that, as the dimensionality increases, more and more of the probability lies in the tails of the distribution (Silverman 1986, p. 92), and the saddle test used by PROC MODECLUS is more conservative for distributions with pronounced tails. This applies even to a uniform distribution on a hypersphere because, although the density has no tails , the NDF does.

Since the formulas for the significance tests do not involve the dimensionality, no problems are created when the data are linearly dependent. Simulations of data in nonlinear subspaces (the circumference of a circle or surface of a sphere) have also yielded conservative results.

Table 47.5 displays results in terms of power for identifying two clusters in samples from a bimodal mixture of two normal distributions with equal variances and equal sampling probabilities separated by four standard deviations. In this simulation, PROC MODECLUS never indicated more than two significant clusters.

Table 47.5: Power (%) for Normal Mixture with 4 ƒ Separation

Sample Size

Nominal Type 1 Error Rate

Number of Simulations

1

2

5

10

15

20

20

0.0

0.0

0.0

2.0

37.5

68.5

200

35

0.0

13.5

38.5

48.5

64.0

75.5

200

50

17.5

26.0

51.5

67.0

78.5

84.0

200

75

25.5

36.0

58.5

77.5

85.5

89.5

200

100

40.0

54.5

72.5

84.5

91.5

92.5

200

150

70.5

80.0

92.0

97.0

100.0

100.0

200

200

89.0

96.0

99.5

100.0

100.0

100.0

200

The saddle test is not as efficient as excess-mass tests for multimodality (M ¼uller and Sawitzki 1991, Polonik 1993). However, there is not yet a general approximation for the distribution of excess-mass statistics to circumvent the need for simulations to do significance tests. Refer to Minnotte (1992) for a review of tests for multimodality.

Computational Resources

The MODECLUS procedure stores coordinate data in memory if there is enough space. For distance data, only one observation at a time is in memory.

PROC MODECLUS constructs lists of the neighbors of each observation. The total space required is 12 ˆ‘ n i bytes, where n i is based on the largest neighborhood required by any analysis. The lists are stored in a SAS utility data set unless you specify the CORE option. You may get an error message from the SAS System or from the operating system if there is not enough disk space for the utility data set. Clustering method 6 requires a second list that is always stored in memory.

For coordinate data, the time required to construct the neighbor lists is roughly proportional to v (log n )( ˆ‘ n i ) log( ˆ‘ n i /n ). For distance data, the time is roughly proportional to n 2 log( ˆ‘ n i /n ).

The time required for density estimation is proportional to ˆ‘ n i and is usually small compared to the time required for constructing the neighbor lists.

Clustering methods 0 through 3 are quite efficient, requiring time proportional to ˆ‘ n i . Methods 4 and 5 are slower, requiring time roughly proportional to ( ˆ‘ n i ) log( ˆ‘ n i ). Method 6 can also be slow, but the time requirements depend very much on the data and the particular options specified. Methods 4, 5, and 6 also require more memory than the other methods.

The time required for significance tests is roughly proportional to g ˆ‘ n i , where g is the number of clusters.

PROC MODECLUS can process data sets of several thousand observations if you specify reasonable smoothing parameters. Very small smoothing values produce many clusters, whereas very large values produce many neighbors; either case can require excessive time or space.

Missing Values

If the data are coordinates, observations with missing values are excluded from the analysis.

If the data are distances, missing values are treated as infinite. The neighbors of each observation are determined solely by the distances in that observation. The distances are not required to be symmetric, and there is no check for symmetry; the neighbors of each observation are determined only from the distances in that observation. This treatment of missing values is quite different from that of the CLUSTER procedure, which ignores the upper triangle of the distance matrix.

Output Data Sets

The OUT= data set contains one complete copy of the input data set for each cluster solution. There are additional variables identifying each solution and giving information about individual observations. Solutions with only one remaining cluster when JOIN= p is specified are omitted from the OUT= data set (see the description of the JOIN= option on page 2866). The OUT= data set can be extremely large, so it may be advisable to specify the DROP= data set option to exclude unnecessary variables.

The OUTCLUS= or OUTC= data set contains one observation for each cluster in each cluster solution. The variables identify the solution and provide statistics describing the cluster.

The OUTSUM= or OUTS= data set contains one observation for each cluster solution. The variables identify the solution and provide information about the solution as a whole.

The following variables can appear in all of the output data sets:

  • _K_ , which is the value of the K= option for the current solution. This variable appears only if you specify the K= option.

  • _DK_ , which is the value of the DK= option for the current solution. This variable appears only if you specify the DK= option.

  • _CK_ , which is the value of the CK= option for the current solution. This variable appears only if you specify the CK= option.

  • _R_ , which is the value of the R= option for the current solution. This variable appears only if you specify the R= option.

  • _DR_ , which is the value of the DR= option for the current solution. This variable appears only if you specify the DR= option.

  • _CR_ , which is the value of the CR= option for the current solution. This variable appears only if you specify the CR= option.

  • _CASCAD_ , which is the number of times the density estimates have been cascaded for the current solution. This variable appears only if you specify the CASCADE= option.

  • _METHOD_ , which is the value of the METHOD= option for the current solution. This variable appears only if you specify the METHOD= option.

  • _NJOIN_ , which is the number of clusters that are joined or dissolved in the current solution. This variable appears only if you specify the JOIN option.

  • _LOCAL_ , which is the local dimensionality estimate of the observation. This variable appears only if you specify the LOCAL option.

The OUT= data set contains the following variables:

  • the variables from the input data set

  • _OBS_ , which is the observation number from the input data set. This variable appears only if you omit the ID statement.

  • DENSITY , which is the estimated density at the observation. This variable can be renamed by the DENSITY= option.

  • CLUSTER , which is the number of the cluster to which the observation is assigned. This variable can be renamed by the CLUSTER= option.

The OUTC= data set contains the following variables:

  • the BY variables, if any

  • _NCLUS_ , which is the number of clusters in the solution

  • CLUSTER , which is the number of the current cluster

  • _FREQ_ , which is the number of observations in the cluster

  • _MODE_ , which is the maximum estimated density in the cluster

  • _BFREQ_ , which is the number of observations in the cluster with neighbors belonging to a different cluster

  • _SADDLE_ , which is the estimated saddle density for the cluster

  • _MC_ , which is the number of observations within the fixed-radius density-estimation neighborhood of the modal observation. This variable appears only if you specify the TEST or JOIN option.

  • _SC_ , which is the number of observations within the fixed-radius density-estimation neighborhood of the saddle observation. This variable appears only if you specify the TEST or JOIN option.

  • _OC_ , which is the number of observations within the overlap of the two previous neighborhoods. This variable appears only if you specify the TEST or JOIN option.

  • _Z_ , which is the approximate z statistic for the cluster. This variable appears only if you specify the TEST or JOIN option.

  • _P_ , which is the approximate p -value for the cluster. This variable appears only if you specify the TEST or JOIN option.

The OUTS= data set contains the following variables:

  • the BY variables, if any

  • _NCLUS_ , which is the number of clusters in the solution

  • _UNCL_ , which is the number of unclassified observations

  • _CROSS_ , which is the likelihood cross-validation criterion if you specify the CROSS or CROSSLIST option

Displayed Output

If you specify the SIMPLE option and the data are coordinates, PROC MODECLUS displays the following simple descriptive statistics for each variable:

  • the MEAN

  • the standard deviation, STD DEV

  • the SKEWNESS

  • the KURTOSIS

  • a coefficient of BIMODALITY (see Chapter 23, The CLUSTER Procedure, )

If you specify the NEIGHBOR option, PROC MODECLUS displays a list of the neighbors of each observation. The table contains

  • the observation number or ID value of the observation

  • the observation number or ID value of each of its neighbors

  • the distance to each neighbor

If you specify the CROSSLIST option, PROC MODECLUS produces a table of information regarding cross validation of the density estimates. Each table has a row for each observation. For each observation, the following are displayed:

  • the observation number or ID value of the observation

  • the radius of the neighborhood

  • the number of neighbors

  • the estimated log density

  • the estimated cross-validated log density

If you specify the LOCAL option, PROC MODECLUS produces a table of information regarding estimates of local dimensionality. Each table has a row for each observation. For each observation, the following are displayed:

  • the observation number or ID value of the observation

  • the radius of the neighborhood

  • the estimated local dimensionality

If you specify the LIST option, PROC MODECLUS produces a table listing the observations within each cluster. The table can include

  • the cluster number

  • the observation number or ID value of the observation

  • the estimated density

  • the sum of the density estimates of observations within the neighborhood that belong to the same cluster

  • the sum of the density estimates of observations within the neighborhood that belong to a different cluster

  • the sum of the density estimates of all the observations within the neighborhood

  • the ratio of the sum of the density estimates for the same cluster to the sum of all the density estimates in the neighborhood

If you specify the LIST option and there are unassigned objects, PROC MODECLUS produces a table listing those observations. The table includes

  • the observation number or ID value of the observation

  • the estimated density

  • the ratio of the sum of the density estimates for the same cluster to the sum of the density estimates in the neighborhood for all other clusters

If you specify the BOUNDARY option, PROC MODECLUS produces a table listing the observations in each cluster that have a neighbor belonging to a different cluster. The table includes

  • the observation number or ID value of the observation

  • the estimated density

  • the cluster number

  • the ratio of the sum of the density estimates for the same cluster to the sum of the density estimates in the neighborhood for all other clusters

If you do not specify the SHORT option, PROC MODECLUS produces a table of cluster statistics including

  • the cluster number

  • the cluster frequency (the number of observations in the cluster)

  • the maximum estimated density within the cluster

  • the number of observations in the cluster having a neighbor that belongs to a different cluster

  • the estimated saddle density of the cluster

If you specify the TEST or JOIN option, the table of cluster statistics includes the following items pertaining to the saddle test:

  • the number of observations within the fixed-radius density-estimation neighborhood of the modal observation

  • the number of observations within the fixed-radius density-estimation neighborhood of the saddle observation

  • the number of observations within the overlap of the two preceding neighborhoods

  • the z statistic for comparing the preceding counts

  • the approximate p -value

If you do not specify the NOSUMMARY option, PROC MODECLUS produces a table summarizing each cluster solution containing the following items:

  • the smoothing parameters and cascade value

  • the number of clusters

  • the frequency of unclassified objects

  • the likelihood cross-validation criterion if you specify the CROSS or CROSSLIST option

If you specify the JOIN option, the summary table also includes

  • the number of clusters joined

  • the maximum p -value of any cluster in the solution

If you specify the TRACE option, PROC MODECLUS produces a table for each cluster solution that lists each observation along with its cluster membership as it is reassigned from the Old cluster to the New cluster. This reassignment is described in Step 1 through Step 3 of the section METHOD=6 on page 2875. Each table has a row for each observation. For each observation, the following are displayed:

M

means the observation is a mode

S

means the observation is a seed

N

means the neighbor of a mode or seed, for which the ratio is not computed

ODS Table Names

PROC MODECLUS assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.

Table 47.6: ODS Tables Produced in PROC MODECLUS

ODS Table Name

Description

Statement

Option

BoundaryFreq

Boundary objects information

PROC

BOUNDARY (or ALL)

ClusterList

Cluster listing, cluster id, freq., density etc.

PROC

LIST (or ALL)

ClusterStats

Cluster statistics

PROC

default

 

Cluster statistics, significance test statistics

PROC

TEST or JOIN (or ALL)

ClusterSummary

Cluster summary

PROC

default

 

Cluster summary, crossvalidation criterion

PROC

CROSS or CROSSLIST (or ALL)

 

Cluster summary, clusters joined information

PROC

JOIN (or ALL)

CrossList

Cross-validated log density

PROC

CROSSLIST

ListLocal

Local dimensionality estimates

PROC

LOCAL

Neighbor

Nearest neighbor list

PROC

NEIGHBOR (or ALL)

SimpleStatistics

Simple statistics

PROC

SIMPLE (or ALL)

Trace

Trace of clustering algorithm (METHOD= 6 only)

PROC

TRACE (or ALL) with METHOD= 6

UnassignObjects

Information on unassigned objects

PROC

LIST (or ALL)




SAS.STAT 9.1 Users Guide (Vol. 4)
SAS.STAT 9.1 Users Guide (Vol. 4)
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 91

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net