Refer to Silverman (1986) or Scott (1992) for an introduction to nonparametric density estimation.
PROC MODECLUS uses (hyper)spherical uniform kernels of fixed or variable radius. The density estimate at a point is computed by dividing the number of observations within a sphere centered at the point by the product of the sample size and the volume of the sphere. The size of the sphere is determined by the smoothing parameters that you are required to specify.
For fixed-radius kernels, specify the radius as a Euclidean distance with either the DR= or R= option. For variable-radius kernels, specify the number of neighbors desired within the sphere with either the DK= or K= option; the radius is then the smallest radius that contains at least the specified number of observations including the observation at which the density is being estimated. If you specify both the DR= or R= option and the DK= or K= option, the radius used is the maximum of the two indicated radii; this is useful for dealing with outliers.
It is convenient to refer to the sphere of support of the kernel at observation x i as the neighborhood of x i . The observations within the neighborhood of x i are the neighbors of x i . In some contexts, x i is considered a neighbor of itself, but in other contexts it is not. The following notation is used in this chapter.
x i | the i th observation |
d( x , y ) | the distance between points x and y |
n | the total number of observations in the sample |
n i | the number of observations within the neighborhood of x i including x i itself |
| the number of observations within the neighborhood of x i not including x i itself |
N i | the set of indices of neighbors of x i including i |
| the set of indices of neighbors of x i not including i |
v i | the volume of the neighborhood of x i |
i | the estimated density at x i |
| the cross- validated density estimate at x i |
C k | the set of indices of observations assigned to cluster k |
v | the number of variables or the dimensionality |
s l | standard deviation of the l th variable |
The estimated density at x i is
that is, the number of neighbors of x i divided by the product of the sample size and the volume of the neighborhood at x i .
The density estimates provided by uniform kernels are not quite as good as those provided by some other types of kernels, but they are quite satisfactory for clustering. The significance tests for the number of clusters require the use of fixed-size uniform kernels.
There is no simple answer to the question of which smoothing parameter to use (Silverman 1986, pp. 43 “61, 84 “88, 98 “99). It is usually necessary to try several different smoothing parameters. A reasonable first guess for the K= option is in the range of 0.1 to 1 times n 4 / ( v +4) , smaller values being suitable for higher dimensionalities. A reasonable first guess for the R= option in many coordinate data sets is given by
which can be computed in a DATA step using the GAMMA function for “ .The MODECLUS procedure also provides this first guess as a default smoothing parameter if none of the options (DR=, CR=, R=, DK=, CK=, and K= ) is specified. This formula is derived under the assumption that the data are sampled from a multivariate normal distribution and, therefore, tend to be too large (oversmooth) if the true distribution is multimodal. Robust estimates of the standard deviations may be preferable if there are outliers. If the data are distances, the factor can be replaced by an average root-mean-square Euclidean distance divided by . To prevent outliers from appearing as separate clusters, you can also specify K=2 or CK=2 or, more generally , K= m or CK= m , m ‰ 2, which in most cases forces clusters to have at least m members .
If the variables all have unit variance (for example, if you specify the STD option), you can use Table 47.2 to obtain an initial guess for the R= option.
Number of Obs | Number of Variables | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
20 | 1.01 | 1.36 | 1.77 | 2.23 | 2.73 | 3.25 | 3.81 | 4.38 | 4.98 | 5.60 |
35 | 0.91 | 1.24 | 1.64 | 2.08 | 2.56 | 3.08 | 3.62 | 4.18 | 4.77 | 5.38 |
50 | 0.84 | 1.17 | 1.56 | 1.99 | 2.46 | 2.97 | 3.50 | 4.06 | 4.64 | 5.24 |
75 | 0.78 | 1.09 | 1.47 | 1.89 | 2.35 | 2.85 | 3.38 | 3.93 | 4.50 | 5.09 |
100 | 0.73 | 1.04 | 1.41 | 1.82 | 2.28 | 2.77 | 3.29 | 3.83 | 4.40 | 4.99 |
150 | 0.68 | 0.97 | 1.33 | 1.73 | 2.18 | 2.66 | 3.17 | 3.71 | 4.27 | 4.85 |
200 | 0.64 | 0.93 | 1.28 | 1.67 | 2.11 | 2.58 | 3.09 | 3.62 | 4.17 | 4.75 |
350 | 0.57 | 0.85 | 1.18 | 1.56 | 1.98 | 2.44 | 2.93 | 3.45 | 4.00 | 4.56 |
500 | 0.53 | 0.80 | 1.12 | 1.49 | 1.91 | 2.36 | 2.84 | 3.35 | 3.89 | 4.45 |
750 | 0.49 | 0.74 | 1.06 | 1.42 | 1.82 | 2.26 | 2.74 | 3.24 | 3.77 | 4.32 |
1000 | 0.46 | 0.71 | 1.01 | 1.37 | 1.77 | 2.20 | 2.67 | 3.16 | 3.69 | 4.23 |
1500 | 0.43 | 0.66 | 0.96 | 1.30 | 1.69 | 2.11 | 2.57 | 3.06 | 3.57 | 4.11 |
2000 | 0.40 | 0.63 | 0.92 | 1.25 | 1.63 | 2.05 | 2.50 | 2.99 | 3.49 | 4.03 |
One data-based method for choosing the smoothing parameter is likelihood cross validation (Silverman 1986, pp. 52 “55). The cross-validated density estimate at an observation is obtained by omitting the observation from the computations .
The (log) likelihood cross-validation criterion is then computed as
The suggested smoothing parameter is the one that maximizes this criterion. With fixed-radius kernels, likelihood cross validation oversmooths long-tailed distributions; for purposes of clustering, it tends to undersmooth short-tailed distributions. With k - nearest -neighbor density estimation, likelihood cross validation is useless because it almost always indicates k =2.
Cascaded density estimates are obtained by computing initial kernel density estimates and then, at each observation, taking the arithmetic mean, harmonic mean, or sum of the initial density estimates of the observations within the neighborhood. The cascaded density estimates can, in turn , be cascaded, and so on. Let k i be the density estimate at x i cascaded k times. For all types of cascading, i = i . If the cascading is done by arithmetic means, then, for k ‰ 0,
For harmonic means,
and for sums,
To avoid cluttering formulas, the symbol i is used from now on to denote the density estimate at x i whether cascaded or not, since the clustering methods and significance tests do not depend on the degree of cascading.
Cascading increases the smoothness of the estimates with less computation than would be required by increasing the smoothing parameters to yield a comparable degree of smoothness. For population densities with bounded support and discontinuities at the boundaries, cascading improves estimates near the boundaries. Cascaded estimates, especially using sums, may be more sensitive to the local covariance structure of the distribution than are the uncascaded kernel estimates. Cascading seems to be useful for detecting very nonspherical clusters. Cascading was suggested by Tukey and Tukey (1981, p. 237). Additional research into the properties of cascaded density estimates is needed.
The number of clusters is a function of the smoothing parameters. The number of clusters tends to decrease as the smoothing parameters increase, but the relationship is not strictly monotonic. Generally, you should specify several different values of the smoothing parameters to see how the number of clusters varies.
The clustering methods used by PROC MODECLUS use spherical clustering neighborhoods of fixed or variable radius that are similar to the spherical kernels used for density estimation. For fixed-radius neighborhoods, specify the radius as a Euclidean distance with either the CR= or R= option. For variable-radius neighborhoods, specify the number of neighbors desired within the sphere with either the CK= or K= option; the radius is then the smallest radius that contains at least the specified number of observations including the observation for which the neighborhood is being determined. However, in the following descriptions of clustering methods, an observation is not considered to be one of its own neighbors. If you specify both the CR= or R= option and the CK= or K= option, the radius used is the maximum of the two indicated radii; this is useful for dealing with outliers. In this section, the symbols N i , , n i , and refer to clustering neighborhoods, not density estimation neighborhoods.
Begin with each observation in a separate cluster. For each observation and each of its neighbors, join the cluster to which the observation belongs with the cluster to which the neighbor belongs. This method does not use density estimates. With a fixed clustering radius, the clusters are those obtained by cutting the single linkage tree at the specified radius (see Chapter 23, The CLUSTER Procedure, ).
Begin with each observation in a separate cluster. For each observation, find the nearest neighbor with a greater estimated density. If such a neighbor exists, join the cluster to which the observation belongs with the cluster to which the specified neighbor belongs.
Next , consider each observation with density estimates equal to that of one or more neighbors but not less than the estimate at any neighbor. Join the cluster containing the observation with (1) each cluster containing a neighbor of the observation such that the maximum density estimate in the cluster equals the density estimate at the observation and (2) the cluster containing the nearest neighbor of the observation such that the maximum density estimate in the cluster exceeds the density estimate at the observation.
This method is similar to the classification or assignment stage of algorithms described by Gitman (1973) and Huizinga (1978).
Begin with each observation in a separate cluster. For each observation, find the neighbor with the greatest estimated density exceeding the estimated density of the observation. If such a neighbor exists, join the cluster to which the observation belongs with the cluster to which the specified neighbor belongs.
Observations with density estimates equal to that of one or more neighbors but not less than the estimate at any neighbor are treated the same way as they are in METHOD=1.
This method is similar to the first stage of an algorithm proposed by Mizoguchi and Shimura (1980).
Begin with each observation in a separate cluster. For each observation, find the neighbor with greater estimated density such that the slope of the line connecting the point on the estimated density surface at the observation with the point on the estimated density surface at the neighbor is a maximum. That is, for observation x i , find a neighbor x j such that ( j ˆ’ i ) / d( x j , x i ) is a maximum. If this slope is positive, join the cluster to which observation x i belongs with the cluster to which the specified neighbor x j belongs. This method was invented by Koontz, Narendra, and Fukunaga (1976).
Observations with density estimates equal to that of one or more neighbors but not less than the estimate at any neighbor are treated the same way as they are in METHOD=1. The algorithm suggested for this situation by Koontz, Narendra, and
Fukunaga (1976) may fail for flat areas in the estimated density that contain four or more observations.
This method is equivalent to the first stage of two-stage density linkage (see Chapter 23, The CLUSTER Procedure, ) without the use of the MODE=option.
This method is equivalent to the first stage of two-stage density linkage (see Chapter 23, The CLUSTER Procedure, ) with the use of the MODE=option.
Begin with all observations unassigned .
Step 1: Form a list of seeds , each seed being a single observation such that the estimated density of the observation is not less than the estimated density of any of its neighbors. If you specify the MAXCLUSTERS= n option, retain only the n seeds with the greatest estimated densities.
Step 2: Consider each seed in decreasing order of estimated density.
If the current seed has already been assigned, proceed to the next seed. Otherwise , form a new cluster consisting of the current seed.
Add to the cluster any unassigned seed that is a neighbor of a member of the cluster or that shares a neighbor with a member of the cluster; repeat until no unassigned seed satisfies these conditions.
Add to the cluster all neighbors of seeds that belong to the cluster.
Consider each unassigned observation. Compute the ratio of the sum of the p ˆ’ 1 powers of the estimated density of the neighbors that belong to the current cluster to the sum of the p ˆ’ 1 powers of the estimated density of all of its neighbors, where p is specified by the POWER= option and is 2 by default. Let x i be the current observation, and let k be the index of the current cluster. Then this ratio is
(The sum of the p ˆ’ 1 powers of the estimated density of the neighbors of an observation is an estimate of the integral of the p th power of the density over the neighborhood.) If r ik exceeds the maximum of 0.5 and the value of the THRESHOLD= option, add the observation x i to the current cluster k . Repeat until no more observations can be added to the current cluster.
Step 3: (This step is performed only if the value of the THRESHOLD= option is less than 0.5.) Form a list of unassigned observations in decreasing order of estimated density. Repeat the following actions until the list is empty.
Remove the first observation from the list, for example, observation x i .
For each cluster k , compute r ik .
If the maximum over clusters of r ik exceeds the value of the THRESHOLD= option, assign observation x i to the corresponding cluster and insert all observations of which the current observation is a neighbor into the list, keeping the list in decreasing order of estimated density.
METHOD=6 is related to a method invented by Koontz and Fukunaga (1972a) and discussed by Koontz and Fukunaga (1972b).
Significance tests require that a fixed-radius kernel be specified for density estimation via the DR= or R= option. You can also specify the DK= or K= option, but only the fixed radius is used for the significance tests.
The purpose of the significance tests is as follows : given a simple random sample of objects from a population, obtain an estimate of the number of clusters in the population such that the probability in repeated sampling that the estimate exceeds the true number of clusters is not much greater than ± , 1% ‰ ± ‰ 10%. In other words, a sequence of null hypotheses of the form
:The number of population clusters is i or less
where i =1 , 2 , · · · ,n , is tested against the alternatives such as
: The number of population clusters exceeds i
with a maximum experimentwise error rate of approximately ± . The tests protect you from overestimating the number of population clusters. It is impossible to protect against underestimating the number of population clusters without introducing much stronger assumptions than are used here, since the number of population clusters could conceivably exceed the sample size.
The method for conducting significance tests is as follows:
Estimate densities using fixed-radius uniform kernels.
Obtain preliminary clusters by a valley-seeking method. Other clustering methods could be used but would yield less power.
Compute an approximate p -value for each cluster by comparing the estimated maximum density in the cluster with the estimated maximum density on the cluster boundary.
Repeatedly join the least significant cluster with a neighboring cluster until all remaining clusters are significant.
Estimate the number of population clusters as the number of significant sample clusters.
The preceding steps can be repeated for any number of different radii, and the estimate of the number of population clusters can be taken to be the maximum number of significant sample clusters for any radius.
This method has the following useful features:
No distributional assumptions are required.
The choice of smoothing parameter is not critical since you can try any number of different values.
The data can be coordinates or distances.
Time and space requirements for the significance tests are no worse than those for obtaining the clusters.
The power is high enough to be useful for practical purposes.
The method for computing the p -values is based on a series of plausible approximations. There are as yet no rigorous proofs that the method is infallible. Neither are there any asymptotic results. However, simulations for sample sizes ranging from 20 to 2000 indicate that the p -values are almost always conservative. The only case discovered so far in which the p -values are liberal is a uniform distribution in one dimension for which the simulated error rates exceed the nominal significance level only slightly for a limited range of sample sizes.
To make inferences regarding population clusters, it is first necessary to define what is meant by a cluster. For clustering methods using nonparametric density estimation, a cluster is usually loosely defined as a region surrounding a local maximum of the probability density function or a maximal connected set of local maxima. This definition may not be satisfactory for very rough densities with many local maxima. It is not applicable at all to discrete distributions for which the density does not exist. As another example in which this definition is not intuitively reasonable, consider a uniform distribution in two dimensions with support in the shape of a figure eight (including the interior). This density might be considered to contain two clusters even though it does not have two distinct modes.
These difficulties can be avoided by defining clusters in terms of the local maxima of a smoothed probability density or mass function. For example, define the neighborhood distribution function (NDF) with radius r at a point x as the probability that a randomly selected point will lie within a radius r of x , that is, the probability integral over a hypersphere of radius r centered at x :
where X is the random variable being sampled, r is a user -specified radius, and d( x , y ) is the distance between points x and y .
The NDF exists for all probability distributions. You can select the radius according to the degree of resolution required. The minimum-variance unbiased estimate of the NDF at a point x is proportional to the uniform-kernel density estimate with corresponding support.
You can define a modal region as a maximal connected set of local maxima of the NDF. A cluster is a connected set containing exactly one modal region. This definition seems to give intuitively reasonable results in most cases. An exception is a uniform density on the perimeter of a square. The NDF has four local maxima. There are eight local maxima along the perimeter, but running PROC MODECLUS with the R= option would yield four clusters since the two local maxima at each corner are separated by a distance equal to the radius. While this density does indeed have four distinctive features (the corners), it is not obvious that each corner should be considered a cluster.
The number of population clusters depends on the radius of the NDF. The significance tests in PROC MODECLUS protect against overestimating the number of clusters at any specified radius. It is often useful to look at the clustering results across a range of radii. A plot of the number of sample clusters as a function of the radius is a useful descriptive display, especially for high-dimensional data (Wong and Schaack 1982).
If a population has two clusters, it must have two modal regions. If there are two modal regions , there must be a valley between them. It seems intuitively desirable that the boundary between the two clusters should follow the bottom of this valley. All the clustering methods in PROC MODECLUS are designed to locate the estimated cluster boundaries in this way, although methods 1 and 6 seem to be much more successful at this than the others. Regardless of the precise location of the cluster boundary, it is clear that the maximum of the NDF along the boundary between two clusters must be strictly less than the value of the NDF in either modal region; otherwise, there would be only a single modal region; according to Hartigan and Hartigan (1985), there must be a dip between the two modes. PROC MODECLUS assesses the significance of a sample cluster by comparing the NDF in the modal region with the maximum of the NDF along the cluster boundary. If the NDF has second-order derivatives in the region of interest and if the boundary between the two clusters is indeed at the bottom of the valley, then the maximum value of the NDF along the boundary occurs at a saddle point. Hence, this test is called a saddle test . This term is intended to describe any test for clusters that compares modal densities with saddle densities, not just the test currently implemented in the MODECLUS procedure.
The obvious estimate of the maximum NDF in a sample cluster is the maximum estimated NDF at an observation in the cluster. Let m ( k ) be the index of the observation for which the maximum is attained in cluster k .
Estimating the maximum NDF on the cluster boundary is more complicated. One approach is to take the maximum NDF estimate at an observation in the cluster that has a neighbor belonging to another cluster. This method yields excessively large estimates when the neighborhood is large. Another approach is to try to choose an object closer to the boundary by taking the observation with the maximum sum of estimated densities of neighbors belonging to a different cluster. After some experimentation, it is found that a combination of these two methods works well. Let B k be the set of indices of observations in cluster k that have neighbors belonging to a different cluster, and compute
Let s ( k ) be the index of the observation for which the maximum is attained.
Using the notation #( S ) for the cardinality of set S , let
Let R ( u ) be a random variable distributed as the range of a random sample of u observations from a standard normal distribution. Then the approximate p -value p k for cluster k is
If points m ( k ) and s ( k ) are fixed a priori , z k would be the usual approximately normal test statistic for comparing two binomial random variables. In fact, m ( k ) and s ( k ) are selected in such a way that c m ( k ) tends to be large and c s ( k ) tends to be small. For this reason, and because there may be a large number of clusters, each with its own z k to be tested, each z k is referred to the distribution of R ( u ) instead of a standard normal distribution. If the tests are conducted for only one radius and if u is chosen equal to n , then the p -values are very conservative because (1) you are not making all possible pairwise comparisons of observations in the sample and (2) and are positively correlated if the neighborhoods overlap. In the formula for u , the summation overcorrects somewhat for the conservativeness due to correlated . The factor is empirically estimated from simulation results to adjust for the use of more than one radius.
If the JOIN option is specified, the least significant cluster (the cluster with the smallest z k ) is either dissolved or joined with a neighboring cluster. If no members of the cluster have neighbors belonging to a different cluster, all members of the cluster are unassigned. Otherwise, the cluster is joined to the neighboring cluster such that the sum of density estimates of neighbors of the estimated saddle point belonging to it is a maximum. Joining clusters increases the power of the saddle test. For example, consider a population with two well-separated clusters. Suppose that, for a certain radius, each population cluster is divided into two sample clusters. None of the four sample clusters is likely to be significant, but after the two sample clusters corresponding to each population cluster are joined, the remaining two clusters may be highly significant.
The saddle test implemented in PROC MODECLUS has been evaluated by simulation from known distributions. Some results are given in the following three tables. In Table 47.3, samples of 20 to 2000 observations are generated from a one-dimensional uniform distribution. For sample sizes of 1000 or less, 2000 samples are generated and analyzed by PROC MODECLUS. For a sample size of 2000, only 1000 samples are generated. The analysis is done with at least 20 different values of the R= option spread across the range of radii most likely to yield significant results. The six central columns of the table give the observed error rates at the nominal error rates ( ± ) at the head of each column. The standard errors of the observed error rates are given at the bottom of the table. The observed error rates are conservative for ± ‰ 5%, but they increase with ± and become slightly liberal for sample sizes in the middle of the range tested.
Sample Size | Nominal Type 1 Error Rate | Number of Simulations | |||||
---|---|---|---|---|---|---|---|
1 | 2 | 5 | 10 | 15 | 20 | ||
20 | 0.00 | 0.00 | 0.00 | 0.60 | 11.65 | 27.05 | 2000 |
50 | 0.35 | 0.70 | 4.50 | 10.95 | 20.55 | 29.80 | 2000 |
100 | 0.35 | 0.85 | 3.90 | 11.05 | 18.95 | 28.05 | 2000 |
200 | 0.30 | 1.35 | 4.00 | 10.50 | 18.60 | 27.05 | 2000 |
500 | 0.45 | 1.05 | 4.35 | 9.80 | 16.55 | 23.55 | 2000 |
1000 | 0.70 | 1.30 | 4.65 | 9.55 | 15.45 | 19.95 | 2000 |
2000 | 0.40 | 1.10 | 3.00 | 7.40 | 11.50 | 16.70 | 1000 |
Standard Error | 0.22 | 0.31 | 0.49 | 0.67 | 0.80 | 0.89 | 2000 |
0.31 | 0.44 | 0.69 | 0.95 | 1.13 | 1.26 | 1000 |
All unimodal distributions other than the uniform that have been tested, including normal, Cauchy, and exponential distributions and uniform mixtures, have produced much more conservative results. Table 47.4 displays results from a unimodal mixture of two normal distributions with equal variances and equal sampling probabilities and with means separated by two standard deviations. Any greater separation would produce a bimodal distribution. The observed error rates are quite conservative.
Sample Size | Nominal Type 1 Error Rate | Number of Simulations | |||||
---|---|---|---|---|---|---|---|
1 | 2 | 5 | 10 | 15 | 20 | ||
100 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 4.0 | 200 |
200 | 0.0 | 0.0 | 0.0 | 2.0 | 3.0 | 3.0 | 200 |
500 | 0.0 | 0.0 | 0.5 | 0.5 | 0.5 | 0.5 | 200 |
All distributions in two or more dimensions that have been tested yield extremely conservative results. For example, a uniform distribution on a circle yields observed error rates that are never more than one-tenth of the nominal error rates for sample sizes up to 1000. This conservatism is due to the fact that, as the dimensionality increases, more and more of the probability lies in the tails of the distribution (Silverman 1986, p. 92), and the saddle test used by PROC MODECLUS is more conservative for distributions with pronounced tails. This applies even to a uniform distribution on a hypersphere because, although the density has no tails , the NDF does.
Since the formulas for the significance tests do not involve the dimensionality, no problems are created when the data are linearly dependent. Simulations of data in nonlinear subspaces (the circumference of a circle or surface of a sphere) have also yielded conservative results.
Sample Size | Nominal Type 1 Error Rate | Number of Simulations | |||||
---|---|---|---|---|---|---|---|
1 | 2 | 5 | 10 | 15 | 20 | ||
20 | 0.0 | 0.0 | 0.0 | 2.0 | 37.5 | 68.5 | 200 |
35 | 0.0 | 13.5 | 38.5 | 48.5 | 64.0 | 75.5 | 200 |
50 | 17.5 | 26.0 | 51.5 | 67.0 | 78.5 | 84.0 | 200 |
75 | 25.5 | 36.0 | 58.5 | 77.5 | 85.5 | 89.5 | 200 |
100 | 40.0 | 54.5 | 72.5 | 84.5 | 91.5 | 92.5 | 200 |
150 | 70.5 | 80.0 | 92.0 | 97.0 | 100.0 | 100.0 | 200 |
200 | 89.0 | 96.0 | 99.5 | 100.0 | 100.0 | 100.0 | 200 |
The saddle test is not as efficient as excess-mass tests for multimodality (M ¼uller and Sawitzki 1991, Polonik 1993). However, there is not yet a general approximation for the distribution of excess-mass statistics to circumvent the need for simulations to do significance tests. Refer to Minnotte (1992) for a review of tests for multimodality.
The MODECLUS procedure stores coordinate data in memory if there is enough space. For distance data, only one observation at a time is in memory.
PROC MODECLUS constructs lists of the neighbors of each observation. The total space required is 12 ˆ‘ n i bytes, where n i is based on the largest neighborhood required by any analysis. The lists are stored in a SAS utility data set unless you specify the CORE option. You may get an error message from the SAS System or from the operating system if there is not enough disk space for the utility data set. Clustering method 6 requires a second list that is always stored in memory.
For coordinate data, the time required to construct the neighbor lists is roughly proportional to v (log n )( ˆ‘ n i ) log( ˆ‘ n i /n ). For distance data, the time is roughly proportional to n 2 log( ˆ‘ n i /n ).
The time required for density estimation is proportional to ˆ‘ n i and is usually small compared to the time required for constructing the neighbor lists.
Clustering methods 0 through 3 are quite efficient, requiring time proportional to ˆ‘ n i . Methods 4 and 5 are slower, requiring time roughly proportional to ( ˆ‘ n i ) log( ˆ‘ n i ). Method 6 can also be slow, but the time requirements depend very much on the data and the particular options specified. Methods 4, 5, and 6 also require more memory than the other methods.
The time required for significance tests is roughly proportional to g ˆ‘ n i , where g is the number of clusters.
PROC MODECLUS can process data sets of several thousand observations if you specify reasonable smoothing parameters. Very small smoothing values produce many clusters, whereas very large values produce many neighbors; either case can require excessive time or space.
If the data are coordinates, observations with missing values are excluded from the analysis.
If the data are distances, missing values are treated as infinite. The neighbors of each observation are determined solely by the distances in that observation. The distances are not required to be symmetric, and there is no check for symmetry; the neighbors of each observation are determined only from the distances in that observation. This treatment of missing values is quite different from that of the CLUSTER procedure, which ignores the upper triangle of the distance matrix.
The OUT= data set contains one complete copy of the input data set for each cluster solution. There are additional variables identifying each solution and giving information about individual observations. Solutions with only one remaining cluster when JOIN= p is specified are omitted from the OUT= data set (see the description of the JOIN= option on page 2866). The OUT= data set can be extremely large, so it may be advisable to specify the DROP= data set option to exclude unnecessary variables.
The OUTCLUS= or OUTC= data set contains one observation for each cluster in each cluster solution. The variables identify the solution and provide statistics describing the cluster.
The OUTSUM= or OUTS= data set contains one observation for each cluster solution. The variables identify the solution and provide information about the solution as a whole.
The following variables can appear in all of the output data sets:
_K_ , which is the value of the K= option for the current solution. This variable appears only if you specify the K= option.
_DK_ , which is the value of the DK= option for the current solution. This variable appears only if you specify the DK= option.
_CK_ , which is the value of the CK= option for the current solution. This variable appears only if you specify the CK= option.
_R_ , which is the value of the R= option for the current solution. This variable appears only if you specify the R= option.
_DR_ , which is the value of the DR= option for the current solution. This variable appears only if you specify the DR= option.
_CR_ , which is the value of the CR= option for the current solution. This variable appears only if you specify the CR= option.
_CASCAD_ , which is the number of times the density estimates have been cascaded for the current solution. This variable appears only if you specify the CASCADE= option.
_METHOD_ , which is the value of the METHOD= option for the current solution. This variable appears only if you specify the METHOD= option.
_NJOIN_ , which is the number of clusters that are joined or dissolved in the current solution. This variable appears only if you specify the JOIN option.
_LOCAL_ , which is the local dimensionality estimate of the observation. This variable appears only if you specify the LOCAL option.
The OUT= data set contains the following variables:
the variables from the input data set
_OBS_ , which is the observation number from the input data set. This variable appears only if you omit the ID statement.
DENSITY , which is the estimated density at the observation. This variable can be renamed by the DENSITY= option.
CLUSTER , which is the number of the cluster to which the observation is assigned. This variable can be renamed by the CLUSTER= option.
The OUTC= data set contains the following variables:
the BY variables, if any
_NCLUS_ , which is the number of clusters in the solution
CLUSTER , which is the number of the current cluster
_FREQ_ , which is the number of observations in the cluster
_MODE_ , which is the maximum estimated density in the cluster
_BFREQ_ , which is the number of observations in the cluster with neighbors belonging to a different cluster
_SADDLE_ , which is the estimated saddle density for the cluster
_MC_ , which is the number of observations within the fixed-radius density-estimation neighborhood of the modal observation. This variable appears only if you specify the TEST or JOIN option.
_SC_ , which is the number of observations within the fixed-radius density-estimation neighborhood of the saddle observation. This variable appears only if you specify the TEST or JOIN option.
_OC_ , which is the number of observations within the overlap of the two previous neighborhoods. This variable appears only if you specify the TEST or JOIN option.
_Z_ , which is the approximate z statistic for the cluster. This variable appears only if you specify the TEST or JOIN option.
_P_ , which is the approximate p -value for the cluster. This variable appears only if you specify the TEST or JOIN option.
The OUTS= data set contains the following variables:
the BY variables, if any
_NCLUS_ , which is the number of clusters in the solution
_UNCL_ , which is the number of unclassified observations
_CROSS_ , which is the likelihood cross-validation criterion if you specify the CROSS or CROSSLIST option
If you specify the SIMPLE option and the data are coordinates, PROC MODECLUS displays the following simple descriptive statistics for each variable:
the MEAN
the standard deviation, STD DEV
the SKEWNESS
the KURTOSIS
a coefficient of BIMODALITY (see Chapter 23, The CLUSTER Procedure, )
If you specify the NEIGHBOR option, PROC MODECLUS displays a list of the neighbors of each observation. The table contains
the observation number or ID value of the observation
the observation number or ID value of each of its neighbors
the distance to each neighbor
If you specify the CROSSLIST option, PROC MODECLUS produces a table of information regarding cross validation of the density estimates. Each table has a row for each observation. For each observation, the following are displayed:
the observation number or ID value of the observation
the radius of the neighborhood
the number of neighbors
the estimated log density
the estimated cross-validated log density
If you specify the LOCAL option, PROC MODECLUS produces a table of information regarding estimates of local dimensionality. Each table has a row for each observation. For each observation, the following are displayed:
the observation number or ID value of the observation
the radius of the neighborhood
the estimated local dimensionality
If you specify the LIST option, PROC MODECLUS produces a table listing the observations within each cluster. The table can include
the cluster number
the observation number or ID value of the observation
the estimated density
the sum of the density estimates of observations within the neighborhood that belong to the same cluster
the sum of the density estimates of observations within the neighborhood that belong to a different cluster
the sum of the density estimates of all the observations within the neighborhood
the ratio of the sum of the density estimates for the same cluster to the sum of all the density estimates in the neighborhood
If you specify the LIST option and there are unassigned objects, PROC MODECLUS produces a table listing those observations. The table includes
the observation number or ID value of the observation
the estimated density
the ratio of the sum of the density estimates for the same cluster to the sum of the density estimates in the neighborhood for all other clusters
If you specify the BOUNDARY option, PROC MODECLUS produces a table listing the observations in each cluster that have a neighbor belonging to a different cluster. The table includes
the observation number or ID value of the observation
the estimated density
the cluster number
the ratio of the sum of the density estimates for the same cluster to the sum of the density estimates in the neighborhood for all other clusters
If you do not specify the SHORT option, PROC MODECLUS produces a table of cluster statistics including
the cluster number
the cluster frequency (the number of observations in the cluster)
the maximum estimated density within the cluster
the number of observations in the cluster having a neighbor that belongs to a different cluster
the estimated saddle density of the cluster
If you specify the TEST or JOIN option, the table of cluster statistics includes the following items pertaining to the saddle test:
the number of observations within the fixed-radius density-estimation neighborhood of the modal observation
the number of observations within the fixed-radius density-estimation neighborhood of the saddle observation
the number of observations within the overlap of the two preceding neighborhoods
the z statistic for comparing the preceding counts
the approximate p -value
If you do not specify the NOSUMMARY option, PROC MODECLUS produces a table summarizing each cluster solution containing the following items:
the smoothing parameters and cascade value
the number of clusters
the frequency of unclassified objects
the likelihood cross-validation criterion if you specify the CROSS or CROSSLIST option
If you specify the JOIN option, the summary table also includes
the number of clusters joined
the maximum p -value of any cluster in the solution
If you specify the TRACE option, PROC MODECLUS produces a table for each cluster solution that lists each observation along with its cluster membership as it is reassigned from the Old cluster to the New cluster. This reassignment is described in Step 1 through Step 3 of the section METHOD=6 on page 2875. Each table has a row for each observation. For each observation, the following are displayed:
the observation number or ID value of the observation
the estimated density
the Old cluster membership. 0 represents an unassigned observation and -1 represents a seed.
the New cluster membership
Ratio, which is documented in the section METHOD=6 on page 2875. The following character values can also be displayed:
M | means the observation is a mode |
S | means the observation is a seed |
N | means the neighbor of a mode or seed, for which the ratio is not computed |
PROC MODECLUS assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.
ODS Table Name | Description | Statement | Option |
---|---|---|---|
BoundaryFreq | Boundary objects information | PROC | BOUNDARY (or ALL) |
ClusterList | Cluster listing, cluster id, freq., density etc. | PROC | LIST (or ALL) |
ClusterStats | Cluster statistics | PROC | default |
Cluster statistics, significance test statistics | PROC | TEST or JOIN (or ALL) | |
ClusterSummary | Cluster summary | PROC | default |
Cluster summary, crossvalidation criterion | PROC | CROSS or CROSSLIST (or ALL) | |
Cluster summary, clusters joined information | PROC | JOIN (or ALL) | |
CrossList | Cross-validated log density | PROC | CROSSLIST |
ListLocal | Local dimensionality estimates | PROC | LOCAL |
Neighbor | Nearest neighbor list | PROC | NEIGHBOR (or ALL) |
SimpleStatistics | Simple statistics | PROC | SIMPLE (or ALL) |
Trace | Trace of clustering algorithm (METHOD= 6 only) | PROC | TRACE (or ALL) with METHOD= 6 |
UnassignObjects | Information on unassigned objects | PROC | LIST (or ALL) |