The following statements are available in the FASTCLUS procedure:
PROC FASTCLUS MAXCLUSTERS= n RADIUS= t< options > ;
VAR variables ;
ID variable ;
FREQ variable ;
WEIGHT variable ;
BY variables ;
Usually you need only the VAR statement in addition to the PROC FASTCLUS statement. The BY, FREQ, ID, VAR, and WEIGHT statements are described in alphabetical order after the PROC FASTCLUS statement.
PROC FASTCLUS MAXCLUSTERS= n RADIUS= t< options > ;
You must specify either the MAXCLUSTERS= or the RADIUS= argument in the PROC FASTCLUS statement.
MAXCLUSTERS= n
MAXC= n
specifies the maximum number of clusters allowed. If you omit the MAXCLUSTERS= option, a value of 100 is assumed.
RADIUS= t
R= t
establishes the minimum distance criterion for selecting new seeds . No observation is considered as a new seed unless its minimum distance to previous seeds exceeds the value given by the RADIUS= option. The default value is 0. If you specify the REPLACE=RANDOM option, the RADIUS= option is ignored.
You can specify the following options in the PROC FASTCLUS statement. Table 28.1 summarizes the options.
Task | Options |
---|---|
Specify data set details | CLUSTER= DATA= INSTAT= MEAN= OUT= OUTITER OUTSEED= OUTSTAT= SEED= |
Specify distance dimension | BINS = HC= HP= IRLS LEAST= |
Select initial cluster seeds | RANDOM= REPLACE= |
Compute final cluster seeds | CONVERGE= DELETE= DRIFT MAXCLUSTERS= MAXITER= RADIUS= STRICT |
Work with missing values | IMPUTE NOMISS |
Specify variance divisor | VARDEF |
Control output | DISTANCE LIST NOPRINT SHORT SUMMARY |
The following list provides details on these options. The list is in alphabetical order.
BINS= n
specifies the number of bins used in the bin- sort algorithm for computing medians for LEAST=1. By default, PROC FASTCLUS uses from 10 to 100 bins, depending on the amount of memory available. Larger values use more memory and make each iteration somewhat slower, but they may reduce the number of iterations. Smaller values have the opposite effect. The minimum value of n is 5.
CLUSTER= name
specifies a name for the variable in the OUTSEED= and OUT= data sets that indicates cluster membership. The default name for this variable is CLUSTER .
CONVERGE= c
CONV= c
specifies the convergence criterion. Any nonnegative value is allowed. The default value is 0.0001 for all values of p if LEAST= p is explicitly specified; otherwise , the default value is 0.02. Iterations stop when the maximum relative change in the cluster seeds is less than or equal to the convergence criterion and additional conditions on the homotopy parameter, if any, are satisfied (see the HP= option). The relative change in a cluster seed is the distance between the old seed and the new seed divided by a scaling factor. If you do not specify the LEAST= option, the scaling factor is the minimum distance between the initial seeds. If you specify the LEAST= option, the scaling factor is an L 1 scale estimate and is recomputed on each iteration. Specify the CONVERGE= option only if you specify a MAXITER= value greater than 1.
DATA= SAS-data-set
specifies the input data set containing observations to be clustered. If you omit the DATA= option, the most recently created SAS data set is used. The data must be coordinates, not distances, similarities, or correlations .
DELETE= n
deletes cluster seeds to which n or fewer observations are assigned. Deletion occurs after processing for the DRIFT option is completed and after each iteration specified by the MAXITER= option. Cluster seeds are not deleted after the final assignment of observations to clusters, so in rare cases a final cluster may not have more than n members . The DELETE= option is ineffective if you specify MAXITER=0 and do not specify the DRIFT option. By default, no cluster seeds are deleted.
DISTANCE DIST
computes distances between the cluster means.
DRIFT
executes the second of the four steps described in the section Background on page 1380. After initial seed selection, each observation is assigned to the cluster with the nearest seed. After an observation is processed , the seed of the cluster to which it is assigned is recalculated as the mean of the observations currently assigned to the cluster. Thus, the cluster seeds drift about rather than remaining fixed for the duration of the pass.
HC= c
HP= p 1 < p 2 >
pertains to the homotopy parameter for LEAST= p , where 1 < p < 2. You should specify these options only if you encounter convergence problems using the default values.
For 1 < p < 2, PROC FASTCLUS tries to optimize a perturbed variant of the L p clustering criterion (Gonin and Money 1989, pp. 5“6). When the homotopy parameter is 0, the optimization criterion is equivalent to the clustering criterion. For a large homotopy parameter, the optimization criterion approaches the least-squares criterion and is, therefore, easy to optimize. Beginning with a large homotopy parameter, PROC FASTCLUS gradually decreases it by a factor in the range [0.01,0.5] over the course of the iterations. When both the homotopy parameter and the convergence measure are sufficiently small, the optimization process is declared to have converged .
If the initial homotopy parameter is too large or if it is decreased too slowly, the optimization may require many iterations. If the initial homotopy parameter is too small or if it is decreased too quickly, convergence to a local optimum is likely.
HC= c | specifies the criterion for updating the homotopy parameter. The homotopy parameter is updated when the maximum relative change in the cluster seeds is less than or equal to c . The default is the minimum of 0.01 and 100 times the value of the CONVERGE= option. |
HP= p 1 | specifies p 1 as the initial value of the homotopy parameter. The default is 0.05 if the modified Ekblom-Newton method is used; otherwise, it is 0.25. |
HP= p 1 p 2 | also specifies p 2 as the minimum value for the homotopy parameter, which must be reached for convergence. The default is the minimum of p 1 and 0.01 times the value of the CONVERGE= option. |
IMPUTE
requests imputation of missing values after the final assignment of observations to clusters. If an observation has a missing value for a variable used in the cluster analysis, the missing value is replaced by the corresponding value in the cluster seed to which the observation is assigned. If the observation is not assigned to a cluster, missing values are not replaced . If you specify the IMPUTE option, the imputed values are not used in computing cluster statistics.
If you also request an OUT= data set, it contains the imputed values.
INSTAT= SAS-data-set
reads a SAS data set previously created by the FASTCLUS procedure using the OUTSTAT= option. If you specify the INSTAT= option, no clustering iterations are performed and no output is displayed. Only cluster assignment and imputation are performed as an OUT= data set is created.
IRLS
causes PROC FASTCLUS to use an iteratively reweighted least-squares method instead of the modified Ekblom-Newton method. If you specify the IRLS option, you must also specify LEAST= p , where 1 < p < 2. Use the IRLS option only if you encounter convergence problems with the default method.
LEAST= p MAX
L= p MAX
causes PROC FASTCLUS to optimize an L p criterion, where 1 p ˆ (Spath 1985, pp. 62“63). Infinity is indicated by LEAST=MAX. The value of this clustering criterion is displayed in the iteration history.
If you do not specify the LEAST= option, PROC FASTCLUS uses the least-squares ( L 2 ) criterion. However, the default number of iterations is only 1 if you omit the LEAST= option, so the optimization of the criterion is generally not completed. If you specify the LEAST= option, the maximum number of iterations is increased to allow the optimization process a chance to converge. See the MAXITER= option on page 1393.
Specifying the LEAST= option also changes the default convergence criterion from 0.02 to 0.0001. See the CONVERGE= option on page 1390.
When LEAST=2, PROC FASTCLUS tries to minimize the root mean square difference between the data and the corresponding cluster means.
When LEAST=1, PROC FASTCLUS tries to minimize the mean absolute difference between the data and the corresponding cluster medians.
When LEAST=MAX, PROC FASTCLUS tries to minimize the maximum absolute difference between the data and the corresponding cluster midranges.
For general values of p , PROC FASTCLUS tries to minimize the p th root of the mean of the p th powers of the absolute differences between the data and the corresponding cluster seeds.
The divisor in the clustering criterion is either the number of nonmissing data used in the analysis or, if there is a WEIGHT statement, the sum of the weights corresponding to all the nonmissing data used in the analysis (that is, an observation with n nonmissing data contributes n times the observation weight to the divisor). The divisor is not adjusted for degrees of freedom.
The method for updating cluster seeds during iteration depends on the LEAST= option, as follows (Gonin and Money 1989).
LEAST= p | Algorithm for Computing Cluster Seeds |
---|---|
p = 1 | bin sort for median |
1 < p < 2 | modified Merle-Spath if you specify IRLS, otherwise modified Ekblom-Newton |
p = 2 | arithmetic mean |
2 < p < ˆ | Newton |
p = ˆ | midrange |
During the final pass, a modified Merle-Spath step is taken to compute the cluster centers for 1 p < 2 or 2 < p < ˆ .
If you specify the LEAST= p option with a value other than 2, PROC FASTCLUS computes pooled scale estimates analogous to the root mean square standard deviation but based on p th power deviations instead of squared deviations.
LEAST= p | Scale Estimate |
---|---|
p = 1 | mean absolute deviation |
1 < p < ˆ | root mean p th-power absolute deviation |
p = ˆ | maximum absolute deviation |
The divisors for computing the mean absolute deviation or the root mean p th-power absolute deviation are adjusted for degrees of freedom just like the divisors for computing standard deviations. This adjustment can be suppressed by the VARDEF= option.
LIST
lists all observations, giving the value of the ID variable (if any), the number of the cluster to which the observation is assigned, and the distance between the observation and the final cluster seed.
MAXITER= n
specifies the maximum number of iterations for recomputing cluster seeds. When the value of the MAXITER= option is greater than 0, PROC FASTCLUS executes the third of the four steps described in the Background section on page 1380. In each iteration, each observation is assigned to the nearest seed, and the seeds are recomputed as the means of the clusters.
The default value of the MAXITER= option depends on the LEAST= p option.
LEAST= p | MAXITER= |
---|---|
not specified | 1 |
p = 1 | 20 |
1 < p < 1 . 5 | 50 |
1 . 5 p < 2 | 20 |
p = 2 | 10 |
2 < p ˆ | 20 |
MEAN= SAS-data-set
creates an output data set to contain the cluster means and other statistics for each cluster. If you want to create a permanent SAS data set, you must specify a two-level name. Refer to SAS Data Files in SAS Language Reference: Concepts for more information on permanent data sets.
NOMISS
excludes observations with missing values from the analysis. However, if you also specify the IMPUTE option, observations with missing values are included in the final cluster assignments.
NOPRINT
suppresses the display of all output. Note that this option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 14, Using the Output Delivery System.
OUT= SAS-data-set
creates an output data set to contain all the original data, plus the new variables CLUSTER and DISTANCE . Refer to SAS Data Files in SAS Language Reference: Concepts for more information on permanent data sets.
OUTITER
outputs information from the iteration history to the OUTSEED= data set, including the cluster seeds at each iteration.
OUTSEED= SAS-data-set
OUTS= SAS-data-set
is another name for the MEAN= data set, provided because the data set may contain location estimates other than means. The MEAN= option is still accepted.
OUTSTAT= SAS-data-set
creates an output data set to contain various statistics, especially those not included in the OUTSEED= data set. Unlike the OUTSEED= data set, the OUTSTAT= data set is not suitable for use as a SEED= data set in a subsequent PROC FASTCLUS step.
RANDOM= n
specifies a positive integer as a starting value for the pseudo-random number generator for use with REPLACE=RANDOM. If you do not specify the RANDOM= option, the time of day is used to initialize the pseudo-random number sequence.
REPLACE=FULL PART NONE RANDOM
specifies how seed replacement is performed.
FULL | requests default seed replacement as described in the section Background on page 1380. |
PART | requests seed replacement only when the distance between the observation and the closest seed is greater than the minimum distance between seeds. |
NONE | suppresses seed replacement. |
RANDOM | selects a simple pseudo-random sample of complete observations as initial cluster seeds. |
SEED= SAS-data-set
specifies an input data set from which initial cluster seeds are to be selected. If you do not specify the SEED= option, initial seeds are selected from the DATA= data set. The SEED= data set must contain the same variables that are used in the data analysis.
SHORT
suppresses the display of the initial cluster seeds, cluster means, and standard deviations.
STRICT
STRICT= s
prevents an observation from being assigned to a cluster if its distance to the nearest cluster seed exceeds the value of the STRICT= option. If you specify the STRICT option without a numeric value, you must also specify the RADIUS= option, and its value is used instead. In the OUT= data set, observations that are not assigned due to the STRICT= option are given a negative cluster number, the absolute value of which indicates the cluster with the nearest seed.
SUMMARY
suppresses the display of the initial cluster seeds, statistics for variables, cluster means, and standard deviations.
VARDEF=DFNWDFWEIGHT WGT
specifies the divisor to be used in the calculation of variances and covariances. The default value is VARDEF=DF. The possible values of the VARDEF= option and associated divisors are as follows.
Value | Description | Divisor |
---|---|---|
DF | error degrees of freedom | n ˆ’ c |
N | number of observations | n |
WDF | sum of weights DF | ( ˆ‘ i w i ) ˆ’ c |
WEIGHT WGT | sum of weights | ˆ‘ i w i |
In the preceding definitions, c represents the number of clusters.
BY variables ;
You can specify a BY statement with PROC FASTCLUS to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables.
If your input data set is not sorted in ascending order, use one of the following alternatives:
Sort the data using the SORT procedure with a similar BY statement.
Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the FASTCLUS procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order.
Create an index on the BY variables using the DATASETS procedure.
If you specify the SEED= option and the SEED= data set does not contain any of the BY variables, then the entire SEED= data set is used to obtain initial cluster seeds for each BY group in the DATA= data set.
If the SEED= data set contains some but not all of the BY variables, or if some BY variables do not have the same type or length in the SEED= data set as in the DATA= data set, then PROC FASTCLUS displays an error message and stops.
If all the BY variables appear in the SEED= data set with the same type and length as in the DATA= data set, then each BY group in the SEED= data set is used to obtain initial cluster seeds for the corresponding BY group in the DATA= data set. All BY groups in the DATA= data set must also appear in the SEED= data set. The BY groups in the SEED= data set must be in the same order as in the DATA= data set. If you specify the NOTSORTED option in the BY statement, there must be exactly the same BY groups in the same order in both data sets. If you do not specify NOTSORTED, some BY groups can appear in the SEED= data set but not in the DATA= data set; such BY groups are not used in the analysis.
For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts . For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide.
FREQ variable ;
If a variable in the data set represents the frequency of occurrence for the other values in the observation, include the variables name in a FREQ statement. The procedure then treats the data set as if each observation appears n times, where n is the value of the FREQ variable for the observation.
If the value of the FREQ variable is missing or 0, the observation is not used in the analysis. The exact values of the FREQ variable are used in computations : frequency values are not truncated to integers. The total number of observations is considered to be equal to the sum of the FREQ variable when the procedure determines degrees of freedom for significance probabilities.
The WEIGHT and FREQ statements have a similar effect, except in determining the number of observations for significance tests.
ID variable ;
The ID variable, which can be character or numeric, identifies observations on the output when you specify the LIST option.
VAR variables ;
The VAR statement lists the numeric variables to be used in the cluster analysis. If you omit the VAR statement, all numeric variables not listed in other statements are used.
WEIGHT variable ;
The values of the WEIGHT variable are used to compute weighted cluster means. The WEIGHT and FREQ statements have a similar effect, except the WEIGHT statement does not alter the degrees of freedom or the number of observations. The WEIGHT variable can take nonintegral values. An observation is used in the analysis only if the value of the WEIGHT variable is greater than zero.