Details


Updates in the FASTCLUS Procedure

Some FASTCLUS procedure options and statements have changed from previous versions. The differences are as follows :

  • Values of the FREQ variable are no longer truncated to integers. Noninteger variables specified in the FREQ statement produce results different than in previous releases.

  • The IMPUTE option produces different cluster standard deviations and related statistics. When you specify the IMPUTE option, imputed values are no longer used in computing cluster statistics. This change causes the cluster standard deviations and other statistics computed from the standard deviations to be different than in previous releases.

  • The INSTAT= option reads a SAS data set previously created by the FASTCLUS procedure using the OUTSTAT= option. If you specify the INSTAT= option, no clustering iterations are performed and no output is produced. Only cluster assignment and imputation are performed as an OUT= data set is created.

  • The OUTSTAT= data set contains additional information used for imputation. _TYPE_ =SEED corresponds to values that are cluster seeds . Observations previously designated _TYPE_ = SCALE are now _TYPE_ = DISPERSION .

Missing Values

Observations with all missing values are excluded from the analysis. If you specify the NOMISS option, observations with any missing values are excluded. Observations with missing values cannot be cluster seeds.

The distance between an observation with missing values and a cluster seed is obtained by computing the squared distance based on the nonmissing values, multiplying by the ratio of the number of variables, n , to the number of variables having nonmissing values, m , and taking the square root:

click to expand

where

  • n = number of variables

  • m = number of variables with nonmissing values

  • x i = value of the i th variable for the observation

  • s i = value of the i th variable for the seed

If you specify the LEAST= p option with a power p other than 2 (the default), the distance is computed using:

click to expand

The summation is taken over variables with nonmissing values.

The IMPUTE option fills in missing values in the OUT= output data set.

Output Data Sets

OUT= Data Set

The OUT= data set contains

  • the original variables

  • a new variable taking values from 1 to the value specified in the MAXCLUSTERS= option, indicating the cluster to which each observation has been assigned. You can specify the variable name with the CLUSTER= option; the default name is CLUSTER .

  • a new variable, DISTANCE , giving the distance from the observation to its cluster seed

If you specify the IMPUTE option, the OUT= data set also contains a new variable, _IMPUTE_ , giving the number of imputed values in each observation.

OUTSEED= Data Set

The OUTSEED= data set contains one observation for each cluster. The variables are as follows:

  • the BY variables, if any

  • a new variable giving the cluster number. You can specify the variable name with the CLUSTER= option. The default name is CLUSTER .

  • either the FREQ variable or a new variable called _FREQ_ giving the number of observations in the cluster

  • the WEIGHT variable, if any

  • a new variable, _RMSSTD_ , giving the root mean square standard deviation for the cluster. See Chapter 23, The CLUSTER Procedure, for details.

  • a new variable, _RADIUS_ , giving the maximum distance between any observation in the cluster and the cluster seed

  • a new variable, _GAP_ , containing the distance between the current cluster mean and the nearest other cluster mean. The value is the centroid distance given in the output.

  • a new variable, _NEAR_ , specifying the cluster number of the nearest cluster

  • the VAR variables giving the cluster means

If you specify the LEAST= p option with a value other than 2, the _RMSSTD_ variable is replaced by the _SCALE_ variable, which contains the pooled scale estimate analogous to the root mean square standard deviation but based on p th power deviations instead of squared deviations:

LEAST=1

mean absolute deviation

LEAST= p

root mean p th-power absolute deviation

LEAST=MAX

maximum absolute deviation

If you specify the OUTITER option, there is one set of observations in the OUTSEED= data set for each pass through the data set (that is, one set for initial seeds, one for each iteration, and one for the final clusters). Also, several additional variables appear:

_ITER_

is the iteration number. For the initial seeds, the value is 0. For the final cluster means or centers, the _ITER_ variable is one greater than the last iteration reported in the iteration history.

_CRIT_

is the clustering criterion as described under the LEAST= option.

_CHANGE_

is the maximum over clusters of the relative change in the cluster seed from the previous iteration. The relative change in a cluster seed is the distance between the old seed and the new seed divided by a scaling factor. If you do not specify the LEAST= option, the scaling factor is the minimum distance between the initial seeds. If you specify the LEAST= option, the scaling factor is an L 1 scale estimate and is recomputed on each iteration.

_HOMPAR_

is the value of the homotopy parameter. This variable appears only for LEAST= p with 1 < p < 2.

_BINSIZ_

is the maximum bin size used for estimating medians. This variable appears only for LEAST=1.

If you specify the OUTITER option, the variables _SCALE_ or _RMSSTD_ , _RADIUS_ , _NEAR_ ,and _GAP_ have missing values except for the last pass.

You can use the OUTSEED= data set as a SEED= input data set for a subsequent analysis.

OUTSTAT= Data Set

The variables in the OUTSTAT= data set are as follows:

  • BY variables, if any

  • a new character variable, _TYPE_ , specifying the type of statistic given by other variables (see Table 28.2 and Table 28.3)

    Table 28.2: _TYPE_ Values for all LEAST= Options

    _TYPE_

    Contents of VAR variables

    Contents of OVER_ALL

    INITIAL

    Initial seeds

    Missing

    CRITERION

    Missing

    Optimization criterion; see the LEAST= option; this value is displayed just before the Cluster Summary table

    CENTER

    Cluster centers; see the LEAST= option

    Missing

    SEED

    Cluster seeds: additional information used for imputation

     

    DISPERSION

    Dispersion estimates for each cluster; see the LEAST= option; these values are displayed in a separate row with title depending on the LEAST= option

    Dispersion estimates pooled over variables; see the LEAST= option; these values are displayed in the Cluster Summary table with label depending on the LEAST= option

    FREQ

    Frequency of each cluster omitting observations with missing values for the VAR variable; these values are not displayed

    Frequency of each cluster based on all observations with any nonmissing value; these values are displayed in the Cluster Summary table

    WEIGHT

    Sum of weights for each cluster omitting observations with missing values for the VAR variable; these values are not displayed

    Sum of weights for each cluster based on all observations with any nonmissing value; these values are displayed in the Cluster Summary table

    Table 28.3: _TYPE_ Values for Least-Squares Clustering

    _TYPE_

    Contents of VAR variables

    Contents of OVER_ALL

    MEAN

    Mean for the total sample; this is not displayed

    Missing

    STD

    Standard deviation for the total sample; this is labeled Total STD in the output

    Standard deviation pooled over all the VAR variables; this is labeled Total STD in the output

    WITHIN_STD

    Pooled within-cluster standard deviation

    Within cluster standard deviation pooled over clusters and all the VAR variables

    RSQ

    R 2 for predicting the variable from the clusters; this is labeled R-Squared in the output

    R 2 pooled over all the VAR variables; this is labeled R-Squared in the output

    RSQ_RATIO R 2

    ; this is labeled RSQ/(1-RSQ) in the output

    ; labeled RSQ/(1-RSQ) in the output

    PSEUDO_F

    Missing

    Pseudo F statistic

    ESRQ

    Missing

    Approximate expected value of R 2 under the null hypothesis of a single uniform cluster

    CCC

    Missing

    The cubic clustering criterion

  • a new numeric variable giving the cluster number. You can specify the variable name with the CLUSTER= option. The default name is CLUSTER .

  • a new numeric variable, OVER_ALL , containing statistics that apply over all of the VAR variables

  • the VAR variables giving statistics for particular variables

The values of _TYPE_ for all LEAST= options are given in the following table.

Observations with _TYPE_ = WEIGHT are included only if you specify the WEIGHT statement.

The _TYPE_ values included only for least-squares clustering are given in the following table. Least-squares clustering is obtained by omitting the LEAST= option or by specifying LEAST=2.

Computational Resources

Let

  • n = number of observations

  • v = number of variables

  • c = number of clusters

  • p = number of passes over the data set

Memory

The memory required is approximately 4(19 v + 12 cv + 10 c + 2max( c + 1 , v )) bytes.

If you request the DISTANCE option, an additional 4 c ( c + 1) bytes of space is needed.

Time

The overall time required by PROC FASTCLUS is roughly proportional to nvcp if c is small with respect to n .

Initial seed selection requires one pass over the data set. If the observations are in random order, the time required is roughly proportional to

unless you specify REPLACE=NONE. In that case, a complete pass may not be necessary, and the time is roughly proportional to mvc , where c m n .

The DRIFT option, each iteration, and the final assignment of cluster seeds each require one pass, with time for each pass roughly proportional to nvc .

For greatest efficiency, you should list the variables in the VAR statement in order of decreasing variance.

Using PROC FASTCLUS

Before using PROC FASTCLUS, decide whether your variables should be standardized in some way, since variables with large variances tend to have more effect on the resulting clusters than those with small variances. If all variables are measured in the same units, standardization may not be necessary. Otherwise, some form of standardization is strongly recommended. The STANDARD procedure can standardize all variables to mean zero and variance one. The FACTOR or PRINCOMP procedures can compute standardized principal component scores. The ACECLUS procedure can transform the variables according to an estimated within-cluster covariance matrix.

Nonlinear transformations of the variables may change the number of population clusters and should, therefore, be approached with caution. For most applications, the variables should be transformed so that equal differences are of equal practical importance. An interval scale of measurement is required. Ordinal or ranked data are generally not appropriate.

PROC FASTCLUS produces relatively little output. In most cases you should create an output data set and use other procedures such as PRINT, PLOT, CHART, MEANS, DISCRIM, or CANDISC to study the clusters. It is usually desirable to try several values of the MAXCLUSTERS= option. Macros are useful for running PROC FASTCLUS repeatedly with other procedures.

A simple application of PROC FASTCLUS with two variables to examine the 2- and 3-cluster solutions may proceed as follows:

  proc standard mean=0 std=1 out=stan;   var v1 v2;   run;   proc fastclus data=stan out=clust maxclusters=2;   var v1 v2;   run;   proc plot;   plot v2*v1=cluster;   run;   proc fastclus data=stan out=clust maxclusters=3;   var v1 v2;   run;   proc plot;   plot v2*v1=cluster;   run;  

If you have more than two variables, you can use the CANDISC procedure to compute canonical variables for plotting the clusters, for example,

  proc standard mean=0 std=1 out=stan;   var v1-v10;   run;   proc fastclus data=stan out=clust maxclusters=3;   var v1-v10;   run;   proc candisc out=can;   var v1-v10;   class cluster;   run;   proc plot;   plot can2*can1=cluster;   run;  

If the data set is not too large, it may also be helpful to use

  proc sort;   by cluster distance;   run;   proc print;   by cluster;   run;  

to list the clusters. By examining the values of DISTANCE , you can determine if any observations are unusually far from their cluster seeds.

It is often advisable, especially if the data set is large or contains outliers, to make a preliminary PROC FASTCLUS run with a large number of clusters, perhaps 20 to 100. Use MAXITER=0 and OUTSEED= SAS-data-set . You can save time on subsequent runs by selecting cluster seeds from this output data set using the SEED= option.

You should check the preliminary clusters for outliers, which often appear as clusters with only one member. Use a DATA step to delete outliers from the data set created by the OUTSEED= option before using it as a SEED= data set in later runs. If there are severe outliers, the subsequent PROC FASTCLUS runs should specify the STRICT option to prevent the outliers from distorting the clusters.

You can use the OUTSEED= data set with the PLOT procedure to plot _GAP_ by _FREQ_ . An overlay of _RADIUS_ by _FREQ_ provides a baseline against which to compare the values of _GAP_ . Outliers appear in the upper left area of the plot, with large values of _GAP_ and small _FREQ_ values. Good clusters appear in the upper right area, with large values of both _GAP_ and _FREQ_ . Good potential cluster seeds appear in the lower right, as well as in the upper right, since large _FREQ_ values indicate high density regions . Small _FREQ_ values in the left part of the plot indicate poor cluster seeds because the points are in low density regions. It often helps to remove all clusters with small frequencies even though the clusters may not be remote enough to be considered outliers. Removing points in low density regions improves cluster separation and provides visually sharper cluster outlines in scatter plots.

Displayed Output

Unless the SHORT or SUMMARY option is specified, PROC FASTCLUS displays

  • Initial Seeds, cluster seeds selected after one pass through the data

  • Change in Cluster Seeds for each iteration, if you specify MAXITER= n > 1

If you specify the LEAST= p option, with (1 < p < 2), and you omit the IRLS option, an additional column is displayed in the Iteration History table. The column contains a character to identify the method used in each iteration. PROC FASTCLUS chooses the most efficient method to cluster the data at each iterative step, given the condition of the data. Thus, the method chosen is data dependent. The possible values are described as follows:

Value

Method

N

Newton s Method

I or L

iteratively weighted least squares (IRLS)

1

IRLS step, halved once

2

IRLS step, halved twice

3

IRLS step, halved three times

PROC FASTCLUS displays a Cluster Summary, giving the following for each cluster:

  • Cluster number

  • Frequency, the number of observations in the cluster

  • Weight, the sum of the weights of the observations in the cluster, if you specify the WEIGHT statement

  • RMS Std Deviation, the root mean square across variables of the cluster standard deviations, which is equal to the root mean square distance between observations in the cluster

  • Maximum Distance from Seed to Observation, the maximum distance from the cluster seed to any observation in the cluster

  • Nearest Cluster, the number of the cluster with mean closest to the mean of the current cluster

  • Centroid Distance, the distance between the centroids (means) of the current cluster and the nearest other cluster

A table of statistics for each variable is displayed unless you specify the SUMMARY option. The table contains

  • Total STD, the total standard deviation

  • Within STD, the pooled within-cluster standard deviation

  • R-Squared, the R 2 for predicting the variable from the cluster

  • RSQ/(1 - RSQ), the ratio of between-cluster variance to within-cluster variance ( R 2 / (1 ˆ’ R 2 ))

  • OVER-ALL, all of the previous quantities pooled across variables

PROC FASTCLUS also displays

  • Pseudo F Statistic,

    where R 2 is the observed overall R 2 , c is the number of clusters, and n is the number of observations. The pseudo F statistic was suggested by Calinski and Harabasz (1974). Refer to Milligan and Cooper (1985) and Cooper and Milligan (1988) regarding the use of the pseudo F statistic in estimating the number of clusters. See Example 23.2 in Chapter 23, The CLUSTER Procedure, for a comparison of pseudo F statistics.

  • Observed Overall R-Squared, if you specify the SUMMARY option

  • Approximate Expected Overall R-Squared, the approximate expected value of the overall R 2 under the uniform null hypothesis assuming that the variables are uncorrelated. The value is missing if the number of clusters is greater than one-fifth the number of observations.

  • Cubic Clustering Criterion, computed under the assumption that the variables are uncorrelated. The value is missing if the number of clusters is greater than one-fifth the number of observations.

    If you are interested in the approximate expected R 2 or the cubic clustering criterion but your variables are correlated, you should cluster principal component scores from the PRINCOMP procedure. Both of these statistics are described by Sarle (1983). The performance of the cubic clustering criterion in estimating the number of clusters is examined by Milligan and Cooper (1985) and Cooper and Milligan (1988).

  • Distances Between Cluster Means, if you specify the DISTANCE option

Unless you specify the SHORT or SUMMARY option, PROC FASTCLUS displays

  • Cluster Means for each variable

  • Cluster Standard Deviations for each variable

ODS Table Names

PROC FASTCLUS assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.

Table 28.4: ODS Tables Produced in PROC FASTCLUS

ODS Table Name

Description

Statement

Option

ApproxExpOverAllRSq

Approximate expected over-all R-squared, single number

PROC

default

CCC

CCC, Cubic Clustering Criterion, single number

PROC

default

ClusterList

Cluster listing, obs, id, and distances

PROC

LIST

ClusterSum

Cluster summary, cluster number, distances

PROC

PRINTALL

ClusterCenters

Cluster centers

PROC

default

ClusterDispersion

Cluster dispersion

PROC

default

ConvergenceStatus

Convergence status

PROC

PRINTALL

Criterion

Criterion based on final seeds, single number

PROC

default

DistBetweenClust

Distance between clusters

PROC

default

InitialSeeds

Initial seeds

PROC

default

IterHistory

Iteration history, various statistics for each iter

PROC

PRINTALL

MinDist

Minimum distance between initial seeds, single number

PROC

PRINTALL

NumberOfBins

Number of bins

PROC

default

ObsOverAllRSquare

Observed over-all R-squared, single number

PROC

SUMMARY

PrelScaleEst

Preliminary L(1) scale estimate, single number

PROC

PRINTALL

PseudoFStat

Pseudo F statistic, single number

PROC

default

SimpleStatistics

Simple statistics for input variables

PROC

default

VariableStat

Statistics for variables within clusters

PROC

default




SAS.STAT 9.1 Users Guide (Vol. 2)
SAS/STAT 9.1 Users Guide Volume 2 only
ISBN: B003ZVJDOK
EAN: N/A
Year: 2004
Pages: 92

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net