Observations containing missing values are omitted from the analysis.
Default options for PROC VARCLUS often provide satisfactory results. If you want to change the final number of clusters, use the MAXCLUSTERS=, MAXEIGEN=, or PROPORTION= options. The MAXEIGEN= and PROPORTION= options usually produce similar results but occasionally cause different clusters to be selected for splitting. The MAXEIGEN= option tends to choose clusters with a large number of variables, while the PROPORTION= option is more likely to select a cluster with a small number of variables .
PROC VARCLUS usually requires more computer time than principal factor analysis, but it can be faster than some of the iterative factoring methods. If you have more than 30 variables, you may want to reduce execution time by one or more of the following methods :
Specify the MINCLUSTERS= and MAXCLUSTERS= options if you know how many clusters you want.
Specify the HIERARCHY option.
Specify the SEED statement if you have some prior knowledge of what clusters to expect.
If computer time is not a limiting factor, you may want to try one of the following methods to obtain a better solution:
If the clustering algorithm has not converged , specify larger values for MAXITER= and MAXSEARCH=.
Try several factoring and rotation methods with PROC FACTOR to use as input to PROC VARCLUS.
Run PROC VARCLUS several times, specifying INITIAL=RANDOM.
The OUTSTAT= data set is TYPE=CORR, and it can be used as input to the SCORE procedure or a subsequent run of PROC VARCLUS. The variables it contains are
BY variables
_NCL_ , a numeric variable giving the number of clusters
_TYPE_ , a character variable indicating the type of statistic the observation
contains
_NAME_ , a character variable containing a variable name or a cluster name , which is of the form CLUS n where n is the number of the cluster
the variables that are clustered
The values of the _TYPE_ variable are listed in the following table.
_TYPE_ | Contents |
---|---|
MEAN | means |
STD | standard deviations |
USTD | uncorrected standard deviations, produced when the NOINT option is specified |
N | number of observations |
CORR | correlations |
UCORR | uncorrected correlation matrix, produced when the NOINT option is specified |
MEMBERS | number of members in each cluster |
VAREXP | variance explained by each cluster |
PROPOR | proportion of variance explained by each cluster |
GROUP | number of the cluster to which each variable belongs |
RSQUARED | squared multiple correlation of each variable with its cluster component |
SCORE | standardized scoring coefficients |
USCORE | scoring coefficients to be applied without subtracting the mean from the raw variables, produced when the NOINT option is specified |
STRUCTUR | cluster structure |
CCORR | correlations between cluster components |
The observations with _TYPE_ ='MEAN', 'STD', 'N', and 'CORR' have missing values for the _NCL_ variable. All other values of the _TYPE_ variable are repeated for each cluster solution, with different solutions distinguished by the value of the _NCL_ variable. If you want to specify the OUTSTAT= data set with the SCORE procedure, you can use a DATA step to select observations with the _NCL_ variable missing or equal to the desired number of clusters.
data Coef2; set Coef; if _ncl_ = . or _ncl_ = 3; drop _ncl_; run; proc score data=NewScore score=Coef2; run;
PROC SCORE standardizes the new data by subtracting the original variable means that are stored in the _TYPE_ ='MEAN' observations, and dividing by the original variable standard deviations from the _TYPE_ ='STD' observations. Then PROC SCORE multiplies the standardized variables by the coefficients from the _TYPE_ ='SCORE' observations to get the cluster scores.
The OUTTREE= data set contains one observation for each variable clustered plus one observation for each cluster of two or more variables, that is, one observation for each node of the cluster tree. The total number of output observations is between n and 2 n ˆ’ 1, where n is the number of variables clustered.
The variables in the OUTTREE= data set are
BY variables, if any
_NAME_ , a character variable giving the name of the node. If the node is a cluster, the name is CLUS n where n is the number of the cluster. If the node isasinglevariable,thevariablenameisused.
_PARENT_ , a character variable giving the value of _NAME_ of the parent of the node. If the node is the root of the tree, _PARENT_ is blank.
_LABEL_ , a character variable giving the label of the node. If the node is a cluster, the label is CLUS n where n is the number of the cluster. If the node is a single variable, the variable label is used.
_NCL_ , the number of clusters.
_VAREXP_ , the total variance explained by the clusters at the current level of the tree.
_PROPOR_ , the total proportion of variance explained by the clusters at the current level of the tree.
_MINPRO_ , the minimum proportion of variance explained by a cluster component.
_MAXEIG_ , the maximum second eigenvalue of a cluster.
Let
n = number of observations
v = number of variables
c = number of clusters
It is assumed that, at each stage of clustering, the clusters all contain the same number of variables.
The time required for PROC VARCLUS to analyze a given data set varies greatly depending on the number of clusters requested , the number of iterations in both the alternating least-squares and search phases, and whether centroid or principal components are used.
The time required to compute the correlation matrix is roughly proportional to nv 2 .
Default cluster initialization requires time roughly proportional to v 3 . Any other method of initialization requires time roughly proportional to cv 2 .
In the alternating least-squares phase, each iteration requires time roughly proportional to cv 2 if centroid components are used or
if principal components are used.
In the search phase, each iteration requires time roughly proportional to v 3 /c if centroid components are used or v 4 /c 2 if principal components are used. The HIERARCHY option speeds up each iteration after the first split by as much as c/ 2.
The amount of memory, in bytes, needed by PROC VARCLUS is approximately
Because PROC VARCLUS is a type of oblique component analysis, its output is similar to the output from the FACTOR procedure for oblique rotations . The scoring coefficients have the same meaning in both PROC VARCLUS and PROC FACTOR; they are coefficients applied to the standardized variables to compute component scores. The cluster structure is analogous to the factor structure containing the correlations between each variable and each cluster component. A cluster pattern is not displayed because it would be the same as the cluster structure, except that zeros would appear in the same places in which zeros appear in the scoring coefficients. The intercluster correlations are analogous to interfactor correlations; they are the correlations among cluster components.
PROC VARCLUS also displays a cluster summary and a cluster listing. The cluster summary gives the number of variables in each cluster and the variation explained by the cluster component. The latter is similar to the variation explained by a factor but includes contributions from only the variables in that cluster rather than from all variables, as in PROC FACTOR. The proportion of variance explained is obtained by dividing the variance explained by the total variance of variables in the cluster. If the cluster contains two or more variables and the CENTROID option is not used, the second largest eigenvalue of the cluster is also displayed.
The cluster listing gives the variables in each cluster. Two squared correlations are calculated for each cluster. The column labeled 'Own Cluster' gives the squared correlation of the variable with its own cluster component. This value should be higher than the squared correlation with any other cluster unless an iteration limit has been exceeded or the CENTROID option has been used. The larger the squared correlation is, the better. The column labeled 'Next Closest' contains the next highest squared correlation of the variable with a cluster component. This value is low if the clusters are well separated. The column headed '1_R**2 Ratio' gives the ratio of one minus the 'Own Cluster' R 2 to one minus the 'Next Closest' R 2 . A small '1_R**2 Ratio' indicates a good clustering.
The following items are displayed for each cluster solution unless the NOPRINT or SUMMARY option is specified. The CLUSTER SUMMARY table includes
the Cluster number
Members, the number of members in the cluster
Cluster Variation of the variables in the cluster
Variation Explained by the cluster component. This statistic is based only on the variables in the cluster rather than on all variables.
Proportion Explained, the result of dividing the variation explained by the cluster variation
Second Eigenvalue, the second largest eigenvalue of the cluster. This is displayed if the cluster contains more than one variable and the CENTROID option is not specified
PROC VARCLUS also displays
Total variation explained, the sum across clusters of the variation explained by each cluster
Proportion, the total explained variation divided by the total variation of all the variables
The cluster listing includes
Variable, the variables in each cluster
R-squared with Own Cluster, the squared correlation of the variable with its own cluster component; and R-squared with Next Closest, the next highest squared correlation of the variable with a cluster component. Own Cluster values should be higher than the R 2 with any other cluster unless an iteration limit is exceeded or you specify the CENTROID option. Next Closest should be a low value if the clusters are well separated.
1 ˆ’ R**2 Ratio, the ratio of one minus the value in the Own Cluster column to one minus the value in the Next Closest column. The occurrence of low ratios indicates well-separated clusters.
If the SHORT option is not specified, PROC VARCLUS also displays
Standardized Scoring Coefficients, standardized regression coefficients for predicting cluster components from variables
Cluster Structure, the correlations between each variable and each cluster component
Inter-Cluster Correlations, the correlations between the cluster components
If the analysis includes partitions for two or more numbers of clusters, a final summary table is displayed. Each row of the table corresponds to one partition. The columns include
Number of Clusters
Total Variation Explained by Clusters
Proportion of Variation Explained by Clusters
Minimum Proportion (of variation) Explained by a Cluster
Maximum Second Eigenvalue in a Cluster
Minimum R-squared for a Variable
Maximum 1 ˆ’ R**2 Ratio for a Variable
PROC VARCLUS assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, 'Using the Output Delivery System.'
ODS Table Name | Description | Option |
---|---|---|
ClusterQuality | Cluster quality | default |
ClusterStructure | Cluster structure | default |
ClusterSummary | Cluster Summary | default |
ConvergenceStatus | Convergence status | default |
Corr | Correlations | CORR |
DataOptSummary | Data and options summary table | default |
InterClusterCorr | Inter-cluster correlations | default |
IterHistory | Iteration history | TRACE |
RSquare | Cluster Rsq | default |
SimpleStatistics | Simple statistics | SIMPLE |
StdScoreCoef | Standardized scoring coefficients | default |