Details


Missing Values

Observations containing missing values are omitted from the analysis.

Using PROC VARCLUS

Default options for PROC VARCLUS often provide satisfactory results. If you want to change the final number of clusters, use the MAXCLUSTERS=, MAXEIGEN=, or PROPORTION= options. The MAXEIGEN= and PROPORTION= options usually produce similar results but occasionally cause different clusters to be selected for splitting. The MAXEIGEN= option tends to choose clusters with a large number of variables, while the PROPORTION= option is more likely to select a cluster with a small number of variables .

Execution time

PROC VARCLUS usually requires more computer time than principal factor analysis, but it can be faster than some of the iterative factoring methods. If you have more than 30 variables, you may want to reduce execution time by one or more of the following methods :

  • Specify the MINCLUSTERS= and MAXCLUSTERS= options if you know how many clusters you want.

  • Specify the HIERARCHY option.

  • Specify the SEED statement if you have some prior knowledge of what clusters to expect.

If computer time is not a limiting factor, you may want to try one of the following methods to obtain a better solution:

  • If the clustering algorithm has not converged , specify larger values for MAXITER= and MAXSEARCH=.

  • Try several factoring and rotation methods with PROC FACTOR to use as input to PROC VARCLUS.

  • Run PROC VARCLUS several times, specifying INITIAL=RANDOM.

Output Data Sets

OUTSTAT= Data Set

The OUTSTAT= data set is TYPE=CORR, and it can be used as input to the SCORE procedure or a subsequent run of PROC VARCLUS. The variables it contains are

  • BY variables

  • _NCL_ , a numeric variable giving the number of clusters

  • _TYPE_ , a character variable indicating the type of statistic the observation

  • contains

  • _NAME_ , a character variable containing a variable name or a cluster name , which is of the form CLUS n where n is the number of the cluster

  • the variables that are clustered

The values of the _TYPE_ variable are listed in the following table.

Table 78.2: _TYPE_ Value and Statistic

_TYPE_

Contents

MEAN

means

STD

standard deviations

USTD

uncorrected standard deviations, produced when the NOINT option is specified

N

number of observations

CORR

correlations

UCORR

uncorrected correlation matrix, produced when the NOINT option is specified

MEMBERS

number of members in each cluster

VAREXP

variance explained by each cluster

PROPOR

proportion of variance explained by each cluster

GROUP

number of the cluster to which each variable belongs

RSQUARED

squared multiple correlation of each variable with its cluster component

SCORE

standardized scoring coefficients

USCORE

scoring coefficients to be applied without subtracting the mean from the raw variables, produced when the NOINT option is specified

STRUCTUR

cluster structure

CCORR

correlations between cluster components

The observations with _TYPE_ ='MEAN', 'STD', 'N', and 'CORR' have missing values for the _NCL_ variable. All other values of the _TYPE_ variable are repeated for each cluster solution, with different solutions distinguished by the value of the _NCL_ variable. If you want to specify the OUTSTAT= data set with the SCORE procedure, you can use a DATA step to select observations with the _NCL_ variable missing or equal to the desired number of clusters.

  data Coef2;   set Coef;   if _ncl_ = . or _ncl_ = 3;   drop _ncl_;   run;   proc score data=NewScore score=Coef2; run;  

PROC SCORE standardizes the new data by subtracting the original variable means that are stored in the _TYPE_ ='MEAN' observations, and dividing by the original variable standard deviations from the _TYPE_ ='STD' observations. Then PROC SCORE multiplies the standardized variables by the coefficients from the _TYPE_ ='SCORE' observations to get the cluster scores.

OUTTREE= Data Set

The OUTTREE= data set contains one observation for each variable clustered plus one observation for each cluster of two or more variables, that is, one observation for each node of the cluster tree. The total number of output observations is between n and 2 n ˆ’ 1, where n is the number of variables clustered.

The variables in the OUTTREE= data set are

  • BY variables, if any

  • _NAME_ , a character variable giving the name of the node. If the node is a cluster, the name is CLUS n where n is the number of the cluster. If the node isasinglevariable,thevariablenameisused.

  • _PARENT_ , a character variable giving the value of _NAME_ of the parent of the node. If the node is the root of the tree, _PARENT_ is blank.

  • _LABEL_ , a character variable giving the label of the node. If the node is a cluster, the label is CLUS n where n is the number of the cluster. If the node is a single variable, the variable label is used.

  • _NCL_ , the number of clusters.

  • _VAREXP_ , the total variance explained by the clusters at the current level of the tree.

  • _PROPOR_ , the total proportion of variance explained by the clusters at the current level of the tree.

  • _MINPRO_ , the minimum proportion of variance explained by a cluster component.

  • _MAXEIG_ , the maximum second eigenvalue of a cluster.

Computational Resources

Let

  • n = number of observations

  • v = number of variables

  • c = number of clusters

It is assumed that, at each stage of clustering, the clusters all contain the same number of variables.

Time

The time required for PROC VARCLUS to analyze a given data set varies greatly depending on the number of clusters requested , the number of iterations in both the alternating least-squares and search phases, and whether centroid or principal components are used.

The time required to compute the correlation matrix is roughly proportional to nv 2 .

Default cluster initialization requires time roughly proportional to v 3 . Any other method of initialization requires time roughly proportional to cv 2 .

In the alternating least-squares phase, each iteration requires time roughly proportional to cv 2 if centroid components are used or

if principal components are used.

In the search phase, each iteration requires time roughly proportional to v 3 /c if centroid components are used or v 4 /c 2 if principal components are used. The HIERARCHY option speeds up each iteration after the first split by as much as c/ 2.

Memory

The amount of memory, in bytes, needed by PROC VARCLUS is approximately

Interpreting VARCLUS Procedure Output

Because PROC VARCLUS is a type of oblique component analysis, its output is similar to the output from the FACTOR procedure for oblique rotations . The scoring coefficients have the same meaning in both PROC VARCLUS and PROC FACTOR; they are coefficients applied to the standardized variables to compute component scores. The cluster structure is analogous to the factor structure containing the correlations between each variable and each cluster component. A cluster pattern is not displayed because it would be the same as the cluster structure, except that zeros would appear in the same places in which zeros appear in the scoring coefficients. The intercluster correlations are analogous to interfactor correlations; they are the correlations among cluster components.

PROC VARCLUS also displays a cluster summary and a cluster listing. The cluster summary gives the number of variables in each cluster and the variation explained by the cluster component. The latter is similar to the variation explained by a factor but includes contributions from only the variables in that cluster rather than from all variables, as in PROC FACTOR. The proportion of variance explained is obtained by dividing the variance explained by the total variance of variables in the cluster. If the cluster contains two or more variables and the CENTROID option is not used, the second largest eigenvalue of the cluster is also displayed.

The cluster listing gives the variables in each cluster. Two squared correlations are calculated for each cluster. The column labeled 'Own Cluster' gives the squared correlation of the variable with its own cluster component. This value should be higher than the squared correlation with any other cluster unless an iteration limit has been exceeded or the CENTROID option has been used. The larger the squared correlation is, the better. The column labeled 'Next Closest' contains the next highest squared correlation of the variable with a cluster component. This value is low if the clusters are well separated. The column headed '1_R**2 Ratio' gives the ratio of one minus the 'Own Cluster' R 2 to one minus the 'Next Closest' R 2 . A small '1_R**2 Ratio' indicates a good clustering.

Displayed Output

The following items are displayed for each cluster solution unless the NOPRINT or SUMMARY option is specified. The CLUSTER SUMMARY table includes

  • the Cluster number

  • Members, the number of members in the cluster

  • Cluster Variation of the variables in the cluster

  • Variation Explained by the cluster component. This statistic is based only on the variables in the cluster rather than on all variables.

  • Proportion Explained, the result of dividing the variation explained by the cluster variation

  • Second Eigenvalue, the second largest eigenvalue of the cluster. This is displayed if the cluster contains more than one variable and the CENTROID option is not specified

PROC VARCLUS also displays

  • Total variation explained, the sum across clusters of the variation explained by each cluster

  • Proportion, the total explained variation divided by the total variation of all the variables

The cluster listing includes

  • Variable, the variables in each cluster

  • R-squared with Own Cluster, the squared correlation of the variable with its own cluster component; and R-squared with Next Closest, the next highest squared correlation of the variable with a cluster component. Own Cluster values should be higher than the R 2 with any other cluster unless an iteration limit is exceeded or you specify the CENTROID option. Next Closest should be a low value if the clusters are well separated.

  • 1 ˆ’ R**2 Ratio, the ratio of one minus the value in the Own Cluster column to one minus the value in the Next Closest column. The occurrence of low ratios indicates well-separated clusters.

If the SHORT option is not specified, PROC VARCLUS also displays

  • Standardized Scoring Coefficients, standardized regression coefficients for predicting cluster components from variables

  • Cluster Structure, the correlations between each variable and each cluster component

  • Inter-Cluster Correlations, the correlations between the cluster components

If the analysis includes partitions for two or more numbers of clusters, a final summary table is displayed. Each row of the table corresponds to one partition. The columns include

  • Number of Clusters

  • Total Variation Explained by Clusters

  • Proportion of Variation Explained by Clusters

  • Minimum Proportion (of variation) Explained by a Cluster

  • Maximum Second Eigenvalue in a Cluster

  • Minimum R-squared for a Variable

  • Maximum 1 ˆ’ R**2 Ratio for a Variable

ODS Table Names

PROC VARCLUS assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, 'Using the Output Delivery System.'

Table 78.3: ODS Tables Produced in PROC VARCLUS

ODS Table Name

Description

Option

ClusterQuality

Cluster quality

default

ClusterStructure

Cluster structure

default

ClusterSummary

Cluster Summary

default

ConvergenceStatus

Convergence status

default

Corr

Correlations

CORR

DataOptSummary

Data and options summary table

default

InterClusterCorr

Inter-cluster correlations

default

IterHistory

Iteration history

TRACE

RSquare

Cluster Rsq

default

SimpleStatistics

Simple statistics

SIMPLE

StdScoreCoef

Standardized scoring coefficients

default




SAS.STAT 9.1 Users Guide (Vol. 7)
SAS/STAT 9.1 Users Guide, Volumes 1-7
ISBN: 1590472438
EAN: 2147483647
Year: 2004
Pages: 132

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net