Syntax | SAS/STAT 9.1 Users Guide, Volumes 1-7

The following statements are available in the ACECLUS procedure.

PROC ACECLUS PROPORTION=p THRESHOLD=t < options > ;
- BY variables ;
- FREQ variable ;
- VAR variables ;
- WEIGHT variable ;

Usually you need only the VAR statement in addition to the required PROC ACECLUS statement. The optional BY, FREQ, VAR, and WEIGHT statements are described in alphabetical order after the PROC ACECLUS statement.

PROC ACECLUS Statement

PROC ACECLUS PROPORTION=p THRESHOLD=t < options > ;

The PROC ACECLUS statement starts the ACECLUS procedure. The options available with the PROC ACECLUS statement are summarized in Table 16.2 and discussed in the following sections. Note that, if you specify the METHOD=COUNT option, you must specify either the PROPORTION= or the MPAIRS= option. Otherwise , you must specify either the PROPORTION= or THRESHOLD= option.

Table 16.2: Summary of PROC ACECLUS Statement Options
Task	Options	Description
Specify clustering options
	METHOD=	specify the clustering method
	MPAIRS=	specify number of pairs for estimating within-cluster covariance (when you specify the option METHOD=COUNT)
	PROPORTION=	specify proportion of pairs for estimating within-cluster covariance
	THRESHOLD=	specify the threshold for including pairs in the estimation of the within-cluster covariance
Specify input and output data sets
	DATA=	specify input data set name
	OUT=	specify output data set name
	OUTSTAT=	specify output data set name containing various statistics
Specify iteration options
	ABSOLUTE	use absolute instead of relative threshold
	CONVERGE=	specify convergence criterion
	INITIAL=	specify initial estimate of within-cluster covariance matrix
	MAXITER=	specify maximum number of iterations
	METRIC=	specify metric in which computations are performed
	SINGULAR=	specify singularity criterion
Specify canonical analysis options
	N=	specify number of canonical variables
	PREFIX=	specify prefix for naming canonical variables
Control displayed output
	NOPRINT	suppress the display of the output
	PP	produce PP-plot of distances between pairs from last iteration
	QQ	produce QQ-plot of power transformation of distances between pairs from last iteration
	SHORT	omit all output except for iteration history and eigenvalue table

The following list provides details on the options. The list is in alphabetical order.

ABSOLUTE

causes the THRESHOLD= value or the threshold computed from the PROPORTION= option to be treated absolutely rather than relative to the root mean square distance between observations. Use the ABSOLUTE option only when you are confident that the initial estimate of the within-cluster covariance matrix is close to the final estimate, such as when the INITIAL= option specifies a data set created by a previous execution of PROC ACECLUS using the OUTSTAT= option.

CONVERGE= c

specifies the convergence criterion. By default, CONVERGE= 0.001. Iteration stops when the convergence measure falls below the value specified by the CONVERGE= option or when the iteration limit as specified by the MAXITER= option is exceeded, whichever happens first.

DATA= SAS-data-set

specifies the SAS data set to be analyzed . By default, PROC ACECLUS uses the most recently created SAS data set.

INITIAL= name

specifies the matrix for the initial estimate of the within-cluster covariance matrix. Valid values for name are as follows :

DIAGONAL D	uses the diagonal matrix of sample variances as the initial estimate of the within-cluster covariance matrix.
FULL F	uses the total-sample covariance matrix as the initial estimate of the within-cluster covariance matrix.
IDENTITY I	uses the identity matrix as the initial estimate of the within-cluster covariance matrix.
INPUT= SAS-data-set	specifies a SAS data set from which to obtain the initial estimate of the within-cluster covariance matrix. The data set can be TYPE=CORR, COV, UCORR, UCOV, SSCP, or ACE, or it can be an ordinary SAS data set. (See Appendix 1, 'Special SAS Data Sets,' for descriptions of CORR, COV, UCORR, UCOV, and SSCP data sets. See the section 'Output Data Sets' on page 409 for a description of ACE data sets.) If you do not specify the INITIAL= option, the default is the matrix specified by the METRIC= option. If neither the INITIAL= nor the METRIC= option is specified, INITIAL=FULL is used if there are enough observations to obtain a nonsingular total-sample covariance matrix; otherwise, INITIAL=DIAGONAL is used.

MAXITER= n

specifies the maximum number of iterations. By default, MAXITER=10.

METHOD= COUNT C

METHOD= THRESHOLD T

specifies the clustering method. The METHOD=THRESHOLD option requests a method (also the default) that uses all pairs closer than a given cutoff value to form the estimate at each iteration. The METHOD=COUNT option requests a method that uses a number of pairs, m , with the smallest distances to form the estimate at each iteration.

METRIC= name

specifies the metric in which the computations are performed, implies the default value for the INITIAL= option, and specifies the matrix Z used in the formula for the convergence measure e _i and for checking singularity of the A matrix. Valid values for name are as follows:

DIAGONAL D	uses the diagonal matrix of sample variances diag( S ) and sets Z = diag( S ) ^{ˆ’ 1/2} , where the superscript ˆ’ 1/2 indicates an inverse factor.
FULL F	uses the total-sample covariance matrix S and sets Z = S ^{ˆ’ 1/2} .
IDENTITY I	uses the identity matrix I and sets Z = I .

If you do not specify the METRIC= option, METRIC=FULL is used if there are enough observations to obtain a nonsingular total-sample covariance matrix; otherwise, METRIC=DIAGONAL is used.

The option METRIC= is rather technical. It affects the computations in a variety of ways, but for well-conditioned data the effects are subtle. For most data sets, the METRIC= option is not needed.

MPAIRS= m

specifies the number of pairs to be included in the estimation of the within-cluster covariance matrix when METHOD=COUNT is requested . The values of m must be greater than 0 but less than or equal to ( totfq —( totfq ˆ’ 1)) / 2, where totfq is the sum of nonmissing frequencies specified in the FREQ statement. If there is no FREQ statement, totfq equals the number of total nonmissing observations.

N= n

specifies the number of canonical variables to be computed. The default is the number of variables analyzed. N=0 suppresses the canonical analysis.

NOPRINT

suppresses the display of all output. Note that this option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 14, 'Using the Output Delivery System.'

OUT= SAS-data-set

creates an output SAS data set that contains all the original data as well as the canonical variables having an estimated within-cluster covariance matrix equal to the identity matrix. If you want to create a permanent SAS data set, you must specify a two-level name. See Chapter 16, 'SAS Data Files' in SAS Language Reference: Concepts for information on permanent SAS data sets.

OUTSTAT= SAS-data-set

specifies a TYPE=ACE output SAS data set that contains means, standard deviations, number of observations, covariances, estimated within-cluster covariances, eigenvalues, and canonical coefficients. If you want to create a permanent SAS data set, you must specify a two-level name. See Chapter 16, 'SAS Data Files' in SAS Language Reference: Concepts for information on permanent SAS data sets.

PROPORTION= p

PERCENT= p

P= p

specifies the percentage of pairs to be included in the estimation of the within-cluster covariance matrix. The value of p must be greater than 0. If p is greater than or equal to 1, it is interpreted as a percentage and divided by 100; PROPORTION=0.02 and PROPORTION=2 are equivalent. When you specify METHOD=THRESHOLD, a threshold value is computed from the PROPORTION= option under the assumption that the observations are sampled from a multivariate normal distribution.

When you specify METHOD=COUNT, the number of pairs, m , is computed from PROPORTION= p as

where totfq is the number of total non-missing observations.

produces a PP probability plot of distances between pairs of observations computed in the last iteration.

PREFIX= name

specifies a prefix for naming the canonical variables. By default the names are CAN1 , CAN2 , ... , CAN n . If you specify PREFIX= ABC , the variables are named ABC1 , ABC2 , ABC3 , and so on. The number of characters in the prefix plus the number of digits required to designate the variables should not exceed the name length defined by the VALIDVARNAME= system option. For more information on the VALIDVARNAME= system option, refer to SAS Language Reference: Dictionary .

produces a QQ probability plot of a power transformation of the distances between pairs of observations computed in the last iteration. Caution: The QQ plot may require an enormous amount of computer time.

SHORT

omits all items from the standard output except for the iteration history and the eigenvalue table.

SINGULAR= g

SING= g

specifies a singularity criterion 0 < g < 1 for the total-sample covariance matrix S and the approximate within-cluster covariance estimate A . The default is SINGULAR=1E ˆ’ 4.

THRESHOLD= t

T= t

specifies the threshold for including pairs of observations in the estimation of the within-cluster covariance matrix. A pair of observations is included if the Euclidean distance between them is less than or equal to t times the root mean square distance computed over all pairs of observations.

BY Statement

BY variables ;

You can specify a BY statement with PROC ACECLUS to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables.

If your input data set is not sorted in ascending order, use one of the following alternatives:

Sort the data using the SORT procedure with a similar BY statement.
Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the ACECLUS procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order.
Create an index on the BY variables using the DATASETS procedure.

If you specify the INITIAL=INPUT= option and the INITIAL=INPUT= data set does not contain any of the BY variables, the entire INITIAL=INPUT= data set provides the initial value for the matrix A for each BY group in the DATA= data set.

If the INITIAL=INPUT= data set contains some but not all of the BY variables, or if some BY variables do not have the same type or length in the INITIAL=INPUT= data set as in the DATA= data set, then PROC ACECLUS displays an error message and stops.

If all the BY variables appear in the INITIAL=INPUT= data set with the same type and length as in the DATA= data set, then each BY group in the INITIAL=INPUT= data set provides the initial value for A for the corresponding BY group in the DATA= data set. All BY groups in the DATA= data set must also appear in the INITIAL= INPUT= data set. The BY groups in the INITIAL=INPUT= data set must be in the same order as in the DATA= data set. If you specify NOTSORTED in the BY statement, identical BY groups must occur in the same order in both data sets. If you do not specify NOTSORTED, some BY groups can appear in the INITIAL= INPUT= data set, but not in the DATA= data set; such BY groups are not used in the analysis.

For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts . For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide .

FREQ Statement

FREQ variable ;

If a variable in your data set represents the frequency of occurrence for the observation, include the name of that variable in the FREQ statement. The procedure then treats the data set as if each observation appears n times, where n is the value of the FREQ variable for the observation. If a value of the FREQ variable is not integral, it is truncated to the largest integer not exceeding the given value. Observations with FREQ values less than one are not included in the analysis. The total number of observations is considered equal to the sum of the FREQ variable.

VAR Statement

VAR variables ;

The VAR statement specifies the numeric variables to be analyzed. If the VAR statement is omitted, all numeric variables not specified in other statements are analyzed.

WEIGHT Statement

WEIGHT variable ;

If you want to specify relative weights for each observation in the input data set, place the weights in a variable in the data set and specify that variable name in a WEIGHT statement. This is often done when the variance associated with each observation is different and the values of the weight variable are proportional to the reciprocals of the variances. The values of the WEIGHT variable can be non-integral and are not truncated. An observation is used in the analysis only if the value of the WEIGHT variable is greater than zero.

The WEIGHT and FREQ statements have a similar effect, except in calculating the divisor of the A matrix.