Syntax | SAS/STAT 9.1 Users Guide Volume 2 only

The following statements are available in the DISTANCE procedure.

PROC DISTANCE < options > ;
- BY variables ;
- COPY variables ;
- FREQ variable ;
- ID variable ;
- VAR level(variables < / opt-list > ) ;
- WEIGHT variable ;

Both the PROC DISTANCE statement and the VAR statement are required.

PROC DISTANCE Statement

PROC DISTANCE < options >

You can specify the following options in the PROC DISTANCE statement.

Table 26.1: Summary of PROC DISTANCE Statement Options
Task/Statement	Options	Description
standardizing variables	ADD=	specifies the constant to add to each value after standardizing and multiplying by the value specified in the MULT= option
	FUZZ=	specifies the relative fuzz factor for writing the output
	INITIAL=	specifies the method for computing initial estimates for the A-estimates
	MULT=	specifies the constant to multiply each value by after standardizing
	NORM	normalizes the scale estimator to be consistent for the standard deviation of a normal distribution
	SNORM	normalizes the scale estimator to have an expectation of approximately 1 for a standard normal distribution
	VARDEF=	specifies the variances divisor
generating distance matrix	ABSENT=	specifies the value to be used as an absence value for all the asymmetric nominal variables
	METHOD=	specifies the method for computing proximity measures
	PREFIX=	specifies a prefix for naming the distance variables in the OUT= data set
	RANKSCORE=	specifies the method of assigning scores to ordinal variables
	SHAPE=	specifies the shape of the proximity matrix to be stored in the OUT= data set
	UNDEF=	specifies the numeric constant used to replace undefined distances
	VARDEF=	specifies the variances divisor
missing values	NOMISS	replaces missing data by the location measure (does not standardize the data); generates missing distance for observations with missing values
	REPLACE	replaces missing data by zero in the standardized data
	REPONLY	replaces missing data by the location measure (does not standardize the data)
specifying data set details	DATA=	specifies the input data set
	OUT=	specifies the output data set
	OUTSDZ=	specifies the output data set for standardized scores

These options and their abbreviations are described, in alphabetical order, in the re-mainder of this section.

ABSENT= num or qs

specifies the value to be used as an absence value in an irrelevant absent-absent match for all of the asymmetric nominal variables. If you want to specify a different absence value for a particular variable, use the ABSENT= option in the VAR statement. See the ABSENT= option in the VAR statement later in this chapter for details.

An absence value for a variable can be either a numeric value or a quoted string consisting of combinations of characters . For instance, ., -999, NA are legal values for the ABSENT= option.

The default absence value for a character variable is NONE (notice that a blank value is considered a missing value), and the default absence value for a numeric variable is 0.

ADD= c

specifies a constant, c , to add to each value after standardizing and multiplying by the value you specify in the MULT= option. The default value is 0.

DATA= SAS-data-set

specifies the input data set containing observations from which the proximity is computed. If you omit the DATA= option, the most recently created SAS data set is used.

FUZZ= c

specifies the relative fuzz factor for computing the standardized scores. The default value is 1E-14. For the OUTSDZ= data set, the score is computed as follows :
- if standardized scores < scale measure c , then standardized scores = 0

INITIAL= method

specifies the method for computing initial estimates for the A-estimates (ABW, AWAVE, and AHUBER). The following methods are not allowed for the INITIAL= option: ABW, AHUBER, AWAVE, IN.

The default value is INITIAL=MAD.

METHOD= method

specifies the method for computing proximity measures.

For use in PROC CLUSTER, distance or dissimilarity measures such as METHOD= EUCLID or METHOD= DGOWER should be chosen .

The following six tables outline the proximity measures available for the METHOD= option. These tables are classified by levels of measurement accepted by each method. There are three to four columns in each table: the proximity measures (Method) column, the upper and lower bounds (Range) column(s), and the types of proximity (Type) column.

The Type column has two possible values: sim if a method generate similarity or dis if a method generates distance or dissimilarity measures.

For formulas and descriptions of these methods, see the Details section on page 1270.

Table 26.2 lists the GOWER and the DGOWER methods. These two methods accept all measurement levels including ratio, interval, ordinal, nominal, and asymmetric nominal. METHOD= GOWER or METHOD= DGOWER always implies standardization. Assuming all the numeric (ordinal, interval, and ratio) variables are standardized by their corresponding default methods, the possible range values for both methods in the second column of this table are on or between 0 and 1. To find out the default methods of standardization for METHOD= GOWER or METHOD= DGOWER, see the STD= option for the VAR statement later in this section. Entries in this table are:

GOWER	Gowers similarity
DGOWER	1 minus GOWER

Table 26.2: Methods Accepting all Measurement Levels
Method	Range	Type
GOWER	0 to 1	sim
DGOWER	0 to 1	dis

Table 26.3 lists methods accepting ratio, interval, and ordinal variables. Entries in this table are:

EUCLID	Euclidean distance
SQEUCLID	Squared Euclidean distance
SIZE	Size distance
SHAPE	Shape distance
COV	Covariance
CORR	Correlation
DCORR	Correlation transformed to Euclidean distance
SQCORR	Squared correlation
DSQCORR	One minus squared correlation
L( p )	Minkowski ( L _p ) distance, where p is a positive numeric value
CITYBLOCK	L ₁ , City-block, or Manhattan distance
CHEBYCHEV	L _ˆ
POWER( p, r )	Generalized Euclidean distance where p is a positive numeric value, and r is a non-negative numeric value. The distance between two observations is the r th root of sum of the absolute differences to the p th power between the values for the observations.

Table 26.3: Methods Accepting Ratio, Interval, and Ordinal Variables
Method	Range	Type
EUCLID		dis
SQEUCLID		dis
SIZE		dis
SHAPE		dis
COV		sim
CORR	ˆ’ 1 to 1	sim
DCORR	0 to 2	dis
SQCORR	0 to 1	sim
DSQCORR	0 to 1	dis
L( p )		dis
CITYBLOCK		dis
CHEBYCHEV		dis
POWER( p, r )		dis

Table 26.4 lists methods accepting ratio variables. Notice that in the second column of this table, all of the possible range values are non-negative, because ratio variables are assumed to be positive. Entries in this table are:

SIMRATIO	Similarity ratio (if variables are binary, this is the Jaccard coefficient)
DISRATIO	One minus similarity ratio
NONMETRIC	Lance and Williams nonmetric coefficient
CANBERRA	Canberra metric distance coefficient
COSINE	Cosine
DOT	Dot (inner) product
OVERLAP	Overlap similarity
DOVERLAP	Overlap dissimilarity
CHISQ	Chi-squared
CHI	Squared root of Chi-squared
PHISQ	phi-squared
PHI	Squared root of phi-squared

Table 26.4: Methods Accepting Ratio Variables
Method	Range	Type
SIMRATIO	0 to 1	sim
DISRATIO	0 to 1	dis
NONMETRIC	0 to 1	dis
CANBERRA	0 to 1	dis
COSINE	0 to 1	sim
DOT		sim
OVERLAP		sim
DOVERLAP		dis
CHISQ		dis
CHI		dis
PHISQ		dis
PHI		dis

Table 26.5 lists methods accepting nominal variables. Entries in the previous table are:

HAMMING	Hamming distance
MATCH	Simple matching coefficient
DMATCH	Simple matching coefficient transformed to Euclidean distance
DSQMATCH	Simple matching coefficient transformed to squared Euclidean distance
HAMANN	Hamann coefficient
RT	Roger and Tanimoto
SS1	Sokal and Sneath 1
SS3	Sokal and Sneath 3

Table 26.5: Methods Accepting Nominal Variables
Method	Range	Type
HAMMING	0 to v ^{[ 1]}	dis
MATCH	0 to 1	sim
DMATCH	0 to 1	dis
DSQMATCH	0 to 1	dis
HAMANN	ˆ’ 1 to 1	sim
RT	0 to 1	sim
SS1	0 to 1	sim
SS3	0 to 1	sim
^{[ 1]} the number of variables or dimensionality.

Table 26.6 lists methods that accept asymmetric nominal variables. Use the ABSENT= option to create a value to be considered absent. Entries in the previous table are:

DICE	Dice coefficient or Czekanowski/Sorensen similarity coefficient
RR	Russell and Rao
BLWNM	Binary Lance and Williams nonmetric, or Bray-Curtis coefficient
K1	Kulcynski 1

Table 26.6: Methods Accepting Asymmetric Nominal Variables
Method	Range	Type
DICE	0 to 1	sim
RR	0 to 1	sim
BLWNM	0 to 1	dis
K1		sim

Table 26.7 lists methods accepting asymmetric nominal and ratio variables. Use the ABSENT= option to create a value to be considered absent. There are four instead of three columns in this table. The second column contains possible range values if only one level of measurement (either ratio or asymmetric nominal but not both) is specified; the third column contains possible range values if both levels are specified.

The JACCARD method is equivalent to the SIMRATIO method if there is no asymmetric nominal variable; if both ratio and asymmetric nominal variables are present, the coefficient is computed as the sum of the coefficient from the ratio variables and the coefficient from the asymmetric nominal variables. See Proximity Measures of the Details section on page 1270 for formulas and descriptions of the JACCARD method. Entries in this table are:

JACCARD	Jaccard similarity coefficient
DJACCARD	Jaccard dissimilarity coefficient

Table 26.7: Methods Accepting Asymmetric Nominal and Ratio Variables
Method	Range (one level)	Range (two levels)	Type
JACCARD	0 to 1	0 to 2	sim
DJACCARD	0 to 1	0 to 2	dis

MULT= c

specifies a constant, c , by which to multiply each value after standardizing. The default value is 1.

NOMISS

While standardizing variables, omit observations with missing values from computation of the location and scale measures. While computing distances, generate undefined (missing) distances for observations with missing values. Use the UNDEF= option to specify the undefined values.

If a distance matrix is created to be used as an input to PROC CLUSTER, the NOMISS option should not be used because the CLUSTER procedure will not accept distance matrices with missing values.

NORM

normalizes the scale estimator to be consistent for the standard deviation of a normal distribution when you specify the option STD= AGK, STD= IQR, STD= MAD, or STD= SPACING in the VAR statement.

PREFIX= name

specifies a prefix for naming the distance variables in the OUT= data set. By default, the names are Dist1 , Dist2 , , Dist n . If you specify PREFIX= ABC ,the variables are named ABC1 , ABC2 , , ABCn . If the ID statement is also specified, the variables are named by appending the value of the ID variable to the prefix.

OUT= SAS-data-set

specifies the name of the SAS data set created by PROC DISTANCE. The output data set contains the BY variables, the ID variable, computed distance variables, the COPY variables, the FREQ variable, and the WEIGHT variables.

If you omit the OUT= option, PROC DISTANCE creates an output data set named according to the DATA n convention.

OUTSDZ= SAS-data-set

specifies the name of the SAS data set containing the standardized scores. The output data set contains a copy of the DATA= data set, except that the analyzed variables have been standardized. Analyzed variables are those listed in the VAR statement.

RANKSCORE= MIDRANK INDEX

specifies the method of assigning scores to ordinal variables. The available methods are listed as follows:

MIDRANK	assigns consecutive integers to each category with consideration of the frequency value. This is the default method.
INDEX	assigns consecutive integers to each category regardless of frequencies.

The following example explains how each method assigns the rank scores. Suppose the data contain an ordinal variable ABC with values A, B, C. There are two ways to assign numbers . One is to use midranks, which depend on the frequencies of each category. Another is to assign consecutive integers to each category, regardless of frequencies.

Table 26.8: Example of Assigning Rank Scores
ABC	MIDRANK	INDEX
A	1.5	1
A	1.5	1
B	4	2
B	4	2
B	4	2
C	6	3

REPLACE

replaces missing data by zero in the standardized data (which corresponds to the location measure before standardizing.) To replace missing data by something else, use the MISSING= option in the VAR statement. The REPLACE option implies standardization.

You can not specify both the REPLACE and the REPONLY options.

REPONLY

replaces missing data by the location measure specified by the MISSING= option or the STD= option (if the MISSING= option is not specified), but does not standardize the data. If the MISSING= option is not specified and METHOD= GOWER is specified, missing values are replaced by the location measure from the RANGE method (the minimum value), no matter what the value of the STD= option is.

When standardization is mandatory, PROC DISTANCE suppresses the REPONLY option.

You can not specify both the REPLACE and the REPONLY options.

SHAPE= TRIANGLE TRI

SHAPE= SQUARE SQU SQR

specifies the shape of the proximity matrix to be stored in the OUT= data set. SHAPE= TRIANGLE requests the matrix to be stored as a lower triangular matrix; SHAPE= SQUARE requests the matrix to be stored as a squared matrix. Use SHAPE= SQUARE if the output data set is to be used as input to the MODECLUS procedures. The default is TRIANGLE.

SNORM

normalizes the scale estimator to have an expectation of approximately 1 for a standard normal distribution when the STD= SPACING option is specified.

UNDEF= n

specifies the numeric constant used to replace undefined distances, for example, when an observation has all missing values, or if a divisor is zero.

VARDEF=DFNWDFWEIGHT WGT

specifies the divisor to be used in the calculation of distance, dissimilarity, or similarity measures, and for standardizing variables whenever a variance or covariance is computed. By default, VARDEF=DF. The values and associated divisors are as follows:

Value	Divisor	Formula
DF	degrees of freedom	n ˆ’ 1
N	number of observations	n
WDF	sum of weights minus 1	( ˆ‘ _i w _i ) ˆ’ 1
WEIGHT WGT	sum of weights	ˆ‘ _i w _i

VAR Statement

VAR VARIABLES level (variables < / opt-list >)
- < level (variables < / opt-list > )
  - level (variables < / opt-list > )
    
    .
    
    .
    
    .
  - level (variables < / opt-list >) >

where the syntax for the opt-list is:

< ABSENT = value >
< MISSING = miss -method or value >
< ORDER = order-option >
< STD = std-method >
< WEIGHTS = weight-list >

The VAR statement lists variables from which distances are to be computed. The VAR statement is required. The variables can be numeric or character depending on their measurement levels. A variable may not appear more than once in either the same list or in a different list.

level is required. It declares the levels of measurement for those variables specified within the parentheses. Available values for level are:

ANOMINAL	variables are asymmetric nominal and can be either numeric or character.
NOMINAL	variables are symmetric nominal and can be either numeric or character.
ORDINAL	variables are ordinal and can be either numeric or character. Values of ordinal variables will be replaced by their corresponding rank scores. If standardization is required, the standardized rank scores are output to the data set specified in the OUTSDZ= option.
	See the RANKSCORE= option in the PROC DISTANCE statement for methods available for assigning rank scores to ordinal variables. After being replaced by scores, ordinal variables are considered interval.
INTERVAL	variables are interval, and only numeric variables are allowed.
RATIO	variables are ratio, and only numeric variables are allowed. Ratio variables should always contain positive measurements.

Each variable list can be followed by an option list. Use / after the list of variables to start the option list. An option list contains options that are applied to the variables. The following options are available in the option list.

ABSENT=	to specify the value to be used as an absence value in an irrelevant absent-absent match for asymmetric nominal variables.
MISSING=	to specify the method (or value) with which to replace missing data
ORDER=	to select the order for assigning scores to ordinal variables.
STD=	to select the standardization method
WEIGHTS=	to assign weights to the variables in the list

If an option is missing from the current attribute list, PROC DISTANCE provides default values for all the variables in the current list.

For example, in the following VAR statement:

  var ratio(x1-x4/std= mad weights= .5 .5 .1 .5 missing= -99)   interval(x5/std= range)   ordinal(x6/order= desc);

the first option list defines x1 ˆ’ x4 as ratio variables to be standardized by the MAD method. Also, any missing values in x1 ˆ’ x4 should be replaced by -99. x1 is given a weight of 0.5, x2 is given a weight of 0.5, x3 is given a weight of 0.1, and x4 is given a weight of 0.5.

The second option list defines x5 as an interval variable to be standardized by the RANGE method. If the REPLACE option is specified in the PROC DISTANCE statement, missing values in x5 are replaced by the the location estimate from the RANGE method. By default, x5 is given a weight of 1.

The last option list defines x6 as an ordinal variable. The scores are assigned from highest-to- lowest by its unformatted values. Although the STD= option is not specified, x6 will be standardized by the default method (STD) because there is more than one level of measurements (ratio, interval, and ordinal) in the VAR statement. Again, if the REPLACE option is specified, missing values in x6 are replaced by the location estimate from the STD method. Finally, by default, x6 is given a weight of 1.

More details for the options are explained as follows.

STD= std-method

specifies the standardization method. Valid values for std-method are: MEAN, MEDIAN, SUM, EUCLEN, USTD, STD, RANGE, MIDRANGE , MAXABS, IQR, MAD, ABW, AHUBER, AWAVE, AGK, SPACING, and L. Table 26.9 lists available methods of standardization as well as their corresponding location and scale measures.

Table 26.9: Available Standardization Methods
Method	Scale	Location
MEAN	1	mean
MEDIAN	1	median
SUM	sum
EUCLEN	Euclidean length
USTD	standard deviation about origin
STD	standard deviation	mean
RANGE	range	minimum
MIDRANGE	range/2	midrange
MAXABS	maximum absolute value
IQR	interval quartile range	median
MAD	median abs. dev. from median	median
ABW( c )	biweight A-estimate	biweight 1-step M-estimate
AHUBER( c )	Huber A-estimate	Huber 1-step M-estimate
AWAVE( c )	Wave 1-step M-estimate	Wave A-estimate
AGK(p)	AGK estimate (ACECLUS)	mean
SPACING( p )	minimum spacing	mid minimum-spacing
L( p )	L _p	L _p

These standardization methods are further documented in the section on the METHOD= option in the PROC STDIZE statement of the STDIZE procedure (also see the Standardization Methods section on page 4136 in Chapter 66, The STDIZE Procedure, .)

Standardization is not required if there is only one level of measurement, or if only asymmetric nominal and nominal levels are specified; otherwise , standardization is mandatory. When standardization is mandatory, a default method will be provided when the STD= option is not given. The default method is STD for standardizing interval variables and MAXABS for standardizing ratio variables unless METHOD= GOWER or METHOD= DGOWER is specified. If METHOD= GOWER is specified, interval variables are standardized by the RANGE method, and whatever is specified in the STD= option is ignored; if METHOD= DGOWER is specified, the RANGE method is the default standardization method for interval variables. The MAXABS method is the default standardization method for ratio variables for both the GOWER and the DGOWER.

Notice that a ratio variable should always be positive.

Table 26.10 lists standardization methods and the levels of measurement that can be accepted by each method. For example, the SUM method can be used to standardize ratio variables but not interval, or ordinal variables. Also, the AGK and SPACING methods should not be used to standardize ordinal variables. If you apply AGK and SPACING to ranks, the results are degenerate because all the spacings of a given order are equal.

Table 26.10: Legitimate Levels of Measurements for Each Method
Standardization Method	Legitimate Levels of Measurement
MEAN	ratio, interval, ordinal
MEDIAN	ratio, interval, ordinal
SUM	ratio
EUCLEN	ratio
USTD	ratio
STD	ratio, interval, ordinal
RANGE	ratio, interval, ordinal
MIDRANGE	ratio, interval, ordinal
MAXABS	ratio
IQR	ratio, interval, ordinal
MAD	ratio, interval, ordinal
ABW( c )	ratio, interval, ordinal
AHUBER( c )	ratio, interval, ordinal
AWAVE( c )	ratio, interval, ordinal
AGK( p )	ratio, interval
SPACING( p )	ratio, interval
L( p )	ratio, interval, ordinal

ABSENT= num or qs

specifies the value to be used as an absence value in an irrelevant absent-absent match for asymmetric nominal variables. The absence value specified here overwrites the absence value specified through the ABSENT= option in the PROC DISTANCE statement for those variables in the current variable list.

An absence value for a variable can be either a numeric value or a quoted string consisting of combinations of characters. For instance, ., -999, NA are legal values for the ABSENT= option.

The default for an absence value for a character variable is NONE (notice that a blank value is considered a missing value), and the default for an absence value for a numeric variable is 0.

MISSING= miss-method or value

specifies the method or a numeric value for replacing missing values. If you omit the MISSING= option, the REPLACE option replaces missing values with the location measure given by the STD= option. Specify the MISSING= option when you want to replace missing values with a different value. You can specify any method that is valid in the STD= option. The corresponding location measure is used to replace missing values.

If a numeric value is given, the value replaces missing values after standardizing the data. However, when standardization is not mandatory, you can specify the REPONLY option with the MISSING= option to suppress standardization for cases in which you want only to replace missing values.

ORDER= ASCENDING ASC

ORDER= DESCENDING DESC

ORDER= ASCFORMATTED ASCFMT

ORDER= DESFORMATTED DESFMT

ORDER= DSORDER DATA

specifies the order for assigning score to ordinal variables. The value for the ORDER= option can be one of the following:

ASCENDING	scores are assigned in lowest-to-highest order of unformatted values.
DESCENDING	scores are assigned in highest-to-lowest order of unformatted values.
ASCFORMATTED	scores are assigned in ascending order by their formatted values. This option can be applied to character variables only, since unformatted values are always used for numeric variables.
DESFORMATTED	scores are assigned in descending order by their formatted values. This option can be applied to character variables only, since unformatted values are always used for numeric variables.
DSORDER	scores are assigned according to the order of their appearance in the input data set.

The default value is ASCENDING.

WEIGHTS= weight-list

specifies a list of values for weighting individual variables while computing the proximity. Values in this list can be separated by blanks or commas. You can include one or more items of the form start TO stop BY increment . This list should contain at least one weight. The maximum number of weights you can list is equal to the number of variables. If the number of weights is less than the number of variables, the last value in the weight-list is used for the rest of the variables; conversely, if the number of weights is greater than the number of variables, the trailing weights will be discarded.

The default value is 1.

ID Statement

ID variable

The ID statement specifies a single variable to be copied to the OUT= data set and used to generate names for the distance variables. The ID variable must be character.

Typically, each ID value occurs only once in the input data set or, if you use a BY statement, only once within a BY group .

If you specify both the ID and the BY statements, the ID variable must have the same values in the same order in each BY group.

COPY Statement

COPY variables

The COPY statement specifies a list of additional variables to be copied to the OUT= data set.

BY Statement

BY variables

You can specify a BY statement to obtain separate distance matrices for observations in groups defined by the BY variables.

For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts .

FREQ Statement

FREQ FREQUENCY variable

The frequency variable is used for either standardizing variables or assigning rank scores to the ordinal variables. It has no direct effect on computing the distances.

For standardizing variables and assigning rank scores, PROC DISTANCE treats the data set as if each observation appeared n times, where n is the value of the FREQ variable for the observation. Non-integral values of the FREQ variable are truncated to the largest integer less than the FREQ value. If the FREQ variable has a value that is less than 1 or is missing, the observation is not used in the analysis.

WEIGHT Statement

WGT WEIGHT variable

The WEIGHT statement specifies a numeric variable in the input data set with values that are used to weight each observation. This weight variable is used for standardizing variables rather than computing the distances. Only one variable can be specified.

The WEIGHT variable values can be non-integers. An observation is used in the analysis only if the value of the WEIGHT variable is greater than zero. The WEIGHT variable applies to variables that are standardized by the following options: STD=MEAN, STD=SUM, STD=EUCLEN, STD=USTD, STD=STD, STD=AGK, or STD=L. PROC DISTANCE uses the value of the WEIGHT variable w _i , as follows.

The sample mean and (uncorrected) sample variances are computed as

where w _i is the weight value of the i th observation, x _i is the value of the i th observation, and d is the divisor controlled by the VARDEF= option (see the VARDEF= option in the PROC DISTANCE statement for details.)

PROC DISTANCE uses the value of the WEIGHT variable to calculate the following statistics:

MEAN	the weighted mean, x _w
SUM	the weighted sum, ˆ‘ _i w _i x _i
USTD	the weighted uncorrected standard deviation,
STD	the weighted standard deviation,
EUCLEN	the weighted Euclidean length, computed as the square root of the weighted uncorrected sum of squares:
AGK	the AGK estimate. This estimate is documented further in the ACECLUS procedure as the METHOD=COUNT option. See the discussion of the WEIGHT statement in Chapter 16, The ACECLUS Procedure, for information on how the WEIGHT variable is applied to the AGK estimate.
L	the L _p estimate. This estimate is documented further in the FASTCLUS procedure as the LEAST= option. See the discussion of the WEIGHT statement in Chapter 28, The FASTCLUS Procedure, for information on how the WEIGHT variable is used to compute weighted cluster means. Note that the number of clusters is always 1.