Chapter 26: The DISTANCE Procedure | SAS/STAT 9.1 Users Guide Volume 2 only

Overview

The DISTANCE procedure computes various measures of distance, dissimilarity, or similarity between the observations (rows) of a SAS data set. These proximity measures are stored as a lower triangular matrix or a square matrix in an output data set (depending on the SHAPE= option) that can then be used as input to the CLUSTER, MDS, and MODECLUS procedures. The input data set may contain numeric or character variables , or both, depending on which proximity measure is used.

The number of rows and columns in the output matrix equals the number of observations in the input data set. If there are BY groups, an output matrix is computed for each BY group with the size determined by the maximum number of observations in any BY group .

PROC DISTANCE also provides various non-parametric and parametric methods for standardizing variables. Different variables can be standardized with different methods .

Distance matrices are used frequently in data mining, genomics, marketing, financial analysis, management science, education, chemistry , psychology, biology, and various other fields.

Levels of Measurement

Measurement of some attribute of a set of objects is the process of assigning numbers or other symbols to the objects in such a way that properties of the numbers or symbols reflect properties of the attribute being measured. There are different levels of measurement that involve different properties (relations and operations) of the numbers or symbols. Associated with each level of measurement is a set of transformations of the measurements that preserve the relevant properties; these transformations are called permissible transformations. A particular way of assigning numbers or symbols to measure something is called a scale of measurement.

The most commonly discussed levels of measurement are:

Nominal	Two objects are assigned the same symbol if they have the same value of the attribute. Permissible transformations are any one-to-one or many-to-one transformation, although a many-to-one transformation loses information.
Ordinal	Objects are assigned numbers such that the order of the numbers reflects an order relation defined on the attribute. Two objects x and y with attribute values a(x) and a(y) are assigned numbers m(x) and m(y) such that if m ( x ) > m ( y ), then a ( x ) > a ( y ). Permissible transformations are any monotone increasing transformation, although a transformation that is not strictly increasing loses information.
Interval	Objects are assigned numbers such that differences between the numbers reflect differences of the attribute. If m ( x ) ˆ’ m ( y ) > m ( u ) ˆ’ m ( v ), then a ( x ) ˆ’ a ( y ) > a ( u ) ˆ’ a ( v ). Permissible transformations are any affine transformation t ( m ) = c * m + d , where c and d are constants; another way of saying this is that the origin and unit of measurement are arbitrary.
Log-interval	Objects are assigned numbers such that ratios between the numbers reflect ratios of the attribute. If m ( x ) /m ( y ) > m ( u ) /m ( v ), then a ( x ) /a ( y ) > a ( u ) /a ( v ). Permissible transformations are any power transformation t ( m ) = c * m ^d , where c and d are constants.
Ratio	Objects are assigned numbers such that differences and ratios between the numbers reflect differences and ratios of the attribute. Permissible transformations are any linear (similarity) transformation t ( m ) = c * m , where c is a constant; another way of saying this is that the unit of measurement is arbitrary.
Absolute	Objects are assigned numbers such that all properties of the numbers reflect analogous properties of the attribute. The only permissible transformation is the identity transformation.

Proximity measures provided in the DISTANCE procedure accept four levels of measurement: nominal, ordinal, interval, and ratio. Ordinal variables are transformed to interval variables before processing. This is done by replacing the data with their rank scores, and by assuming that the classes of an ordinal variable are spaced equally along the interval scale. See the RANKSCORE= option in the PROC DISTANCE statement for choices on assigning scores to ordinal variables. There are also different approaches on how to transform an ordinal variable to an interval variable. Refer to Anderberg (1973) for alternatives.

Symmetric versus Asymmetric Nominal Variables

A binary variable contains two possible outcomes : 1 (positive/present) or 0 (nega-tive/absent). If there is no preference for which outcome should be coded as 0 and which as 1, the binary variable is called symmetric . For example, the binary variable is evergreen? for a plant has the possible states loses leaves in winter and does not lose leaves in winter. Both are equally valuable and carry the same weight when a proximity measure is computed. Commonly used measures that accept symmetric binary variables include the Simple Matching, Hamann, Roger and Tanimoto, Sokal and Sneath 1, and Sokal and Sneath 3 coefficients.

If the outcomes of a binary variable are not equally important, the binary variable is called asymmetric . An example of such a variable is the presence or absence of a relatively rare attribute, such as is color blind for a human-being. While you say that two people who are color blind have something in common, you cannot say that people who are not color blind have something in common. The most important outcome is usually coded as 1 (present) and the other is coded as 0 (absent). The agreement of two 1 s (a present-present match or a positive match) is more significant than the agreement of two 0 s (an absent-absent match or a negative match.) Usually, the negative match is treated as irrelevant. Commonly used measures that accept asymmetric binary variables include Jaccard, Dice, Russell and Rao, Binary Lance and Williams nonmetric, and Kulcynski coefficients.

When nominal variables are employed, the comparison of one data unit with another can only be in terms of whether the data units score the same or different on the variables. If a variable is defined as an asymmetric nominal variable and two data units score the same but fall into the absent category, the absent-absent match is excluded from the computation of the proximity measure.

Standardization

Since variables with large variances tend to have more effect on the proximity measure than those with small variances, it is recommended to standardize the variables before the computation of the proximity measure. The DISTANCE procedure provides a convenient way to standardize each variable with its own method before the proximity measures are computed. The standardization can also be performed by the STDIZE procedure with the limitation that all variables must be standardized with the same method.

Mandatory Standardization

Variable standardization is not required if there is only one level of measurement, or if only asymmetric nominal and nominal levels are specified; otherwise , standardization is mandatory.

When standardization is mandatory and no standardization method is specified, a default method of standardization will be used. This default method is determined by the measurement level. In general, the default method is STD for interval variables and is MAXABS for ratio variables except when METHOD= GOWER or METHOD= DGOWER is specified. See the STD= option in the VAR statement for the default methods for GOWER and DGOWER as well as methods available for standardizing variables.

When standardization is mandatory, PROC DISTANCE suppresses the REPONLY option, if it is specified.