The DISTANCE procedure computes various measures of distance, dissimilarity, or similarity between the observations (rows) of a SAS data set. These proximity measures are stored as a lower
The number of rows and
PROC DISTANCE also provides various non-parametric and parametric methods for standardizing variables. Different variables can be standardized with different
Distance matrices are used frequently in data mining, genomics, marketing, financial analysis, management science, education,
Measurement
of some attribute of a set of objects is the process of assigning numbers or other symbols to the objects in such a way that properties of the
The most commonly discussed levels of measurement are:
|
Nominal |
Two objects are assigned the same symbol if they have the same value of the attribute. Permissible transformations are any one-to-one or many-to-one transformation, although a
|
|
Ordinal |
Objects are assigned numbers such that the order of the numbers reflects an order relation defined on the attribute. Two objects
x
and
y
with attribute values
a(x)
and
a(y)
are assigned numbers
m(x)
and
m(y)
such that if
m
(
x
)
> m
(
y
), then
a
(
x
)
> a
(
y
). Permissible transformations are any monotone increasing transformation, although a transformation that is not
|
|
Interval |
Objects are assigned numbers such that differences between the numbers reflect differences of the attribute. If m ( x ) ˆ’ m ( y ) > m ( u ) ˆ’ m ( v ), then a ( x ) ˆ’ a ( y ) > a ( u ) ˆ’ a ( v ). Permissible transformations are any affine transformation t ( m ) = c * m + d , where c and d are constants; another way of saying this is that the origin and unit of measurement are arbitrary. |
|
Log-interval |
Objects are assigned numbers such that ratios between the numbers reflect ratios of the attribute. If m ( x ) /m ( y ) > m ( u ) /m ( v ), then a ( x ) /a ( y ) > a ( u ) /a ( v ). Permissible transformations are any power transformation t ( m ) = c * m d , where c and d are constants. |
|
Ratio |
Objects are assigned numbers such that differences and ratios between the numbers reflect differences and ratios of the attribute. Permissible transformations are any linear (similarity) transformation t ( m ) = c * m , where c is a constant; another way of saying this is that the unit of measurement is arbitrary. |
|
Absolute |
Objects are assigned numbers such that all properties of the numbers reflect analogous properties of the attribute. The only permissible transformation is the identity transformation. |
Proximity measures provided in the DISTANCE procedure accept four levels of measurement: nominal, ordinal, interval, and ratio. Ordinal variables are transformed to interval variables before processing. This is done by replacing the data with their rank scores, and by
A binary variable contains two possible
If the outcomes of a binary variable are not equally important, the binary variable is called
asymmetric
. An example of such a variable is the presence or absence of a relatively rare attribute, such as is color blind for a human-being. While you say that two people who are color blind have something in common, you cannot say that people who are not
When nominal variables are employed, the comparison of one data unit with another can only be in terms of whether the data units score the same or different on the variables. If a variable is defined as an asymmetric nominal variable and two data units score the same but fall into the absent category, the absent-absent match is excluded from the computation of the proximity measure.
Since variables with large variances tend to have more effect on the proximity measure than those with small variances, it is recommended to standardize the variables before the computation of the proximity measure. The DISTANCE procedure provides a
Variable standardization is not required if there is only one level of measurement, or if only asymmetric nominal and nominal levels are specified;
When standardization is mandatory and no standardization method is specified, a default method of standardization will be used. This default method is determined by the measurement level. In general, the default method is STD for interval variables and is MAXABS for ratio variables except when METHOD= GOWER or METHOD= DGOWER is specified. See the STD= option in the VAR statement for the default methods for GOWER and DGOWER as well as methods available for standardizing variables.
When standardization is mandatory, PROC DISTANCE suppresses the REPONLY option, if it is specified.