Details


Missing Values

Observations with missing values for variables in the analysis are excluded from the development of the classification criterion. When the values of the classification variable are missing, the observation is excluded from the development of the classification criterion, but if no other variables in the analysis have missing values for that observation, the observation is classified and displayed with the classification results.

Background

The following notation is used to describe the classification methods :

x

a p -dimensional vector containing the quantitative variables of an observation

S p

the pooled covariance matrix

t

a subscript to distinguish the groups

n t

the number of training set observations in group t

m t

the p -dimensional vector containing variable means in group t

S t

the covariance matrix within group t

S t

the determinant of S t

q t

the prior probability of membership in group t

p ( t x )

the posterior probability of an observation x belonging to group t

f t

the probability density function for group t

f t ( x )

the group-specific density estimate at x from group t

f ( x )

ˆ‘ t q t f t ( x ), the estimated unconditional density at x

e t

the classification error rate for group t

Bayes Theorem

Assuming that the prior probabilities of group membership are known and that the group-specific densities at x can be estimated, PROC DISCRIM computes p ( t x ), the probability of x belonging to group t , by applying Bayes theorem:

PROC DISCRIM partitions a p -dimensional vector space into regions R t , where the region R t is the subspace containing all p -dimensional vectors y such that p ( t y ) is the largest among all groups. An observation is classified as coming from group t if it lies in region R t .

Parametric Methods

Assuming that each group has a multivariate normal distribution, PROC DISCRIM develops a discriminant function or classification criterion using a measure of generalized squared distance. The classification criterion is based on either the individual within-group covariance matrices or the pooled covariance matrix; it also takes into account the prior probabilities of the classes. Each observation is placed in the class from which it has the smallest generalized squared distance. PROC DISCRIM also computes the posterior probability of an observation belonging to each class.

The squared Mahalanobis distance from x to group t is

click to expand

where V t = S t if the within-group covariance matrices are used, or V t = S p if the pooled covariance matrix is used.

The group-specific density estimate at x from group t is then given by

click to expand

Using Bayes theorem, the posterior probability of x belonging to group t is

click to expand

where the summation is over all groups.

The generalized squared distance from x to group t is defined as

click to expand

where

click to expand

if the within-group covariance matrices are used

if the pooled covariance matrix is used

and

click to expand

if the prior probabilities are not all equal

if the prior probabilities are all equal

The posterior probability of x belonging to group t is then equal to

click to expand

The discriminant scores are . An observation is classified into group u if setting t = u produces the largest value of p ( t x ) or the smallest value of . If this largest posterior probability is less than the threshold specified, x is classified into group OTHER.

Nonparametric Methods

Nonparametric discriminant methods are based on nonparametric estimates of group-specific probability densities. Either a kernel method or the k - nearest -neighbor method can be used to generate a nonparametric density estimate in each group and to produce a classification criterion. The kernel method uses uniform, normal, Epanechnikov, biweight, or triweight kernels in the density estimation.

Either Mahalanobis distance or Euclidean distance can be used to determine proximity. When the k -nearest-neighbor method is used, the Mahalanobis distances are based on the pooled covariance matrix. When a kernel method is used, the Mahalanobis distances are based on either the individual within-group covariance matrices or the pooled covariance matrix. Either the full covariance matrix or the diagonal matrix of variances can be used to calculate the Mahalanobis distances.

The squared distance between two observation vectors, x and y , in group t is given

click to expand

where V t has one of the following forms:

the pooled covariance matrix

the diagonal matrix of the pooled covariance matrix

the covariance matrix within group t

the diagonal matrix of the covariance matrix within group t

the identity matrix

The classificationofanobservationvector x is based on the estimated group-specific densities from the training set. From these estimated densities, the posterior probabilities of group membership at x are evaluated. An observation x is classified into group u if setting t = u produces the largest value of p ( t x ). If there is a tie for the largest probability or if this largest probability is less than the threshold specified, x is classified into group OTHER.

The kernel method uses a fixed radius, r , and a specified kernel, K t , to estimate the group t density at each observation vector x . Let z be a p -dimensional vector. Then the volume of a p -dimensional unit sphere bounded by z ² z = 1 is

where represents the gamma function (refer to SAS Language Reference: Dictionary ).

Thus, in group t , the volume of a p -dimensional ellipsoid bounded by click to expand is

The kernel method uses one of the following densities as the kernel density in group t .

Uniform Kernel

click to expand

Normal Kernel (with mean zero, variance r 2 V t )

click to expand
  • where click to expand

Epanechnikov Kernel

click to expand

Biweight Kernel

click to expand

Triweight Kernel

click to expand

The group t density at x is estimated by

click to expand

where the summation is over all observations y in group t , and K t is the specified kernel function. The posterior probability of membership in group t is then given by

where f ( x ) = ˆ‘ u q u f u ( x ) is the estimated unconditional density. If f ( x ) is zero, the observation x is classified into group OTHER.

The uniform-kernel method treats K t ( z ) as a multivariate uniform function with density uniformly distributed over . Let k t be the number of training set observations y from group t within the closed ellipsoid centered at x specified by . Then the group t density at x is estimated by

When the identity matrix or the pooled within-group covariance matrix is used in calculating the squared distance, v r ( t ) is a constant, independent of group membership. The posterior probability of x belonging to group t is then given by

click to expand

If the closed ellipsoid centered at x does not include any training set observations, f ( x ) is zero and x is classified into group OTHER. When the prior probabilities are equal, p ( t x ) is proportional to k t /n t and x is classified into the group that has the highest proportion of observations in the closed ellipsoid. When the prior probabilities are proportional to the group sizes, p ( t x ) = k t / ˆ‘ u k u , x is classified into the group that has the largest number of observations in the closed ellipsoid.

The nearest-neighbor method fixes the number, k , of training set points for each observation x . The method finds the radius r k ( x ) that is the distance from x to the k th nearest training set point in the metric . Consider a closed ellipsoid centered at x bounded by click to expand ; the nearest-neighbor method is equivalent to the uniform-kernel method with a location-dependent radius r k ( x ). Note that, with ties, more than k training set points may be in the ellipsoid.

Using the k -nearest-neighbor rule, the k n (or more with ties) smallest distances are saved. Of these k distances, let k t represent the number of distances that are associated with group t . Then, as in the uniform-kernel method, the estimated group t density at x is

where v k ( x ) is the volume of the ellipsoid bounded by click to expand . Since the pooled within-group covariance matrix is used to calculate the distances used in the nearest-neighbor method, the volume v k ( x ) is a constant independent of group membership. When k = 1 is used in the nearest-neighbor rule, x is classified into the group associated with the y point that yields the smallest squared distance . Prior probabilities affect nearest-neighbor results in the same way that they affect uniform-kernel results.

With a specified squared distance formula (METRIC=, POOL=), the values of r and k determine the degree of irregularity in the estimate of the density function, and they are called smoothing parameters. Small values of r or k produce jagged density estimates, and large values of r or k produce smoother density estimates. Various methods for choosing the smoothing parameters have been suggested, and there is as yet no simple solution to this problem.

For a fixed kernel shape, one way to choose the smoothing parameter r is to plot estimated densities with different values of r and to choose the estimate that is most in accordance with the prior information about the density. For many applications, this approach is satisfactory.

Another way of selecting the smoothing parameter r is to choose a value that optimizes a given criterion. Different groups may have different sets of optimal values. Assume that the unknown density has bounded and continuous second derivatives and that the kernel is a symmetric probability density function. One criterion is to minimize an approximate mean integrated square error of the estimated density (Rosenblatt 1956). The resulting optimal value of r depends on the density function and the kernel. A reasonable choice for the smoothing parameter r is to optimize the criterion with the assumption that group t has a normal distribution with covariance matrix V t . Then, in group t , the resulting optimal value for r is given by

where the optimal constant A ( K t ) depends on the kernel K t (Epanechnikov 1969). For some useful kernels, the constants A ( K t ) are given by

click to expand

with a uniform kernel

click to expand

with a normal kernel

click to expand

with an Epanechnikov kernel

These selections of A ( K t ) are derived under the assumption that the data in each group are from a multivariate normal distribution with covariance matrix V t . However, when the Euclidean distances are used in calculating the squared distance

( V t = I ), the smoothing constant should be multiplied by s , where s is an estimate of standard deviations for all variables. A reasonable choice for s is

where s jj are group t marginal variances.

The DISCRIM procedure uses only a single smoothing parameter for all groups. However, with the selection of the matrix to be used in the distance formula (using the METRIC= or POOL= option), individual groups and variables can have different scalings. When V t , the matrix used in calculating the squared distances, is an identity matrix, the kernel estimate on each data point is scaled equally for all variables in all groups. When V t is the diagonal matrix of a covariance matrix, each variable in group t is scaled separately by its variance in the kernel estimation, where the variance can be the pooled variance ( V t = S p ) or an individual within-group variance ( V t = S t ). When V t is a full covariance matrix, the variables in group t are scaled simultaneously by V t in the kernel estimation.

In nearest-neighbor methods, the choice of k is usually relatively uncritical (Hand 1982). A practical approach is to try several different values of the smoothing parameters within the context of the particular application and to choose the one that gives the best cross validated estimate of the error rate.

Classification Error-Rate Estimates

A classification criterion can be evaluated by its performance in the classification of future observations. PROC DISCRIM uses two types of error-rate estimates to evaluate the derived classification criterion based on parameters estimated by the training sample:

  • error-count estimates

  • posterior probability error-rate estimates.

The error-count estimate is calculated by applying the classification criterion derived from the training sample to a test set and then counting the number of misclassified observations. The group-specific error-count estimate is the proportion of misclassified observations in the group. When the test set is independent of the training sample, the estimate is unbiased . However, it can have a large variance, especially if the test set is small.

When the input data set is an ordinary SAS data set and no independent test sets are available, the same data set can be used both to define and to evaluate the classification criterion. The resulting error-count estimate has an optimistic bias and is called an apparent error rate . To reduce the bias, you can split the data into two sets, one set for deriving the discriminant function and the other set for estimating the error rate. Such a split-sample method has the unfortunate effect of reducing the effective sample size .

Another way to reduce bias is cross validation (Lachenbruch and Mickey 1968). Cross validation treats n ˆ’ 1 out of n training observations as a training set. It determines the discriminant functions based on these n ˆ’ 1 observations and then applies them to classify the one observation left out. This is done for each of the n training observations. The misclassification rate for each group is the proportion of sample observations in that group that are misclassified. This method achieves a nearly unbiased estimate but with a relatively large variance.

To reduce the variance in an error-count estimate, smoothed error-rate estimates are suggested (Glick 1978). Instead of summing terms that are either zero or one as in the error-count estimator, the smoothed estimator uses a continuum of values between zero and one in the terms that are summed. The resulting estimator has a smaller variance than the error-count estimate. The posterior probability error-rate estimates provided by the POSTERR option in the PROC DISCRIM statement (see the following section, Posterior Probability Error-Rate Estimates ) are smoothed error-rate estimates. The posterior probability estimates for each group are based on the posterior probabilities of the observations classified into that same group. The posterior probability estimates provide good estimates of the error rate when the posterior probabilities are accurate. When a parametric classification criterion (linear or quadratic discriminant function) is derived from a nonnormal population, the resulting posterior probability error-rate estimators may not be appropriate.

The overall error rate is estimated through a weighted average of the individual group-specific error-rate estimates, where the prior probabilities are used as the weights.

To reduce both the bias and the variance of the estimator, Hora and Wilcox (1982) compute the posterior probability estimates based on cross validation. The resulting estimates are intended to have both low variance from using the posterior probability estimate and low bias from cross validation. They use Monte Carlo studies on two-group multivariate normal distributions to compare the cross validation posterior probability estimates with three other estimators: the apparent error rate, cross validation estimator, and posterior probability estimator. They conclude that the cross validation posterior probability estimator has a lower mean squared error in their simulations.

Quasi-Inverse

Consider the plot shown in Figure 25.6 with two variables, X1 and X2 , and two classes, A and B. The within-class covariance matrix is diagonal, with a positive value for X1 but zero for X2 . Using a Moore-Penrose pseudo-inverse would effectively ignore X2 in doing the classification, and the two classes would have a zero generalized distance and could not be discriminated at all. The quasi-inverse used by PROC DISCRIM replaces the zero variance for X2 by a small positive number to remove the singularity. This allows X2 to be used in the discrimination and results correctly in a large generalized distance between the two classes and a zero error rate. It also allows new observations, such as the one indicated by N, to be classified in a reasonable way. PROC CANDISC also uses a quasi-inverse when the total-sample covariance matrix is considered to be singular and Mahalanobis distances are requested . This problem with singular within-class covariance matrices is discussed in Ripley (1996, p. 38). The use of the quasi-inverse is an innovation introduced by SAS Institute Inc.

click to expand
Figure 25.6: Plot of Data with Singular Within-Class Covariance Matrix

Let S be a singular covariance matrix. The matrix S can be either a within-group covariance matrix, a pooled covariance matrix, or a total-sample covariance matrix. Let v be the number of variables in the VAR statement and the nullity n be the number of variables among them with (partial) R 2 exceeding 1 ˆ’ p . If the determinant of S (Testing of Homogeneity of Within Covariance Matrices) or the inverse of S (Squared Distances and Generalized Squared Distances) is required, a quasi-determinant or quasi-inverse is used instead. PROC DISCRIM scales each variable to unit total-sample variance before calculating this quasi-inverse. The calculation is based on the spectral decomposition S = ““ ² , where is a diagonal matrix of eigenvalues » j , j = 1 , ..., v , where » i » j when i < j , and is a matrix with the corresponding orthonormal eigenvectors of S as columns . When the nullity n is less than v , set for j = 1, , v ˆ’ n , and for j = v ˆ’ n + 1 , ..., v , where

When the nullity n is equal to v , set , for j = 1 , ..., v . A quasi-determinant is then defined as the product of , j = 1 , ..., v . Similarly, a quasi-inverse is then defined as S * = * ² , where * is a diagonal matrix of values , j = 1, , v .

Posterior Probability Error-Rate Estimates

The posterior probability error-rate estimates (Fukunaga and Kessell 1973; Glick 1978; Hora and Wilcox 1982) for each group are based on the posterior probabilities of the observations classified into that same group.

A sample of observations with classification results can be used to estimate the posterior error rates. The following notation is used to describe the sample.

S

the set of observations in the (training) sample

n

the number of observations in S

n t

the number of observations in S in group t

R t

the set of observations such that the posterior probability belonging to group t is the largest

R ut

the set of observations from group u such that the posterior probability belonging to group t is the largest.

The classification error rate for group t is defined as

click to expand

The posterior probability of x for group t can be written as

where f ( x ) = ˆ‘ u q u f u ( x ) is the unconditional density of x .

Thus, if you replace f t ( x ) with p ( t x ) f ( x ) /q t , the error rate is

click to expand

An estimator of e t , unstratified over the groups from which the observations come, is then given by

click to expand

where p ( t x ) is estimated from the classification criterion, and the summation is over all sample observations of S classified into group t . The true group membership of each observation is not required in the estimation. The term nq t is the number of observations that are expected to be classified into group t , given the priors . If more observations than expected are classified into group t , then t can be negative.

Further, if you replace f ( x ) with ˆ‘ u q u f u ( x ), the error rate can be written as

click to expand

and an estimator stratified over the group from which the observations come is given by

click to expand

The inner summation is over all sample observations of S coming from group u and classified into group t , and n u is the number of observations originally from group u . The stratified estimate uses only the observations with known group membership. When the prior probabilities of the group membership are proportional to the group sizes, the stratified estimate is the same as the unstratified estimator.

The estimated group-specific error rates can be less than zero, usually due to a large discrepancy between prior probabilities of group membership and group sizes. To have a reliable estimate for group-specific error rate estimates, you should use group sizes that are at least approximately proportional to the prior probabilities of group membership.

A total error rate is defined as a weighted average of the individual group error rates

and can be estimated from

click to expand

or

click to expand

The total unstratified error-rate estimate can also be written as

click to expand

which is one minus the average value of the maximum posterior probabilities for each observation in the sample. The prior probabilities of group membership do not appear explicitly in this overall estimate.

Saving and Using Calibration Information

When you specify METHOD=NORMAL to derive a linear or quadratic discriminant function, you can save the calibration information developed by the DISCRIM procedure in a SAS data set by using the OUTSTAT= option in the procedure. PROC DISCRIM then creates a specially structured SAS data set of TYPE=LINEAR, TYPE=QUAD, or TYPE=MIXED that contains the calibration information. For more information on these data sets, see Appendix A, Special SAS Data Sets. Calibration information cannot be saved when METHOD=NPAR, but you can classify a TESTDATA= data set in the same step. For an example of this, see Example 25.1 on page 1180.

To use this calibration information to classify observations in another data set, specify both of the following:

  • the name of the calibration data set after the DATA= option in the PROC DISCRIM statement

  • thenameofthedatasettobeclassified after the TESTDATA= option in the PROC DISCRIM statement.

Here is an example:

  data original;   input position x1 x2;   datalines;    ...[data lines]    ;   proc discrim outstat=info;   class position;   run;   data check;   input position x1 x2;   datalines;    ...[second set of data lines]    ;   proc discrim data=info testdata=check testlist;   class position;   run;  

The first DATA step creates the SAS data set Original , which the DISCRIM procedure uses to develop a classification criterion. Specifying OUTSTAT=INFO in the PROC DISCRIM statement causes the DISCRIM procedure to store the calibration information in a new data set called Info . The next DATA step creates the data set Check . The second PROC DISCRIM statement specifies DATA=INFO and TESTDATA=CHECK so that the classification criterion developed earlier is applied to the Check data set. Note that if the CLASS variable is not present in the TESTDATA= data set, the output will not include misclassification statistics.

Input Data Sets

DATA= Data Set

When you specify METHOD=NPAR, an ordinary SAS data set is required as the input DATA= data set. When you specify METHOD=NORMAL, the DATA= data set can be an ordinary SAS data set or one of several specially structured data sets created by SAS/STAT procedures. These specially structured data sets include

  • TYPE=CORR data sets created by PROC CORR using a BY statement

  • TYPE=COV data sets created by PROC PRINCOMP using both the COV option and a BY statement

  • TYPE=CSSCP data sets created by PROC CORR using the CSSCP option and a BY statement, where the OUT= data set is assigned TYPE=CSSCP with the TYPE= data set option

  • TYPE=SSCP data sets created by PROC REG using both the OUTSSCP= option and a BY statement

  • TYPE=LINEAR, TYPE=QUAD, and TYPE=MIXED data sets produced by previous runs of PROC DISCRIM that used both METHOD=NORMAL and OUTSTAT= options

When the input data set is TYPE=CORR, TYPE=COV, TYPE=CSSCP, or TYPE=SSCP, the BY variable in these data sets becomes the CLASS variable in the DISCRIM procedure.

When the input data set is TYPE=CORR, TYPE=COV, or TYPE=CSSCP, PROC DISCRIM reads the number of observations for each class from the observations with _TYPE_ = N and reads the variable means in each class from the observations with _TYPE_ = MEAN . PROC DISCRIM then reads the within-class correlations from the observations with _TYPE_ = CORR and reads the standard deviations from the observations with _TYPE_ = STD (data set TYPE=CORR), the within-class covariances from the observations with _TYPE_ = COV (data set TYPE=COV), or the within-class corrected sums of squares and cross products from the observations with _TYPE_ = CSSCP (data set TYPE=CSSCP).

When you specify POOL=YES and the data set does not include any observations with _TYPE_ = CSSCP (data set TYPE=CSSCP), _TYPE_ = COV (data set TYPE=COV), or _TYPE_ = CORR (data set TYPE=CORR) for each class, PROC DISCRIM reads the pooled within-class information from the data set. In this case, PROC DISCRIM reads the pooled within-class covariances from the observations with _TYPE_ = PCOV (data set TYPE=COV) or reads the pooled within-class correlations from the observations with _TYPE_ = PCORR and the pooled within-class standard deviations from the observations with _TYPE_ = PSTD (data set TYPE=CORR) or the pooled within-class corrected SSCP matrix from the observations with _TYPE_ = PSSCP (data set TYPE=CSSCP).

When the input data set is TYPE=SSCP, the DISCRIM procedure reads the number of observations for each class from the observations with _TYPE_ = N , the sum of weights of observations for each class from the variable INTERCEP in observations with _TYPE_ = SSCP and _NAME_ = INTERCEPT , the variable sums from the variable= variablenames in observations with _TYPE_ = SSCP and _NAME_ = INTERCEPT , and the uncorrected sums of squares and cross products from the variable= variablenames in observations with _TYPE_ = SSCP and _NAME_ = variablenames .

When the input data set is TYPE=LINEAR, TYPE=QUAD, or TYPE=MIXED, PROC DISCRIM reads the prior probabilities for each class from the observations with variable _TYPE_ = PRIOR .

When the input data set is TYPE=LINEAR, PROC DISCRIM reads the coefficients of the linear discriminant functions from the observations with variable _TYPE_ = LINEAR (see page 1173).

When the input data set is TYPE=QUAD, PROC DISCRIM reads the coefficients of the quadratic discriminant functions from the observations with variable _TYPE_ = QUAD (see page 1173).

When the input data set is TYPE=MIXED, PROC DISCRIM reads the coefficients of the linear discriminant functions from the observations with variable _TYPE_ = LINEAR . If there are no observations with _TYPE_ = LINEAR , PROC DISCRIM then reads the coefficients of the quadratic discriminant functions from the observations with variable _TYPE_ = QUAD (see page 1173).

TESTDATA= Data Set

The TESTDATA= data set is an ordinary SAS data set with observations that are to be classified. The quantitative variable names in this data set must match those in the DATA= data set. The TESTCLASS statement can be used to specify the variable containing group membership information of the TESTDATA= data set observations. When the TESTCLASS statement is missing and the TESTDATA= data set contains the variable given in the CLASS statement, this variable is used as the TESTCLASS variable. The TESTCLASS variable should have the same type (character or numeric) and length as the variable given in the CLASS statement. PROC DISCRIM considers an observation misclassified when the value of the TESTCLASS variable does not match the group into which the TESTDATA= observation is classified.

Output Data Sets

When an output data set includes variables containing the posterior probabilities of group membership (OUT=, OUTCROSS=, or TESTOUT= data sets) or group-specific density estimates (OUTD= or TESTOUTD= data sets), the names of these variables are constructed from the formatted values of the class levels converted to valid SAS variable names.

OUT= Data Set

The OUT= data set contains all the variables in the DATA= data set, plus new variables containing the posterior probabilities and the resubstitution classification results. The names of the new variables containing the posterior probabilities are constructed from the formatted values of the class levels converted to SAS names. A new variable, _INTO_ , with the same attributes as the CLASS variable, specifies the class to which each observation is assigned. If an observation is classified into group OTHER, the variable _INTO_ has a missing value. When you specify the CANONICAL option, the data set also contains new variables with canonical variable scores. The NCAN= option determines the number of canonical variables. The names of the canonical variables are constructed as described in the CANPREFIX= option. The canonical variables have means equal to zero and pooled within-class variances equal to one.

An OUT= data set cannot be created if the DATA= data set is not an ordinary SAS data set.

OUTD= Data Set

The OUTD= data set contains all the variables in the DATA= data set, plus new variables containing the group-specific density estimates. The names of the new variables containing the density estimates are constructed from the formatted values of the class levels.

An OUTD= data set cannot be created if the DATA= data set is not an ordinary SAS data set.

OUTCROSS= Data Set

The OUTCROSS= data set contains all the variables in the DATA= data set, plus new variables containing the posterior probabilities and the classification results of cross validation. The names of the new variables containing the posterior probabilities are constructed from the formatted values of the class levels. A new variable, _INTO_ , with the same attributes as the CLASS variable, specifies the class to which each observation is assigned. When an observation is classified into group OTHER, the variable _INTO_ has a missing value. When you specify the CANONICAL option, the data set also contains new variables with canonical variable scores. The NCAN= option determines the number of new variables. The names of the new variables are constructed as described in the CANPREFIX= option. The new variables have mean zero and pooled within-class variance equal to one.

An OUTCROSS= data set cannot be created if the DATA= data set is not an ordinary SAS data set.

TESTOUT= Data Set

The TESTOUT= data set contains all the variables in the TESTDATA= data set, plus new variables containing the posterior probabilities and the classification results. The names of the new variables containing the posterior probabilities are formed from the formatted values of the class levels. A new variable, _INTO_ , with the same attributes as the CLASS variable, gives the class to which each observation is assigned. If an observation is classified into group OTHER, the variable _INTO_ has a missing value. When you specify the CANONICAL option, the data set also contains new variables with canonical variable scores. The NCAN= option determines the number of new variables. The names of the new variables are formed as described in the CANPREFIX= option.

TESTOUTD= Data Set

The TESTOUTD= data set contains all the variables in the TESTDATA= data set, plus new variables containing the group-specific density estimates. The names of the new variables containing the density estimates are formed from the formatted values of the class levels.

OUTSTAT= Data Set

The OUTSTAT= data set is similar to the TYPE=CORR data set produced by the CORR procedure. The data set contains various statistics such as means, standard deviations, and correlations. For an example of an OUTSTAT= data set, see Example 25.3 on page 1222. When you specify the CANONICAL option, canonical correlations, canonical structures, canonical coefficients, and means of canonical variables for each class are included in the data set.

If you specify METHOD=NORMAL, the output data set also includes coefficients of the discriminant functions, and the data set is TYPE=LINEAR (POOL=YES), TYPE=QUAD (POOL=NO), or TYPE=MIXED (POOL=TEST). If you specify METHOD=NPAR, this output data set is TYPE=CORR.

The OUTSTAT= data set contains the following variables:

  • the BY variables, if any

  • the CLASS variable

  • _TYPE_ , a character variable of length 8 that identifies the type of statistic

  • _NAME_ , a character variable of length 32 that identifies the row of the matrix, the name of the canonical variable, or the type of the discriminant function coefficients

  • the quantitative variables, that is, those in the VAR statement, or, if there is no VAR statement, all numeric variables not listed in any other statement

The observations, as identified by the variable _TYPE_ , have the following _TYPE_ values:

_TYPE_

Contents

N

number of observations both for the total sample (CLASS variable missing) and within each class (CLASS variable present)

SUMWGT

sum of weights both for the total sample (CLASS variable missing) and within each class (CLASS variable present), if a WEIGHT statement is specified

MEAN

means both for the total sample (CLASS variable missing) and within each class (CLASS variable present)

PRIOR

prior probability for each class

STDMEAN

total-standardized class means

PSTDMEAN

pooled within-class standardized class means

STD

standard deviations both for the total sample (CLASS variable missing) and within each class (CLASS variable present)

PSTD

pooled within-class standard deviations

BSTD

between-class standard deviations

RSQUARED

univariate R 2 s

LNDETERM

the natural log of the determinant or the natural log of the quasi-determinant of the within-class covariance matrix either pooled (CLASS variable missing) or not pooled (CLASS variable present)

The following kinds of observations are identified by the combination of the variables _TYPE_ and _NAME_ . When the _TYPE_ variable has one of the following values, the _NAME_ variable identifies the row of the matrix.

_TYPE_

Contents

CSSCP

corrected SSCP matrix both for the total sample (CLASS variable missing) and within each class (CLASS variable present)

PSSCP

pooled within-class corrected SSCP matrix

BSSCP

between-class SSCP matrix

COV

covariance matrix both for the total sample (CLASS variable missing) and within each class (CLASS variable present)

PCOV

pooled within-class covariance matrix

BCOV

between-class covariance matrix

CORR

correlation matrix both for the total sample (CLASS variable missing) and within each class (CLASS variable present)

PCORR

pooled within-class correlation matrix

BCORR

between-class correlation matrix

When you request canonical discriminant analysis, the _TYPE_ variable can have one of the following values. The _NAME_ variable identifies a canonical variable.

_TYPE_

Contents

CANCORR

canonical correlations

STRUCTUR

canonical structure

BSTRUCT

between canonical structure

PSTRUCT

pooled within-class canonical structure

SCORE

standardized canonical coefficients

RAWSCORE

raw canonical coefficients

CANMEAN

means of the canonical variables for each class

When you specify METHOD=NORMAL, the _TYPE_ variable can have one of the following values. The _NAME_ variable identifies different types of coefficients in the discriminant function.

_TYPE_

Contents

LINEAR

coefficients of the linear discriminant functions

QUAD

coefficients of the quadratic discriminant functions

The values of the _NAME_ variable are as follows :

_NAME_

Contents

variable names

quadratic coefficients of the quadratic discriminant functions (a symmetric matrix for each class)

_LINEAR_

linear coefficients of the discriminant functions

_CONST_

constant coefficients of the discriminant functions

Computational Resources

In the following discussion, let

n = number of observations in the training data set

v = number of variables

c = number of class levels

k = number of canonical variables

l = length of the CLASS variable

Memory Requirements

The amount of temporary storage required depends on the discriminant method used and the options specified. The least amount of temporary storage in bytes needed to process the data is approximately

click to expand

A parametric method (METHOD=NORMAL) requires an additional temporary memory of 12 v 2 + 100 v bytes. When you specify the CROSSVALIDATE option, this temporary storage must be increased by 4 v 2 + 44 v bytes. When a nonparametric method (METHOD=NPAR) is used, an additional temporary storage of 10 v 2 + 94 v bytes is needed if you specify METRIC=FULL to evaluate the distances.

With the MANOVA option, the temporary storage must be increased by 8 v 2 + 96 v bytes. The CANONICAL option requires a temporary storage of 2 v 2 + 94 v + 8 k ( v + c ) bytes. The POSTERR option requires a temporary storage of 8 c 2 + 64 c + 96 bytes. Additional temporary storage is also required for classification summary and for each output data set.

For example, in the following statements,

  proc discrim manova;   class gp;   var x1 x2 x3;   run;  

if the CLASS variable gp has a length of eight and the input data set contains two class levels, the procedure requires a temporary storage of 1992 bytes. This includes 1104 bytes for data processing, 480 bytes for using a parametric method, and 408 bytes for specifying the MANOVA option.

Time Requirements

The following factors determine the time requirements of discriminant analysis.

  • The time needed for reading the data and computing covariance matrices is proportional to nv 2 . PROC DISCRIM must also look up each class level in the list. This is faster if the data are sorted by the CLASS variable. The time for looking up class levels is proportional to a value ranging from n to n ln( c ).

  • The time for inverting a covariance matrix is proportional to v 3 .

  • With a parametric method, the time required to classify each observation is proportional to cv for a linear discriminant function and is proportional to cv 2 for a quadratic discriminant function. When you specify the CROSSVALIDATE option, the discriminant function is updated for each observation in the classification. A substantial amount of time is required.

  • With a nonparametric method, the data are stored in a tree structure (Friedman, Bentley, and Finkel 1977). The time required to organize the observations into the tree structure is proportional to nv ln( n ). The time for performing each tree search is proportional to ln( n ). When you specify the normal KERNEL= option, all observations in the training sample contribute to the density estimation and more computer time is needed.

  • The time required for the canonical discriminant analysis is proportional to v 3 .

Each of the preceding factors has a different machine-dependent constant of proportionality.

Displayed Output

The displayed output from PROC DISCRIM includes the following:

  • Class Level Information, including the values of the classification variable, Variable Name constructed from each class value, the Frequency and Weight of each value, its Proportion in the total sample, and the Prior Probability for each class level.

Optional output includes the following:

  • Within-Class SSCP Matrices for each group

  • Pooled Within-Class SSCP Matrix

  • Between-Class SSCP Matrix

  • Total-Sample SSCP Matrix

  • Within-Class Covariance Matrices, S t , for each group

  • Pooled Within-Class Covariance Matrix, S p

  • Between-Class Covariance Matrix, equal to the between-class SSCP matrix divided by n ( c ˆ’ 1) /c , where n is the number of observations and c is the number of classes

  • Total-Sample Covariance Matrix

  • Within-Class Correlation Coefficients and Pr > r to test the hypothesis that the within-class population correlation coefficients are zero

  • Pooled Within-Class Correlation Coefficients and Pr > r to test the hypothesis that the partial population correlation coefficients are zero

  • Between-Class Correlation Coefficients and Pr > r to test the hypothesis that the between-class population correlation coefficients are zero

  • Total-Sample Correlation Coefficients and Pr > r to test the hypothesis that the total population correlation coefficients are zero

  • Simple descriptive Statistics including N (the number of observations), Sum, Mean, Variance, and Standard Deviation both for the total sample and within each class

  • Total-Sample Standardized Class Means, obtained by subtracting the grand mean from each class mean and dividing by the total sample standard deviation

  • Pooled Within-Class Standardized Class Means, obtained by subtracting the grand mean from each class mean and dividing by the pooled within-class standard deviation

  • Pairwise Squared Distances Between Groups

  • Univariate Test Statistics, including Total-Sample Standard Deviations, Pooled Within-Class Standard Deviations, Between-Class Standard Deviations, R 2 , R 2 / (1 ˆ’ R 2 ), F , and Pr > F (univariate F values and probability levels for one-way analyses of variance)

  • Multivariate Statistics and F Approximations, including Wilks Lambda, Pillai s Trace, Hotelling-Lawley Trace, and Roy s Greatest Root with F approximations, degrees of freedom (Num DF and Den DF), and probability values (Pr > F ). Each of these four multivariate statistics tests the hypothesis that the class means are equal in the population. See Chapter 2, Introduction to Regression Procedures, for more information.

If you specify METHOD=NORMAL, the following three statistics are displayed:

  • Covariance Matrix Information, including Covariance Matrix Rank and Natural Log of Determinant of the Covariance Matrix for each group (POOL=TEST, POOL=NO) and for the pooled within-group (POOL=TEST, POOL=YES)

  • Optionally, Test of Homogeneity of Within Covariance Matrices (the results of a chi-square test of homogeneity of the within-group covariance matrices) (Morrison 1976; Kendall, Stuart, and Ord 1983; Anderson 1984)

  • Pairwise Generalized Squared Distances Between Groups

If the CANONICAL option is specified, the displayed output contains these statistics:

  • Canonical Correlations

  • Adjusted Canonical Correlations (Lawley 1959). These are asymptotically less biased than the raw correlations and can be negative. The adjusted canonical correlations may not be computable and are displayed as missing values if two canonical correlations are nearly equal or if some are close to zero. A missing value is also displayed if an adjusted canonical correlation is larger than a previous adjusted canonical correlation.

  • Approximate Standard Error of the canonical correlations

  • Squared Canonical Correlations

  • Eigenvalues of E ˆ’ 1 H . Each eigenvalue is equal to 2 / (1 ˆ’ 2 ), where 2 is the corresponding squared canonical correlation and can be interpreted as the ratio of between-class variation to within-class variation for the corresponding canonical variable. The table includes Eigenvalues, Differences between successive eigenvalues, the Proportion of the sum of the eigenvalues, and the Cumulative proportion.

  • Likelihood Ratio for the hypothesis that the current canonical correlation and all smaller ones are zero in the population. The likelihood ratio for all canonical correlations equals Wilks lambda.

  • Approximate F statistic based on Rao s approximation to the distribution of the likelihood ratio (Rao 1973, p. 556; Kshirsagar 1972, p. 326)

  • Num DF (numerator degrees of freedom), Den DF (denominator degrees of freedom), and Pr > F , the probability level associated with the F statistic

The following statistic concerns the classification criterion:

  • the Linear Discriminant Function, but only if you specify METHOD=NORMAL and the pooled covariance matrix is used to calculate the (generalized) squared distances

When the input DATA= data set is an ordinary SAS data set, the displayed output includes the following:

  • Optionally, the Resubstitution Results including Obs, the observation number (if an ID statement is included, the values of the ID variable are displayed instead of the observation number), the actual group for the observation, the group into which the developed criterion would classify it, and the Posterior Probability of its Membership in each group

  • Resubstitution Summary, a summary of the performance of the classification criterion based on resubstitution classification results

  • Error Count Estimate of the resubstitution classification results

  • Optionally, Posterior Probability Error Rate Estimates of the resubstitution classification results

If you specify the CROSSVALIDATE option, the displayed output contains these statistics:

  • Optionally, the Cross-validation Results including Obs, the observation number (if an ID statement is included, the values of the ID variable are displayed instead of the observation number), the actual group for the observation, the group into which the developed criterion would classify it, and the Posterior Probability of its Membership in each group

  • Cross-validation Summary, a summary of the performance of the classification criterion based on cross validation classification results

  • Error Count Estimate of the cross validation classification results

  • Optionally, Posterior Probability Error Rate Estimates of the cross validation classification results

If you specify the TESTDATA= option, the displayed output contains these statistics:

  • Optionally, the Classification Results including Obs, the observation number (if a TESTID statement is included, the values of the ID variable are displayed instead of the observation number), the actual group for the observation (if a TESTCLASS statement is included), the group into which the developed criterion would classify it, and the Posterior Probability of its Membership in each group

  • Classification Summary, a summary of the performance of the classification criterion

  • Error Count Estimate of the test data classification results

  • Optionally, Posterior Probability Error Rate Estimates of the test data classification results

ODS Table Names

PROC DISCRIM assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.

Table 25.1: ODS Tables Produced by PROC DISCRIM

ODS Table Name

Description

PROC DISCRIM Option

ANOVA

Univariate statistics

ANOVA

AvePostCrossVal

Average posterior probabilities, cross validation

POSTERR & CROSSVALIDATE

AvePostResub

Average posterior probabilities, resubstitution

POSTERR

AvePostTestClass

Average posterior probabilities, test classification

POSTERR & TEST=

AveRSquare

Average R-Square

ANOVA

BCorr

Between-class correlations

BCORR

BCov

Between-class covariances

BCOV

BSSCP

Between-class SSCP matrix

BSSCP

BStruc

Between canonical structure

CANONICAL

CanCorr

Canonical correlations

CANONICAL

CanonicalMeans

Class means on canonical variables

CANONICAL

ChiSq

Chi-square information

POOL=TEST

ClassifiedCrossVal

Number of observations and percent classified, cross validation

CROSSVALIDATE

ClassifiedResub

Number of observations and percent classified, resubstitution

default

ClassifiedTestClass

Number of observations and percent classified, test classification

TEST=

Counts

Number of observations, variables, classes, df

default

CovDF

DF for covariance matrices, not displayed

any *COV option

Dist

Squared distances

MAHALANOBIS

DistFValues

F values based on squared distances

MAHALANOBIS

DistGeneralized

Generalized squared distances

default

DistProb

Probabilities for F values from squared distances

MAHALANOBIS

ErrorCrossVal

Error count estimates, cross validation

CROSSVALIDATE

ErrorResub

Error count estimates, resubstitution

default

ErrorTestClass

Error count estimates, test classification

TEST=

Levels

Class level information

default

LinearDiscFunc

Linear discriminant function

POOL=YES

LogDet

Log determinant of the covariance matrix

default

MultStat

MANOVA

MANOVA

PCoef

Pooled standard canonical coefficients

CANONICAL

PCorr

Pooled within-class correlations

PCORR

PCov

Pooled within-class covariances

PCOV

PSSCP

Pooled within-class SSCP matrix

PSSCP

PStdMeans

Pooled standardized class means

STDMEAN

PStruc

Pooled within canonical structure

CANONICAL

PostCrossVal

Posterior probabilities, cross validation

CROSSLIST or CROSSLISTERR

PostErrCrossVal

Posterior error estimates, cross validation

POSTERR & CROSSVALIDATE

PostErrResub

Posterior error estimates, resubstitution

POSTERR

PostErrTestClass

Posterior error estimates, test classification

POSTERR & TEST=

PostResub

Posterior probabilities, resubstitution

LIST or LISTERR

PostTestClass

Posterior probabilities, test classification

TESTLIST or TESTLISTERR

RCoef

Raw canonical coefficients

CANONICAL

SimpleStatistics

Simple statistics

SIMPLE

TCoef

Total-sample standard canonical coefficients

CANONICAL

TCorr

Total-sample correlations

TCORR

TCov

Total-sample covariances

TCOV

TSSCP

Total-sample SSCP matrix

TSSCP

TStdMeans

Total standardized class means

STDMEAN

TStruc

Total canonical structure

CANONICAL

WCorr

Within-class correlations

WCORR

WCov

Within-class covariances

WCOV

WSSCP

Within-class SSCP matrices

WSSCP




SAS.STAT 9.1 Users Guide (Vol. 2)
SAS/STAT 9.1 Users Guide Volume 2 only
ISBN: B003ZVJDOK
EAN: N/A
Year: 2004
Pages: 92

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net