Details | SAS/STAT 9.1 Users Guide, Volumes 1-7

Missing Values

If an observation has a missing value for any of the quantitative variables , it is omitted from the analysis. If an observation has a missing CLASS value but is otherwise complete, it is not used in computing the canonical correlations and coefficients; however, canonical variable scores are computed for that observation for the OUT= data set.

Computational Details

General Formulas

Canonical discriminant analysis is equivalent to canonical correlation analysis between the quantitative variables and a set of dummy variables coded from the class variable. In the following notation the dummy variables will be denoted by y and the quantitative variables by x . The total sample covariance matrix for the x and y variables is

When c is the number of groups, n _t is the number of observations in group t , and S _t is the sample covariance matrix for the x variables in group t , the within-class pooled covariance matrix for the x variables is

The canonical correlations, _i , are the square roots of the eigenvalues, » _i , of the following matrix. The corresponding eigenvectors are v _i .

Let V be the matrix with the eigenvectors v _i that correspond to nonzero eigenvalues as columns . The raw canonical coefficients are calculated as follows

The pooled within-class standardized canonical coefficients are

And the total sample standardized canonical coefficients are

Let X _c be the matrix with the centered x variables as columns. The canonical scores may be calculated by any of the following

For the Multivariate tests based on E ^{ˆ’ 1} H

where n is the total number of observations.

Input Data Set

The input DATA= data set can be an ordinary SAS data set or one of several specially structured data sets created by statistical procedures available with SAS/STAT software. For more information on special types of data sets, see Appendix A, Special SAS Data Sets. The BY variable in these data sets becomes the CLASS variable in PROC CANDISC. These specially structured data sets include

TYPE=CORR data sets created by PROC CORR using a BY statement
TYPE=COV data sets created by PROC PRINCOMP using both the COV option and a BY statement
TYPE=CSSCP data sets created by PROC CORR using the CSSCP option and a BY statement, where the OUT= data set is assigned TYPE=CSSCP with the TYPE= data set option
TYPE=SSCP data sets created by PROC REG using both the OUTSSCP= option and a BY statement.

When the input data set is TYPE=CORR, TYPE=COV, or TYPE=CSSCP, PROC CANDISC reads the number of observations for each class from the observations with _TYPE_= N and the variable means in each class from the observations with _TYPE_= MEAN . The CANDISC procedure then reads the within-class correlations from the observations with _TYPE_= CORR , the standard deviations from the observations with _TYPE_= STD (data set TYPE=CORR), the within-class covariances from the observations with _TYPE_= COV (data set TYPE=COV), or the within-class corrected sums of squares and crossproducts from the observations with _TYPE_= CSSCP (data set TYPE=CSSCP).

When the data set does not include any observations with _TYPE_= CORR (data set TYPE=CORR), _TYPE_= COV (data set TYPE=COV), or _TYPE_= CSSCP (data set TYPE=CSSCP) for each class, PROC CANDISC reads the pooled within-class information from the data set. In this case, PROC CANDISC reads the pooled within-class correlations from the observations with _TYPE_= PCORR , the pooled within-class standard deviations from the observations with _TYPE_= PSTD (data set TYPE=CORR), the pooled within-class covariances from the observations with _TYPE_= PCOV (data set TYPE=COV), or the pooled within-class corrected SSCP matrix from the observations with_TYPE_= PSSCP (data set TYPE=CSSCP).

When the input data set is TYPE=SSCP, PROC CANDISC reads the number of observations for each class from the observations with _TYPE_= N , the sum of weights of observations from the variable INTERCEPT in observations with _TYPE_= SSCP and _NAME_= INTERCEPT , the variable sums from the variable= variablenames in observations with _TYPE_= SSCP and _NAME_= INTERCEPT , and the uncorrected sums of squares and crossproducts from the variable= variablenames in observations with _TYPE_= SSCP and _NAME_= variablenames .

Output Data Sets

OUT= Data Set

The OUT= data set contains all the variables in the original data set plus new variables containing the canonical variable scores. You determine the number of new variables using the NCAN= option. The names of the new variables are formed as described in the PREFIX= option. The new variables have means equal to zero and pooled within-class variances equal to one. An OUT= data set cannot be created if the DATA= data set is not an ordinary SAS data set.

OUTSTAT= Data Set

The OUTSTAT= data set is similar to the TYPE=CORR data set produced by the CORR procedure but contains many results in addition to those produced by the CORR procedure.

The OUTSTAT= data set is TYPE=CORR, and it contains the following variables:

the BY variables, if any
the CLASS variable
_TYPE_ , a character variable of length 8 that identifies the type of statistic
_NAME_ , a character variable of length 32 that identifies the row of the matrix or the name of the canonical variable
the quantitative variables (those in the VAR statement, or if there is no VAR statement, all numeric variables not listed in any other statement)

The observations, as identified by the variable _TYPE_ , have the following _TYPE_ values:

_TYPE_	Contents
N	number of observations for both the total sample (CLASS variable missing) and within each class (CLASS variable present)
SUMWGT	sum of weights for both the total sample (CLASS variable missing) and within each class (CLASS variable present) if a WEIGHT statement is specified
MEAN	means for both the total sample (CLASS variable missing) and within each class (CLASS variable present)
STDMEAN	total-standardized class means
PSTDMEAN	pooled within-class standardized class means
STD	standard deviations for both the total sample (CLASS variable missing) and within each class (CLASS variable present)
PSTD	pooled within-class standard deviations
BSTD	between-class standard deviations
RSQUARED	univariate R ² s

The following kinds of observations are identified by the combination of the variables _TYPE_ and _NAME_ . When the _TYPE_ variable has one of the following values, the _NAME_ variable identifies the row of the matrix.

_TYPE_	Contents
CSSCP	corrected SSCP matrix for the total sample (CLASS variable missing) and within each class (CLASS variable present)
PSSCP	pooled within-class corrected SSCP matrix
BSSCP	between-class SSCP matrix
COV	covariance matrix for the total sample (CLASS variable missing) and within each class (CLASS variable present)
PCOV	pooled within-class covariance matrix
BCOV	between-class covariance matrix
CORR	correlation matrix for the total sample (CLASS variable missing) and within each class (CLASS variable present)
PCORR	pooled within-class correlation matrix
BCORR	between-class correlation matrix

When the _TYPE_ variable has one of the following values, the _NAME_ variable identifies the canonical variable:

_TYPE_	Contents
CANCORR	canonical correlations
STRUCTUR	canonical structure
BSTRUCT	between canonical structure
PSTRUCT	pooled within-class canonical structure
SCORE	total sample standardized canonical coefficients
PSCORE	pooled within-class standardized canonical coefficients
RAWSCORE	raw canonical coefficients
CANMEAN	means of the canonical variables for each class

You can use this data set with PROC SCORE to get scores on the canonical variables for new data using one of the following forms.

  * The CLASS variable C is numeric;   proc score data=NewData score=Coef(where=(c = .  )) out=Scores; run;   * The CLASS variable C is character;   proc score data=NewData score=Coef(where=(c = ' ')) out=Scores;   run;

The WHERE clause is used to exclude the within-class means and standard deviations. PROC SCORE standardizes the new data by subtracting the original variable means that are stored in the _TYPE_ = MEAN observations, and dividing by the original variable standard deviations from the _TYPE_ = STD observations. Then PROC SCORE multiplies the standardized variables by the coefficients from the _TYPE_ = SCORE observations to get the canonical scores.

Computational Resources

In the following discussion, let

n = number of observations
c = number of class levels
v = number of variables in the VAR list
l = length of the CLASS variable

Memory Requirements

The amount of memory in bytes for temporary storage needed to process the data is

With the ANOVA option, the temporary storage must be increased by 16v bytes. The DISTANCE option requires an additional temporary storage of 4 v ² + 4 v bytes.

Time Requirements

The following factors determine the time requirements of the CANDISC procedure.

The time needed for reading the data and computing covariance matrices is proportional to nv ² . PROC CANDISC must also look up each class level in the list. This is faster if the data are sorted by the CLASS variable. The time for looking up class levels is proportional to a value ranging from n to n log( c ).
The time for inverting a covariance matrix is proportional to v ³ .
The time required for the canonical discriminant analysis is proportional to v ³ .

Each of the preceding factors has a different constant of proportionality.

Displayed Output

The output produced by PROC CANDISC includes

Class Level Information, including the values of the classification variable, the Frequency and Weight of each value, and its Proportion in the total sample.

Optional output includes

Within-Class SSCP Matrices for each group
Pooled Within-Class SSCP Matrix
Between-Class SSCP Matrix
Total-Sample SSCP Matrix
Within-Class Covariance Matrices for each group
Pooled Within-Class Covariance Matrix
Between-Class Covariance Matrix, equal to the between-class SSCP matrix divided by n ( c ˆ’ 1) /c , where n is the number of observations and c is the number of classes
Total-Sample Covariance Matrix
Within-Class Correlation Coefficients and Pr > r to test the hypothesis that the within-class population correlation coefficients are zero
Pooled Within-Class Correlation Coefficients and Pr > r to test the hypothesis that the partial population correlation coefficients are zero
Between-Class Correlation Coefficients and Pr > r to test the hypothesis that the between-class population correlation coefficients are zero
Total-Sample Correlation Coefficients and Pr > r to test the hypothesis that the total population correlation coefficients are zero
Simple Statistics including N (the number of observations), Sum, Mean, Variance, and Standard Deviation both for the total sample and within each class
Total-Sample Standardized Class Means, obtained by subtracting the grand mean from each class mean and dividing by the total sample standard deviation
Pooled Within-Class Standardized Class Means, obtained by subtracting the grand mean from each class mean and dividing by the pooled within-class standard deviation
Pairwise Squared Distances Between Groups
Univariate Test Statistics, including Total-Sample Standard Deviations, Pooled Within-Class Standard Deviations, Between-Class Standard Deviations, R ² , R ² / (1 ˆ’ R ² ), F , and Pr >F (univariate F values and probability levels for one-way analyses of variance)

By default, PROC CANDISC displays these statistics:

Multivariate Statistics and F Approximations including Wilks Lambda, Pillai s Trace, Hotelling-Lawley Trace, and Roy s Greatest Root with F approximations, degrees of freedom (Num DF and Den DF), and probability values (Pr >F ). Each of these four multivariate statistics tests the hypothesis that the class means are equal in the population. See the Multivariate Tests section in Chapter 2, Introduction to Regression Procedures, for more information.
Canonical Correlations
Adjusted Canonical Correlations (Lawley 1959). These are asymptotically less biased than the raw correlations and can be negative. The adjusted canonical correlations may not be computable and are displayed as missing values if two canonical correlations are nearly equal or if some are close to zero. A missing value is also displayed if an adjusted canonical correlation is larger than a previous adjusted canonical correlation.
Approx Standard Error, approximate standard error of the canonical correlations
Squared Canonical Correlations
Eigenvalues of E ^{ˆ’ 1} H . Each eigenvalue is equal to ² / (1 ˆ’ ² ), where ² is the corresponding squared canonical correlation and can be interpreted as the ratio of between-class variation to pooled within-class variation for the corresponding canonical variable. The table includes Eigenvalues, Differences between successive eigenvalues, the Proportion of the sum of the eigenvalues, and the Cumulative proportion.
Likelihood Ratio for the hypothesis that the current canonical correlation and all smaller ones are zero in the population. The likelihood ratio for the hypothesis that all canonical correlations equal zero is Wilks lambda.
Approx F statistic based on Rao s approximation to the distribution of the likelihood ratio (Rao 1973, p. 556; Kshirsagar 1972, p. 326)
Num DF (numerator degrees of freedom), Den DF (denominator degrees of freedom), and Pr >F , the probability level associated with the F statistic

The following statistics can be suppressed with the SHORT option:

Total Canonical Structure, giving total-sample correlations between the canonical variables and the original variables
Between Canonical Structure, giving between-class correlations between the canonical variables and the original variables
Pooled Within Canonical Structure, giving pooled within-class correlations between the canonical variables and the original variables
Total-Sample Standardized Canonical Coefficients, standardized to give canonical variables with zero mean and unit pooled within-class variance when applied to the total-sample standardized variables
Pooled Within-Class Standardized Canonical Coefficients, standardized to give canonical variables with zero mean and unit pooled within-class variance when applied to the pooled within-class standardized variables
Raw Canonical Coefficients, standardized to give canonical variables with zero mean and unit pooled within-class variance when applied to the centered variables
Class Means on Canonical Variables

ODS Table Names

PROC CANDISC assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.

Table 21.2: ODS Tables Produced in PROC CANDISC
ODS Table Name	Description	PROC CANDISC Option
ANOVA	Univariate statistics	ANOVA
AveRSquare	Average R-square	ANOVA
BCorr	Between-class correlations	BCORR
BCov	Between-class covariances	BCOV
BSSCP	Between-class SSCP matrix	BSSCP
BStruc	Between canonical structure	default
CanCorr	Canonical correlations	default
CanonicalMeans	Class means on canonical variables	default
Counts	Number of observations, variables, classes, df	default
CovDF	DF for covariance matrices, not printed	any *COV option
Dist	Squared distances	MAHALANOBIS
DistFValues	F statistics based on squared distances	MAHALANOBIS
DistProb	Probabilities for F statistics from squared distances	MAHALANOBIS
Levels	Class level information	default
MultStat	MANOVA	default
PCoef	Pooled standard canonical coefficients	default
PCorr	Pooled within-class correlations	PCORR
PCov	Pooled within-class covariances	PCOV
PSSCP	Pooled within-class SSCP matrix	PSSCP
PStdMeans	Pooled standardized class means	STDMEAN
PStruc	Pooled within canonical structure	default
RCoef	Raw canonical coefficients	default
SimpleStatistics	Simple statistics	SIMPLE
TCoef	Total-sample standard canonical coefficients	default
TCorr	Total-sample correlations	TCORR
TCov	Total-sample covariances	TCOV
TSSCP	Total-sample SSCP matrix	TSSCP
TStdMeans	Total standardized class means	STDMEAN
TStruc	Total canonical structure	default
WCorr	Within-class correlations	WCORR
WCov	Within-class covariances	WCOV
WSSCP	Within-class SSCP matrices	WSSCP