If an observation has a missing value for any of the quantitative variables , it is omitted from the analysis. If an observation has a missing CLASS value but is otherwise complete, it is not used in computing the canonical correlations and coefficients; however, canonical variable scores are computed for that observation for the OUT= data set.
Canonical discriminant analysis is equivalent to canonical correlation analysis between the quantitative variables and a set of dummy variables coded from the class variable. In the following notation the dummy variables will be denoted by y and the quantitative variables by x . The total sample covariance matrix for the x and y variables is
When c is the number of groups, n t is the number of observations in group t , and S t is the sample covariance matrix for the x variables in group t , the within-class pooled covariance matrix for the x variables is
The canonical correlations, i , are the square roots of the eigenvalues, » i , of the following matrix. The corresponding eigenvectors are v i .
Let V be the matrix with the eigenvectors v i that correspond to nonzero eigenvalues as columns . The raw canonical coefficients are calculated as follows
The pooled within-class standardized canonical coefficients are
And the total sample standardized canonical coefficients are
Let X c be the matrix with the centered x variables as columns. The canonical scores may be calculated by any of the following
For the Multivariate tests based on E ˆ’ 1 H
where n is the total number of observations.
The input DATA= data set can be an ordinary SAS data set or one of several specially structured data sets created by statistical procedures available with SAS/STAT software. For more information on special types of data sets, see Appendix A, Special SAS Data Sets. The BY variable in these data sets becomes the CLASS variable in PROC CANDISC. These specially structured data sets include
TYPE=CORR data sets created by PROC CORR using a BY statement
TYPE=COV data sets created by PROC PRINCOMP using both the COV option and a BY statement
TYPE=CSSCP data sets created by PROC CORR using the CSSCP option and a BY statement, where the OUT= data set is assigned TYPE=CSSCP with the TYPE= data set option
TYPE=SSCP data sets created by PROC REG using both the OUTSSCP= option and a BY statement.
When the input data set is TYPE=CORR, TYPE=COV, or TYPE=CSSCP, PROC CANDISC reads the number of observations for each class from the observations with _TYPE_= N and the variable means in each class from the observations with _TYPE_= MEAN . The CANDISC procedure then reads the within-class correlations from the observations with _TYPE_= CORR , the standard deviations from the observations with _TYPE_= STD (data set TYPE=CORR), the within-class covariances from the observations with _TYPE_= COV (data set TYPE=COV), or the within-class corrected sums of squares and crossproducts from the observations with _TYPE_= CSSCP (data set TYPE=CSSCP).
When the data set does not include any observations with _TYPE_= CORR (data set TYPE=CORR), _TYPE_= COV (data set TYPE=COV), or _TYPE_= CSSCP (data set TYPE=CSSCP) for each class, PROC CANDISC reads the pooled within-class information from the data set. In this case, PROC CANDISC reads the pooled within-class correlations from the observations with _TYPE_= PCORR , the pooled within-class standard deviations from the observations with _TYPE_= PSTD (data set TYPE=CORR), the pooled within-class covariances from the observations with _TYPE_= PCOV (data set TYPE=COV), or the pooled within-class corrected SSCP matrix from the observations with_TYPE_= PSSCP (data set TYPE=CSSCP).
When the input data set is TYPE=SSCP, PROC CANDISC reads the number of observations for each class from the observations with _TYPE_= N , the sum of weights of observations from the variable INTERCEPT in observations with _TYPE_= SSCP and _NAME_= INTERCEPT , the variable sums from the variable= variablenames in observations with _TYPE_= SSCP and _NAME_= INTERCEPT , and the uncorrected sums of squares and crossproducts from the variable= variablenames in observations with _TYPE_= SSCP and _NAME_= variablenames .
The OUT= data set contains all the variables in the original data set plus new variables containing the canonical variable scores. You determine the number of new variables using the NCAN= option. The names of the new variables are formed as described in the PREFIX= option. The new variables have means equal to zero and pooled within-class variances equal to one. An OUT= data set cannot be created if the DATA= data set is not an ordinary SAS data set.
The OUTSTAT= data set is similar to the TYPE=CORR data set produced by the CORR procedure but contains many results in addition to those produced by the CORR procedure.
The OUTSTAT= data set is TYPE=CORR, and it contains the following variables:
the BY variables, if any
the CLASS variable
_TYPE_ , a character variable of length 8 that identifies the type of statistic
_NAME_ , a character variable of length 32 that identifies the row of the matrix or the name of the canonical variable
the quantitative variables (those in the VAR statement, or if there is no VAR statement, all numeric variables not listed in any other statement)
The observations, as identified by the variable _TYPE_ , have the following _TYPE_ values:
_TYPE_ | Contents |
---|---|
N | number of observations for both the total sample (CLASS variable missing) and within each class (CLASS variable present) |
SUMWGT | sum of weights for both the total sample (CLASS variable missing) and within each class (CLASS variable present) if a WEIGHT statement is specified |
MEAN | means for both the total sample (CLASS variable missing) and within each class (CLASS variable present) |
STDMEAN | total-standardized class means |
PSTDMEAN | pooled within-class standardized class means |
STD | standard deviations for both the total sample (CLASS variable missing) and within each class (CLASS variable present) |
PSTD | pooled within-class standard deviations |
BSTD | between-class standard deviations |
RSQUARED | univariate R 2 s |
The following kinds of observations are identified by the combination of the variables _TYPE_ and _NAME_ . When the _TYPE_ variable has one of the following values, the _NAME_ variable identifies the row of the matrix.
_TYPE_ | Contents |
---|---|
CSSCP | corrected SSCP matrix for the total sample (CLASS variable missing) and within each class (CLASS variable present) |
PSSCP | pooled within-class corrected SSCP matrix |
BSSCP | between-class SSCP matrix |
COV | covariance matrix for the total sample (CLASS variable missing) and within each class (CLASS variable present) |
PCOV | pooled within-class covariance matrix |
BCOV | between-class covariance matrix |
CORR | correlation matrix for the total sample (CLASS variable missing) and within each class (CLASS variable present) |
PCORR | pooled within-class correlation matrix |
BCORR | between-class correlation matrix |
When the _TYPE_ variable has one of the following values, the _NAME_ variable identifies the canonical variable:
_TYPE_ | Contents |
---|---|
CANCORR | canonical correlations |
STRUCTUR | canonical structure |
BSTRUCT | between canonical structure |
PSTRUCT | pooled within-class canonical structure |
SCORE | total sample standardized canonical coefficients |
PSCORE | pooled within-class standardized canonical coefficients |
RAWSCORE | raw canonical coefficients |
CANMEAN | means of the canonical variables for each class |
You can use this data set with PROC SCORE to get scores on the canonical variables for new data using one of the following forms.
* The CLASS variable C is numeric; proc score data=NewData score=Coef(where=(c = . )) out=Scores; run; * The CLASS variable C is character; proc score data=NewData score=Coef(where=(c = ' ')) out=Scores; run;
The WHERE clause is used to exclude the within-class means and standard deviations. PROC SCORE standardizes the new data by subtracting the original variable means that are stored in the _TYPE_ = MEAN observations, and dividing by the original variable standard deviations from the _TYPE_ = STD observations. Then PROC SCORE multiplies the standardized variables by the coefficients from the _TYPE_ = SCORE observations to get the canonical scores.
In the following discussion, let
n = number of observations
c = number of class levels
v = number of variables in the VAR list
l = length of the CLASS variable
The amount of memory in bytes for temporary storage needed to process the data is
With the ANOVA option, the temporary storage must be increased by 16v bytes. The DISTANCE option requires an additional temporary storage of 4 v 2 + 4 v bytes.
The following factors determine the time requirements of the CANDISC procedure.
The time needed for reading the data and computing covariance matrices is proportional to nv 2 . PROC CANDISC must also look up each class level in the list. This is faster if the data are sorted by the CLASS variable. The time for looking up class levels is proportional to a value ranging from n to n log( c ).
The time for inverting a covariance matrix is proportional to v 3 .
The time required for the canonical discriminant analysis is proportional to v 3 .
Each of the preceding factors has a different constant of proportionality.
The output produced by PROC CANDISC includes
Class Level Information, including the values of the classification variable, the Frequency and Weight of each value, and its Proportion in the total sample.
Optional output includes
Within-Class SSCP Matrices for each group
Pooled Within-Class SSCP Matrix
Between-Class SSCP Matrix
Total-Sample SSCP Matrix
Within-Class Covariance Matrices for each group
Pooled Within-Class Covariance Matrix
Between-Class Covariance Matrix, equal to the between-class SSCP matrix divided by n ( c ˆ’ 1) /c , where n is the number of observations and c is the number of classes
Total-Sample Covariance Matrix
Within-Class Correlation Coefficients and Pr > r to test the hypothesis that the within-class population correlation coefficients are zero
Pooled Within-Class Correlation Coefficients and Pr > r to test the hypothesis that the partial population correlation coefficients are zero
Between-Class Correlation Coefficients and Pr > r to test the hypothesis that the between-class population correlation coefficients are zero
Total-Sample Correlation Coefficients and Pr > r to test the hypothesis that the total population correlation coefficients are zero
Simple Statistics including N (the number of observations), Sum, Mean, Variance, and Standard Deviation both for the total sample and within each class
Total-Sample Standardized Class Means, obtained by subtracting the grand mean from each class mean and dividing by the total sample standard deviation
Pooled Within-Class Standardized Class Means, obtained by subtracting the grand mean from each class mean and dividing by the pooled within-class standard deviation
Pairwise Squared Distances Between Groups
Univariate Test Statistics, including Total-Sample Standard Deviations, Pooled Within-Class Standard Deviations, Between-Class Standard Deviations, R 2 , R 2 / (1 ˆ’ R 2 ), F , and Pr >F (univariate F values and probability levels for one-way analyses of variance)
By default, PROC CANDISC displays these statistics:
Multivariate Statistics and F Approximations including Wilks Lambda, Pillai s Trace, Hotelling-Lawley Trace, and Roy s Greatest Root with F approximations, degrees of freedom (Num DF and Den DF), and probability values (Pr >F ). Each of these four multivariate statistics tests the hypothesis that the class means are equal in the population. See the Multivariate Tests section in Chapter 2, Introduction to Regression Procedures, for more information.
Canonical Correlations
Adjusted Canonical Correlations (Lawley 1959). These are asymptotically less biased than the raw correlations and can be negative. The adjusted canonical correlations may not be computable and are displayed as missing values if two canonical correlations are nearly equal or if some are close to zero. A missing value is also displayed if an adjusted canonical correlation is larger than a previous adjusted canonical correlation.
Approx Standard Error, approximate standard error of the canonical correlations
Squared Canonical Correlations
Eigenvalues of E ˆ’ 1 H . Each eigenvalue is equal to 2 / (1 ˆ’ 2 ), where 2 is the corresponding squared canonical correlation and can be interpreted as the ratio of between-class variation to pooled within-class variation for the corresponding canonical variable. The table includes Eigenvalues, Differences between successive eigenvalues, the Proportion of the sum of the eigenvalues, and the Cumulative proportion.
Likelihood Ratio for the hypothesis that the current canonical correlation and all smaller ones are zero in the population. The likelihood ratio for the hypothesis that all canonical correlations equal zero is Wilks lambda.
Approx F statistic based on Rao s approximation to the distribution of the likelihood ratio (Rao 1973, p. 556; Kshirsagar 1972, p. 326)
Num DF (numerator degrees of freedom), Den DF (denominator degrees of freedom), and Pr >F , the probability level associated with the F statistic
The following statistics can be suppressed with the SHORT option:
Total Canonical Structure, giving total-sample correlations between the canonical variables and the original variables
Between Canonical Structure, giving between-class correlations between the canonical variables and the original variables
Pooled Within Canonical Structure, giving pooled within-class correlations between the canonical variables and the original variables
Total-Sample Standardized Canonical Coefficients, standardized to give canonical variables with zero mean and unit pooled within-class variance when applied to the total-sample standardized variables
Pooled Within-Class Standardized Canonical Coefficients, standardized to give canonical variables with zero mean and unit pooled within-class variance when applied to the pooled within-class standardized variables
Raw Canonical Coefficients, standardized to give canonical variables with zero mean and unit pooled within-class variance when applied to the centered variables
Class Means on Canonical Variables
PROC CANDISC assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.
ODS Table Name | Description | PROC CANDISC Option |
---|---|---|
ANOVA | Univariate statistics | ANOVA |
AveRSquare | Average R-square | ANOVA |
BCorr | Between-class correlations | BCORR |
BCov | Between-class covariances | BCOV |
BSSCP | Between-class SSCP matrix | BSSCP |
BStruc | Between canonical structure | default |
CanCorr | Canonical correlations | default |
CanonicalMeans | Class means on canonical variables | default |
Counts | Number of observations, variables, classes, df | default |
CovDF | DF for covariance matrices, not printed | any *COV option |
Dist | Squared distances | MAHALANOBIS |
DistFValues | F statistics based on squared distances | MAHALANOBIS |
DistProb | Probabilities for F statistics from squared distances | MAHALANOBIS |
Levels | Class level information | default |
MultStat | MANOVA | default |
PCoef | Pooled standard canonical coefficients | default |
PCorr | Pooled within-class correlations | PCORR |
PCov | Pooled within-class covariances | PCOV |
PSSCP | Pooled within-class SSCP matrix | PSSCP |
PStdMeans | Pooled standardized class means | STDMEAN |
PStruc | Pooled within canonical structure | default |
RCoef | Raw canonical coefficients | default |
SimpleStatistics | Simple statistics | SIMPLE |
TCoef | Total-sample standard canonical coefficients | default |
TCorr | Total-sample correlations | TCORR |
TCov | Total-sample covariances | TCOV |
TSSCP | Total-sample SSCP matrix | TSSCP |
TStdMeans | Total standardized class means | STDMEAN |
TStruc | Total canonical structure | default |
WCorr | Within-class correlations | WCORR |
WCov | Within-class covariances | WCOV |
WSSCP | Within-class SSCP matrices | WSSCP |