Details


Missing Values

If an observation has a missing value for any of the quantitative variables , it is omitted from the analysis. If an observation has a missing CLASS value but is otherwise complete, it is not used in computing the canonical correlations and coefficients; however, canonical variable scores are computed for that observation for the OUT= data set.

Computational Details

General Formulas

Canonical discriminant analysis is equivalent to canonical correlation analysis between the quantitative variables and a set of dummy variables coded from the class variable. In the following notation the dummy variables will be denoted by y and the quantitative variables by x . The total sample covariance matrix for the x and y variables is

When c is the number of groups, n t is the number of observations in group t , and S t is the sample covariance matrix for the x variables in group t , the within-class pooled covariance matrix for the x variables is

click to expand

The canonical correlations, i , are the square roots of the eigenvalues, » i , of the following matrix. The corresponding eigenvectors are v i .

click to expand

Let V be the matrix with the eigenvectors v i that correspond to nonzero eigenvalues as columns . The raw canonical coefficients are calculated as follows

The pooled within-class standardized canonical coefficients are

click to expand

And the total sample standardized canonical coefficients are

click to expand

Let X c be the matrix with the centered x variables as columns. The canonical scores may be calculated by any of the following

click to expand

For the Multivariate tests based on E ˆ’ 1 H

click to expand

where n is the total number of observations.

Input Data Set

The input DATA= data set can be an ordinary SAS data set or one of several specially structured data sets created by statistical procedures available with SAS/STAT software. For more information on special types of data sets, see Appendix A, Special SAS Data Sets. The BY variable in these data sets becomes the CLASS variable in PROC CANDISC. These specially structured data sets include

  • TYPE=CORR data sets created by PROC CORR using a BY statement

  • TYPE=COV data sets created by PROC PRINCOMP using both the COV option and a BY statement

  • TYPE=CSSCP data sets created by PROC CORR using the CSSCP option and a BY statement, where the OUT= data set is assigned TYPE=CSSCP with the TYPE= data set option

  • TYPE=SSCP data sets created by PROC REG using both the OUTSSCP= option and a BY statement.

When the input data set is TYPE=CORR, TYPE=COV, or TYPE=CSSCP, PROC CANDISC reads the number of observations for each class from the observations with _TYPE_= N and the variable means in each class from the observations with _TYPE_= MEAN . The CANDISC procedure then reads the within-class correlations from the observations with _TYPE_= CORR , the standard deviations from the observations with _TYPE_= STD (data set TYPE=CORR), the within-class covariances from the observations with _TYPE_= COV (data set TYPE=COV), or the within-class corrected sums of squares and crossproducts from the observations with _TYPE_= CSSCP (data set TYPE=CSSCP).

When the data set does not include any observations with _TYPE_= CORR (data set TYPE=CORR), _TYPE_= COV (data set TYPE=COV), or _TYPE_= CSSCP (data set TYPE=CSSCP) for each class, PROC CANDISC reads the pooled within-class information from the data set. In this case, PROC CANDISC reads the pooled within-class correlations from the observations with _TYPE_= PCORR , the pooled within-class standard deviations from the observations with _TYPE_= PSTD (data set TYPE=CORR), the pooled within-class covariances from the observations with _TYPE_= PCOV (data set TYPE=COV), or the pooled within-class corrected SSCP matrix from the observations with_TYPE_= PSSCP (data set TYPE=CSSCP).

When the input data set is TYPE=SSCP, PROC CANDISC reads the number of observations for each class from the observations with _TYPE_= N , the sum of weights of observations from the variable INTERCEPT in observations with _TYPE_= SSCP and _NAME_= INTERCEPT , the variable sums from the variable= variablenames in observations with _TYPE_= SSCP and _NAME_= INTERCEPT , and the uncorrected sums of squares and crossproducts from the variable= variablenames in observations with _TYPE_= SSCP and _NAME_= variablenames .

Output Data Sets

OUT= Data Set

The OUT= data set contains all the variables in the original data set plus new variables containing the canonical variable scores. You determine the number of new variables using the NCAN= option. The names of the new variables are formed as described in the PREFIX= option. The new variables have means equal to zero and pooled within-class variances equal to one. An OUT= data set cannot be created if the DATA= data set is not an ordinary SAS data set.

OUTSTAT= Data Set

The OUTSTAT= data set is similar to the TYPE=CORR data set produced by the CORR procedure but contains many results in addition to those produced by the CORR procedure.

The OUTSTAT= data set is TYPE=CORR, and it contains the following variables:

  • the BY variables, if any

  • the CLASS variable

  • _TYPE_ , a character variable of length 8 that identifies the type of statistic

  • _NAME_ , a character variable of length 32 that identifies the row of the matrix or the name of the canonical variable

  • the quantitative variables (those in the VAR statement, or if there is no VAR statement, all numeric variables not listed in any other statement)

The observations, as identified by the variable _TYPE_ , have the following _TYPE_ values:

_TYPE_

Contents

N

number of observations for both the total sample (CLASS variable missing) and within each class (CLASS variable present)

SUMWGT

sum of weights for both the total sample (CLASS variable missing) and within each class (CLASS variable present) if a WEIGHT statement is specified

MEAN

means for both the total sample (CLASS variable missing) and within each class (CLASS variable present)

STDMEAN

total-standardized class means

PSTDMEAN

pooled within-class standardized class means

STD

standard deviations for both the total sample (CLASS variable missing) and within each class (CLASS variable present)

PSTD

pooled within-class standard deviations

BSTD

between-class standard deviations

RSQUARED

univariate R 2 s

The following kinds of observations are identified by the combination of the variables _TYPE_ and _NAME_ . When the _TYPE_ variable has one of the following values, the _NAME_ variable identifies the row of the matrix.

_TYPE_

Contents

CSSCP

corrected SSCP matrix for the total sample (CLASS variable missing) and within each class (CLASS variable present)

PSSCP

pooled within-class corrected SSCP matrix

BSSCP

between-class SSCP matrix

COV

covariance matrix for the total sample (CLASS variable missing) and within each class (CLASS variable present)

PCOV

pooled within-class covariance matrix

BCOV

between-class covariance matrix

CORR

correlation matrix for the total sample (CLASS variable missing) and within each class (CLASS variable present)

PCORR

pooled within-class correlation matrix

BCORR

between-class correlation matrix

When the _TYPE_ variable has one of the following values, the _NAME_ variable identifies the canonical variable:

_TYPE_

Contents

CANCORR

canonical correlations

STRUCTUR

canonical structure

BSTRUCT

between canonical structure

PSTRUCT

pooled within-class canonical structure

SCORE

total sample standardized canonical coefficients

PSCORE

pooled within-class standardized canonical coefficients

RAWSCORE

raw canonical coefficients

CANMEAN

means of the canonical variables for each class

You can use this data set with PROC SCORE to get scores on the canonical variables for new data using one of the following forms.

  * The CLASS variable C is numeric;   proc score data=NewData score=Coef(where=(c = .  )) out=Scores; run;   * The CLASS variable C is character;   proc score data=NewData score=Coef(where=(c = ' ')) out=Scores;   run;  

The WHERE clause is used to exclude the within-class means and standard deviations. PROC SCORE standardizes the new data by subtracting the original variable means that are stored in the _TYPE_ = MEAN observations, and dividing by the original variable standard deviations from the _TYPE_ = STD observations. Then PROC SCORE multiplies the standardized variables by the coefficients from the _TYPE_ = SCORE observations to get the canonical scores.

Computational Resources

In the following discussion, let

  • n = number of observations

  • c = number of class levels

  • v = number of variables in the VAR list

  • l = length of the CLASS variable

Memory Requirements

The amount of memory in bytes for temporary storage needed to process the data is

click to expand

With the ANOVA option, the temporary storage must be increased by 16v bytes. The DISTANCE option requires an additional temporary storage of 4 v 2 + 4 v bytes.

Time Requirements

The following factors determine the time requirements of the CANDISC procedure.

  • The time needed for reading the data and computing covariance matrices is proportional to nv 2 . PROC CANDISC must also look up each class level in the list. This is faster if the data are sorted by the CLASS variable. The time for looking up class levels is proportional to a value ranging from n to n log( c ).

  • The time for inverting a covariance matrix is proportional to v 3 .

  • The time required for the canonical discriminant analysis is proportional to v 3 .

Each of the preceding factors has a different constant of proportionality.

Displayed Output

The output produced by PROC CANDISC includes

  • Class Level Information, including the values of the classification variable, the Frequency and Weight of each value, and its Proportion in the total sample.

Optional output includes

  • Within-Class SSCP Matrices for each group

  • Pooled Within-Class SSCP Matrix

  • Between-Class SSCP Matrix

  • Total-Sample SSCP Matrix

  • Within-Class Covariance Matrices for each group

  • Pooled Within-Class Covariance Matrix

  • Between-Class Covariance Matrix, equal to the between-class SSCP matrix divided by n ( c ˆ’ 1) /c , where n is the number of observations and c is the number of classes

  • Total-Sample Covariance Matrix

  • Within-Class Correlation Coefficients and Pr > r to test the hypothesis that the within-class population correlation coefficients are zero

  • Pooled Within-Class Correlation Coefficients and Pr > r to test the hypothesis that the partial population correlation coefficients are zero

  • Between-Class Correlation Coefficients and Pr > r to test the hypothesis that the between-class population correlation coefficients are zero

  • Total-Sample Correlation Coefficients and Pr > r to test the hypothesis that the total population correlation coefficients are zero

  • Simple Statistics including N (the number of observations), Sum, Mean, Variance, and Standard Deviation both for the total sample and within each class

  • Total-Sample Standardized Class Means, obtained by subtracting the grand mean from each class mean and dividing by the total sample standard deviation

  • Pooled Within-Class Standardized Class Means, obtained by subtracting the grand mean from each class mean and dividing by the pooled within-class standard deviation

  • Pairwise Squared Distances Between Groups

  • Univariate Test Statistics, including Total-Sample Standard Deviations, Pooled Within-Class Standard Deviations, Between-Class Standard Deviations, R 2 , R 2 / (1 ˆ’ R 2 ), F , and Pr >F (univariate F values and probability levels for one-way analyses of variance)

By default, PROC CANDISC displays these statistics:

  • Multivariate Statistics and F Approximations including Wilks Lambda, Pillai s Trace, Hotelling-Lawley Trace, and Roy s Greatest Root with F approximations, degrees of freedom (Num DF and Den DF), and probability values (Pr >F ). Each of these four multivariate statistics tests the hypothesis that the class means are equal in the population. See the Multivariate Tests section in Chapter 2, Introduction to Regression Procedures, for more information.

  • Canonical Correlations

  • Adjusted Canonical Correlations (Lawley 1959). These are asymptotically less biased than the raw correlations and can be negative. The adjusted canonical correlations may not be computable and are displayed as missing values if two canonical correlations are nearly equal or if some are close to zero. A missing value is also displayed if an adjusted canonical correlation is larger than a previous adjusted canonical correlation.

  • Approx Standard Error, approximate standard error of the canonical correlations

  • Squared Canonical Correlations

  • Eigenvalues of E ˆ’ 1 H . Each eigenvalue is equal to 2 / (1 ˆ’ 2 ), where 2 is the corresponding squared canonical correlation and can be interpreted as the ratio of between-class variation to pooled within-class variation for the corresponding canonical variable. The table includes Eigenvalues, Differences between successive eigenvalues, the Proportion of the sum of the eigenvalues, and the Cumulative proportion.

  • Likelihood Ratio for the hypothesis that the current canonical correlation and all smaller ones are zero in the population. The likelihood ratio for the hypothesis that all canonical correlations equal zero is Wilks lambda.

  • Approx F statistic based on Rao s approximation to the distribution of the likelihood ratio (Rao 1973, p. 556; Kshirsagar 1972, p. 326)

  • Num DF (numerator degrees of freedom), Den DF (denominator degrees of freedom), and Pr >F , the probability level associated with the F statistic

The following statistics can be suppressed with the SHORT option:

  • Total Canonical Structure, giving total-sample correlations between the canonical variables and the original variables

  • Between Canonical Structure, giving between-class correlations between the canonical variables and the original variables

  • Pooled Within Canonical Structure, giving pooled within-class correlations between the canonical variables and the original variables

  • Total-Sample Standardized Canonical Coefficients, standardized to give canonical variables with zero mean and unit pooled within-class variance when applied to the total-sample standardized variables

  • Pooled Within-Class Standardized Canonical Coefficients, standardized to give canonical variables with zero mean and unit pooled within-class variance when applied to the pooled within-class standardized variables

  • Raw Canonical Coefficients, standardized to give canonical variables with zero mean and unit pooled within-class variance when applied to the centered variables

  • Class Means on Canonical Variables

ODS Table Names

PROC CANDISC assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.

Table 21.2: ODS Tables Produced in PROC CANDISC

ODS Table Name

Description

PROC CANDISC Option

ANOVA

Univariate statistics

ANOVA

AveRSquare

Average R-square

ANOVA

BCorr

Between-class correlations

BCORR

BCov

Between-class covariances

BCOV

BSSCP

Between-class SSCP matrix

BSSCP

BStruc

Between canonical structure

default

CanCorr

Canonical correlations

default

CanonicalMeans

Class means on canonical variables

default

Counts

Number of observations, variables, classes, df

default

CovDF

DF for covariance matrices, not printed

any *COV option

Dist

Squared distances

MAHALANOBIS

DistFValues

F statistics based on squared distances

MAHALANOBIS

DistProb

Probabilities for F statistics from squared distances

MAHALANOBIS

Levels

Class level information

default

MultStat

MANOVA

default

PCoef

Pooled standard canonical coefficients

default

PCorr

Pooled within-class correlations

PCORR

PCov

Pooled within-class covariances

PCOV

PSSCP

Pooled within-class SSCP matrix

PSSCP

PStdMeans

Pooled standardized class means

STDMEAN

PStruc

Pooled within canonical structure

default

RCoef

Raw canonical coefficients

default

SimpleStatistics

Simple statistics

SIMPLE

TCoef

Total-sample standard canonical coefficients

default

TCorr

Total-sample correlations

TCORR

TCov

Total-sample covariances

TCOV

TSSCP

Total-sample SSCP matrix

TSSCP

TStdMeans

Total standardized class means

STDMEAN

TStruc

Total canonical structure

default

WCorr

Within-class correlations

WCORR

WCov

Within-class covariances

WCOV

WSSCP

Within-class SSCP matrices

WSSCP




SAS.STAT 9.1 Users Guide (Vol. 1)
SAS/STAT 9.1 Users Guide, Volumes 1-7
ISBN: 1590472438
EAN: 2147483647
Year: 2004
Pages: 156

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net