Observations containing missing values are omitted from the analysis.
The input data set can be an ordinary SAS data set or one of several specially structured data sets created by statistical procedures available with SAS/STAT software. For more information on these data sets, see Appendix A, 'Special SAS Data Sets.' The BY variable in these data sets becomes the CLASS variable in PROC STEPDISC. These specially structured data sets include
TYPE=CORR data sets created by PROC CORR using a BY statement
TYPE=COV data sets created by PROC PRINCOMP using both the COV option and a BY statement
TYPE=CSSCP data sets created by PROC CORR using the CSSCP option and a BY statement, where the OUT= data set is assigned TYPE=CSSCP with the TYPE= data set option
TYPE=SSCP data sets created by PROC REG using both the OUTSSCP= option and a BY statement
When the input data set is TYPE=CORR, TYPE=COV, or TYPE=CSSCP, the STEPDISC procedure reads the number of observations for each class from the observations with _TYPE_='N' and the variable means in each class from the observations with _TYPE_='MEAN'. The procedure then reads the within-class correlations from the observations with _TYPE_='CORR', the standard deviations from the observations with _TYPE_='STD' (data set TYPE=CORR), the within-class covariances from the observations with _TYPE_='COV' (data set TYPE=COV), or the within-class corrected sums of squares and crossproducts from the observations with _TYPE_='CSSCP' (data set TYPE=CSSCP).
When the data set does not include any observations with _TYPE_='CORR' (data set TYPE=CORR), _TYPE_='COV' (data set TYPE=COV), or _TYPE_='CSSCP' (data set TYPE=CSSCP) for each class, PROC STEPDISC reads the pooled within-class information from the data set. In this case, the STEPDISC procedure reads the pooled within-class correlations from the observations with _TYPE_='PCORR', the pooled within-class standard deviations from the observations with _TYPE_='PSTD' (data set TYPE=CORR), the pooled within-class covariances from the observations with _TYPE_='PCOV' (data set TYPE=COV), or the pooled within-class corrected SSCP matrix from the observations with_TYPE_='PSSCP' (data set TYPE=CSSCP).
When the input data set is TYPE=SSCP, the STEPDISC procedure reads the number of observations for each class from the observations with _TYPE_='N', the sum of weights of observations from the variable INTERCEPT in observations with _TYPE_='SSCP' and _NAME_='INTERCEPT', the variable sums from the variable= variablenames in observations with _TYPE_='SSCP' and _NAME_='INTERCEPT', and the uncorrected sums of squares and crossproducts from the variable= variablenames in observations with _TYPE_='SSCP' and _NAME_= variablenames .
In the following discussion, let
n = number of observations
c = number of class levels
v = number of variables in the VAR list
l = length of the CLASS variable
t = v + c ˆ’ 1 .
The amount of memory in bytes for temporary storage needed to process the data is
Additional temporary storage of 72 bytes at each step is also required to store the results.
The following factors determine the time requirements of a stepwise discriminant analysis.
The time needed for reading the data and computing covariance matrices is proportional to nv 2 . The STEPDISC procedure must also look up each class level in the list. This is faster if the data are sorted by the CLASS variable. The time for looking up class levels is proportional to a value ranging from n to n ln( c ).
The time needed for stepwise discriminant analysis is proportional to the number of steps required to select the set of variables in the discrimination model. The number of steps required depends on the data set itself and the selection method and criterion used in the procedure. Each forward or backward step takes time proportional to ( v + c ) 2 .
The STEPDISC procedure displays the following output:
Class Level Information, including the values of the classification variable, the Frequency of each value, the Weight of each value, and the Proportion of each value in the total sample
Optional output includes
Within-Class SSCP Matrices for each group
Pooled Within-Class SSCP Matrix
Between-Class SSCP Matrix
Total-Sample SSCP Matrix
Within-Class Covariance Matrices for each group
Pooled Within-Class Covariance Matrix
Between-Class Covariance Matrix, equal to the between-class SSCP matrix divided by n ( c ˆ’ 1) /c , where n is the number of observations and c is the number of classes
Total-Sample Covariance Matrix
Within-Class Correlation Coefficients and Pr > r to test the hypothesis that the within-class population correlation coefficients are zero
Pooled Within-Class Correlation Coefficients and Pr > r to test the hypothesis that the partial population correlation coefficients are zero
Between-Class Correlation Coefficients and Pr > r to test the hypothesis that the between-class population correlation coefficients are zero
Total-Sample Correlation Coefficients and Pr > r to test the hypothesis that the total population correlation coefficients are zero
descriptive Simple Statistics including N (the number of observations), Sum, Mean, Variance, and Standard Deviation for the total sample and within each class
Total-Sample Standardized Class Means, obtained by subtracting the grand mean from each class mean and dividing by the total-sample standard deviation
Pooled Within-Class Standardized Class Means, obtained by subtracting the grand mean from each class mean and dividing by the pooled within-class stan-dard deviation
At each step, the following statistics are displayed:
for each variable considered for entry or removal: Partial R-Square, the squared (partial) correlation, the F statistic, and Pr > F , the probability level, from a one-way analysis of covariance
the minimum Tolerance for entering each variable. A variable is entered only if its tolerance and the tolerances for all variables already in the model are greater than the value specified in the SINGULAR= option. The tolerance for the entering variable is 1 ˆ’ R 2 from regressing the entering variable on the other variables already in the model. The tolerance for a variable already in the model is 1 ˆ’ R 2 from regressing that variable on the entering variable and the other variables already in the model. With m variables already in the model, for each entering variable, m + 1 multiple regressions are performed using the entering variable and each of the m variables already in the model as a dependent variable. These m + 1 tolerances are computed for each entering variable, and the minimum tolerance is displayed for each.
The tolerance is computed using the total-sample correlation matrix. It is customary to compute tolerance using the pooled within-class correlation matrix (Jennrich 1977), but it is possible for a variable with excellent discriminatory power to have a high total-sample tolerance and a low pooled within-class tolerance. For example, PROC STEPDISC enters a variable that yields perfect discrimination (that is, produces a canonical correlation of one), but a program using pooled within-class tolerance does not.
the variable Label, if any
the name of the variable chosen
the variables already selected or removed
Wilks' Lambda and the associated F approximation with degrees of freedom and Pr < F , the associated probability level after the selected variable has been entered or removed. Wilks' lambda is the likelihood ratio statistic for testing the hypothesis that the means of the classes on the selected variables are equal in the population (see the 'Multivariate Tests' section in Chapter 2, 'Introduction to Regression Procedures.' ) Lambda is close to zero if any two groups are well separated.
Pillai's Trace and the associated F approximation with degrees of freedom and Pr > F , the associated probability level after the selected variable has been entered or removed. Pillai's trace is a multivariate statistic for testing the hypothesis that the means of the classes on the selected variables are equal in the population (see the 'Multivariate Tests' section in Chapter 2).
Average Squared Canonical Correlation (ASCC). The ASCC is Pillai's trace divided by the number of groups minus 1. The ASCC is close to 1 if all groups are well separated and if all or most directions in the discriminant space show good separation for at least two groups.
Summary to give statistics associated with the variable chosen at each step. The summary includes the following:
Step number
Variable Entered or Removed
Number In, the number of variables in the model
Partial R-Square
the F Value for entering or removing the variable
Pr > F , the probability level for the F statistic
Wilks' Lambda
Pr < Lambda basedonthe F approximation to Wilks' Lambda
Average Squared Canonical Correlation
Pr > ASCC basedonthe F approximation to Pillai's trace
the variable Label, if any
PROC STEPDISC assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, 'Using the Output Delivery System.'
ODS Table Name | Description | PROC STEPWISE Option |
---|---|---|
BCorr | Between-class correlations | BCORR |
BCov | Between-class covariances | BCOV |
BSSCP | Between-class SSCP matrix | BSSCP |
Counts | Number of observations, variables, classes, df | default |
CovDF | DF for covariance matrices, not printed | any *COV option |
Levels | Class level information | default |
Messages | Entry/removal messages | default |
Multivariate | Multivariate statistics | default |
PCorr | Pooled within-class correlations | PCORR |
PCov | Pooled within-class covariances | PCOV |
PSSCP | Pooled within-class SSCP matrix | PSSCP |
PStdMeans | Pooled standardized class means | STDMEAN |
SimpleStatistics | Simple statistics | SIMPLE |
Steps | Stepwise selection entry/removal | default |
Summary | Stepwise selection summary | default |
TCorr | Total-sample correlations | TCORR |
TCov | Total-sample covariances | TCOV |
TSSCP | Total-sample SSCP matrix | TSSCP |
TStdMeans | Total standardized class means | STDMEAN |
Variables | Variable lists | default |
WCorr | Within-class correlations | WCORR |
WCov | Within-class covariances | WCOV |
WSSCP | Within-class SSCP matrices | WSSCP |