PROC CORRESP can read two kinds of input:
raw category responses on two or more classification variables with the TABLES statement
a two-way contingency table with the VAR statement
You can use output from PROC FREQ as input for PROC CORRESP.
The classification variables referred to by the TABLES statement can be either numeric or character variables. Normally, all observations for a given variable that have the same formatted value are placed in the same level, and observations with different values are placed in different levels.
The variables in the VAR statement must be numeric. The values of the observations specify the cell frequencies. These values are not required to be integers, but only those observations with all nonnegative, nonmissing values are used in the correspondence analysis. Observations with one or more negative values are removed from the analysis.
The WEIGHT variable must be numeric. Observations with negative weights are treated as supplementary observations. The absolute values of the weights are used to weight the observations.
The following example explains correspondence analysis and illustrates some capabilities of PROC CORRESP.
data Neighbor; input Name $ 1 10 Age $ 12 18 Sex $ 19 25 Height $ 26 30 Hair $ 32 37; datalines; Jones Old Male Short White Smith Young Female Tall Brown Kasavitz Old Male Short Brown Ernst Old Female Tall White Zannoria Old Female Short Brown Spangel Young Male Tall Blond Myers Young Male Tall Brown Kasinski Old Male Short Blond Colman Young Female Short Blond Delafave Old Male Tall Brown Singer Young Male Tall Brown Igor Old Short ;
There are several types of tables, N , that can be used as input to correspondence analysis ”all tables can be defined using a binary matrix, Z .
With the BINARY option, N = Z is directly analyzed . The binary matrix has one column for each category and one row for each individual or case. A binary table constructed from m categorical variables has m partitions. The following table has four partitions, one for each of the four categorical variables. Each partition has a 1 in each row, and each row contains exactly four 1s since there are four categorical variables. More generally , the binary design matrix has exactly m 1s in each row. The 1s indicate the categories to which the observation applies.
Z Hair | Z Height | Z Sex | Z Age | |||||
---|---|---|---|---|---|---|---|---|
Blond | Brown | White | Short | Tall | Female | Male | Old | Young |
|
| 1 | 1 |
|
| 1 | 1 |
|
| 1 |
|
| 1 | 1 |
|
| 1 |
| 1 |
| 1 |
|
| 1 | 1 |
|
|
| 1 |
| 1 | 1 |
| 1 |
|
| 1 |
| 1 |
| 1 |
| 1 |
|
1 |
|
|
| 1 |
| 1 |
| 1 |
| 1 |
|
| 1 |
| 1 |
| 1 |
1 |
|
| 1 |
|
| 1 | 1 |
|
1 |
|
| 1 |
| 1 |
|
| 1 |
| 1 |
|
| 1 |
| 1 | 1 |
|
| 1 |
|
| 1 |
| 1 |
| 1 |
With the MCA option, the Burt table ( Z ² Z ) is analyzed. A Burt table is a partitioned symmetric matrix containing all pairs of crosstabulations among a set of categorical variables. Each diagonal partition is a diagonal matrix containing marginal frequencies (a crosstabulation of a variable with itself). Each off-diagonal partition is an ordinary contingency table. Each contingency table above the diagonal has a transposed counterpart below the diagonal.
Blond | Brown | White | Short | Tall | Female | Male | Old | Young | |
---|---|---|---|---|---|---|---|---|---|
Blond | 3 |
|
| 2 | 1 | 1 | 2 | 1 | 2 |
Brown |
| 6 |
| 2 | 4 | 2 | 4 | 3 | 3 |
White |
|
| 2 | 1 | 1 | 1 | 1 | 2 |
|
Short | 2 | 2 | 1 | 5 |
| 2 | 3 | 4 | 1 |
Tall | 1 | 4 | 1 |
| 6 | 2 | 4 | 2 | 4 |
Female | 1 | 2 | 1 | 2 | 2 | 4 |
| 2 | 2 |
Male | 2 | 4 | 1 | 3 | 4 |
| 7 | 4 | 3 |
Old | 1 | 3 | 2 | 4 | 2 | 2 | 4 | 6 |
|
Young | 2 | 3 |
| 1 | 4 | 2 | 3 |
| 5 |
This Burt table is composed of all pairs of crosstabulations among the variables Hair , Height , Sex , and Age . It is composed of sixteen individual subtables ”the number of variables squared. Both the rows and the columns have the same nine categories (in this case Blond, Brown, White, Short, Tall, Female, Male, Old, and Young). The off-diagonal partitions are crosstabulations of each variable with every other variable. Below the diagonal are the following crosstabulations (from left to right, top to bottom): Height * Hair , Sex * Hair , Sex * Height , Age * Hair , Age * Height , and Age * Sex . Each crosstabulation below the diagonal has a transposed counterpart above the diagonal. Each diagonal partition contains a crosstabulation of a variable with itself ( Hair * Hair , Height * Height , Sex * Sex ,and Age * Age ). The diagonal elements of the diagonal partitions contain marginal frequencies of the off-diagonal partitions.
For example, the table Hair * Height has three rows for Hair and two columns for Height . The values of the Hair * Height table, summed across rows, sum to the diagonal values of the Height * Height table, as displayed in the following table.
Short | Tall | |
---|---|---|
Blond | 2 | 1 |
Brown | 2 | 4 |
White | 1 | 1 |
Short | 5 |
|
Tall |
| 6 |
A simple crosstabulation of Hair — Height is N = Z Hair ² Z Height . Crosstabulations such as this, involving only two variables, are the input to simple correspondence analysis.
Short | Tall | |
---|---|---|
Blond | 2 | 1 |
Brown | 2 | 4 |
White | 1 | 1 |
Tables such as the following ( N = Z Hair ² Z Height , Sex ), made up of several crosstabulations, can also be analyzed in simple correspondence analysis.
Short | Tall | Female | Male | |
---|---|---|---|---|
Blond | 2 | 1 | 1 | 2 |
Brown | 2 | 4 | 2 | 4 |
White | 1 | 1 | 1 | 1 |
You can use an indicator matrix as input to PROC CORRESP using the VAR statement. An indicator matrix is composed of several submatrices, each of which is a design matrix with one column for each category of a categorical variable. In order to create an indicator matrix, you must code an indicator variable for each level of each categorical variable. For example, the categorical variable Sex , with two levels (Female and Male), would be coded using two indicator variables.
A binary indicator variable is coded 1 to indicate the presence of an attribute and 0 to indicate its absence. For the variable Sex , a male would be coded Female =0 and Male =1, and a female would be coded Female =1 and Male =0. The indicator variables representing a categorical variable must sum to 1.0. You can specify the BINARY option to create a binary table.
Sometimes binary data such as Yes/No data are available. For example, 1 means Yes, I have bought this brand in the last month and 0 means No, I have not bought this brand in the last month .
title 'Doubling Yes/No Data'; proc format; value yn 0 = 'No ' 1 = 'Yes'; run; data BrandChoice; input a b c; label a = 'Brand A' b = 'Brand B' c = 'Brand B'; format a b c yn.; datalines; 0 0 1 1 1 0 0 1 1 0 1 0 1 0 0 ;
Data such as these cannot be analyzed directly because the raw data do not consist of partitions, each with one column per level and exactly one 1 in each row. The data must be doubled so that both Yes and No are both represented by a column in the data matrix. The TRANSREG procedure provides one way of doubling. In the following statements, the DESIGN option specifies that PROC TRANSREG is being used only for coding, not analysis. The option SEPARATORS= : specifies that labels for the coded columns are constructed from input variable labels, followed by a colon and space, followed by the formatted value. The variables are designated in the MODEL statement as CLASS variables, and the ZERO=NONE option creates binary variables for all levels. The OUTPUT statement specifies the output data set and drops the _NAME_ , _TYPE_ , and Intercept variables. PROC TRANSREG stores a list of coded variable names in a macro variable &_TRGIND , which in this case has the value aNo aYes bNo bYes cNo cYes . This macro can be used directly in the VAR statement in PROC CORRESP.
proc transreg data=BrandChoice design separators=': '; model class(a b c / zero=none); output out=Doubled(drop=_: Intercept); run; proc print label; run; proc corresp data=Doubled norow short; var &_trgind; run;
A fuzzy-coded indicator also sums to 1.0 across levels of the categorical variable, but it is coded with fractions rather than with 1 and 0. The fractions represent the distribution of the attribute across several levels of the categorical variable.
Ordinal variables, such as survey responses of 1 to 3 can be represented as two design variables.
Ordinal Values | Coding | |
---|---|---|
1 | 0.25 | 0.75 |
2 | 0.50 | 0.50 |
3 | 0.75 | 0.25 |
Values of the coding sum to one across the two coded variables.
This next example illustrates the use of binary and fuzzy-coded indicator variables. Fuzzy-coded indicators are used to represent missing data. Note that the missing values in the observation Igor are coded with equal proportions .
proc transreg data=Neighbor design cprefix=0; model class(Age Sex Height Hair / zero=none); output out=Neighbor2(drop=_: Intercept); id Name; run; data Neighbor3; set Neighbor2; if Sex = ' ' then do; Female = 0.5; Male = 0.5; end; if Hair = ' ' then do; White = 1/3; Brown = 1/3; Blond = 1/3; end; run; proc print label; run;
There is one set of coded variables for each input categorical variable. If observation 12 is excluded, each set is a binary design matrix. Each design matrix has one column for each category and exactly one 1 in each row.
Fuzzy-coding is shown in the final observation, Igor. The observation Igor has missing values for the variables Sex and Hair . The design matrix variables are coded with fractions that sum to one within each categorical variable.
An alternative way to represent missing data is to treat missing values as an additional level of the categorical variable. This alternative is available with the MISSING option in the PROC statement. This approach yields coordinates for missing responses, allowing the comparison of missing along with the other levels of the categorical variables.
Greenacre and Hastie (1987) discuss additional coding schemes, including one for continuous variables. Continuous variables can be coded with PROC TRANSREG by specifying BSPLINE( variables / degree=1) in the MODEL statement.
In the following TABLES statement, each variable list consists of a single variable:
proc corresp data=Neighbor dimens=1 observed short; ods select observed; tables Sex, Age; run;
These statements create a contingency table with two rows (Female and Male) and two columns (Old and Young) and show the neighbors broken down by age and sex. The DIMENS=1 option overrides the default, which is DIMENS=2. The OBSERVED option displays the contingency table. The SHORT option limits the displayed output. Because it contains missing values, the observation where Name = Igor is omitted from the analysis. Figure 24.4 displays the contingency table.
The CORRESP Procedure Contingency Table Old Young Sum Female 2 2 4 Male 4 3 7 Sum 6 5 11
The following statements create a table with six rows ( Blond*Short , Blond*Tall , Brown*Short , Brown*Tall , White*Short , and White*Tall ), and four columns ( Female , Male , Old , and Young ). The levels of the row variables are crossed, forming mutually exclusive categories, whereas the categories of the column variables overlap.
proc corresp data=Neighbor cross=row observed short; ods select observed; tables Hair Height, Sex Age; run;
The CORRESP Procedure Contingency Table Female Male Old Young Sum Blond * Short 1 1 1 1 4 Blond * Tall 0 1 0 1 2 Brown * Short 1 1 2 0 4 Brown * Tall 1 3 1 3 8 White * Short 0 1 1 0 2 White * Tall 1 0 1 0 2 Sum 4 7 6 5 22
You can enter supplementary variables with TABLES input by including a SUPPLEMENTARY statement. Variables named in the SUPPLEMENTARY statement indicate TABLES variables with categories that are supplementary. In other words, the categories of the variable Age are represented in the row and column space, but they are not used in determining the scores of the categories of the variables Hair , Height , and Sex . The variable used in the SUPPLEMENTARY statement must be listed in the TABLES statement as well. For example, the following statements create a Burt table with seven active rows and columns ( Blond , Brown , White , Short , Tall , Female , Male ) and two supplementary rows and columns ( Old and Young ).
proc corresp data=Neighbor observed short mca; ods select burt supcols; tables Hair Height Sex Age; supplementary Age; run;
The following statements create a binary table with 7 active columns ( Blond , Brown , White , Short , Tall , Female , Male ), 2 supplementary columns ( Old and Young ), and 11 rows for the 11 observations with nonmissing values.
proc corresp data=Neighbor observed short binary; ods select binary supcols; tables Hair Height Sex Age; supplementary Age; run;
The CORRESP Procedure Binary Table Blond Brown White Short Tall Female Male 1 0 0 1 1 0 0 1 2 0 1 0 0 1 1 0 3 0 1 0 1 0 0 1 4 0 0 1 0 1 1 0 5 0 1 0 1 0 1 0 6 1 0 0 0 1 0 1 7 0 1 0 0 1 0 1 8 1 0 0 1 0 0 1 9 1 0 0 1 0 1 0 10 0 1 0 0 1 0 1 11 0 1 0 0 1 0 1 Supplementary Columns Old Young 1 1 0 2 0 1 3 1 0 4 1 0 5 1 0 6 0 1 7 0 1 8 1 0 9 0 1 10 1 0 11 0 1
With VAR statement input, the rows of the contingency table correspond to the observations of the input data set, and the columns correspond to the VAR statement variables. The values of the variables typically contain the table frequencies. The example displayed in Figure 24.4 could be run with VAR statement input using the following code:
data Ages; input Sex $ Old Young; datalines; Female 2 2 Male 4 3 ; proc corresp data=Ages dimens=1 observed short; var Old Young; id Sex; run;
Only nonnegative values are accepted. Negative values are treated as missing, causing the observation to be excluded from the analysis. The values are not required to be integers. Row labels for the table are specified with an ID variable. Column labels are constructed from the variable name or variable label if one is specified. When you specify multiple correspondence analysis (MCA), the row and column labels are the same and are constructed from the variable names or labels, so you cannot include an ID statement. With MCA, the VAR statement must list the variables in the order in which the rows occur. For example, the table displayed in Figure 24.6, which was created with the following TABLES statement,
The CORRESP Procedure Burt Table Blond Brown White Short Tall Female Male Blond 3 0 0 2 1 1 2 Brown 0 6 0 2 4 2 4 White 0 0 2 1 1 1 1 Short 2 2 1 5 0 2 3 Tall 1 4 1 0 6 2 4 Female 1 2 1 2 2 4 0 Male 2 4 1 3 4 0 7 Supplementary Columns Old Young Blond 1 2 Brown 3 3 White 2 0 Short 4 1 Tall 2 4 Female 2 2 Male 4 3
tables Hair Height Sex Age;
is input as follows with the VAR statement:
proc corresp data=table nvars=4 mca; var Blond Brown White Short Tall Female Male Old Young; run;
You must specify the NVARS= option to specify the number of original categorical variables with the MCA option. The option NVARS= n is needed to find boundaries between the subtables of the Burt table. If f is the sum of all elements in the Burt table Z ² Z , then fn ˆ’ 2 is the number of rows in the binary matrix Z . Thesumofall elements in each diagonal subtable of the Burt table must be fn ˆ’ 2 .
To enter supplementary observations, include a WEIGHT statement with negative weights for those observations. Specify the SUPPLEMENTARY statement to include supplementary variables. You must list supplementary variables in both the VAR and SUPPLEMENTARY statements.
With VAR statement input, observations with missing or negative frequencies are excluded from the analysis. Supplementary variables and supplementary observations with missing or negative frequencies are also excluded. Negative weights are valid with VAR statement input.
With TABLES statement input, observations with negative weights are excluded from the analysis. With this form of input, missing cell frequencies cannot occur. Observations with missing values on the categorical variables are excluded unless you specify the MISSING option. If you specify the MISSING option, ordinary missing values and special missing values are treated as additional levels of a categorical variable. In all cases, if any row or column of the constructed table contains only zeros, that row or column is excluded from the analysis.
Observations with missing weights are excluded from the analysis.
The CORRESP procedure can read or create a contingency or Burt table. PROC CORRESP is generally more efficient with VAR statement input than with TABLES statement input. TABLES statement input requires that the table be created from raw categorical variables, whereas the VAR statement is used to read an existing table. If PROC CORRESP runs out of memory, it may be possible to use some other method to create the table and then use VAR statement input with PROC CORRESP.
The following example uses the CORRESP, FREQ, and TRANSPOSE procedures to create rectangular tables from a SAS data set WORK.A that contains the categorical variables V1 “ V5 . The Burt table examples assume that no categorical variable has a value found in any of the other categorical variables (that is, that each row and column label is unique).
You can use PROC CORRESP and ODS to create a rectangular two-way contingency table from two categorical variables.
proc corresp data=a observed short; ods listing close; ods output Observed=Obs(drop=Sum where=(Label ne 'Sum')); tables v1, v2; run; ods listing;
You can use PROC FREQ and PROC TRANSPOSE to create a rectangular two-way contingency table from two categorical variables.
proc freq data=a; tables v1 * v2 / sparse noprint out=freqs; run; proc transpose data=freqs out=rfreqs; id v2; var count; by v1; run;
You can use PROC CORRESP and ODS to create a Burt table from five categorical variables.
proc corresp data=a observed short mca; ods listing close; ods output Burt=Obs; tables v1-v5; run; ods listing;
You can use a DATA step, PROC FREQ, and PROC TRANSPOSE to create a Burt table from five categorical variables.
data b; set a; array v[5] $ v1-v5; do i = 1 to 5; row = v[i]; do j = 1 to 5; column = v[j]; output; end; end; keep row column; run; proc freq data=b; tables row * column / sparse noprint out=freqs; run; proc transpose data=freqs out=rfreqs; id column; var count; by row; run;
The OUTC= data set contains two or three character variables and 4 n + 4 numeric variables, where n is the number of axes from DIMENS= n (two by default). The OUTC= data set contains one observation for each row, column, supplementary row, and supplementary column point, and one observation for inertias.
The first variable is named _TYPE_ and identifies the type of observation. The values of _TYPE_ are as follows:
The ˜INERTIA observation contains the total inertia in the INERTIA variable, and each dimension s inertia in the Contr1 “ Contr n variables.
The ˜OBS observations contain the coordinates and statistics for the rows of the table.
The ˜SUPOBS observations contain the coordinates and statistics for the supplementary rows of the table.
The ˜VAR observations contain the coordinates and statistics for the columns of the table.
The ˜SUPVAR observations contain the coordinates and statistics for the supplementary columns of the table.
If you specify the SOURCE option, then the data set also contains a variable _VAR_ containing the name or label of the input variable from which that row originates. The name of the next variable is either _NAME_ or (if you specify an ID statement) the name of the ID variable.
For observations with a value of ˜OBS or ˜SUPOBS for the _TYPE_ variable, the values of the second variable are constructed as follows:
When you use a VAR statement without an ID statement, the values are ˜Row1 , ˜Row2 , and so on.
When you specify a VAR statement with an ID statement, the values are set equal to the values of the ID variable.
When you specify a TABLES statement, the _NAME_ variable has values formed from the appropriate row variable values.
For observations with a value of ˜VAR or ˜SUPVAR for the _TYPE_ variable, the values of the second variable are equal to the names or labels of the VAR (or SUPPLEMENTARY) variables. When you specify a TABLES statement, the values are formed from the appropriate column variable values.
The third and subsequent variables contain the numerical results of the correspondence analysis.
Quality contains the quality of each point s representation in the DIMENS= n dimensional display, which is the sum of squared cosines over the first n dimensions.
Mass contains the masses or marginal sums of the relative frequency matrix.
Inertia contains each point s relative contribution to the total inertia.
Dim1 “ Dim n contain the point coordinates.
Contr1 “ Contr n contain the partial contributions to inertia.
SqCos1 “ SqCos n contain the squared cosines.
Best1 “ Best n and Best contain the summaries of the partial contributions to inertia.
The OUTF= data set contains frequencies and percentages. It is similar to a PROC FREQ output data set. The OUTF= data set begins with a variable called _TYPE_ , which contains the observation type. If the SOURCE option is specified, the data set contains two variables _ROWVAR_ and _COLVAR_ that contain the names or labels of the row and column input variables from which each cell originates. The next two variables are classification variables that contain the row and column levels. If you use TABLES statement input and each variable list consists of a single variable, the names of the first two variables match the names of the input variables; otherwise , these variables are named Row and Column . The next two variables are Count and Percent , which contain frequencies and percentages.
The _TYPE_ variable can have the following values:
˜OBSERVED observations contain the contingency table.
˜SUPOBS observations contain the supplementary rows.
˜SUPVAR observations contain the supplementary columns.
˜EXPECTED observations contain the product of the row marginals and the column marginals divided by the grand frequency of the observed frequency table. For ordinary two-way contingency tables, these are the expected frequency matrix under the hypothesis of row and column independence.
˜DEVIATION observations contain the matrix of deviations between the observed frequency matrix and the product of its row marginals and column marginals divided by its grand frequency. For ordinary two-way contingency tables, these are the observed minus expected frequencies under the hypothesis of row and column independence.
˜CELLCHI2 observations contain contributions to the total chi-square test statistic.
˜RP observations contain the row profiles.
˜SUPRP observations contain supplementary row profiles.
˜CP observations contain the column profiles.
˜SUPCP observations contain supplementary column profiles.
Let
n r | = | number of rows in the table |
n c | = | number of columns in the table |
n | = | number of observations |
v | = | number of VAR statement variables |
t | = | number of TABLES statement variables |
c | = | max( n r , n c ) |
d | = | min( n r , n c ) |
For TABLES statement input, more than
bytes of array space are required.
For VAR statement input, more than
bytes of array space are required.
The computational resources formulas are underestimates of the amounts of memory needed to handle most problems. If you use a utility data set, and if memory could be used with perfect efficiency, then roughly the stated amount of memory would be needed. In reality, most problems require at least two or three times the minimum.
PROC CORRESP tries to store the raw data (TABLES input) and the contingency table in memory. If there is not enough memory, a utility data set is used, potentially resulting in a large increase in execution time.
The time required to perform the generalized singular value decomposition is roughly proportional to 2 cd 2 + 5 d 3 . Overall computation time increases with table size at a rate roughly proportional to ( n r n c ) 3/2 .
This section is primarily based on the theory of correspondence analysis found in Greenacre (1984). If you are interested in other references, see the Background section on page 1069.
Let N be the contingency table formed from those observations and variables that are not supplementary and from those observations that have no missing values and have a positive weight. This table is an ( n r — n c ) rank q matrix of nonnegative numbers with nonzero row and column sums. If Z a is the binary coding for variable A , and Z b is the binary coding for variable B , then N = Z b is a contingency table. Similarly, if Z b,c contains the binary coding for both variables B and C , then N = Z b,c can also be input to a correspondence analysis. With the BINARY option, N = Z , and the analysis is based on a binary table. In multiple correspondence analysis, the analysis is based on a Burt table, Z ² Z .
Let 1 be a vector of 1s of the appropriate order, let I be an identity matrix, and let diag(·) be a matrix-valued function that creates a diagonal matrix from a vector. Let
The scalar f is the sum of all elements in N . The matrix P is a matrix of relative frequencies. The vector r contains row marginal proportions or row masses. The vector c contains column marginal proportions or column masses. The matrices D r and D c are diagonal matrices of marginals.
The rows of R contain the row profiles. The elements of each row of R sum to one. Each ( i, j ) element of R contains the observed probability of being in column j given membership in row i . Similarly, the columns of C contain the column profiles. The coordinates in correspondence analysis are based on the generalized singular value decomposition of P ,
where
In multiple correspondence analysis,
The matrix A , which is the rectangular matrix of left generalized singular vectors, has n r rows and q columns; the matrix D u , which is a diagonal matrix of singular values, has q rows and columns; and the matrix B , which is the rectangular matrix of right generalized singular vectors, has n c rows and q columns. The columns of A and B define the principal axes of the column and row point clouds, respectively.
The generalized singular value decomposition of P ˆ’ rc ² , discarding the last singular value (which is zero) and the last left and right singular vectors, is exactly the same as a generalized singular value decomposition of P , discarding the first singular value (which is one), the first left singular vector, r , and the first right singular vector, c . The first (trivial) column of A and B and the first singular value in D u are discarded before any results are displayed. You can obtain the generalized singular value decomposition of P ˆ’ rc ² from the ordinary singular value decomposition of .
Hence, and .
The default row coordinates are , and the default column coordinates are . Typically the first two columns of and are plotted to display graphically associations between the row and column categories. The plot consists of two overlaid plots, one for rows and one for columns. The row points are row profiles, rescaled so that distances between profiles can be displayed as ordinary Euclidean distances, then orthogonally rotated to a principal axes orientation. The column points are column profiles, rescaled so that distances between profiles can be displayed as ordinary Euclidean distances, then orthogonally rotated to a principal axes orientation. Distances between row points and other row points have meaning. Distances between column points and other column points have meaning. However, distances between column points and row points are not interpretable.
The PROFILE=, ROW=, and COLUMN= options standardize the coordinates before they are displayed and placed in the output data set. The options PROFILE=BOTH, PROFILE=ROW, and PROFILE=COLUMN provide the standardizations that are typically used in correspondence analysis. There are six choices each for row and column coordinates. However, most of the combinations of the ROW= and COLUMN= options are not useful. The ROW= and COLUMN= options are provided for completeness, but they are not intended for general use.
ROW= | Matrix Formula |
A | A |
AD | AD u |
DA |
|
DAD |
|
DAD1/2 |
|
DAID1/2 |
|
COLUMN= | Matrix Formula |
B | B |
BD | BD u |
DB |
|
DBD |
|
DBD1/2 |
|
DBID1/2 |
|
When PROFILE=ROW (ROW=DAD and COLUMN=DB), the row coordinates and column coordinates provide a correspondence analysis based on the row profile matrix. The row profile (conditional probability) matrix is defined as . The elements of each row of R sum to one. Each ( i, j ) element of R contains the observed probability of being in column j given membership in row i . The principal row coordinates and standard column coordinates provide a decomposition of . Since , the row coordinates are weighted centroids of the column coordinates. Each column point, with coordinates scaled to standard coordinates, defines a vertex in ( n c ˆ’ 1)-dimensional space. All of the principal row coordinates are located in the space defined by the standard column coordinates. Distances among row points have meaning, but distances among column points and distances between row and column points are not interpretable.
The option PROFILE=COLUMN can be described as applying the PROFILE=ROW formulas to the transpose of the contingency table. When PROFILE=COLUMN (ROW=DA and COLUMN=DBD), the principal column coordinates are weighted centroids of the standard row coordinates . Each row point, with coordinates scaled to standard coordinates, defines a vertex in ( n r ˆ’ 1)-dimensional space. All of the principal column coordinates are located in the space defined by the standard row coordinates. Distances among column points have meaning, but distances among row points and distances between row and column points are not interpretable.
The usual sets of coordinates are given by the default PROFILE=BOTH (ROW=DAD and COLUMN=DBD). All of the summary statistics, such as the squared cosines and contributions to inertia, apply to these two sets of points. One advantage to using these coordinates is that both sets ( are postmultiplied by the diagonal matrix D u , which has diagonal values that are all less than or equal to one. When D u is a part of the definition of only one set of coordinates, that set forms a tight cluster near the centroid whereas the other set of points is more widely dispersed. Including D u in both sets makes a better graphical display. However, care must be taken in interpreting such a plot. No correct interpretation of distances between row points and column points can be made.
Another property of this choice of coordinates concerns the geometry of distances between points within each set. The default row coordinates can be decomposed into ). The row coordinates are row profiles , rescaled by ( ) (rescaled so that distances between profiles are transformed from a chi-square metric to a Euclidean metric), then orthogonally rotated (with ) to a principal axes orientation. Similarly, the column coordinates are column profiles rescaled to a Euclidean metric and orthogonally rotated to a principal axes orientation.
The rationale for computing distances between row profiles using the non-Euclidean chi-square metric is as follows. Each row of the contingency table can be viewed as a realization of a multinomial distribution conditional on its row marginal frequency. The null hypothesis of row and column independence is equivalent to the hypothesis of homogeneity of the row profiles. A significant chi-square statistic is geometrically interpreted as a significant deviation of the row profiles from their centroid, c ² . The chi-square metric is the Mahalanobis metric between row profiles based on their estimated covariance matrix under the homogeneity assumption (Greenacre and Hastie 1987). A parallel argument can be made for the column profiles.
When ROW=DAD1/2 and COLUMN=DBD1/2 (Gifi 1990; van der Heijden and de Leeuw 1985), the row coordinates and column coordinates are a decomposition of .
In all of the preceding pairs, distances between row and column points are not meaningful. This prompted Carroll, Green, and Schaffer (1986) to propose that row coordinates and column coordinates be used.
These coordinates are (except for a constant scaling) the coordinates from a multiple correspondence analysis of a Burt table created from two categorical variables. This standardization is available with ROW=DAID1/2 and COLUMN=DBID1/2. However, this approach has been criticized on both theoretical and empirical grounds by Greenacre (1989). The Carroll, Green, and Schaffer standardization relies on the assumption that the chi-square metric is an appropriate metric for measuring the distance between the columns of a bivariate indicator matrix. See the section Types of Tables Used as Input on page 1083 for a description of indicator matrices. Greenacre (1989) showed that this assumption cannot be justified.
The MCA option performs a multiple correspondence analysis (MCA). This option requires a Burt table. You can specify the MCA option with a table created from a design matrix with fuzzy coding schemes as long as every row of every partition of the design matrix has the same marginal sum. For example, each row of each partition could contain the probabilities that the observation is a member of each level. Then the Burt table constructed from this matrix no longer contains all integers, and the diagonal partitions are no longer diagonal matrices, but MCA is still valid.
A TABLES statement with a single variable list creates a Burt table. Thus, you can always specify the MCA option with this type of input. If you use the MCA option when reading an existing table with a VAR statement, you must ensure that the table is a Burt table.
If you perform MCA on a table that is not a Burt table, the results of the analysis are invalid. If the table is not symmetric, or if the sums of all elements in each diagonal partition are not equal, PROC CORRESP displays an error message and quits.
A subset of the columns of a Burt table is not necessarily a Burt table, so in MCA it is not appropriate to designate arbitrary columns as supplementary. You can, however, designate all columns from one or more categorical variables as supplementary.
The results of a multiple correspondence analysis of a Burt table Z ² Z are the same as the column results from a simple correspondence analysis of the binary (or fuzzy) matrix Z . Multiple correspondence analysis is not a simple correspondence analysis of the Burt table. It is not appropriate to perform a simple correspondence analysis of a Burt table. The MCA option is based on , whereas a simple correspondence analysis of the Burt table would be based on P = BD u B ² .
Since the rows and columns of the Burt table are the same, no row information is displayed or written to the output data sets. The resulting inertias and the default (COLUMN=DBD) column coordinates are the appropriate inertias and coordinates for an MCA. The supplementary column coordinates, cosines, and quality of representation formulas for MCA differ from the simple correspondence analysis formulas because the design matrix column profiles and left singular vectors are not available.
The following statements create a Burt table and perform a multiple correspondence analysis:
proc corresp data=Neighbor observed short mca; tables Hair Height Sex Age; run;
Both the rows and the columns have the same nine categories (Blond, Brown, White, Short, Tall, Female, Male, Old, and Young).
The usual principal inertias of a Burt Table constructed from m categorical variables in MCA are the eigenvalues u k from . The problem with these inertias is that they provide a pessimistic indication of fit. Benz cri (1979) proposed the following inertia adjustment, which is also described by Greenacre (1984, p. 145):
The Benz cri adjustment is available with the BENZECRI option.
Greenacre (1994, p. 156) argues that the Benz cri adjustment overestimates the quality of fit. Greenacre proposes instead the following inertia adjustment:
The Greenacre adjustment is available with the GREENACRE option.
Ordinary unadjusted inertias are printed by default with MCA when neither the BENZECRI nor the GREENACRE option is specified. However, the unadjusted inertias are not printed by default when either the BENZECRI or the GREENACRE option is specified. To display both adjusted and unadjusted inertias, specify the UNADJUSTED option in addition to the relevant adjusted inertia option (BENZECRI, GREENACRE, or both).
Supplementary rows and columns are represented as points in the joint row and column space, but they are not used when determining the locations of the other active rows and columns of the table. The formulas that are used to compute coordinates for the supplementary rows and columns depend on the PROFILE= option or on the ROW= and COLUMN= options. Let S o be the matrix with rows that contain the supplementary observations and S v be a matrix with rows that contain the supplementary variables. Note that S v is defined to be the transpose of the supplementary variable partition of the table. Let R s = diag( S o 1 ) ˆ’ 1 S o be the supplementary observation profile matrix and C s = diag( S v 1 ) ˆ’ 1 S v be the supplementary variable profile matrix. Note that the notation diag(·) ˆ’ 1 means to convert the vector to a diagonal matrix, then invert the diagonal matrix. The coordinates for the supplementary observations and variables are as follows.
ROW= | Matrix Formula |
A |
|
AD |
|
DA |
|
DAD |
|
DAD1/2 |
|
DAID1/2 |
|
COLUMN= | Matrix Formula |
B |
|
BD |
|
DB |
|
DBD |
|
DBD1/2 |
|
DBID1/2 |
|
MCA COLUMN= | Matrix Formula |
B | not allowed |
BD | not allowed |
DB |
|
DBD |
|
DBD1/2 |
|
DBID1/2 |
|
The partial contributions to inertia, squared cosines, quality of representation, inertia, and mass provide additional information about the coordinates. These statistics are displayed by default. Include the SHORT or NOPRINT option in the PROC CORRESP statement to avoid having these statistics displayed.
These statistics pertain to the default PROFILE=BOTH coordinates, no matter what values you specify for the ROW=, COLUMN=, or PROFILE= option. Let sq(·) be a matrix-valued function denoting element-wise squaring of the argument matrix. Let t be the total inertia (the sum of the elements in ).
In MCA, let D s be the Burt table partition containing the intersection of the supplementary columns and the supplementary rows. The matrix D s is a diagonal matrix of marginal frequencies of the supplemental columns of the binary matrix Z . Let p be the number of rows in this design matrix.
Statistic | Matrix Formula |
---|---|
Row partial contributions to inertia |
|
Column partial contributions to inertia |
|
Row squared cosines |
|
Column squared cosines |
|
Row mass | r |
Column mass | c |
Row inertia |
|
Column inertia |
|
Supplementary row squared cosines |
|
Supplementary column squared cosines |
|
MCA supplementary column squared cosines |
|
The quality of representation in the DIMENS= n dimensional display of any point is the sum of its squared cosines over only the n dimensions. Inertia and mass are not defined for supplementary points.
A table that summarizes the partial contributions to inertia table is also computed. The points that best explain the inertia of each dimension and the dimension to which each point contributes the most inertia are indicated. The output data set variable names for this table are Best1 “ Best n (where DIMENS= n ) and Best . The Best column contains the dimension number of the largest partial contribution to inertia for each point (the index of the maximum value in each row of or .
For each row, the Best1 “ Best n columns contain either the corresponding value of Best if the point is one of the biggest contributors to the dimension s inertia or 0 if it is not. Specifically, Best1 contains the value of Best for the point with the largest contribution to dimension one s inertia. A cumulative proportion sum is initialized to this point s partial contribution to the inertia of dimension one. If this sum is less than the value for the MININERTIA= option, then Best1 contains the value of Best for the point with the second largest contribution to dimension one s inertia. Otherwise, this point s Best1 is 0. This point s partial contribution to inertia is added to the sum. This process continues for the point with the third largest partial contribution, and so on, until adding a point s contribution to the sum increases the sum beyond the value of the MININERTIA= option. This same algorithm is then used for Best2 , and so on.
For example, the following table contains contributions to inertia and the correspond-ing Best variables. The contribution to inertia variables are proportions that sum to 1 within each column. The first point makes its greatest contribution to the inertia of dimension two, so Best for point one is set to 2 and Best1 _ Best3 for point one must all be 0 or 2. The second point also makes its greatest contribution to the inertia of dimension two, so Best for point two is set to 2 and Best1 “ Best3 for point two must all be 0 or 2, and so on.
Assume MININERTIA=0.8, the default. In dimension one, the largest contribution is 0.41302 for the fourth point, so Best1 is set to 1, the value of Best for the fourth point. Because this value is less than 0.8, the second largest value (0.36456 for point five) is found and its Best1 is set to its Best s value of 1. Because 0 . 41302 + 0 . 36456 = 0 . 77758 is less than 0.8, the third point (0.0882 at point eight) is found and Best1 is set to 3 since the contribution to dimension 3 for that point is greater than the contribution to dimension 1. This increases the sum of the partial contributions to greater than 0.8, so the remaining Best1 values are all 0.
Contr1 | Contr2 | Contr3 | Best1 | Best2 | Best3 | Best |
0.01593 | 0.32178 | 0.07565 |
| 2 | 2 | 2 |
0.03014 | 0.24826 | 0.07715 |
| 2 | 2 | 2 |
0.00592 | 0.02892 | 0.02698 |
|
|
| 2 |
0.41302 | 0.05191 | 0.05773 | 1 |
|
| 1 |
0.36456 | 0.00344 | 0.15565 | 1 |
| 1 | 1 |
0.03902 | 0.30966 | 0.11717 |
| 2 | 2 | 2 |
0.00019 | 0.01840 | 0.00734 |
|
|
| 2 |
0.08820 | 0.00527 | 0.16555 | 3 |
| 3 | 3 |
0.01447 | 0.00024 | 0.03851 |
|
|
| 3 |
0.02855 | 0.01213 | 0.27827 |
|
| 3 | 3 |
The display options control the amount of displayed output. By default, the following information is displayed:
an inertia and chi-square decomposition table including the total inertia, the principal inertias of each dimension (eigenvalues), the singular values (square roots of the eigenvalues), each dimension s percentage of inertia, a horizontal bar chart of the percentages, and the total chi-square with its degrees of freedom and decomposition. The chi-square statistics and degrees of freedom are valid only when the constructed table is an ordinary two-way contingency table.
the coordinates of the rows and columns on the dimensions
the mass, relative contribution to the total inertia, and quality of representation in the DIMENS= n dimensional display of each row and column
the squared cosines of the angles between each axis and a vector from the origin to the point
the partial contributions of each point to each dimension s inertia
the Best table, indicators of which points best explain the inertia of each dimension
Specific display options and combinations of options display output as follows.
If you specify the OBSERVED or ALL option and you do not specify PRINT=PERCENT, PROC CORRESP displays
the contingency table including the row and column marginal frequencies; or with BINARY, the binary table; or the Burt table in MCA
the supplementary rows
the supplementary columns
If you specify the OBSERVED or ALL option, with the PRINT=PERCENT or PRINT=BOTH option, PROC CORRESP displays
the contingency table or Burt table in MCA, scaled to percentages, including the row and column marginal percentages
the supplementary rows, scaled to percentages
the supplementary columns, scaled to percentages
If you specify the EXPECTED or ALL option and you do not specify PRINT=PERCENT, PROC CORRESP displays the product of the row marginals and the column marginals divided by the grand frequency of the observed frequency table. For ordinary two-way contingency tables, these are the expected frequencies under the hypothesis of row and column independence.
If you specify the EXPECTED or ALL option with the PRINT=PERCENT or PRINT=BOTH option, PROC CORRESP displays the product of the row marginals and the column marginals divided by the grand frequency of the observed percentages table. For ordinary two-way contingency tables, these are the expected percentages under the hypothesis of row and column independence.
If you specify the DEVIATION or ALL option and you do not specify PRINT=PERCENT, PROC CORRESP displays the observed minus expected frequencies. For ordinary two-way contingency tables, these are the expected frequencies under the hypothesis of row and column independence.
If you specify the DEVIATION or ALL option with the PRINT=PERCENT or PRINT=BOTH option, PROC CORRESP displays the observed minus expected percentages. For ordinary two-way contingency tables, these are the expected percentages under the hypothesis of row and column independence.
If you specify the CELLCHI2 or ALL option and you do not specify PRINT=PERCENT, PROC CORRESP displays contributions to the total chi-square test statistic, including the row and column marginals. The intersection of the marginals contains the total chi-square statistic.
If you specify the CELLCHI2 or ALL option with the PRINT=PERCENT or the PRINT=BOTH option, PROC CORRESP displays contributions to the total chi-square, scaled to percentages, including the row and column marginals.
If you specify the RP or ALL option and you do not specify PRINT=PERCENT, PROC CORRESP displays the row profiles and the supplementary row profiles.
If you specify the RP or ALL option with the PRINT=PERCENT or the PRINT=BOTH option, PROC CORRESP displays the row profiles (scaled to percentages) and the supplementary row profiles (scaled to percentages).
If you specify the CP or ALL option and you do not specify PRINT=PERCENT, PROC CORRESP displays the column profiles and the supplementary column profiles.
If you specify the CP or ALL option with the PRINT=PERCENT or PRINT=BOTH option, PROC CORRESP displays the column profiles (scaled to percentages) and the supplementary column profiles (scaled to percentages).
If you do not specify the NOPRINT option, PROC CORRESP displays the inertia and chi-square decomposition table. This includes the nonzero singular values of the contingency table (or, in MCA, the binary matrix Z used to create the Burt table), the nonzero principal inertias (or eigenvalues) for each dimension, the total inertia, the total chi-square, the decomposition of chi-square, the chi-square degrees of freedom (appropriate only when the table is an ordinary two-way contingency table), the percent of the total chi-square and inertia for each dimension, and a bar chart of the percents.
If you specify the MCA option and you do not specify the NOPRINT option, PROC CORRESP displays the adjusted inertias. This includes the nonzero adjusted inertias, percents, cumulative percents, and a bar chart of the percents.
If you do not specify the NOROW, NOPRINT, or MCA option, PROC CORRESP displays the row coordinates and the supplementary row coordinates (displayed when there are supplementary row points).
If you do not specify the NOROW, NOPRINT, MCA, or SHORT option, PROC CORRESP displays
the summary statistics for the row points including the quality of representation of the row points in the n -dimensional display, the mass, and the relative contributions to inertia
the quality of representation of the supplementary row points in the n -dimensional display (displayed when there are supplementary row points)
the partial contributions to inertia for the row points
the row Best table, indicators of which row points best explain the inertia of each dimension
the squared cosines for the row points
the squared cosines for the supplementary row points (displayed when there are supplementary row points)
If you do not specify the NOCOLUMN or NOPRINT option, PROC CORRESP displays the column coordinates and the supplementary column coordinates (displayed when there are supplementary column points).
If you do not specify the NOCOLUMN, NOPRINT, or SHORT option, PROC CORRESP displays
the summary statistics for the column points including the quality of representation of the column points in the n -dimensional display, the mass, and the relative contributions to inertia for the supplementary column points
the quality of representation of the supplementary column points in the n -dimensional display (displayed when there are supplementary column points)
the partial contributions to inertia for the column points
the column Best table, indicators of which column points best explain the inertia of each dimension
the squared cosines for the column points
the squared cosines for the supplementary column points
PROC CORRESP assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.
ODS Table Name | Description | Option |
---|---|---|
AdjInGreenacre | Greenacre Inertia Adjustment | GREENACRE |
AdjInBenzecri | Benz cri Inertia Adjustment | BENZECRI |
Binary | Binary table | OBSERVED, BINARY |
BinaryPct | Binary table percents | OBSERVED, BINARY [*] |
Burt | Burt table | OBSERVED, MCA |
BurtPct | Burt table percents | OBSERVED, MCA [*] |
CellChiSq | Contributions to Chi Square | CELLCHI2 |
CellChiSqPct | Contributions, pcts | CELLCHI2 [*] |
ColBest | Col best indicators | default |
ColContr | Col contributions to inertia | default |
ColCoors | Col coordinates | default |
ColProfiles | Col profiles | CP |
ColProfilesPct | Col profiles, pcts | CP [*] |
ColQualMassIn | Col quality, mass, inertia | default |
ColSqCos | Col squared cosines | default |
DF | DF, Chi Square (not displayed) | default |
Deviations | Observed - expected freqs | DEVIATIONS |
DeviationsPct | Observed - expected pcts | DEVIATIONS [*] |
Expected | Expected frequencies | EXPECTED |
ExpectedPct | Expected percents | EXPECTED [*] |
Inertias | Inertia decomposition table | default |
Observed | Observed frequencies | OBSERVED |
ObservedPct | Observed percents | OBSERVED [*] |
RowBest | Row best indicators | default |
RowContr | Row contributions to inertia | default |
RowCoors | Row coordinates | default |
RowProfiles | Row profiles | RP |
RowProfilesPct | Row profiles, pcts | RP [*] |
RowQualMassIn | Row quality, mass, inertia | default |
RowSqCos | Row squared cosines | default |
SupColCoors | Supp col coordinates | default |
SupColProfiles | Supp col profiles | CP |
SupColProfilesPct | Supp col profiles, pcts | CP [*] |
SupColQuality | Supp col quality | default |
SupCols | Supplementary col freq | OBSERVED |
SupColsPct | Supplementary col pcts | OBSERVED [*] |
SupColSqCos | Supp col squared cosines | default |
SupRows | Supplementary row freqs | OBSERVED |
SupRowCoors | Supp row coordinates | default |
SupRowProfiles | Supp row profiles | RP |
SupRowProfilesPct | Supp row profiles, pcts | RP [*] |
SupRowQuality | Supp row quality | default |
SupRowsPct | Supplementary row pcts | OBSERVED [*] |
SupRowSqCos | Supp row squared cosines | default |
[*] Percents are displayed when you specify the PRINT=PERCENT or PRINT=BOTH option. |
This section describes the use of ODS for creating graphics with the CORRESP procedure. These graphics are experimental in this release, meaning that both the graphical results and the syntax for specifying them are subject to change in a future release. To request a graph you must specify the ODS GRAPHICS statement. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.
PROC CORRESP assigns a name to the graph it creates using ODS. You can use this name to reference the graph when using ODS. The name is listed in Table 24.9.
ODS Graph Name | Plot Description |
---|---|
CorrespPlot | Correspondence analysis plot |
To request a graph you must specify the ODS GRAPHICS statement. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.