Details


Input Data Set

PROC CORRESP can read two kinds of input:

  • raw category responses on two or more classification variables with the TABLES statement

  • a two-way contingency table with the VAR statement

You can use output from PROC FREQ as input for PROC CORRESP.

The classification variables referred to by the TABLES statement can be either numeric or character variables. Normally, all observations for a given variable that have the same formatted value are placed in the same level, and observations with different values are placed in different levels.

The variables in the VAR statement must be numeric. The values of the observations specify the cell frequencies. These values are not required to be integers, but only those observations with all nonnegative, nonmissing values are used in the correspondence analysis. Observations with one or more negative values are removed from the analysis.

The WEIGHT variable must be numeric. Observations with negative weights are treated as supplementary observations. The absolute values of the weights are used to weight the observations.

Types of Tables Used as Input

The following example explains correspondence analysis and illustrates some capabilities of PROC CORRESP.

  data Neighbor;   input Name $ 1   10 Age $ 12   18 Sex $ 19   25   Height $ 26   30 Hair $ 32   37;   datalines;   Jones      Old    Male   Short White   Smith      Young  Female Tall  Brown   Kasavitz   Old    Male   Short Brown   Ernst      Old    Female Tall  White   Zannoria   Old    Female Short Brown   Spangel    Young  Male   Tall  Blond   Myers      Young  Male   Tall  Brown   Kasinski   Old    Male   Short Blond   Colman     Young  Female Short Blond   Delafave   Old    Male   Tall  Brown   Singer     Young  Male   Tall  Brown   Igor       Old           Short   ;  

There are several types of tables, N , that can be used as input to correspondence analysis ”all tables can be defined using a binary matrix, Z .

With the BINARY option, N = Z is directly analyzed . The binary matrix has one column for each category and one row for each individual or case. A binary table constructed from m categorical variables has m partitions. The following table has four partitions, one for each of the four categorical variables. Each partition has a 1 in each row, and each row contains exactly four 1s since there are four categorical variables. More generally , the binary design matrix has exactly m 1s in each row. The 1s indicate the categories to which the observation applies.

Table 24.2: Z, The Binary Coding of Neighbor Data Set

Z Hair

Z Height

Z Sex

Z Age

Blond

Brown

White

Short

Tall

Female

Male

Old

Young

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

With the MCA option, the Burt table ( Z ² Z ) is analyzed. A Burt table is a partitioned symmetric matrix containing all pairs of crosstabulations among a set of categorical variables. Each diagonal partition is a diagonal matrix containing marginal frequencies (a crosstabulation of a variable with itself). Each off-diagonal partition is an ordinary contingency table. Each contingency table above the diagonal has a transposed counterpart below the diagonal.

Table 24.3: Z ² Z, The Burt Table
 

Blond

Brown

White

Short

Tall

Female

Male

Old

Young

Blond

3

2

1

1

2

1

2

Brown

6

2

4

2

4

3

3

White

2

1

1

1

1

2

Short

2

2

1

5

2

3

4

1

Tall

1

4

1

6

2

4

2

4

Female

1

2

1

2

2

4

2

2

Male

2

4

1

3

4

7

4

3

Old

1

3

2

4

2

2

4

6

Young

2

3

1

4

2

3

5

This Burt table is composed of all pairs of crosstabulations among the variables Hair , Height , Sex , and Age . It is composed of sixteen individual subtables ”the number of variables squared. Both the rows and the columns have the same nine categories (in this case Blond, Brown, White, Short, Tall, Female, Male, Old, and Young). The off-diagonal partitions are crosstabulations of each variable with every other variable. Below the diagonal are the following crosstabulations (from left to right, top to bottom): Height * Hair , Sex * Hair , Sex * Height , Age * Hair , Age * Height , and Age * Sex . Each crosstabulation below the diagonal has a transposed counterpart above the diagonal. Each diagonal partition contains a crosstabulation of a variable with itself ( Hair * Hair , Height * Height , Sex * Sex ,and Age * Age ). The diagonal elements of the diagonal partitions contain marginal frequencies of the off-diagonal partitions.

For example, the table Hair * Height has three rows for Hair and two columns for Height . The values of the Hair * Height table, summed across rows, sum to the diagonal values of the Height * Height table, as displayed in the following table.

Table 24.4: Z Hair , Height ² Z Height The (Hair Height) — Height Crosstabulation
 

Short

Tall

Blond

2

1

Brown

2

4

White

1

1

Short

5

Tall

6

A simple crosstabulation of Hair Height is N = Z Hair ² Z Height . Crosstabulations such as this, involving only two variables, are the input to simple correspondence analysis.

Table 24.5: Z Hair ² Z Height The Hair — Height Crosstabulation
 

Short

Tall

Blond

2

1

Brown

2

4

White

1

1

Tables such as the following ( N = Z Hair ² Z Height , Sex ), made up of several crosstabulations, can also be analyzed in simple correspondence analysis.

Table 24.6: Z Hair ² Z Height , Sex The Hair — (Height Sex) Crosstabulation
 

Short

Tall

Female

Male

Blond

2

1

1

2

Brown

2

4

2

4

White

1

1

1

1

Coding, Fuzzy Coding, and Doubling

You can use an indicator matrix as input to PROC CORRESP using the VAR statement. An indicator matrix is composed of several submatrices, each of which is a design matrix with one column for each category of a categorical variable. In order to create an indicator matrix, you must code an indicator variable for each level of each categorical variable. For example, the categorical variable Sex , with two levels (Female and Male), would be coded using two indicator variables.

A binary indicator variable is coded 1 to indicate the presence of an attribute and 0 to indicate its absence. For the variable Sex , a male would be coded Female =0 and Male =1, and a female would be coded Female =1 and Male =0. The indicator variables representing a categorical variable must sum to 1.0. You can specify the BINARY option to create a binary table.

Sometimes binary data such as Yes/No data are available. For example, 1 means Yes, I have bought this brand in the last month and 0 means No, I have not bought this brand in the last month .

  title 'Doubling Yes/No Data';   proc format;   value yn 0 = 'No '  1 = 'Yes';   run;   data BrandChoice;   input a b c;   label a = 'Brand A' b = 'Brand B' c = 'Brand B';   format a b c yn.;   datalines;   0 0 1   1 1 0   0 1 1   0 1 0   1 0 0   ;  

Data such as these cannot be analyzed directly because the raw data do not consist of partitions, each with one column per level and exactly one 1 in each row. The data must be doubled so that both Yes and No are both represented by a column in the data matrix. The TRANSREG procedure provides one way of doubling. In the following statements, the DESIGN option specifies that PROC TRANSREG is being used only for coding, not analysis. The option SEPARATORS= : specifies that labels for the coded columns are constructed from input variable labels, followed by a colon and space, followed by the formatted value. The variables are designated in the MODEL statement as CLASS variables, and the ZERO=NONE option creates binary variables for all levels. The OUTPUT statement specifies the output data set and drops the _NAME_ , _TYPE_ , and Intercept variables. PROC TRANSREG stores a list of coded variable names in a macro variable &_TRGIND , which in this case has the value aNo aYes bNo bYes cNo cYes . This macro can be used directly in the VAR statement in PROC CORRESP.

  proc transreg data=BrandChoice design separators=': ';   model class(a b c / zero=none);   output out=Doubled(drop=_: Intercept);   run;   proc print label;   run;   proc corresp data=Doubled norow short;   var &_trgind;   run;  

A fuzzy-coded indicator also sums to 1.0 across levels of the categorical variable, but it is coded with fractions rather than with 1 and 0. The fractions represent the distribution of the attribute across several levels of the categorical variable.

Ordinal variables, such as survey responses of 1 to 3 can be represented as two design variables.

Table 24.7: Coding an Ordinal Variable

Ordinal Values

Coding

1

0.25

0.75

2

0.50

0.50

3

0.75

0.25

Values of the coding sum to one across the two coded variables.

This next example illustrates the use of binary and fuzzy-coded indicator variables. Fuzzy-coded indicators are used to represent missing data. Note that the missing values in the observation Igor are coded with equal proportions .

  proc transreg data=Neighbor design cprefix=0;   model class(Age Sex Height Hair / zero=none);   output out=Neighbor2(drop=_: Intercept);   id Name;   run;   data Neighbor3;   set Neighbor2;   if Sex = ' ' then do;   Female = 0.5;   Male   = 0.5;   end;   if Hair = ' ' then do;   White = 1/3;   Brown = 1/3;   Blond = 1/3;   end;   run;   proc print label;   run;  

There is one set of coded variables for each input categorical variable. If observation 12 is excluded, each set is a binary design matrix. Each design matrix has one column for each category and exactly one 1 in each row.

Fuzzy-coding is shown in the final observation, Igor. The observation Igor has missing values for the variables Sex and Hair . The design matrix variables are coded with fractions that sum to one within each categorical variable.

An alternative way to represent missing data is to treat missing values as an additional level of the categorical variable. This alternative is available with the MISSING option in the PROC statement. This approach yields coordinates for missing responses, allowing the comparison of missing along with the other levels of the categorical variables.

Greenacre and Hastie (1987) discuss additional coding schemes, including one for continuous variables. Continuous variables can be coded with PROC TRANSREG by specifying BSPLINE( variables / degree=1) in the MODEL statement.

Using the TABLES Statement

In the following TABLES statement, each variable list consists of a single variable:

  proc corresp data=Neighbor dimens=1 observed short;   ods select observed;   tables Sex, Age;   run;  

These statements create a contingency table with two rows (Female and Male) and two columns (Old and Young) and show the neighbors broken down by age and sex. The DIMENS=1 option overrides the default, which is DIMENS=2. The OBSERVED option displays the contingency table. The SHORT option limits the displayed output. Because it contains missing values, the observation where Name = Igor is omitted from the analysis. Figure 24.4 displays the contingency table.

start figure
  The CORRESP Procedure   Contingency Table   Old         Young           Sum   Female               2             2             4   Male                 4             3             7   Sum                  6             5            11  
end figure

Figure 24.4: Contingency Table for Sex, Age

The following statements create a table with six rows ( Blond*Short , Blond*Tall , Brown*Short , Brown*Tall , White*Short , and White*Tall ), and four columns ( Female , Male , Old , and Young ). The levels of the row variables are crossed, forming mutually exclusive categories, whereas the categories of the column variables overlap.

  proc corresp data=Neighbor cross=row observed short;   ods select observed;   tables Hair Height, Sex Age;   run;  
start figure
  The CORRESP Procedure   Contingency Table   Female        Male         Old       Young         Sum   Blond * Short              1           1           1           1           4   Blond * Tall               0           1           0           1           2   Brown * Short              1           1           2           0           4   Brown * Tall               1           3           1           3           8   White * Short              0           1           1           0           2   White * Tall               1           0           1           0           2   Sum                        4           7           6           5          22  
end figure

Figure 24.5: Contingency Table for Hair * Height, Sex Age

You can enter supplementary variables with TABLES input by including a SUPPLEMENTARY statement. Variables named in the SUPPLEMENTARY statement indicate TABLES variables with categories that are supplementary. In other words, the categories of the variable Age are represented in the row and column space, but they are not used in determining the scores of the categories of the variables Hair , Height , and Sex . The variable used in the SUPPLEMENTARY statement must be listed in the TABLES statement as well. For example, the following statements create a Burt table with seven active rows and columns ( Blond , Brown , White , Short , Tall , Female , Male ) and two supplementary rows and columns ( Old and Young ).

  proc corresp data=Neighbor observed short mca;   ods select burt supcols;   tables Hair Height Sex Age;   supplementary Age;   run;  

The following statements create a binary table with 7 active columns ( Blond , Brown , White , Short , Tall , Female , Male ), 2 supplementary columns ( Old and Young ), and 11 rows for the 11 observations with nonmissing values.

  proc corresp data=Neighbor observed short binary;   ods select binary supcols;   tables Hair Height Sex Age;   supplementary Age;   run;  
start figure
  The CORRESP Procedure   Binary Table   Blond     Brown     White     Short      Tall    Female      Male   1               0         0         1         1         0         0         1   2               0         1         0         0         1         1         0   3               0         1         0         1         0         0         1   4               0         0         1         0         1         1         0   5               0         1         0         1         0         1         0   6               1         0         0         0         1         0         1   7               0         1         0         0         1         0         1   8               1         0         0         1         0         0         1   9               1         0         0         1         0         1         0   10               0         1         0         0         1         0         1   11               0         1         0         0         1         0         1   Supplementary Columns   Old         Young   1                   1             0   2                   0             1   3                   1             0   4                   1             0   5                   1             0   6                   0             1   7                   0             1   8                   1             0   9                   0             1   10                   1             0   11                   0             1  
end figure

Figure 24.7: Binary Table from PROC CORRESP

Using the VAR Statement

With VAR statement input, the rows of the contingency table correspond to the observations of the input data set, and the columns correspond to the VAR statement variables. The values of the variables typically contain the table frequencies. The example displayed in Figure 24.4 could be run with VAR statement input using the following code:

  data Ages;   input Sex $ Old Young;   datalines;   Female  2 2   Male    4 3   ;   proc corresp data=Ages dimens=1 observed short;   var Old Young;   id Sex;   run;  

Only nonnegative values are accepted. Negative values are treated as missing, causing the observation to be excluded from the analysis. The values are not required to be integers. Row labels for the table are specified with an ID variable. Column labels are constructed from the variable name or variable label if one is specified. When you specify multiple correspondence analysis (MCA), the row and column labels are the same and are constructed from the variable names or labels, so you cannot include an ID statement. With MCA, the VAR statement must list the variables in the order in which the rows occur. For example, the table displayed in Figure 24.6, which was created with the following TABLES statement,

start figure
  The CORRESP Procedure   Burt Table   Blond     Brown     White     Short      Tall    Female      Male   Blond            3         0         0         2         1         1         2   Brown            0         6         0         2         4         2         4   White            0         0         2         1         1         1         1   Short            2         2         1         5         0         2         3   Tall             1         4         1         0         6         2         4   Female           1         2         1         2         2         4         0   Male             2         4         1         3         4         0         7   Supplementary Columns   Old         Young   Blond                1             2   Brown                3             3   White                2             0   Short                4             1   Tall                 2             4   Female               2             2   Male                 4             3  
end figure

Figure 24.6: Burt Table from PROC CORRESP
  tables Hair Height Sex Age;  

is input as follows with the VAR statement:

  proc corresp data=table nvars=4 mca;   var Blond Brown White Short Tall Female Male Old Young;   run;  

You must specify the NVARS= option to specify the number of original categorical variables with the MCA option. The option NVARS= n is needed to find boundaries between the subtables of the Burt table. If f is the sum of all elements in the Burt table Z ² Z , then fn ˆ’ 2 is the number of rows in the binary matrix Z . Thesumofall elements in each diagonal subtable of the Burt table must be fn ˆ’ 2 .

To enter supplementary observations, include a WEIGHT statement with negative weights for those observations. Specify the SUPPLEMENTARY statement to include supplementary variables. You must list supplementary variables in both the VAR and SUPPLEMENTARY statements.

Missing and Invalid Data

With VAR statement input, observations with missing or negative frequencies are excluded from the analysis. Supplementary variables and supplementary observations with missing or negative frequencies are also excluded. Negative weights are valid with VAR statement input.

With TABLES statement input, observations with negative weights are excluded from the analysis. With this form of input, missing cell frequencies cannot occur. Observations with missing values on the categorical variables are excluded unless you specify the MISSING option. If you specify the MISSING option, ordinary missing values and special missing values are treated as additional levels of a categorical variable. In all cases, if any row or column of the constructed table contains only zeros, that row or column is excluded from the analysis.

Observations with missing weights are excluded from the analysis.

Creating a Data Set Containing the Crosstabulation

The CORRESP procedure can read or create a contingency or Burt table. PROC CORRESP is generally more efficient with VAR statement input than with TABLES statement input. TABLES statement input requires that the table be created from raw categorical variables, whereas the VAR statement is used to read an existing table. If PROC CORRESP runs out of memory, it may be possible to use some other method to create the table and then use VAR statement input with PROC CORRESP.

The following example uses the CORRESP, FREQ, and TRANSPOSE procedures to create rectangular tables from a SAS data set WORK.A that contains the categorical variables V1 V5 . The Burt table examples assume that no categorical variable has a value found in any of the other categorical variables (that is, that each row and column label is unique).

You can use PROC CORRESP and ODS to create a rectangular two-way contingency table from two categorical variables.

  proc corresp data=a observed short;   ods listing close;   ods output Observed=Obs(drop=Sum where=(Label ne 'Sum'));   tables v1, v2;   run;   ods listing;  

You can use PROC FREQ and PROC TRANSPOSE to create a rectangular two-way contingency table from two categorical variables.

  proc freq data=a;   tables v1 * v2 / sparse noprint out=freqs;   run;   proc transpose data=freqs out=rfreqs;   id  v2;   var count;   by  v1;   run;  

You can use PROC CORRESP and ODS to create a Burt table from five categorical variables.

  proc corresp data=a observed short mca;   ods listing close;   ods output Burt=Obs;   tables v1-v5;   run;   ods listing;  

You can use a DATA step, PROC FREQ, and PROC TRANSPOSE to create a Burt table from five categorical variables.

  data b;   set a;   array v[5] $ v1-v5;   do i = 1 to 5;   row = v[i];   do j = 1 to 5;   column = v[j];   output;   end;   end;   keep row column;   run;   proc freq data=b;   tables row * column / sparse noprint out=freqs;   run;   proc transpose data=freqs out=rfreqs;   id  column;   var count;   by  row;   run;  

Output Data Sets

The OUTC= Data Set

The OUTC= data set contains two or three character variables and 4 n + 4 numeric variables, where n is the number of axes from DIMENS= n (two by default). The OUTC= data set contains one observation for each row, column, supplementary row, and supplementary column point, and one observation for inertias.

The first variable is named _TYPE_ and identifies the type of observation. The values of _TYPE_ are as follows:

  • The ˜INERTIA observation contains the total inertia in the INERTIA variable, and each dimension s inertia in the Contr1 Contr n variables.

  • The ˜OBS observations contain the coordinates and statistics for the rows of the table.

  • The ˜SUPOBS observations contain the coordinates and statistics for the supplementary rows of the table.

  • The ˜VAR observations contain the coordinates and statistics for the columns of the table.

  • The ˜SUPVAR observations contain the coordinates and statistics for the supplementary columns of the table.

If you specify the SOURCE option, then the data set also contains a variable _VAR_ containing the name or label of the input variable from which that row originates. The name of the next variable is either _NAME_ or (if you specify an ID statement) the name of the ID variable.

For observations with a value of ˜OBS or ˜SUPOBS for the _TYPE_ variable, the values of the second variable are constructed as follows:

  • When you use a VAR statement without an ID statement, the values are ˜Row1 , ˜Row2 , and so on.

  • When you specify a VAR statement with an ID statement, the values are set equal to the values of the ID variable.

  • When you specify a TABLES statement, the _NAME_ variable has values formed from the appropriate row variable values.

For observations with a value of ˜VAR or ˜SUPVAR for the _TYPE_ variable, the values of the second variable are equal to the names or labels of the VAR (or SUPPLEMENTARY) variables. When you specify a TABLES statement, the values are formed from the appropriate column variable values.

The third and subsequent variables contain the numerical results of the correspondence analysis.

  • Quality contains the quality of each point s representation in the DIMENS= n dimensional display, which is the sum of squared cosines over the first n dimensions.

  • Mass contains the masses or marginal sums of the relative frequency matrix.

  • Inertia contains each point s relative contribution to the total inertia.

  • Dim1 Dim n contain the point coordinates.

  • Contr1 Contr n contain the partial contributions to inertia.

  • SqCos1 SqCos n contain the squared cosines.

  • Best1 Best n and Best contain the summaries of the partial contributions to inertia.

The OUTF= Data Set

The OUTF= data set contains frequencies and percentages. It is similar to a PROC FREQ output data set. The OUTF= data set begins with a variable called _TYPE_ , which contains the observation type. If the SOURCE option is specified, the data set contains two variables _ROWVAR_ and _COLVAR_ that contain the names or labels of the row and column input variables from which each cell originates. The next two variables are classification variables that contain the row and column levels. If you use TABLES statement input and each variable list consists of a single variable, the names of the first two variables match the names of the input variables; otherwise , these variables are named Row and Column . The next two variables are Count and Percent , which contain frequencies and percentages.

The _TYPE_ variable can have the following values:

  • ˜OBSERVED observations contain the contingency table.

  • ˜SUPOBS observations contain the supplementary rows.

  • ˜SUPVAR observations contain the supplementary columns.

  • ˜EXPECTED observations contain the product of the row marginals and the column marginals divided by the grand frequency of the observed frequency table. For ordinary two-way contingency tables, these are the expected frequency matrix under the hypothesis of row and column independence.

  • ˜DEVIATION observations contain the matrix of deviations between the observed frequency matrix and the product of its row marginals and column marginals divided by its grand frequency. For ordinary two-way contingency tables, these are the observed minus expected frequencies under the hypothesis of row and column independence.

  • ˜CELLCHI2 observations contain contributions to the total chi-square test statistic.

  • ˜RP observations contain the row profiles.

  • ˜SUPRP observations contain supplementary row profiles.

  • ˜CP observations contain the column profiles.

  • ˜SUPCP observations contain supplementary column profiles.

Computational Resources

Let

n r

=

number of rows in the table

n c

=

number of columns in the table

n

=

number of observations

v

=

number of VAR statement variables

t

=

number of TABLES statement variables

c

=

max( n r , n c )

d

=

min( n r , n c )

For TABLES statement input, more than

click to expand

bytes of array space are required.

For VAR statement input, more than

click to expand

bytes of array space are required.

Memory

The computational resources formulas are underestimates of the amounts of memory needed to handle most problems. If you use a utility data set, and if memory could be used with perfect efficiency, then roughly the stated amount of memory would be needed. In reality, most problems require at least two or three times the minimum.

PROC CORRESP tries to store the raw data (TABLES input) and the contingency table in memory. If there is not enough memory, a utility data set is used, potentially resulting in a large increase in execution time.

Time

The time required to perform the generalized singular value decomposition is roughly proportional to 2 cd 2 + 5 d 3 . Overall computation time increases with table size at a rate roughly proportional to ( n r n c ) 3/2 .

Algorithm and Notation

This section is primarily based on the theory of correspondence analysis found in Greenacre (1984). If you are interested in other references, see the Background section on page 1069.

Let N be the contingency table formed from those observations and variables that are not supplementary and from those observations that have no missing values and have a positive weight. This table is an ( n r n c ) rank q matrix of nonnegative numbers with nonzero row and column sums. If Z a is the binary coding for variable A , and Z b is the binary coding for variable B , then N = Z b is a contingency table. Similarly, if Z b,c contains the binary coding for both variables B and C , then N = Z b,c can also be input to a correspondence analysis. With the BINARY option, N = Z , and the analysis is based on a binary table. In multiple correspondence analysis, the analysis is based on a Burt table, Z ² Z .

Let 1 be a vector of 1s of the appropriate order, let I be an identity matrix, and let diag(·) be a matrix-valued function that creates a diagonal matrix from a vector. Let

click to expand

The scalar f is the sum of all elements in N . The matrix P is a matrix of relative frequencies. The vector r contains row marginal proportions or row masses. The vector c contains column marginal proportions or column masses. The matrices D r and D c are diagonal matrices of marginals.

The rows of R contain the row profiles. The elements of each row of R sum to one. Each ( i, j ) element of R contains the observed probability of being in column j given membership in row i . Similarly, the columns of C contain the column profiles. The coordinates in correspondence analysis are based on the generalized singular value decomposition of P ,

where

click to expand

In multiple correspondence analysis,

The matrix A , which is the rectangular matrix of left generalized singular vectors, has n r rows and q columns; the matrix D u , which is a diagonal matrix of singular values, has q rows and columns; and the matrix B , which is the rectangular matrix of right generalized singular vectors, has n c rows and q columns. The columns of A and B define the principal axes of the column and row point clouds, respectively.

The generalized singular value decomposition of P ˆ’ rc ² , discarding the last singular value (which is zero) and the last left and right singular vectors, is exactly the same as a generalized singular value decomposition of P , discarding the first singular value (which is one), the first left singular vector, r , and the first right singular vector, c . The first (trivial) column of A and B and the first singular value in D u are discarded before any results are displayed. You can obtain the generalized singular value decomposition of P ˆ’ rc ² from the ordinary singular value decomposition of click to expand .

click to expand
click to expand

Hence, and .

The default row coordinates are , and the default column coordinates are . Typically the first two columns of and are plotted to display graphically associations between the row and column categories. The plot consists of two overlaid plots, one for rows and one for columns. The row points are row profiles, rescaled so that distances between profiles can be displayed as ordinary Euclidean distances, then orthogonally rotated to a principal axes orientation. The column points are column profiles, rescaled so that distances between profiles can be displayed as ordinary Euclidean distances, then orthogonally rotated to a principal axes orientation. Distances between row points and other row points have meaning. Distances between column points and other column points have meaning. However, distances between column points and row points are not interpretable.

The PROFILE=, ROW=, and COLUMN= Options

The PROFILE=, ROW=, and COLUMN= options standardize the coordinates before they are displayed and placed in the output data set. The options PROFILE=BOTH, PROFILE=ROW, and PROFILE=COLUMN provide the standardizations that are typically used in correspondence analysis. There are six choices each for row and column coordinates. However, most of the combinations of the ROW= and COLUMN= options are not useful. The ROW= and COLUMN= options are provided for completeness, but they are not intended for general use.

ROW=

Matrix Formula

A

A

AD

AD u

DA

DAD

DAD1/2

DAID1/2

click to expand

COLUMN=

Matrix Formula

B

B

BD

BD u

DB

DBD

DBD1/2

DBID1/2

click to expand

When PROFILE=ROW (ROW=DAD and COLUMN=DB), the row coordinates and column coordinates provide a correspondence analysis based on the row profile matrix. The row profile (conditional probability) matrix is defined as click to expand . The elements of each row of R sum to one. Each ( i, j ) element of R contains the observed probability of being in column j given membership in row i . The principal row coordinates and standard column coordinates provide a decomposition of click to expand click to expand . Since click to expand , the row coordinates are weighted centroids of the column coordinates. Each column point, with coordinates scaled to standard coordinates, defines a vertex in ( n c ˆ’ 1)-dimensional space. All of the principal row coordinates are located in the space defined by the standard column coordinates. Distances among row points have meaning, but distances among column points and distances between row and column points are not interpretable.

The option PROFILE=COLUMN can be described as applying the PROFILE=ROW formulas to the transpose of the contingency table. When PROFILE=COLUMN (ROW=DA and COLUMN=DBD), the principal column coordinates are weighted centroids of the standard row coordinates . Each row point, with coordinates scaled to standard coordinates, defines a vertex in ( n r ˆ’ 1)-dimensional space. All of the principal column coordinates are located in the space defined by the standard row coordinates. Distances among column points have meaning, but distances among row points and distances between row and column points are not interpretable.

The usual sets of coordinates are given by the default PROFILE=BOTH (ROW=DAD and COLUMN=DBD). All of the summary statistics, such as the squared cosines and contributions to inertia, apply to these two sets of points. One advantage to using these coordinates is that both sets ( click to expand are postmultiplied by the diagonal matrix D u , which has diagonal values that are all less than or equal to one. When D u is a part of the definition of only one set of coordinates, that set forms a tight cluster near the centroid whereas the other set of points is more widely dispersed. Including D u in both sets makes a better graphical display. However, care must be taken in interpreting such a plot. No correct interpretation of distances between row points and column points can be made.

Another property of this choice of coordinates concerns the geometry of distances between points within each set. The default row coordinates can be decomposed into click to expand ). The row coordinates are row profiles , rescaled by ( ) (rescaled so that distances between profiles are transformed from a chi-square metric to a Euclidean metric), then orthogonally rotated (with ) to a principal axes orientation. Similarly, the column coordinates are column profiles rescaled to a Euclidean metric and orthogonally rotated to a principal axes orientation.

The rationale for computing distances between row profiles using the non-Euclidean chi-square metric is as follows. Each row of the contingency table can be viewed as a realization of a multinomial distribution conditional on its row marginal frequency. The null hypothesis of row and column independence is equivalent to the hypothesis of homogeneity of the row profiles. A significant chi-square statistic is geometrically interpreted as a significant deviation of the row profiles from their centroid, c ² . The chi-square metric is the Mahalanobis metric between row profiles based on their estimated covariance matrix under the homogeneity assumption (Greenacre and Hastie 1987). A parallel argument can be made for the column profiles.

When ROW=DAD1/2 and COLUMN=DBD1/2 (Gifi 1990; van der Heijden and de Leeuw 1985), the row coordinates and column coordinates are a decomposition of .

In all of the preceding pairs, distances between row and column points are not meaningful. This prompted Carroll, Green, and Schaffer (1986) to propose that row coordinates click to expand and column coordinates click to expand be used.

These coordinates are (except for a constant scaling) the coordinates from a multiple correspondence analysis of a Burt table created from two categorical variables. This standardization is available with ROW=DAID1/2 and COLUMN=DBID1/2. However, this approach has been criticized on both theoretical and empirical grounds by Greenacre (1989). The Carroll, Green, and Schaffer standardization relies on the assumption that the chi-square metric is an appropriate metric for measuring the distance between the columns of a bivariate indicator matrix. See the section Types of Tables Used as Input on page 1083 for a description of indicator matrices. Greenacre (1989) showed that this assumption cannot be justified.

The MCA Option

The MCA option performs a multiple correspondence analysis (MCA). This option requires a Burt table. You can specify the MCA option with a table created from a design matrix with fuzzy coding schemes as long as every row of every partition of the design matrix has the same marginal sum. For example, each row of each partition could contain the probabilities that the observation is a member of each level. Then the Burt table constructed from this matrix no longer contains all integers, and the diagonal partitions are no longer diagonal matrices, but MCA is still valid.

A TABLES statement with a single variable list creates a Burt table. Thus, you can always specify the MCA option with this type of input. If you use the MCA option when reading an existing table with a VAR statement, you must ensure that the table is a Burt table.

If you perform MCA on a table that is not a Burt table, the results of the analysis are invalid. If the table is not symmetric, or if the sums of all elements in each diagonal partition are not equal, PROC CORRESP displays an error message and quits.

A subset of the columns of a Burt table is not necessarily a Burt table, so in MCA it is not appropriate to designate arbitrary columns as supplementary. You can, however, designate all columns from one or more categorical variables as supplementary.

The results of a multiple correspondence analysis of a Burt table Z ² Z are the same as the column results from a simple correspondence analysis of the binary (or fuzzy) matrix Z . Multiple correspondence analysis is not a simple correspondence analysis of the Burt table. It is not appropriate to perform a simple correspondence analysis of a Burt table. The MCA option is based on , whereas a simple correspondence analysis of the Burt table would be based on P = BD u B ² .

Since the rows and columns of the Burt table are the same, no row information is displayed or written to the output data sets. The resulting inertias and the default (COLUMN=DBD) column coordinates are the appropriate inertias and coordinates for an MCA. The supplementary column coordinates, cosines, and quality of representation formulas for MCA differ from the simple correspondence analysis formulas because the design matrix column profiles and left singular vectors are not available.

The following statements create a Burt table and perform a multiple correspondence analysis:

  proc corresp data=Neighbor observed short mca;   tables Hair Height Sex Age;   run;  

Both the rows and the columns have the same nine categories (Blond, Brown, White, Short, Tall, Female, Male, Old, and Young).

MCA Adjusted Inertias

The usual principal inertias of a Burt Table constructed from m categorical variables in MCA are the eigenvalues u k from . The problem with these inertias is that they provide a pessimistic indication of fit. Benz cri (1979) proposed the following inertia adjustment, which is also described by Greenacre (1984, p. 145):

click to expand

The Benz cri adjustment is available with the BENZECRI option.

Greenacre (1994, p. 156) argues that the Benz cri adjustment overestimates the quality of fit. Greenacre proposes instead the following inertia adjustment:

click to expand

The Greenacre adjustment is available with the GREENACRE option.

Ordinary unadjusted inertias are printed by default with MCA when neither the BENZECRI nor the GREENACRE option is specified. However, the unadjusted inertias are not printed by default when either the BENZECRI or the GREENACRE option is specified. To display both adjusted and unadjusted inertias, specify the UNADJUSTED option in addition to the relevant adjusted inertia option (BENZECRI, GREENACRE, or both).

Supplementary Rows and Columns

Supplementary rows and columns are represented as points in the joint row and column space, but they are not used when determining the locations of the other active rows and columns of the table. The formulas that are used to compute coordinates for the supplementary rows and columns depend on the PROFILE= option or on the ROW= and COLUMN= options. Let S o be the matrix with rows that contain the supplementary observations and S v be a matrix with rows that contain the supplementary variables. Note that S v is defined to be the transpose of the supplementary variable partition of the table. Let R s = diag( S o 1 ) ˆ’ 1 S o be the supplementary observation profile matrix and C s = diag( S v 1 ) ˆ’ 1 S v be the supplementary variable profile matrix. Note that the notation diag(·) ˆ’ 1 means to convert the vector to a diagonal matrix, then invert the diagonal matrix. The coordinates for the supplementary observations and variables are as follows.

ROW=

Matrix Formula

A

AD

DA

DAD

DAD1/2

DAID1/2

click to expand

COLUMN=

Matrix Formula

B

BD

DB

DBD

DBD1/2

DBID1/2

click to expand

MCA COLUMN=

Matrix Formula

B

not allowed

BD

not allowed

DB

DBD

DBD1/2

DBID1/2

click to expand

Statistics that Aid Interpretation

The partial contributions to inertia, squared cosines, quality of representation, inertia, and mass provide additional information about the coordinates. These statistics are displayed by default. Include the SHORT or NOPRINT option in the PROC CORRESP statement to avoid having these statistics displayed.

These statistics pertain to the default PROFILE=BOTH coordinates, no matter what values you specify for the ROW=, COLUMN=, or PROFILE= option. Let sq(·) be a matrix-valued function denoting element-wise squaring of the argument matrix. Let t be the total inertia (the sum of the elements in ).

In MCA, let D s be the Burt table partition containing the intersection of the supplementary columns and the supplementary rows. The matrix D s is a diagonal matrix of marginal frequencies of the supplemental columns of the binary matrix Z . Let p be the number of rows in this design matrix.

Statistic

Matrix Formula

Row partial contributions to inertia

Column partial contributions to inertia

Row squared cosines

click to expand

Column squared cosines

click to expand

Row mass

r

Column mass

c

Row inertia

Column inertia

Supplementary row squared cosines

click to expand

Supplementary column squared cosines

click to expand

MCA supplementary column squared cosines

click to expand

The quality of representation in the DIMENS= n dimensional display of any point is the sum of its squared cosines over only the n dimensions. Inertia and mass are not defined for supplementary points.

A table that summarizes the partial contributions to inertia table is also computed. The points that best explain the inertia of each dimension and the dimension to which each point contributes the most inertia are indicated. The output data set variable names for this table are Best1 Best n (where DIMENS= n ) and Best . The Best column contains the dimension number of the largest partial contribution to inertia for each point (the index of the maximum value in each row of or .

For each row, the Best1 Best n columns contain either the corresponding value of Best if the point is one of the biggest contributors to the dimension s inertia or 0 if it is not. Specifically, Best1 contains the value of Best for the point with the largest contribution to dimension one s inertia. A cumulative proportion sum is initialized to this point s partial contribution to the inertia of dimension one. If this sum is less than the value for the MININERTIA= option, then Best1 contains the value of Best for the point with the second largest contribution to dimension one s inertia. Otherwise, this point s Best1 is 0. This point s partial contribution to inertia is added to the sum. This process continues for the point with the third largest partial contribution, and so on, until adding a point s contribution to the sum increases the sum beyond the value of the MININERTIA= option. This same algorithm is then used for Best2 , and so on.

For example, the following table contains contributions to inertia and the correspond-ing Best variables. The contribution to inertia variables are proportions that sum to 1 within each column. The first point makes its greatest contribution to the inertia of dimension two, so Best for point one is set to 2 and Best1 _ Best3 for point one must all be 0 or 2. The second point also makes its greatest contribution to the inertia of dimension two, so Best for point two is set to 2 and Best1 Best3 for point two must all be 0 or 2, and so on.

Assume MININERTIA=0.8, the default. In dimension one, the largest contribution is 0.41302 for the fourth point, so Best1 is set to 1, the value of Best for the fourth point. Because this value is less than 0.8, the second largest value (0.36456 for point five) is found and its Best1 is set to its Best s value of 1. Because 0 . 41302 + 0 . 36456 = 0 . 77758 is less than 0.8, the third point (0.0882 at point eight) is found and Best1 is set to 3 since the contribution to dimension 3 for that point is greater than the contribution to dimension 1. This increases the sum of the partial contributions to greater than 0.8, so the remaining Best1 values are all 0.

Contr1

Contr2

Contr3

Best1

Best2

Best3

Best

0.01593

0.32178

0.07565

2

2

2

0.03014

0.24826

0.07715

2

2

2

0.00592

0.02892

0.02698

2

0.41302

0.05191

0.05773

1

1

0.36456

0.00344

0.15565

1

1

1

0.03902

0.30966

0.11717

2

2

2

0.00019

0.01840

0.00734

2

0.08820

0.00527

0.16555

3

3

3

0.01447

0.00024

0.03851

3

0.02855

0.01213

0.27827

3

3

Displayed Output

The display options control the amount of displayed output. By default, the following information is displayed:

  • an inertia and chi-square decomposition table including the total inertia, the principal inertias of each dimension (eigenvalues), the singular values (square roots of the eigenvalues), each dimension s percentage of inertia, a horizontal bar chart of the percentages, and the total chi-square with its degrees of freedom and decomposition. The chi-square statistics and degrees of freedom are valid only when the constructed table is an ordinary two-way contingency table.

  • the coordinates of the rows and columns on the dimensions

  • the mass, relative contribution to the total inertia, and quality of representation in the DIMENS= n dimensional display of each row and column

  • the squared cosines of the angles between each axis and a vector from the origin to the point

  • the partial contributions of each point to each dimension s inertia

  • the Best table, indicators of which points best explain the inertia of each dimension

Specific display options and combinations of options display output as follows.

If you specify the OBSERVED or ALL option and you do not specify PRINT=PERCENT, PROC CORRESP displays

  • the contingency table including the row and column marginal frequencies; or with BINARY, the binary table; or the Burt table in MCA

  • the supplementary rows

  • the supplementary columns

If you specify the OBSERVED or ALL option, with the PRINT=PERCENT or PRINT=BOTH option, PROC CORRESP displays

  • the contingency table or Burt table in MCA, scaled to percentages, including the row and column marginal percentages

  • the supplementary rows, scaled to percentages

  • the supplementary columns, scaled to percentages

If you specify the EXPECTED or ALL option and you do not specify PRINT=PERCENT, PROC CORRESP displays the product of the row marginals and the column marginals divided by the grand frequency of the observed frequency table. For ordinary two-way contingency tables, these are the expected frequencies under the hypothesis of row and column independence.

If you specify the EXPECTED or ALL option with the PRINT=PERCENT or PRINT=BOTH option, PROC CORRESP displays the product of the row marginals and the column marginals divided by the grand frequency of the observed percentages table. For ordinary two-way contingency tables, these are the expected percentages under the hypothesis of row and column independence.

If you specify the DEVIATION or ALL option and you do not specify PRINT=PERCENT, PROC CORRESP displays the observed minus expected frequencies. For ordinary two-way contingency tables, these are the expected frequencies under the hypothesis of row and column independence.

If you specify the DEVIATION or ALL option with the PRINT=PERCENT or PRINT=BOTH option, PROC CORRESP displays the observed minus expected percentages. For ordinary two-way contingency tables, these are the expected percentages under the hypothesis of row and column independence.

If you specify the CELLCHI2 or ALL option and you do not specify PRINT=PERCENT, PROC CORRESP displays contributions to the total chi-square test statistic, including the row and column marginals. The intersection of the marginals contains the total chi-square statistic.

If you specify the CELLCHI2 or ALL option with the PRINT=PERCENT or the PRINT=BOTH option, PROC CORRESP displays contributions to the total chi-square, scaled to percentages, including the row and column marginals.

If you specify the RP or ALL option and you do not specify PRINT=PERCENT, PROC CORRESP displays the row profiles and the supplementary row profiles.

If you specify the RP or ALL option with the PRINT=PERCENT or the PRINT=BOTH option, PROC CORRESP displays the row profiles (scaled to percentages) and the supplementary row profiles (scaled to percentages).

If you specify the CP or ALL option and you do not specify PRINT=PERCENT, PROC CORRESP displays the column profiles and the supplementary column profiles.

If you specify the CP or ALL option with the PRINT=PERCENT or PRINT=BOTH option, PROC CORRESP displays the column profiles (scaled to percentages) and the supplementary column profiles (scaled to percentages).

If you do not specify the NOPRINT option, PROC CORRESP displays the inertia and chi-square decomposition table. This includes the nonzero singular values of the contingency table (or, in MCA, the binary matrix Z used to create the Burt table), the nonzero principal inertias (or eigenvalues) for each dimension, the total inertia, the total chi-square, the decomposition of chi-square, the chi-square degrees of freedom (appropriate only when the table is an ordinary two-way contingency table), the percent of the total chi-square and inertia for each dimension, and a bar chart of the percents.

If you specify the MCA option and you do not specify the NOPRINT option, PROC CORRESP displays the adjusted inertias. This includes the nonzero adjusted inertias, percents, cumulative percents, and a bar chart of the percents.

If you do not specify the NOROW, NOPRINT, or MCA option, PROC CORRESP displays the row coordinates and the supplementary row coordinates (displayed when there are supplementary row points).

If you do not specify the NOROW, NOPRINT, MCA, or SHORT option, PROC CORRESP displays

  • the summary statistics for the row points including the quality of representation of the row points in the n -dimensional display, the mass, and the relative contributions to inertia

  • the quality of representation of the supplementary row points in the n -dimensional display (displayed when there are supplementary row points)

  • the partial contributions to inertia for the row points

  • the row Best table, indicators of which row points best explain the inertia of each dimension

  • the squared cosines for the row points

  • the squared cosines for the supplementary row points (displayed when there are supplementary row points)

If you do not specify the NOCOLUMN or NOPRINT option, PROC CORRESP displays the column coordinates and the supplementary column coordinates (displayed when there are supplementary column points).

If you do not specify the NOCOLUMN, NOPRINT, or SHORT option, PROC CORRESP displays

  • the summary statistics for the column points including the quality of representation of the column points in the n -dimensional display, the mass, and the relative contributions to inertia for the supplementary column points

  • the quality of representation of the supplementary column points in the n -dimensional display (displayed when there are supplementary column points)

  • the partial contributions to inertia for the column points

  • the column Best table, indicators of which column points best explain the inertia of each dimension

  • the squared cosines for the column points

  • the squared cosines for the supplementary column points

ODS Table Names

PROC CORRESP assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.

Table 24.8: ODS Tables Produced in PROC CORRESP

ODS Table Name

Description

Option

AdjInGreenacre

Greenacre Inertia Adjustment

GREENACRE

AdjInBenzecri

Benz cri Inertia Adjustment

BENZECRI

Binary

Binary table

OBSERVED, BINARY

BinaryPct

Binary table percents

OBSERVED, BINARY [*]

Burt

Burt table

OBSERVED, MCA

BurtPct

Burt table percents

OBSERVED, MCA [*]

CellChiSq

Contributions to Chi Square

CELLCHI2

CellChiSqPct

Contributions, pcts

CELLCHI2 [*]

ColBest

Col best indicators

default

ColContr

Col contributions to inertia

default

ColCoors

Col coordinates

default

ColProfiles

Col profiles

CP

ColProfilesPct

Col profiles, pcts

CP [*]

ColQualMassIn

Col quality, mass, inertia

default

ColSqCos

Col squared cosines

default

DF

DF, Chi Square (not displayed)

default

Deviations

Observed - expected freqs

DEVIATIONS

DeviationsPct

Observed - expected pcts

DEVIATIONS [*]

Expected

Expected frequencies

EXPECTED

ExpectedPct

Expected percents

EXPECTED [*]

Inertias

Inertia decomposition table

default

Observed

Observed frequencies

OBSERVED

ObservedPct

Observed percents

OBSERVED [*]

RowBest

Row best indicators

default

RowContr

Row contributions to inertia

default

RowCoors

Row coordinates

default

RowProfiles

Row profiles

RP

RowProfilesPct

Row profiles, pcts

RP [*]

RowQualMassIn

Row quality, mass, inertia

default

RowSqCos

Row squared cosines

default

SupColCoors

Supp col coordinates

default

SupColProfiles

Supp col profiles

CP

SupColProfilesPct

Supp col profiles, pcts

CP [*]

SupColQuality

Supp col quality

default

SupCols

Supplementary col freq

OBSERVED

SupColsPct

Supplementary col pcts

OBSERVED [*]

SupColSqCos

Supp col squared cosines

default

SupRows

Supplementary row freqs

OBSERVED

SupRowCoors

Supp row coordinates

default

SupRowProfiles

Supp row profiles

RP

SupRowProfilesPct

Supp row profiles, pcts

RP [*]

SupRowQuality

Supp row quality

default

SupRowsPct

Supplementary row pcts

OBSERVED [*]

SupRowSqCos

Supp row squared cosines

default

[*] Percents are displayed when you specify the PRINT=PERCENT or PRINT=BOTH option.

ODS Graphics (Experimental)

This section describes the use of ODS for creating graphics with the CORRESP procedure. These graphics are experimental in this release, meaning that both the graphical results and the syntax for specifying them are subject to change in a future release. To request a graph you must specify the ODS GRAPHICS statement. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.

ODS Graph Names

PROC CORRESP assigns a name to the graph it creates using ODS. You can use this name to reference the graph when using ODS. The name is listed in Table 24.9.

Table 24.9: ODS Graphics Produced by PROC CORRESP

ODS Graph Name

Plot Description

CorrespPlot

Correspondence analysis plot

To request a graph you must specify the ODS GRAPHICS statement. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.




SAS.STAT 9.1 Users Guide (Vol. 2)
SAS/STAT 9.1 Users Guide Volume 2 only
ISBN: B003ZVJDOK
EAN: N/A
Year: 2004
Pages: 92

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net