Details | Base SAS 9.1 Procedures Guide, Volumes 1, 2, 3 and 4

Pearson Product-Moment Correlation

The Pearson product-moment correlation is a parametric measure of association for two variables . It measures both the strength and direction of a linear relationship. If one variable X is an exact linear function of another variable Y , a positive relationship exists if the correlation is 1 and a negative relationship exists if the correlation is ˆ’ 1. If there is no linear predictability between the two variables, the correlation is 0. If the two variables are normal with a correlation 0, the two variables are independent. However, correlation does not imply causality because, in some cases, an underlying causal relationship may not exist.

The following scatter plot matrix displays the relationship between two numeric random variables under various situations.

Figure 1.4: Correlations between Two Variables

The scatter plot matrix shows a positive correlation between variables Y1 and X1 , a negative correlation between Y1 and X2 , and no clear correlation between Y2 and X1 . The plot also shows no clear linear correlation between Y2 and X2 , even though Y2 is dependent on X2 .

The formula for the population Pearson product-moment correlation, denoted _xy , is

The sample correlation, such as a Pearson product-moment correlation or weighted product-moment correlation, estimates the population correlation. The formula for the sample Pearson product-moment correlation is

where x is the sample mean of x and y is the sample mean of y . The formula for a weighted Pearson product-moment correlation is

where w _i is the weight, x _w is the weighted mean of x , and y _w is the weighted mean of y .

Probability Values

Probability values for the Pearson correlation are computed by treating

as coming from a t distribution with ( n ˆ’ 2) degrees of freedom, where r is the sample correlation.

Spearman Rank-Order Correlation

Spearman rank-order correlation is a nonparametric measure of association based on the ranks of the data values. The formula is

where R _i is the rank of x _i , S _i is the rank of y _i , R is the mean of the R _i values, and S is the mean of the S _i values.

PROC CORR computes the Spearman correlation by ranking the data and using the ranks in the Pearson product-moment correlation formula. In case of ties, the averaged ranks are used.

Probability Values

Probability values for the Spearman correlation are computed by treating

as coming from a t distribution with ( n ˆ’ 2) degrees of freedom, where r is the sample Spearman correlation.

Kendall s Tau-b Correlation Coefficient

Kendall s tau-b is a nonparametric measure of association based on the number of concordances and discordances in paired observations. Concordance occurs when paired observations vary together, and discordance occurs when paired observations vary differently. The formula for Kendall s tau-b is

where T = n ( n ˆ’ 1) / 2, T ₁ = ˆ‘ _k t _k ( t _k ˆ’ 1) / 2, and T ₂ = ˆ‘ _l u _l ( u _l ˆ’ 1) / 2. The t _k is the number of tied x values in the k th group of tied x values, u _l is the number of tied y values in the l th group of tied y values, n is the number of observations, and sgn( z ) is defined as

PROC CORR computes Kendall s tau-b by ranking the data and using a method similar to Knight (1966). The data are double sorted by ranking observations according to values of the first variable and reranking the observations according to values of the second variable. PROC CORR computes Kendall s tau-b from the number of interchanges of the first variable and corrects for tied pairs (pairs of observations with equal values of X or equal values of Y).

Probability Values

Probability values for Kendall s tau-b are computed by treating

as coming from a standard normal distribution where

and V ( s ), the variance of s , is computed as

where

The sums are over tied groups of values where t _i is the number of tied x values and u _i is the number of tied y values (Noether 1967). The sampling distribution of Kendall s partial tau-b is unknown; therefore, the probability values are not available.

Hoeffding Dependence Coefficient

Hoeffding s measure of dependence, D , is a nonparametric measure of association that detects more general departures from independence. The statistic approximates a weighted sum over observations of chi-square statistics for two-by-two classification tables (Hoeffding 1948). Each set of ( x, y ) values are cut points for the classification. The formula for Hoeffding s D is

where D ₁ = ˆ‘ _i ( Q _i ˆ’ 1)( Q _i ˆ’ 2), D ₂ = ˆ‘ _i ( R _i ˆ’ 1)( R _i ˆ’ 2)( S _i ˆ’ 1)( S _i ˆ’ 2), and D ₃ = ˆ‘ _i ( R _i ˆ’ 2)( S _i ˆ’ 2)( Q _i ˆ’ 1). R _i is the rank of x _i , S _i is the rank of y _i , and Q _i (also called the bivariate rank) is 1 plus the number of points with both x and y values less than the i th point.

A point that is tied on only the x value or y value contributes 1/2 to Q _i if the other value is less than the corresponding value for the i th point.

A point that is tied on both x and y contributes 1/4 to Q _i . PROC CORR obtains the Q _i values by first ranking the data. The data are then double sorted by ranking observations according to values of the first variable and reranking the observations according to values of the second variable. Hoeffding s D statistic is computed using the number of interchanges of the first variable. When no ties occur among data set observations, the D statistic values are between ˆ’ 0.5 and 1, with 1 indicating complete dependence. However, when ties occur, the D statistic may result in a smaller value. That is, for a pair of variables with identical values, the Hoeffding s D statistic may be less than 1. With a large number of ties in a small data set, the D statistic may be less than ˆ’ 0.5. For more information about Hoeffding s D , refer to Hollander and Wolfe (1973, p. 228).

Probability Values

The probability values for Hoeffding s D statistic are computed using the asymptotic distribution computed by Blum, Kiefer, and Rosenblatt (1961). The formula is

which comes from the asymptotic distribution. If the sample size is less than 10, refer to the tables for the distribution of D in Hollander and Wolfe (1973).

Partial Correlation

A partial correlation measures the strength of a relationship between two variables, while controlling the effect of other variables. The Pearson partial correlation between two variables, after controlling for variables in the PARTIAL statement, is equivalent to the Pearson correlation between the residuals of the two variables after regression on the controlling variables.

Let y = ( y ₁ , y ₂ , . . . , y _v ) be the set of variables to correlate and z = ( z ₁ , z ₂ , . . . , z _p ) be the set of controlling variables. The population Pearson partial correlation between the i th and the j th variables of y given z is the correlation between errors ( y _i ˆ’ E( y _i )) and ( y _j ˆ’ E( y _j )), where

are the regression models for variables y _i and y _j given the set of controlling variables z , respectively.

For a given sample of observations, a sample Pearson partial correlation between y _i and y _j given z is derived from the residuals y _i ˆ’ · _i and y _j ˆ’ · _j , where

are fitted values from regression models for variables y _i and y _j given z .

The partial corrected sums of squares and crossproducts (CSSCP) of y given z are the corrected sums of squares and crossproducts of the residuals y ˆ’ · . Using these partial corrected sums of squares and crossproducts, you can calculate the partial partial covariances and partial correlations.

PROC CORR derives the partial corrected sums of squares and crossproducts matrix by applying the Cholesky decomposition algorithm to the CSSCP matrix. For Pearson partial correlations, let S be the partitioned CSSCP matrix between two sets of variables, z and y :

PROC CORR calculates S _yy.z , the partial CSSCP matrix of y after controlling for z , by applying the Cholesky decomposition algorithm sequentially on the rows associated with z , the variables being partialled out.

After applying the Cholesky decomposition algorithm to each row associated with variables z , PROC CORR checks all higher numbered diagonal elements associated with z for singularity. A variable is considered singular if the value of the corresponding diagonal element is less than µ times the original unpartialled corrected sum of squares of that variable. You can specify the singularity criterion µ using the SINGULAR= option. For Pearson partial correlations, a controlling variable z is considered singular if the R ² for predicting this variable from the variables that are already partialled out exceeds 1 ˆ’ µ . When this happens, PROC CORR excludes the variable from the analysis. Similarly, a variable is considered singular if the R ² for predicting this variable from the controlling variables exceeds 1 ˆ’ µ . When this happens, its associated diagonal element and all higher numbered elements in this row or column are set to zero.

After the Cholesky decomposition algorithm is performed on all rows associated with z , the resulting matrix has the form

where T _zz is an upper triangular matrix with and

If S _zz is positive definite, then and the partial CSSCP matrix S _yy.z is identical to the matrix derived from the formula

The partial variance-covariance matrix is calculated with the variance divisor (VARDEF= option). PROC CORR then uses the standard Pearson correlation formula on the partial variance-covariance matrix to calculate the Pearson partial correlation matrix.

When a correlation matrix is positive definite, the resulting partial correlation between variables x and y after adjusting for a single variable z is identical to that obtained from the first-order partial correlation formula

where r _xy , r _xz , and r _yz are the appropriate correlations.

The formula for higher-order partial correlations is a straightforward extension of the preceding first-order formula. For example, when the correlation matrix is positive definite, the partial correlation between x and y controlling for both z ₁ and z ₂ is identical to the second-order partial correlation formula

where r _xy . _z ₁ , r _xz ₂ . _z ₁ , and r _yz ₂ . _z ₁ are first-order partial correlations among variables x , y , and z ₂ given z ₁ .

To derive the corresponding Spearman partial rank-order correlations and Kendall partial tau-b correlations, PROC CORR applies the Cholesky decomposition algorithm to the Spearman rank-order correlation matrix and Kendall s tau-b correlation matrix and uses the correlation formula. That is, the Spearman partial correlation is equivalent to the Pearson correlation between the residuals of the linear regression of the ranks of the two variables on the ranks of the partialled variables. Thus, if a PARTIAL statement is specified with the CORR=SPEARMAN option, the residuals of the ranks of the two variables are displayed in the plot. The partial tau-b correlations range from ˆ’ 1 to 1. However, the sampling distribution of this partial tau-b is unknown; therefore, the probability values are not available.

Probability Values

Probability values for the Pearson and Spearman partial correlations are computed by treating

as coming from a t distribution with ( n ˆ’ k ˆ’ 2) degrees of freedom, where r is the partial correlation and k is the number of variables being partialled out.

Fisher s z Transformation

For a sample correlation r using a sample from a bivariate normal distribution with correlation = 0, the statistic

has a Student- t distribution with ( n ˆ’ 2) degrees of freedom.

With the monotone transformation of the correlation r (Fisher 1921)

the statistic z has an approximate normal distribution with mean and variance

where = tanh ^{ˆ’ 1} ( ).

For the transformed z _r , the approximate variance V ( z _r ) = 1 / ( n ˆ’ 3) is independent of the correlation . Furthermore, even the distribution of z _r is not strictly normal; it tends to normality rapidly as the sample size increases for any values of (Fisher 1970, pp. 200 “201).

For the null hypothesis H : = , the p -values are computed by treating

as a normal random variable with mean zero and variance 1 / ( n ˆ’ 3), where = tanh ^{ˆ’ 1} ( ) (Fisher 1970, p. 207; Anderson 1984, p. 123).

Note that the bias adjustment, / (2( n ˆ’ 1)), is always used when computing p -values under the null hypothesis H : = in the CORR procedure.

The ALPHA= option in the FISHER option specifies the value ± for the confidence level 1 ˆ’ ± , the RHO0= option specifies the value in the hypothesis H : = , and the BIASADJ= option specifies whether the bias adjustment is to be used for the confidence limits.

The TYPE= option specifies the type of confidence limits. The TYPE=TWOSIDED option requests two-sided confidence limits and a p -value under the hypothesis H : = . For a one-sided confidence limit, the TYPE=LOWER option requests a lower confidence limit and a p -value under the hypothesis H : < = , and the TYPE=UPPER option requests an upper confidence limit and a p -value under the hypothesis H : > = .

Confidence Limits for the Correlation

The confidence limits for the correlation are derived through the confidence limits for the parameter , with or without the bias adjustment.

Without a bias adjustment, confidence limits for are computed by treating

as having a normal distribution with mean zero and variance 1 / ( n ˆ’ 3).

That is, the two-sided confidence limits for are computed as

where z _{(1 ˆ’} _{± /} ₂₎ is the 100(1 ˆ’ ± / 2) percentage point of the standard normal distribution.

With a bias adjustment, confidence limits for are computed by treating

as having a normal distribution with mean zero and variance 1 / ( n ˆ’ 3), where the bias adjustment function (Keeping 1962, p. 308) is

That is, the two-sided confidence limits for are computed as

These computed confidence limits of _l and _u are then transformed back to derive the confidence limits for the correlation :

Note that with a bias adjustment, the CORR procedure also displays the following correlation estimate:

Applications of Fisher s z Transformation

Fisher (1970, p. 199) describes the following practical applications of the z transformation:

Testing whether a population correlation is equal to a given value
Testing for equality of two population correlations
Combining correlation estimates from different samples

To test if a population correlation ₁ from a sample of n ₁ observations with sample correlation r ₁ is equal to a given , first apply the z transformation to r ₁ and : z ₁ = tanh ^{ˆ’ 1} ( r ₁ ) and = tanh ^{ˆ’ 1} ( ).

The p -value is then computed by treating

as a normal random variable with mean zero and variance 1 / ( n ₁ ˆ’ 3).

Assume that sample correlations r ₁ and r ₂ are computed from two independent samples of n ₁ and n ₂ observations, respectively. To test whether the two corresponding population correlations, ₁ and ₂ , are equal, first apply the z transformation to the two sample correlations: z ₁ = tanh ^{ˆ’ 1} ( r ₁ ) and z ₂ = tanh ^{ˆ’ 1} ( r ₂ ).

The p -value is derived under the null hypothesis of equal correlation. That is, the difference z ₁ ˆ’ z ₂ is distributed as a normal random variable with mean zero and variance 1 / ( n ₁ ˆ’ 3) + 1 / ( n ₂ ˆ’ 3).

Assuming further that the two samples are from populations with identical correlation, a combined correlation estimate can be computed. The weighted average of the corresponding z values is

where the weights are inversely proportional to their variances.

Thus, a combined correlation estimate is r = tanh( z ) and V ( z ) = 1 / ( n ₁ + n ₂ ˆ’ 6). See Example 1.4 for further illustrations of these applications.

Note that this approach can be extended to include more than two samples.

Cronbach s Coefficient Alpha

Analyzing latent constructs such as job satisfaction, motor ability, sensory recognition, or customer satisfaction requires instruments to accurately measure the constructs. Interrelated items may be summed to obtain an overall score for each participant. Cronbach s coefficient alpha estimates the reliability of this type of scale by determining the internal consistency of the test or the average correlation of items within the test (Cronbach 1951).

When a value is recorded, the observed value contains some degree of measurement error. Two sets of measurements on the same variable for the same individual may not have identical values. However, repeated measurements for a series of individuals will show some consistency. Reliability measures internal consistency from one set of measurements to another. The observed value Y is divided into two components , a true value T and a measurement error E . The measurement error is assumed to be independent of the true value, that is,

The reliability coefficient of a measurement test is defined as the squared correlation between the observed value Y and the true value T , that is,

which is the proportion of the observed variance due to true differences among individuals in the sample. If Y is the sum of several observed variables measuring the same feature, you can estimate V ( T ). Cronbach s coefficient alpha, based on a lower bound for V ( T ), is an estimate of the reliability coefficient.

Suppose p variables are used with Y _j = T _j + E _j for j = 1 , 2 , . . . , p , where Y _j is the observed value, T _j is the true value, and E _j is the measurement error. The measurement errors ( E _j ) are independent of the true values ( T _j ) and are also independent of each other. Let Y = ˆ‘ _j Y _j be the total observed score and T = ˆ‘ _j T _j be the total true score. Because

a lower bound for V ( T ) is given by

With Cov( Y _i , Y _j ) = Cov( T _i , T _j ) for i ‰ j , a lower bound for the reliability coefficient, V ( T ) /V ( Y ), is then given by the Cronbach s coefficient alpha:

If the variances of the items vary widely, you can standardize the items to a standard deviation of 1 before computing the coefficient alpha. If the variables are dichotomous (0,1), the coefficient alpha is equivalent to the Kuder-Richardson 20 (KR-20) reliability measure.

When the correlation between each pair of variables is 1, the coefficient alpha has a maximum value of 1. With negative correlations between some variables, the coefficient alpha can have a value less than zero. The larger the overall alpha coefficient, the more likely that items contribute to a reliable scale. Nunnally and Bernstein (1994) suggests 0.70 as an acceptable reliability coefficient; smaller reliability coefficients are seen as inadequate. However, this varies by discipline.

To determine how each item reflects the reliability of the scale, you calculate a coefficient alpha after deleting each variable independently from the scale. The Cronbach s coefficient alpha from all variables except the k th variable is given by

If the reliability coefficient increases after an item is deleted from the scale, you can assume that the item is not correlated highly with other items in the scale. Conversely, if the reliability coefficient decreases, you can assume that the item is highly correlated with other items in the scale. Refer to SAS Communications , Fourth Quarter 1994, for more information on how to interpret Cronbach s coefficient alpha.

Listwise deletion of observations with missing values is necessary to correctly calculate Cronbach s coefficient alpha. PROC CORR does not automatically use listwise deletion if you specify the ALPHA option. Therefore, you should use the NOMISS option if the data set contains missing values. Otherwise, PROC CORR prints a warning message indicating the need to use the NOMISS option with the ALPHA option.

Missing Values

PROC CORR excludes observations with missing values in the WEIGHT and FREQ variables. By default, PROC CORR uses pairwise deletion when observations contain missing values. PROC CORR includes all nonmissing pairs of values for each pair of variables in the statistical computations . Therefore, the correlation statistics may be based on different numbers of observations.

If you specify the NOMISS option, PROC CORR uses listwise deletion when a value of the VAR or WITH statement variable is missing. PROC CORR excludes all observations with missing values from the analysis. Therefore, the number of observations for each pair of variables is identical.

The PARTIAL statement always excludes the observations with missing values by automatically invoking the NOMISS option. With the NOMISS option, the data are processed more efficiently because fewer resources are needed. Also, the resulting correlation matrix is nonnegative definite.

In contrast, if the data set contains missing values for the analysis variables and the NOMISS option is not specified, the resulting correlation matrix may not be nonnegative definite. This leads to several statistical difficulties if you use the correlations as input to regression or other statistical procedures.

Output Tables

By default, PROC CORR prints a report that includes descriptive statistics and correlation statistics for each variable. The descriptive statistics include the number of observations with nonmissing values, the mean, the standard deviation, the minimum, and the maximum.

If a nonparametric measure of association is requested, the descriptive statistics include the median. Otherwise, the sample sum is included. If a Pearson partial correlation is requested , the descriptive statistics also include the partial variance and partial standard deviation.

If variable labels are available, PROC CORR labels the variables. If you specify the CSSCP, SSCP, or COV option, the appropriate sum-of-squares and crossproducts and covariance matrix appears at the top of the correlation report. If the data set contains missing values, PROC CORR prints additional statistics for each pair of variables. These statistics, calculated from the observations with nonmissing row and column variable values, may include

SSCP( W , V ), uncorrected sum-of-squares and crossproducts
USS( W ), uncorrected sum-of-squares for the row variable
USS( V ), uncorrected sum-of-squares for the column variable
CSSCP( W , V ), corrected sum-of-squares and crossproducts
CSS( W ), corrected sum-of-squares for the row variable
CSS( V ), corrected sum-of-squares for the column variable
COV( W , V ), covariance
VAR( W ), variance for the row variable
VAR( V ), variance for the column variable
DF( W , V ), divisor for calculating covariance and variances

For each pair of variables, PROC CORR prints the correlation coefficients, the number of observations used to calculate the coefficient, and the p -value.

If you specify the ALPHA option, PROC CORR prints Cronbach s coefficient alpha, the correlation between the variable and the total of the remaining variables, and Cronbach s coefficient alpha using the remaining variables for the raw variables and the standardized variables.

Output Data Sets

If you specify the OUTP=, OUTS=, OUTK=, or OUTH= option, PROC CORR creates an output data set containing statistics for Pearson correlation, Spearman correlation, Kendall s tau-b, or Hoeffding s D , respectively. By default, the output data set is a special data set type (TYPE=CORR) that many SAS/STAT procedures recognize, including PROC REG and PROC FACTOR. When you specify the NOCORR option and the COV, CSSCP, or SSCP option, use the TYPE= data set option to change the data set type to COV, CSSCP, or SSCP.

The output data set includes the following variables:

BY variables, which identify the BY group when using a BY statement
_TYPE_ variable, which identifies the type of observation
_NAME_ variable, which identifies the variable that corresponds to a given row of the correlation matrix
INTERCEPT variable, which identifies variable sums when specifying the SSCP option
VAR variables, which identify the variables listed in the VAR statement

You can use a combination of the _TYPE_ and _NAME_ variables to identify the contents of an observation. The _NAME_ variable indicates which row of the correlation matrix the observation corresponds to. The values of the _TYPE_ variable are

SSCP, uncorrected sums of squares and crossproducts
CSSCP, corrected sums of squares and crossproducts
COV, covariances
MEAN, mean of each variable
STD, standard deviation of each variable
N, number of nonmissing observations for each variable
SUMWGT, sum of the weights for each variable when using a WEIGHT statement
CORR, correlation statistics for each variable.

If you specify the SSCP option, the OUTP= data set includes an additional observation that contains intercept values. If you specify the ALPHA option, the OUTP= data set also includes observations with the following _TYPE_ values:

RAWALPHA, Cronbach s coefficient alpha for raw variables
STDALPHA, Cronbach s coefficient alpha for standardized variables
RAWALDEL, Cronbach s coefficient alpha for raw variables after deleting one variable
STDALDEL, Cronbach s coefficient alpha for standardized variables after deleting one variable
RAWCTDEL, the correlation between a raw variable and the total of the remaining raw variables
STDCTDEL, the correlation between a standardized variable and the total of the remaining standardized variables

If you use a PARTIAL statement, the statistics are calculated after the variables are partialled. If PROC CORR computes Pearson correlation statistics, MEAN equals zero and STD equals the partial standard deviation associated with the partial variance for the OUTP=, OUTK=, and OUTS= data sets. Otherwise, PROC CORR assigns missing values to MEAN and STD.

Determining Computer Resources

The only factor limiting the number of variables that you can analyze is the amount of available memory. The computer resources that PROC CORR requires depend on which statements and options you specify. To determine the computer resources, define the following variables as follows :

L = number of observations in the data set
C = number of correlation types ( C = 1 , 2 , 3 , 4)
V = number of VAR statement variables
W = number of WITH statement variables
P = number of PARTIAL statement variables

Furthermore, define the following variables:

For small N and large K , the CPU time varies as K for all types of correlations. For large N , the CPU time depends on the type of correlation:

You can reduce CPU time by specifying NOMISS. With NOMISS, processing is much faster when most observations do not contain missing values. The options and statements you use in the procedure require different amounts of storage to process the data. For Pearson correlations, the amount of temporary storage needed (in bytes) is

The NOMISS option decreases the amount of temporary storage by 56 K bytes, the FISHER option increases the storage by 24 K bytes, the PARTIAL statement increases the storage by 12 T bytes, and the ALPHA option increases the storage by 32 V + 16 bytes.

The following example uses a PARTIAL statement, which excludes missing values.

  proc corr;   var x1 x2;   with y1 y2 y3;   partial z1;   run;

Therefore, using 40 T + 16 L + 56 T + 12 T , the minimum temporary storage equals 984 bytes ( T = 2 + 3 + 1 and L = T ( T + 1) / 2).

Using the SPEARMAN, KENDALL, or HOEFFDING option requires additional temporary storage for each observation. For the most time-efficient processing, the amount of temporary storage (in bytes) is

where

The following example requests Kendall s tau-b coefficients:

  proc corr kendall;   var x1 x2 x3;   run;

Therefore, the minimum temporary storage in bytes is

where N is the number of observations.

If M bytes are not available, PROC CORR must process the data multiple times to compute all the statistics. This reduces the minimum temporary storage you need by 12( T ˆ’ 2) N bytes. When this occurs, PROC CORR prints a note suggesting a larger memory region.

ODS Table Names

PROC CORR assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets.

Table 1.3: ODS Tables Produced with the PROC CORR Statement
ODS Name	Description	Option
Cov	Covariances	COV
CronbachAlpha	Coefficient alpha	ALPHA
CronbachAlphaDel	Coefficient alpha with deleted variable	ALPHA
Csscp	Corrected sums of squares and crossproducts	CSSCP
FisherPearsonCorr	Pearson correlation statistics using Fisher s z Transformation	FISHER
FisherSpearmanCorr	Spearman correlation statistics using Fisher s z Transformation	FISHER SPEARMAN
HoeffdingCorr	Hoeffding s D statistics	HOEFFDING
KendallCorr	Kendall s tau-b coefficients	KENDALL
PearsonCorr	Pearson correlations	PEARSON
SimpleStats	Simple descriptive statistics
SpearmanCorr	Spearman correlations	SPEARMAN
Sscp	Sums of squares and crossproducts	SSCP
VarInformation	Variable information

Table 1.4: ODS Tables Produced with the PARTIAL Statement
ODS Name	Description	Option
FisherPearsonPartialCorr	Pearson Partial Correlation Statistics Using Fisher s z Transformation	FISHER
FisherSpearmanPartialCorr	Spearman Partial Correlation Statistics Using Fisher s z Transformation	FISHER SPEARMAN
PartialCsscp	Partial corrected sums of squares and crossproduct	CSSCP
PartialCov	Partial covariances	COV
PartialKendallCorr	Partial Kendall tau-b coefficients	KENDALL
PartialPearsonCorr	Partial Pearson correlations
PartialSpearmanCorr	Partial Spearman correlations	SPEARMAN

ODS Graphics (Experimental)

This section describes the use of ODS for creating graphics with the CORR procedure. These graphics are experimental in this release, meaning that both the graphical results and the syntax for specifying them are subject to change in a future release.

To request these graphs, you must specify the ODS GRAPHICS statement in addition to the following options in the PROC CORR statement. For more information on the ODS GRAPHICS statement, refer to Chapter 15, Statistical Graphics Using ODS ( SAS/STAT User s Guide ).

PLOTS

PLOTS = MATRIX < (matrix-options) >

PLOTS = SCATTER < ( scatter-options ) >

PLOTS = ( MATRIX < ( matrix-options ) > SCATTER < ( scatter-options ) > )

requests a scatter plot matrix for all variables, scatter plots for all pairs of variables, or both. If only the option keyword PLOTS is specified, the PLOTS=MATRIX option is used. When you specify the PLOTS option, the Pearson correlations will also be displayed.
You can specify the following with the PLOTS= option :

MATRIX < ( matrix-options ) >

requests a scatter plot matrix for all variables. That is, the procedure displays a symmetric matrix plot with variables in the VAR list if a WITH statement is not specified. Otherwise, the procedure displays a rectangular matrix plot with the WITH variables appear down the side and the VAR variables appear across the top.

The available matrix-options are:

NMAXVAR= n
- specifies the maximum number of variables in the VAR list to be displayed in the matrix plot, where n ‰ 0. If you specify NMAXVAR=0, then the total number of variables in the VAR list is used and no restriction occurs. By default, NMAXVAR=5.

NMAXWITH= n
- specifies the maximum number of variables in the WITH list to be displayed in the matrix plot, where n ‰ 0. If you specify NMAXWITH=0, then the total number of variables in the WITH list is used and no restriction occurs. By default, NMAXWITH=5.

SCATTER < ( scatter-options ) >

requests a scatter plot for each pair of variables. That is, the procedure displays a scatter plot for each pair of distinct variables from the VAR list if a WITH statement is not specified. Otherwise, the procedure displays a scatter plot for each pair of variables, one from the WITH list and the other from the VAR list.
The available scatter-options are:

ALPHA= numbers
- specifies ± the values for the confidence or prediction ellipses to be displayed in the scatter plots, where 0 < ± < 1. For each ± value specified, a (1 ˆ’ ± ) confidence or prediction ellipse is created. By default, ± = 0 . 05.

ELLIPSE=PREDICTION MEAN NONE
- requests prediction ellipses for new observations (ELLIPSE=PREDICTION), confidence ellipses for the mean (ELLIPSE=MEAN), or no ellipses (ELLIPSE=NONE) to be created in the scatter plots. By default, ELLIPSE=PREDICTION.

NOINSET
- suppresses the default inset of summary information for the scatter plot. The inset table is displayed next to the scatter plot and contains statistics such as number of observations (NObs), correlation, and p -value (Prob >r).

NOLEGEND
- suppresses the default legend for overlaid prediction or confidence ellipses. The legend table is displayed next to the scatter plot and identifies each ellipse displayed in the plot.

NMAXVAR= n
- specifies the maximum number of variables in the VAR list to be displayed in the plots, where n ‰ 0. If you specify NMAXVAR=0, then the total number of variables in the VAR list is used and no restriction occurs. By default, NMAXVAR=5.

NMAXWITH= n
- specifies the maximum number of variables in the WITH list to be displayed in the plots, where n ‰ 0. If you specify NMAXWITH=0, then the total number of variables in the WITH list is used and no restriction occurs. By default, NMAXWITH=5.

When the relationship between two variables is nonlinear or when outliers are present, the correlation coefficient may incorrectly estimate the strength of the relationship. Plotting the data enables you to verify the linear relationship and to identify the potential outliers.

The partial correlation between two variables, after controlling for variables in the PARTIAL statement, is the correlation between the residuals of the linear regression of the two variables on the partialled variables. Thus, if a PARTIAL statement is also specified, the residuals of the analysis variables are displayed in the scatter plot matrix and scatter plots.

Confidence and Prediction Ellipses

The CORR procedure optionally provides two types of ellipses for each pair of variables in a scatter plot. One is a confidence ellipse for the population mean, and the other is a prediction ellipse for a new observation. Both assume a bivariate normal distribution.

Let Z and S be the sample mean and sample covariance matrix of a random sample of size n from a bivariate normal distribution with mean µ and covariance matrix & pound ; . The variable Z ˆ’ µ is distributed as a bivariate normal variate with mean zero and covariance (1 /n ) , and it is independent of S . Using Hotelling s T ² statistic, which is defined as

a 100(1 ˆ’ ± )% confidence ellipse for µ is computed from the equation

where F _2, _n _{ˆ’ 2} (1 ˆ’ ± ) is the (1 ˆ’ ± ) critical value of an F distribution with degrees of freedom 2 and n ˆ’ 2.

A prediction ellipse is a region for predicting a new observation in the population. It also approximates a region containing a specified percentage of the population.

Denote a new observation as the bivariate random variable Z _new . The variable

is distributed as a bivariate normal variate with mean zero (the zero vector) and covariance (1 + 1 /n ) , and it is independent of S . A 100(1 ˆ’ ± )% prediction ellipse is then given by the equation

The family of ellipses generated by different critical values of the F distribution has a common center (the sample mean) and common major and minor axis directions.

The shape of an ellipse depends on the aspect ratio of the plot. The ellipse indicates the correlation between the two variables if the variables are standardized (by dividing the variables by their respective standard deviations). In this situation, the ratio between the major and minor axis lengths is

In particular, if r = 0, the ratio is 1, which corresponds to a circular confidence contour and indicates that the variables are uncorrelated. A larger value of the ratio indicates a larger positive or negative correlation between the variables.

ODS Graph Names

The CORR procedure assigns a name to each graph it creates using ODS. You can use these names to reference the graphs when using ODS. The names are listed in Table 1.5.

Table 1.5: ODS Graphics Produced by PROC CORR
ODS Graph Name	Plot Description	Option	Statement
ScatterPlot	Scatter plot	PLOTS=SCATTER
RecMatrixPlot	Rectangular scatter plot matrix	PLOTS=MATRIX	WITH
SymMatrixPlot	Symmetric scatter plot matrix	PLOTS=MATRIX	(omit WITH)

To request these graphs you must specify the ODS GRAPHICS statement in addition to the options and statements indicated in Table 1.5. For more information on the ODS GRAPHICS statement, refer to Chapter 15, Statistical Graphics Using ODS ( SAS/STAT User s Guide ).