Details


Specifying the Sample Design

PROC SURVEYFREQ produces tables and statistics based on the sample design used to obtain the survey data. The procedure uses the Taylor expansion method to estimate sampling errors of estimators based on complex sample designs. See the section 'Statistical Computations' on page 4206 for details. This method is appropriate for all designs where the first-stage sample is selected with replacement, or where the first-stage sampling fraction is small, as it often is in practice.

PROC SURVEYFREQ can be used for single-stage designs or for multistage designs, with or without stratification, and with or without unequal weighting . You provide sample design information with the STRATA, CLUSTER, and WEIGHT statements, and with the RATE= or TOTAL= option in the PROC SURVEYFREQ statement.

When there are clusters, or PSUs, in the sample design, the procedure estimates variance from the variance among PSUs. For a multistage sample design, the variance estimation method depends only on the first stage of the sample design. So, the required input includes only first-stage cluster (PSU) and first-stage stratum identification. You do not need to input design information about any additional stages of sampling.

Stratification

If your sample design is stratified at the first stage of sampling, use the STRATA statement to name variables that form the strata. The combinations of categories of STRATA variables define the strata in the sample, where strata are nonoverlapping subgroups that were sampled independently. If your sample design has stratification at multiple stages, you should identify only the first-stage strata in the STRATA statement. If you do not specify a STRATA statement, PROC SURVEYFREQ assumes there is no stratification at the first stage.

Clustering

If your sample design selects clusters at the first stage of sampling, use the CLUSTER statement to name variables that identify the first-stage clusters, or primary sampling units (PSUs). The combinations of categories of CLUSTER variables define the clusters in the sample. If there is a STRATA statement, clusters are nested within strata.

If your sample design has clustering at multiple stages, you should specify only the first-stage clusters, or PSUs, in the CLUSTER statement. PROC SURVEYFREQ assumes that each cluster defined by the CLUSTER statement variables represents a PSU in the sample, and that each observation belongs to one PSU. If you do not specify a CLUSTER statement, the procedure treats each observation as a PSU.

Weighting

If your sample design includes unequal weighting, use the WEIGHT statement to name the variable that contains the sampling weights. If you do not specify a WEIGHT statement, PROC SURVEYFREQ assigns all observations a weight of 1. Sampling weights must be positive numbers . If an observation has a weight that is nonpositive or missing, then the procedure omits that observation from the analysis. See the section 'Missing Values' on page 4205 for more information.

Population Totals and Sampling Rates

If your analysis needs to include a finite population correction ( fpc ), you can input either the sampling rate or the population total using the RATE= option or the TOTAL= option in the PROC SURVEYFREQ statement. (You cannot specify both of these options in the same PROC SURVEYFREQ statement.) If you do not specify one of these options, the procedure does not use the fpc when computing variance estimates. For fairly small sampling fractions, it is appropriate to ignore this correction. Refer to Cochran (1977) and Kish (1965).

If your design has multiple stages of selection and you are specifying the RATE= option, you should input the first-stage sampling rate, which is the ratio of the number of PSUs in the sample to the total number of PSUs in the study population. If you are specifying the TOTAL= option for a multistage design, you should input the total number of PSUs in the study population.

For a nonstratified sample design, or for a stratified sample design with the same sampling rate or the same population total in all strata, you should use the RATE= value option or the TOTAL= value option. If your sample design is stratified with different sampling rates or population totals in the strata, then you can use the RATE= SAS-data-set option or the TOTAL= SAS-data-set option to name a SAS data set that contains the stratum sampling rates or totals. This data set is called a secondary data set , as opposed to the primary data set that you specify with the DATA= option.

The secondary data set must contain all the stratification variables listed in the STRATA statement and all the variables in the BY statement. Furthermore, the BY groups must appear in the same order as in the primary data set. If there are formats associated with the STRATA variables and the BY variables, then the formats must be consistent in the primary and the secondary data sets. If you specify the TOTAL= SAS-data-set option, the secondary data set must have a variable named _TOTAL_ that contains the stratum population totals. Alternatively, if you specify the RATE= SAS-data-set option, the secondary data set must have a variable named _RATE_ that contains the stratum sampling rates. If the secondary data set contains more than one observation for any one stratum, then the procedure uses the first value of _TOTAL_ or _RATE_ for that stratum and ignores the rest.

The value in the RATE= option or the values of _RATE_ in the secondary data set must be nonnegative numbers. You can specify value as a number between 0 and 1. Or you can specify value in percentage form as a number between 1 and 100, and PROC SURVEYFREQ will convert that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%.

If you specify the TOTAL= value option, value must not be less than the sample size . If you provide stratum population totals in a secondary data set, these values must not be less than the corresponding stratum sample sizes.

Domain Analysis

PROC SURVEYFREQ provides domain analysis through its multiway table capability. Domain Analysis refers to the computation of statistics for subpopulations, or domains, in addition to the computation of statistics for the entire study population. Formation of these domains may be unrelated to the sample design, so the domain sample sizes may actually be random variables. Domain analysis takes into account this variability, using the entire sample when estimating variance for domain estimates. This is also known as subgroup analysis, subpopulation analysis, or subdomain analysis. For more information on domain analysis, refer to Lohr (1999), Cochran (1977), and Fuller et al. (1989).

To request domain analysis with PROC SURVEYFREQ, you should include the domain variable(s) in your TABLES statement request. For example, specifying DOMAIN * A * B in a TABLES statement produces separate two-way tables of A by B for each level of DOMAIN . If your domains are formed by more than one variable, you can specify DomainVariable_1 * DomainVariable_2 * A * B , for example, to obtain two-way tables of A by B for each domain formed by the different combinations of levels for DomainVariable_1 and DomainVariable_2 .

Including the domain variables in a TABLES statement request gives a different analysis from that obtained by using a BY statement, which provides completely separate analyses of the BY groups. The BY statement can also be used to analyze the dataset by subgroups, but it is critical to note that this will not produce a valid domain analysis. The BY statement is only appropriate when the number of units in each subgroup is known with certainty ; when the subgroup sample size is a random variable, include the domain variables in your TABLES statement request.

Missing Values

If an observation has a missing value or a nonpositive value for the WEIGHT variable, then PROC SURVEYFREQ excludes that observation from the analysis.

An observation is also excluded from the analysis if it has a missing value for any STRATA or CLUSTER variable, unless you specify the MISSING optioninthe PROC SURVEYFREQ statement. The MISSING option requests that the procedure treat missing values as a valid category for all categorical variables, which include strata variables, cluster variables, and classification or table variables.

Additionally, PROC SURVEYFREQ excludes an observation from a crosstabulation table (and any associated analyses) if that observation has a missing value for any of the table variables, unless you specify the MISSING option. When the procedure excludes observations with missing values from a table, it displays the total frequency of missing observations below that table. With the MISSING option, the procedure treats the missing values as a valid category and includes them in calculations of percentages and other statistics.

If all values in a stratum are excluded from the analysis of a table as missing values, then that stratum is called an empty stratum . Empty strata are not counted in the total number of strata for the table, which is used to determine the degrees of freedom for confidence limits and tests. Similarly, empty clusters and missing observations are not included in the total counts of clusters and observations used in the analysis of the table.

For each table request, PROC SURVEYFREQ produces a nondisplayed ODS summary table that contains the number of (nonmissing) observations, strata, and clusters that are included in the analysis of the requested table. When there are missing observations, empty strata, or empty clusters for the requested table, then these numbers in the 'Table Summary' differ from the total number of observations, strata, and clusters that are present in the input data set and reported in the 'Data Summary.' See Example 68.3 on page 4236 for more information on the 'Table Summary.'

If you have missing values in your survey data for any reason (such as nonresponse), this can compromise the quality of your survey results. An observation without missing values is called a complete respondent , and an observation with missing values is called an incomplete respondent . If the complete respondents are different from the incomplete respondents with regard to a survey effect or outcome, then survey estimates will be biased and will not accurately represent the survey population. There are a variety of techniques in sample design and survey operations that can reduce nonresponse. Once data collection is complete, you can use imputation to replace missing values with acceptable values, and you can use sampling weight adjustments to compensate for nonresponse. You should complete this data preparation and adjustment before you analyze your data with PROC SURVEYFREQ. Refer to Cochran (1977), Kalton and Kaspyzyk (1986), and Brick and Kalton (1996) for more details.

Statistical Computations

The SURVEYFREQ procedure uses the Taylor series expansion method to estimate standard errors of estimators of proportions for crosstabulation tables. For sample survey data, the proportion estimator is a ratio estimator formed from estimators of totals. For example, to estimate the proportion in a crosstabulation table cell, the procedure uses the ratio of the estimator of the cell total frequency to the estimator of the overall population total, where these totals are linear statistics computed from the survey data. The Taylor series expansion method obtains a first-order linear approximation for the ratio estimator and then uses the variance estimate for this approximation to estimate the variance of the estimate itself (Woodruff 1971, Fuller 1975).

When there are clusters, or PSUs, in the sample design, the procedure estimates variance from the variance among PSUs. When the design is stratified, the procedure combines stratum variance estimates to compute the overall variance estimate. For a multistage sample design, the variance estimation method depends only on the first stage of the sample design. So, the required input includes only first-stage cluster (PSU) and first-stage stratum identification. You do not need to input design information about any additional stages of sampling. This variance estimation method assumes that the first-stage sampling fraction is small, or the first-stage sample is drawn with replacement, as it often is in practice.

In addition to this required sample design information, you also need to specify the sampling weights for a valid analysis, if the weights are not equal. Quite often in complex surveys, respondents have unequal weights, which reflect unequal selection probabilities and adjustments for nonresponse.

For more information on the analysis of sample survey data, refer to Lohr (1999), S rndal, Swenson, and Wretman (1992), Lee, Forthoffer, and Lorimor (1989), Cochran (1977), Kish (1965), and Hansen, Hurwitz, and Madow (1953).

Definitions and Notation

For a stratified clustered sample design, define the following:

  • h = 1, 2, , H is the stratum number, with a total of H strata

  • i = 1, 2, , n h is the cluster number within stratum h , with a total of n h sample clusters from stratum h

  • j = 1, 2, , m hi is the unit number within cluster i of stratum h , with a total of m hi sample units from cluster i of stratum h

  • click to expand is the total number of observations in the sample

  • f h = first-stage sampling rate for stratum h

  • W hij = sampling weight of unit j in cluster i of stratum h

The sampling rate f h is the fraction of first-stage units (PSUs) selected for the sample. You can specify the stratum sampling rates with the RATE= option. Or if you specify population totals with the TOTAL= option, PROC SURVEYFREQ computes f h as the ratio of stratum sample size to the stratum total, in terms of PSUs. See the section 'Population Totals and Sampling Rates' on page 4204 for details. If you do not specify the RATE= option or the TOTAL= option, then the procedure assumes that the stratum sampling rates f h are negligible and does not use a finite population correction when computing variances.

This notation is also applicable to other sample designs. For example, for a design without stratification, you can let H =1; for a sample design without clustering, you can let m hi = 1 for every h and i , replacing clusters with individual sampling units.

For a two-way table representing the crosstabulation of two variables, define the following, where there are R levels of the row variable and C levels of the column variable:

  • r = 1, 2, ..., R is the row number, with a total of R rows

  • c = 1, 2, ..., C is the column number, with a total of C columns

  • N rc = is the population total in row r and column c

  • is the total in row r

  • is the total in column c

  • click to expand is the overall total

  • P rc = N rc / N is the population proportion in row r and column c

  • P r . = N r . / N is the proportion in row r

  • P . c = N . c / N is the proportion in column c

  • click to expand is the row proportion for cell ( r , c )

  • click to expand is the column proportion for cell ( r, c )

For a specified observation (identified by stratum, cluster, and unit number within the cluster), define the following to indicate whether or not that observation belongs to cell ( r, c ), row r and column c , of the two-way table, for r = 1, 2, , R and c = 1, 2, , C :

click to expand

Similarly, define the following functions to indicate the observation's row classification and the observation's column classification.

click to expand

Totals

PROC SURVEYFREQ estimates population frequency totals for the specified crosstabulation tables, including totals for two-way table cells , rows, columns, and overall totals. The procedure computes the estimate of the total frequency in table cell ( r, c ) as the weighted frequency sum

click to expand

Similarly, PROC SURVEYFREQ computes estimates of the row totals, column totals, and overall totals as

click to expand

The estimators of totals are linear sample statistics, and so their variances can be estimated directly, without the Taylor series approximation that is used for proportions. PROC SURVEYFREQ estimates the variance of the total frequency in table cell ( r, c ) as

click to expand

where if n h > 1,

click to expand

and if n h = 1,

click to expand

The standard deviation of the total is computed as

click to expand

The variances and standard deviations are computed in a similar manner for row totals, column totals, and overall table totals.

Covariance of Totals

PROC SURVEYFREQ estimates the covariance between total frequency estimates for table cells ( r, c ) and ( a,b ) as

click to expand

The estimated covariance matrix of the table cell totals is an rc rc matrix , which contains the pair-wise table cell covariances , for r = 1, , R; c = 1, , C ; a = 1, , R ; and b = 1, , C .

Proportions

PROC SURVEYFREQ computes the estimate of the proportion in table cell ( r, c ) as the ratio of the estimated total for the table cell to the estimated overall total,

click to expand

Using the Taylor series expansion method, PROC SURVEYFREQ estimates the variance of this proportion estimate as

click to expand

where if n h > 1,

click to expand

and if n h = 1,

click to expand

The standard error of the proportion is computed as

click to expand

Similarly, the estimate of the proportion in row r is

click to expand

And its variance estimate is

click to expand

where if n h > 1,

click to expand

and if n h = 1,

click to expand

The standard error of the proportion in row r is computed as

click to expand

Computations for the proportion in column c are done in the same way.

Row and Column Proportions

PROC SURVEYFREQ computes the estimate of the row proportion for table cell ( r, c ) as the ratio of the estimated total for the table cell to the estimated total for row r ,

click to expand

Again using the Taylor series expansion method, PROC SURVEYFREQ estimates the variance of this row proportion estimate as

click to expand

where if n h > 1,

click to expand

and if n h = 1,

click to expand

The standard error of the row proportion is computed as

click to expand

Similarly, PROC SURVEYFREQ estimates the column proportion for table cell ( r, c ) as the ratio of the estimated total for the table cell to the estimated total for column c ,

click to expand

The variance estimate for the column proportion is computed as described above for the row proportion, but with

click to expand

Confidence Limits

If you specify the CL option in the TABLES statement, PROC SURVEYFREQ computes confidence limits for the proportions in the frequency and crosstabulation tables. The confidence coefficient is determined according to the value of the ALPHA= option, which by default equals 0.05 and produces 95% confidence limits.

For the proportion in table cell ( r, c ), the confidence limits are computed as

click to expand

where rc is the estimate of the proportion in table cell ( r, c ), StdErr( rc ) is the standard error of the estimate, and t df, ± / 2 is the 100(1 - ± /2) percentile of the t distribution with df degrees of freedom calculated as described in the 'Degrees of Freedom' section on page 4214. The confidence limits for row proportions and column proportions are computed similarly to the confidence limits for table cell proportions.

If you specify the CLWT option in the TABLES statement, PROC SURVEYFREQ computes confidence limits for the weighted frequencies, or totals, in the crosstabulation tables.

For the total in table cell ( r, c ), the confidence limits are computed as

click to expand

where is the estimate of the total frequency in table cell ( r, c ), StdErr( rc ) is the standard error of the estimate, and t df, ± / 2 is the 100(1 ˆ’ ± /2) percentile of the t distribution with df degrees of freedom calculated as described in the 'Degrees of Freedom' section on page 4214. The confidence limits for row totals, column totals, and the overall total are computed similarly to the confidence limits for table cell totals.

For each table request, PROC SURVEYFREQ produces a nondisplayed ODS summary table that contains the number of (nonmissing) observations, strata, and clusters that are included in the analysis of the requested table. When you request confidence limits, this table also contains the degrees of freedom df and the value of t df, ± /2 used to compute the confidence limits. See Example 68.3 on page 4236 for more information on the table summary.

Degrees of Freedom

To compute confidence limits for proportions and totals, PROC SURVEYFREQ uses the 100(1 ˆ’ ± /2) percentile from the t distribution with df degrees of freedom. PROC SURVEYFREQ calculates the degrees of freedom for t as the number of clusters minus the number of strata. If there are no clusters, then df equals the number of observations minus the number of strata. If the design is not stratified, then df equals the number of clusters minus one. If missing values or missing weights are present in the data, the number of strata, the number of observations, and the number of clusters are counted based on the observations in nonempty strata. See the section 'Missing Values' on page 4205 for details.

For the Wald F statistics, PROC SURVEYFREQ also calculates the denominator degrees of freedom as the number of clusters minus the number of strata. Alternatively you can specify the denominator degrees of freedom for these tests with the DDF= option in the TABLES statement. See the section 'Wald Chi-Square Test' on page 4221 and the section 'Wald Log-Linear Chi-Square Test' on page 4223 for details.

For each table request, PROC SURVEYFREQ produces a nondisplayed ODS summary table that contains the number of (nonmissing) observations, strata, and clusters that are included in the analysis of the requested table. When you request confidence limits or chi-square tests, this table also contains the degrees of freedom df . See Example 68.3 on page 4236 for more information on the table summary.

Coefficient of Variation

If you specify the CV option in the TABLES statement, PROC SURVEYFREQ computes the coefficients of variation for the proportion estimates in the frequency and crosstabulation tables. The coefficient of variation is the ratio of the standard error to the estimate.

For the proportion in table cell ( r, c ), the coefficient of variation is computed as

click to expand

where rc is the estimate of the proportion in table cell ( r, c ), and StdErr( rc ) is the standard error of the estimate. The coefficients of variation for row proportions and column proportions are computed similarly.

If you specify the CVWT option in the TABLES statement, PROC SURVEYFREQ computes the coefficients of variation for the weighted frequencies, or estimated totals, in the crosstabulation tables. For the total in table cell ( r, c ), the coefficient of variation is computed as

click to expand

where rc is the estimate of the total in table cell ( r, c ), and StdErr( rc ) is the standard error of the estimate. The coefficients of variation for row totals, column totals, and the overall total are computed similarly.

Design Effect

If you specify the DEFF option in the TABLES statement, PROC SURVEYFREQ computes design effects for the overall proportion estimates in the frequency and crosstabulation tables. The design effect for an estimate is the ratio of the actual variance (estimated based on the sample design) to the variance of a simple random sample with the same number of observations. Refer to Lohr (1999) and Kish (1965).

The design effect for the proportion in table cell ( r, c ) is computed as

click to expand

where rc is the estimate of the proportion in table cell ( r, c ), ( ) is the variance of the estimate, f is the overall sampling fraction, and n is the number of observations in the sample.

PROC SURVEYFREQ determines the value of f , the overall sampling fraction, based on the RATE= and TOTAL= options. If you do not specify either of these options, then PROC SURVEYFREQ assumes the value of f is negligible and does not use a finite population correction in the analysis, as described in the section 'Population Totals and Sampling Rates' on page 4204. If you specify RATE= value , then PROC SURVEYFREQ uses this value for the overall sampling fraction f .If you specify TOTAL= value , then PROC SURVEYFREQ computes f as the ratio of the number of PSUs in the sample to the specified total.

If you specify stratum sampling rates with the RATE= SAS-data-set option, then PROC SURVEYFREQ computes stratum totals based on these stratum sampling rates and the number of sample PSUs in each stratum. The procedure sums the stratum totals to form the overall total, and computes f as the ratio of the number of sample PSUs to the overall total. Alternatively, if you specify stratum totals with the TOTAL= SAS-data-set option, then PROC SURVEYFREQ sums these totals to compute the overall total. The overall sampling fraction f is then computed as the ratio of the number of sample PSUs to the overall total.

Expected Weighted Frequency

If you specify the EXPECTED option in the TABLES statement, PROC SURVEYFREQ displays expected weighted frequencies for the table cells in two-way tables. The expected weighted frequencies are computed under the null hypothesis that the row and column variables are independent, as

click to expand

where r . is the estimated total for row r , . c is the estimated total for column c , and is the estimated overall total. Equivalently,

click to expand

These expected values are used in the design-based chi-square tests of independence, as described in 'Rao-Scott Chi-Square Test' and the section 'Wald Chi-Square Test' on page 4221.

Rao-Scott Chi-Square Test

The Rao-Scott chi-square test is a design-adjusted version of the Pearson chi-square test, which involves differences between observed and expected frequencies. For two-way tables, the null hypothesis for this test is no association between the row and column variables. For one-way tables, the Rao-Scott chi-square tests the null hypothesis of equal proportions, or you can specify null proportions for one-way tables with the TESTP= option.

Two forms of the design correction are available for the Rao-Scott tests. One form of the design correction uses the proportion estimates, and you request the corresponding Rao-Scott chi-square test with the CHISQ option. The other form of the design correction uses the null hypothesis proportions. You request this test, called the Rao-Scott modified chi-square test, with the CHISQ1 option.

Refer to Lohr (1999), Thomas, Singh, and Roberts (1996), and Rao and Scott (1981, 1984, 1987) for details on design-adjusted chi-square tests.

Two-Way Tables

The Rao-Scott chi-square statistic is computed from the Pearson chi-square statistic and a design correction based on the design effects of the proportions. Under the null hypothesis of no association between the row and column variables, this statistic approximately follows a chi-square distribution with ( R ˆ’ 1)( C ˆ’ 1) degrees of freedom. An F approximation is also given.

The Rao-Scott chi-square is computed as

click to expand

where D is the design correction described in the 'Design Correction for Two-Way Tables' section on page 4217, and Q P is the Pearson chi-square based on the estimated totals.

click to expand

where rc is the estimated total for table cell ( r, c ), and E rc is the expected total for cell ( r, c ) under the null hypothesis of no association,

click to expand

Under the null hypothesis of no association, the Rao-Scott chi-square Q RS approximately follows a chi-square distribution with ( R ˆ’ 1)( C ˆ’ 1) degrees of freedom. A better approximation may be obtained by the F statistic

click to expand

which has an F distribution with ( R ˆ’ 1)( C ˆ’ 1) and ( R ˆ’ 1)( C ˆ’ 1) degrees of freedom under the null hypothesis, where equals the number of clusters minus the number of strata, as described in the section 'Degrees of Freedom' on page 4214.

Design Correction for Two-Way Tables

If you specify the CHISQ option, the design correction is computed from the estimated proportions, as

click to expand

where

click to expand

as described in the section 'Design Effect' on page 4215. rc is the estimate of the proportion in table cell ( r, c ), ( ) is the variance of the estimate, f is the overall sampling fraction, and n is the number of observations in the sample. DEFF( r .), the design effect for the estimate of the proportion in row r , and DEFF( . c ), the design effect for the estimate of the proportion in row c , are computed similarly.

If you specify the CHISQ1 option for the Rao-Scott modified test, the design correction uses the null hypothesis cell proportions, computed as the product of the corresponding estimated row and cell proportions.

click to expand

where

click to expand

and

click to expand
One-Way Tables

For one-way tables, the Rao-Scott chi-square statistic provides a design-based goodness-of-fit test for equal proportions. Or if you specify null proportions with the TESTP= option, PROC SURVEYFREQ computes the goodness-of-fit test for the specified proportions. Under the null hypothesis, the Rao-Scott chi-square statistic approximately follows a chi-square distribution with ( C ˆ’ 1) degrees of freedom for a table with C levels. PROC SURVEYFREQ also computes an F statistic that may provide a better approximation.

The Rao-Scott chi-square is computed as

click to expand

where D is the design correction described in the section 'Design Correction for OneWay Tables' on page 4219, and Q P is the Pearson chi-square based on the estimated totals,

click to expand

where E c is the expected total for level c under the null hypothesis. For the null hypothesis of equal proportions,

For specified null proportions,

where is the null proportion for level c .

Under the null hypothesis, the one-way Rao-Scott chi-square Q RS approximately follows a chi-square distribution with ( C - 1) degrees of freedom. A better approximation may be obtained by the F statistic

click to expand

which has an F distribution with ( C ˆ’ 1) and ( C ˆ’ 1) degrees of freedom under the null hypothesis, where equals the number of clusters minus the number of strata, as described in the section 'Degrees of Freedom' on page 4214.

Design Correction for One-Way Tables

If you specify the CHISQ option, the design correction is computed from the estimated proportions, as

click to expand

where

click to expand

c is the proportion estimate for table level c , ( c ) is the variance of the estimate, f is the overall sampling fraction, and n is the number of observations in the sample.

If you specify the CHISQ1 option for the Rao-Scott modified test, the design correction uses the null hypothesis proportions - either equal proportions for all levels, or the proportions you specify with the TESTP= option.

click to expand

where

click to expand

and = 1/ C for equal proportions, or takes the value specified with the TESTP= option.

Rao-Scott Likelihood Ratio Chi-Square Test

The Rao-Scott likelihood ratio chi-square test is a design-adjusted version of the likelihood ratio test, which involves ratios between observed and expected frequencies and tests the null hypothesis of no association between the row and column variables in a two-way table. For a one-way tables the null hypothesis is equal proportions for the table levels, or you can specify other null proportions with the TESTP= option. Refer to Lohr (1999), Thomas, Singh, and Roberts (1996), and Rao and Scott (1981, 1984, 1987).

Two forms of the design correction are available for the Rao-Scott tests. One form of the design correction uses the proportion estimates, and you request the corresponding Rao-Scott likelihood ratio test with the LRCHISQ option. The other form of the design correction uses the null hypothesis proportions. You request this test, called the Rao-Scott modified likelihood ratio test, with the LRCHISQ1 option.

Two-Way Tables

The Rao-Scott likelihood ratio statistic is computed from the likelihood ratio chisquare statistic and a design correction based on the design effects of the proportions. Under the null hypothesis of no association, this statistic approximately follows a chisquare distribution with ( R ˆ’ 1)( C ˆ’ 1) degrees of freedom. An F approximation is also given.

The Rao-Scott likelihood ratio chi-square is computed as

click to expand

where G 2 is the likelihood ratio chi-square based on the estimated totals, and D is the design correction.

click to expand

where rc is the estimated total for table cell ( r, c ), and E rc is the expected total for cell ( r, c ) under the null hypothesis of no association,

click to expand

The Rao-Scott likelihood ratio chi-square uses the same design correction D as the Rao-Scott (Pearson) chi-square uses, which is described in the section 'Design Correction for Two-Way Tables' on page 4217. If you specify the LRCHISQ option, the design correction is computed from the estimated proportions. If you specify the LRCHISQ1 option for the Rao-Scott modified test, the design correction uses the null hypothesis cell proportions, computed as the product of the corresponding estimated row and column proportions.

Under the null hypothesis of no association, the Rao-Scott likelihood ratio chi-square approximately follows a chi-square distribution with ( R ˆ’ 1)( C ˆ’ 1) degrees of freedom. A better approximation may be obtained by the F statistic

click to expand

which has an F distribution with ( R ˆ’ 1)( C ˆ’ 1) and ( R ˆ’ 1)( C ˆ’ 1) degrees of freedom under the null hypothesis, where equals the number of clusters minus the number of strata, as described in the section 'Degrees of Freedom' on page 4214.

One-Way Tables

For one-way tables, the Rao-Scott likelihood ratio chi-square statistic provides a design-based goodness-of-fit test for equal proportions. Or if you specify null proportions with the TESTP= option, PROC SURVEYFREQ computes the goodness-of-fit test for the specified proportions. Under the null hypothesis, the Rao-Scott likelihood ratio statistic approximately follows a chi-square distribution with ( C ˆ’ 1) degrees of freedom for a table with C levels. Am F approximation is also given.

The Rao-Scott likelihood ratio chi-square is computed as

click to expand

where G 2 is the likelihood ratio chi-square based on the estimated totals, and D is the design correction.

click to expand

where E c is the expected total for level c under the null hypothesis. For the null hypothesis of equal proportions,

For specified null proportions ,

The Rao-Scott likelihood ratio chi-square uses the same design correction D as the Rao-Scott (Pearson) chi-square uses, which is described in the section 'Design Correction for One-Way Tables' on page 4219. If you specify the LRCHISQ option, the design correction is computed from the estimated proportions. If you specify the LRCHISQ1 option for the Rao-Scott modified test, the design correction uses the null hypothesis cell proportions.

Under the null hypothesis of no association, the Rao-Scott likelihood ratio chi-square approximately follows a chi-square distribution with ( C - 1) degrees of freedom. A better approximation may be obtained by the F statistic

click to expand

which has an F distribution with ( C ˆ’ 1) and ( C ˆ’ 1) degrees of freedom under the null hypothesis, where equals the number of clusters minus the number of strata, as described in the section 'Degrees of Freedom' on page 4214.

Wald Chi-Square Test

PROC SURVEYFREQ provides two Wald chi-square tests for independence of the row and column variables in two-way tables: a Wald chi-square test based on the difference between observed and expected weighted cell frequencies, and a Wald log-linear chi-square test based on the log odds ratio. These statistics test for independence of the row and column variables in two-way tables, taking into account the complex survey design. Refer to Bedrick (1983), Koch, Freeman, and Freeman (1975), and Wald (1943) for information on Wald statistics and their applications to categorical data analysis.

For these two tests, PROC SURVEYFREQ computes the generalized Wald chi-square statistic, the corresponding Wald F statistic, and also an adjusted Wald F statistic for tables larger than 2 — 2. Under the null hypothesis of independence, the Wald chi-square statistic approximately follows a chi-square distribution with ( R ˆ’ 1)( C ˆ’ 1) degrees of freedom for very large samples. However, it has been shown that this test may perform poorly in terms of actual significance level and power, especially for tables with a large number of cells or for samples with a relatively small number of clusters. Refer to Thomas and Rao (1984 and 1985) and Lohr (1999) for more information. Refer to Felligi (1980) and Hidiroglou, Fuller, and Hickman (1980) for information on the adjusted Wald F statistic. Thomas and Rao (1984) found that the adjusted Wald F statistic provides a more stable test than the chi-square statistic, although its power may be low when the number of sample clusters is not large. Refer also to Korn and Graubard (1990) and Thomas, Singh, and Roberts (1996).

If you specify the WCHISQ option in the TABLES statement, PROC SURVEYFREQ computes a Wald test for independence in the two-way table based on the differences between the observed (weighted) cell frequencies and the expected frequencies.

Under the null hypothesis of independence of the row and column variables, the expected cell frequencies are computed as

click to expand

where r . is the estimated total for row r , . c is the estimated total for column c , and is the estimated overall total, as described in the section 'Expected Weighted Frequency' on page 4215. And the null hypothesis that the population weighted frequencies equal the expected frequencies is

click to expand

for all r =1, ( R ˆ’ 1) and c = 1, ( C ˆ’ 1) This null hypothesis can be stated equivalently in terms of cell proportions, with the expected cell proportions computed as the products of the marginal row and column proportions.

The generalized Wald chi-square statistic is computed as

click to expand

where is the ( R ˆ’ 1)( C ˆ’ 1) array of the differences between the observed and expected weighted frequencies ( rc ˆ’ E rc ), and ( H ( ) H ² ) estimates the variance of .

( ) is the covariance matrix of the estimates rc , and its computation is described in the section 'Covariance of Totals' on page 4210.

H is an ( R ˆ’ 1)( C ˆ’ 1) by RC matrix containing the partial derivatives of the elements of with respect to the elements of . The elements of H are computed as follows, where a denotes a row different from row r , and b denotes a column different from column c .

click to expand

Under the null hypothesis of independence, the statistic Q Wald approximately follows a chi-square distribution with ( R ˆ’ 1)( C ˆ’ 1) degrees of freedom for very large samples.

PROC SURVEYFREQ computes the Wald F statistic as

click to expand

Under the null hypothesis of independence, F Wald approximately follows an F distribution with ( R ˆ’ 1)( C ˆ’ 1) numerator degrees of freedom. By default, PROC SURVEYFREQ computes the denominator degrees of freedom as the number of clusters minus the number of strata, as described in the section 'Degrees of Freedom' on page 4214. Alternatively, you can specify the denominator degrees of freedom with the DDF= option in the TABLES statement.

For tables larger than 2 — 2, PROC SURVEYFREQ also computes the adjusted Wald F statistic

click to expand

where k = ( R - 1) ( C ˆ’ 1), and s is the number of clusters minus the number of strata. Alternatively, you can specify the value of s with the DDF= option in the TABLES statement. Note that for 2 — 2 tables, k =( R ˆ’ 1)( C ˆ’ 1) = 1, so the the adjusted Wald F statistic equals the (unadjusted) Wald F statistic, with the same numerator and denominator degrees of freedom.

Under the null hypothesis, F Adj Wald approximately follows an F distribution with k numerator degrees of freedom and ( s ˆ’ k + 1) denominator degrees of freedom.

Wald Log-Linear Chi-Square Test

If you specify the WLLCHISQ option in the TABLES statement, PROC SURVEYFREQ computes a Wald test for independence based on the log odds ratios. See the section 'Wald Chi-Square Test' on page 4221 for more information on Wald tests.

For a two-way table of R rows and C columns, the Wald log-linear test is based on the ( R ˆ’ 1)( C ˆ’ 1) array of

click to expand

where rc is the estimated total for table cell ( r, c ). The null hypothesis of independence between the row and column variables is H : Y rc = 0 for all r = 1, ( R ˆ’ 1) and c = 1, ( C ˆ’ 1). This null hypothesis can be stated equivalently in terms of cell proportions.

The generalized Wald log-linear chi-square statistic is computed as

click to expand

where is the ( R ˆ’ 1)( C ˆ’ 1) array of the rc , and ( ) estimates the variance of .

click to expand

where ( ) is the covariance matrix of the estimates rc , as described in the section 'Covariance of Totals' on page 4210, D is a diagonal matrix with the estimated totals rc on the diagonal, and A is the ( R ˆ’ 1)( C ˆ’ 1) by RC RC linear contrast matrix.

Under the null hypothesis of independence, the statistic Q Wald LL approximately follows a chi-square distribution with ( R ˆ’ 1)( C ˆ’ 1) degrees of freedom for very large samples.

PROC SURVEYFREQ computes the Wald log-linear F statistic as

click to expand

Under the null hypothesis of independence, F Wald LL approximately follows an F distribution with ( R ˆ’ 1)( C ˆ’ 1) numerator degrees of freedom. By default, PROC SURVEYFREQ computes the denominator degrees of freedom as the number of clusters minus the number of strata, as described in the section 'Degrees of Freedom' on page 4214. Alternatively, you can specify the denominator degrees of freedom with the DDF= option in the TABLES statement.

For tables larger than 2 — 2, PROC SURVEYFREQ also computes the adjusted Wald log-linear F statistic

click to expand

where k =( R ˆ’ 1)( C ˆ’ 1), and s is the number of clusters minus the number of strata. Alternatively, you can specify the value of s with the DDF= option in the TABLES statement. Note that for 2 — 2 tables, k =( R ˆ’ 1)( C ˆ’ 1) = 1, so the the adjusted Wald F statistic equals the (unadjusted) Wald F statistic, with the same numerator and denominator degrees of freedom.

Under the null hypothesis, F Adj Wald LL approximately follows an F distribution with k numerator degrees of freedom and ( s ˆ’ k + 1) denominator degrees of freedom.

Displayed Output

Data and Sample Design Summary Table

The 'Data Summary' table provides information on the input data set and the sample design. PROC SURVEYFREQ displays this table unless you specify the NOSUMMARY option in the PROC SURVEYFREQ statement.

The 'Data Summary' table displays the total number of valid observations. To be considered valid , an observation must have a nonmissing, positive WEIGHT value if you specify a WEIGHT statement. If you do not specify the MISSING option, a valid observation must also have nonmissing values for all STRATA and CLUSTER variables. The number of valid observations may differ from the the number of non-missing observations for an individual analysis variable, which the procedure displays in the frequency or crosstabulation tables. See the section 'Missing Values' on page 4205 for more information.

PROC SURVEYFREQ displays the following information in the 'Data Summary' table:

  • Number of Strata, if you specify a STRATA statement

  • Number of Clusters, if you specify a CLUSTER statement

  • Number of Observations, which is the total number of valid observations

  • Sum of Weights, which is the sum over all valid observations, if you specify a WEIGHT statement

Stratum Information Table

If you specify the LIST optionintheSTRATA statement, PROC SURVEYFREQ displays a 'Stratum Information' table. This table provides the following information for each stratum.

  • Stratum Index, which is a sequential stratum identification number

  • STRATA variable(s), which lists the levels of STRATA variables for the stratum

  • Number of Observations, which is the number of valid observations in the stratum

  • Population Total for the stratum, if you specify the TOTAL= option

  • Sampling Rate for the stratum, if you specify the TOTAL= optionorthe RATE= option. If you specify the TOTAL= option, the sampling rate is based on the number of valid observations in the stratum.

  • Number of Clusters, which is the number of clusters in the stratum, if you specify a CLUSTER statement

One-Way Frequency Tables

PROC SURVEYFREQ displays one-way frequency tables for all one-way table requests in the TABLES statements, unless you specify the NOPRINT option in the TABLES statement. A one-way table shows the sample frequency distribution of a single variable, and provides estimates for its population distribution in terms of totals and proportions. For each level of the variable, PROC SURVEYFREQ displays the following information in the one-way table:

  • Frequency count, giving the number of sample observations for the level

  • Weighted Frequency total, estimating the total population frequency for the level

  • Standard Deviation of Weighted Frequency

  • Percent, estimating the population proportion for the level

  • Standard Error of Percent

The one-way table displays weighted frequencies if your analysis includes a WEIGHT statement, or if you specify the WTFREQ option in the TABLES statement.

The one-way table also displays the Frequency Missing, or the number of observations with missing values.

You can suppress the frequency counts by specifying the NOFREQ optioninthe TABLES statement. Also, the NOWT option suppresses the weighted frequencies and their standard deviations. The NOPERCENT option suppresses the percentages and their standard errors. The NOSTD option suppresses the standard errors of the percentages and the standard deviations of the weighted frequencies. The NOTOTAL option suppresses the total row of the one-way table.

PROC SURVEYFREQ optionally displays the following information for a one-way table:

  • Variance of Weighted Frequency, if you specify the VARWT option

  • Confidence Limits for Weighted Frequency, if you specify the CLWT option

  • Coefficient of Variation for Weighted Frequency, if you specify the CVWT option

  • Test Percent, if you specify the TESTP= option

  • Variance of Percent, if you specify the VAR option

  • Confidence Limits for Percent, if you specify the CL option

  • Coefficient of Variation for Percent, if you specify the CV option

  • Design Effect for Percent, if you specify the DEFF option

Crosstabulation Tables

PROC SURVEYFREQ displays all multiway table requests in the TABLES statements, unless you specify the NOPRINT option in the TABLES statement. For two-way to multiway crosstabulation tables, the values of the last variable in the table request form the table columns. The values of the next -to-last variable form the rows. Each level (or combination of levels) of the other variables form one layer. PROC SURVEYFREQ produces a separate two-way crosstabulation table for each layer.

For each layer, the crosstabulation table displays the row and column variable names and values (or levels). Each two-way table lists levels of the column variable within each level of the row variable.

By default, the procedure displays all levels of the column variable within each level of the row variables, including any column variable levels with zero frequency for that row. For multiway tables, the procedure displays all levels of the row variable for each layer of the table by default, including any row levels with zero frequency for that layer. You can suppress the display of zero frequency levels by specifying the NOSPARSE option.

For each combination of variable levels, or table cell, the two-way table displays the following information:

  • Frequency, giving the number of observations that have the indicated values of the two variables

  • Weighted Frequency total, estimating the total population frequency for the table cell

  • Standard Deviation of Weighted Frequency

  • Percent, estimating the population proportion for the table cell

  • Standard Error of Percent

The two-way table displays weighted frequencies if your analysis includes a WEIGHT statement, or if you specify the WTFREQ option in the TABLES statement.

The two-way table also displays the Frequency Missing, or the number of observations with missing values.

You can suppress the frequency counts by specifying the NOFREQ optioninthe TABLES statement. Also, the NOWT option suppresses the weighted frequencies and their standard deviations. The NOPERCENT option suppresses the percentages and their standard errors. The NOSTD option suppresses the standard errors of the percentages and the standard deviations of the weighted frequencies. The NOTOTAL option suppresses the row totals and column totals.

PROC SURVEYFREQ optionally displays the following information for a two-way table:

  • Expected Weighted Frequency, if you specify the EXPECTED option

  • Variance of Weighted Frequency, if you specify the VARWT option

  • Confidence Limits for Weighted Frequency, if you specify the CLWT option

  • Coefficient of Variation for Weighted Frequency, if you specify the CVWT option

  • Variance of Percent, if you specify the VAR option

  • Confidence Limits for Percent, if you specify the CL option

  • Coefficient of Variation for Percent, if you specify the CV option

  • Design Effect for Percent, if you specify the DEFF option

  • Row Percent, estimating the cell's proportion of the population total for that cell's row, if you specify the ROW option

  • Standard Error of Row Percent, if you specify the ROW option

  • Variance of Row Percent, if you specify the VAR option and the ROW option

  • Confidence Limits for Row Percent, if you specify the CL option and the ROW option

  • Coefficient of Variation for Row Percent, if you specify the CV option and the ROW option

  • Column Percent, estimating the cell's proportion of the population total for that cell's column, if you specify the COL option

  • Standard Error of Column Percent, if you specify the COL option

  • Variance of Column Percent, if you specify the VAR option and the COL option

  • Confidence Limits for Column Percent, if you specify the CL option and the COL option

  • Coefficient of Variation for Column Percent, if you specify the CV option and the COL option

If you specify the ROW option, the NOPERCENT option suppresses the row percentages and their standard errors. The NOSTD option suppresses the standard errors of the row percentages. Similarly, if you specify the COL option, the NOPERCENT option suppresses the column percentages and their standard errors. The NOSTD option suppresses the standard errors of the column percentages.

Statistical Tests

If you specify the CHISQ option for the Rao-Scott chi-square test, the CHISQ1 option for the modified test, the LRCHISQ option for the Rao-Scott likelihood ratio chi-square test, or the LRCHISQ1 option for the modified test, PROC SURVEYFREQ displays the following information:

  • Pearson Chi-Square, if you specify the CHISQ or CHISQ1 option

  • Likelihood Ratio Chi-Square, if you specify the LRCHISQ or LRCHISQ1 option

  • Design Correction

  • Rao-Scott Chi-Square, if you specify the CHISQ or CHISQ1 option

  • Rao-Scott Likelihood Ratio Chi-Square, if you specify the LRCHISQ or LRCHISQ1 option

  • DF, the degrees of freedom for the chi-square test

  • Pr > ChiSq, the p -value for the chi-square test

  • F Value

  • Num DF, the numerator degrees of freedom for F

  • Den DF, the denominator degrees of freedom for F

  • Pr > F, the p -value for the F test

If you specify the WCHISQ option for the Wald chi-square test or the WLLCHISQ option for the Wald log-linear chi-square test, PROC SURVEYFREQ displays the following information:

  • Wald Chi-Square, if you specify the WCHISQ option

  • Wald Log-Linear Chi-Square, if you specify the WLLCHISQ option

  • F Value

  • Num DF, the numerator degrees of freedom for F

  • Den DF, the denominator degrees of freedom for F

  • Pr > F, the p -value for the F test

  • Adjusted F Value, for tables larger than 2 — 2

  • Num DF, the numerator degrees of freedom for Adjusted F

  • Den DF, the denominator degrees of freedom for Adjusted F

  • Pr > Adj F, the p -value for the Adjusted F test

ODS Table Names

PROC SURVEYFREQ assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, 'Using the Output Delivery System.'

Table 68.3: ODS Tables Produced in PROC SURVEYFREQ

ODS Table Name

Description

Statement

Option

ChiSq

Chi-square test

TABLES

CHISQ

ChiSq1

Modified chi-square test

TABLES

CHISQ1

CrossTabs

Crosstabulation table

TABLES

( n -way table request, n > 1)

LRChiSq

Likelihood ratio test

TABLES

LRCHISQ

LRChiSq1

Modified likelihood ratio test

TABLES

LRCHISQ1

OneWay

One-way frequency table

PROC

or TABLES

(with no TABLES stmt)

(one-way table request)

StrataInfo

Stratum information

STRATA

LIST

Summary

Data summary

PROC

default

TableSummary

Table summary (not displayed)

TABLES

default

WChiSq

Wald chi-square test

TABLES

WCHISQ

WLLChiSq

Wald log-linear chi-square test

TABLES

WLLCHISQ




SAS.STAT 9.1 Users Guide (Vol. 6)
SAS.STAT 9.1 Users Guide (Vol. 6)
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 127

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net