Details


Missing Values

If an observation has a missing value or a nonpositive value for the WEIGHT variable, then PROC SURVEYREG excludes that observation from the analysis. An observation is also excluded if it has a missing value for any STRATA variable, CLUSTER variable, dependent variable, or any variable used in the independent effects. The analysis includes all observations in the data set that have nonmissing values for all these design and analysis variables .

If you have missing values in your survey data for any reason (such as nonresponse), this can compromise the quality of your survey results. If the respondents are different from the nonrespondents with regard to a survey effect or outcome, then survey estimates will be biased and will not accurately represent the survey population. There are a variety of techniques in sample design and survey operations that can reduce nonresponse. Once data collection is complete, you can use imputation to replace missing values with acceptable values, and you can use sampling weight adjustments to compensate for nonresponse. You should complete this data preparation and adjustment before you analyze your data with PROC SURVEYREG. Refer to Cochran (1977) for more details.

Survey Design Information

Specification of Population Totals and Sampling Rates

If your analysis should include a finite population correction ( fpc ), you can input either the sampling rate or the population total using the RATE= option or the TOTAL= option. You cannot specify both of these options in the same PROC SURVEYREG statement. If you do not specify one of these options, the procedure does not use the fpc when computing variance estimates. For fairly small sampling fractions, it is appropriate to ignore this correction. Refer to Cochran (1977) and Kish (1965).

If your design has multiple stages of selection and you are specifying the RATE= option, you should input the first-stage sampling rate, which is the ratio of the number of PSUs in the sample to the total number of PSUs in the study population. If you are specifying the TOTAL= option for a multistage design, you should input the total number of PSUs in the study population.

For a nonstratified sample design, or for a stratified sample design with the same sampling rate or the same population total in all strata, you should use the RATE= value option or the TOTAL= value option. If your sample design is stratified with different sampling rates or population totals in the strata, then you can use the RATE= SAS-data-set option or the TOTAL= SAS-data-set option to name a SAS data set that contains the stratum sampling rates or totals. This data set is called a secondary data set , as opposed to the primary data set that you specify with the DATA= option.

The secondary data set must contain all the stratification variables listed in the STRATA statement and all the variables in the BY statement. If there are formats associated with the STRATA variables and the BY variables, then the formats must be consistent in the primary and the secondary data sets. If you specify the TOTAL= SAS-data-set option, the secondary data set must have a variable named _TOTAL_ that contains the stratum population totals. Or if you specify the RATE= SAS-data-set option, the secondary data set must have a variable named _RATE_ that contains the stratum sampling rates.

The secondary data set must contain all BY and STRATA groups that occur in the primary data set. If the secondary data set contains more than one observation for any one stratum, then the procedure uses the first value of _TOTAL_ or _RATE_ for that stratum and ignores the rest.

The value in the RATE= option, or the values of _RATE_ in the secondary data set, must be non-negative numbers . You can specify a sampling rate as a number between 0 and 1. Or you can specify a sampling rate in percentage form as a number between 1 and 100, and PROC SURVEYREG will convert that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%.

If you specify the TOTAL= value option, value must not be less than the sample size . If you provide stratum population totals in a secondary data set, these values must not be less than the corresponding stratum sample sizes.

Primary Sampling Units (PSUs)

When you have clusters, or primary sampling units (PSUs), in your sample design, the procedure estimates variance from the variation among PSUs. For more information, see the section 'Variance Estimation' on page 4385. You can use the CLUSTER statement to identify the first stage clusters in your design. PROC SURVEYREG assumes that each cluster represents a PSU in the sample and that each observation is an element of a PSU. If you do not specify a CLUSTER statement, the procedure treats each observation as a PSU.

Computational Details

Notation

For a stratified clustered sample design, observations are represented by an n — ( p +2) matrix

click to expand

where

  • w denotes the sampling weight vector

  • y denotes the dependent variable

  • X denotes the design matrix. (When an effect contains only classification variables, the columns of X corresponding to this effect contain only 0s and 1s; no reparameterization is made.)

  • h = 1 , 2 , , H is the stratum number with a total of H strata

  • i = 1 , 2 , , n h is the cluster number within stratum h , with a total of n h clusters

  • j = 1 , 2 , , m hi is the unit number within cluster i of stratum h , with a total of m hi units

  • p is the total number of parameters (including an intercept if the INTERCEPT effect is included in the MODEL statement)

  • click to expand is the total number of observations in the sample

Also, f h denotes the sampling rate for stratum h . You can use the TOTAL= option or the RATE= option to input population totals or sampling rates. See the section 'Specification of Population Totals and Sampling Rates' on page 4382 for details. If you input stratum totals, PROC SURVEYREG computes f h as the ratio of the stratum sample size to the stratum total. If you input stratum sampling rates, PROC SURVEYREG uses these values directly for f h . If you do not specify the TOTAL= option or the RATE= option, then the procedure assumes that the stratum sampling rates f h are negligible, and a finite population correction is not used when computing variances.

Regression Coefficients

PROC SURVEYREG solves the normal equations X ² WX ² = X ² Wy using a modified sweep routine that produces a generalized (g2) inverse ( X ² WX ) ˆ’ and a solution (Pringle and Raynor 1971)

click to expand

where W is the diagonal matrix constructed from WEIGHT variable values.

For models with class variables, there are more design matrix columns than there are degrees of freedom (DF) for the effect. Thus, there are linear dependencies among the columns. In this case, the parameters are not estimable; there is an infinite number of least-squares solutions. PROC SURVEYREG uses a generalized (g2) inverse to obtain values for the estimates. The solution values are not displayed unless you specify the SOLUTION option in the MODEL statement. The solution has the characteristic that estimates are 0 whenever the design column for that parameter is a linear combination of previous columns. (Strictly termed, the solution values should not be called estimates.) With this full parameterization, hypothesis tests are constructed to test linear functions of the parameters that are estimable .

Variance Estimation

PROC SURVEYREG uses the Taylor series expansion theory to estimate the covariance-variance matrix of the estimated regression coefficients (Fuller 1975). Let

where the ( h, i, j )th element is r hij . Compute 1 — p row vectors

click to expand

and calculate the p p matrix

click to expand

PROC SURVEYREG computes the covariance matrix of ² as

click to expand

The factor ( n ˆ’ 1) / ( n ˆ’ p ) in the computation of the matrix should reduce the small sample bias associated with using the estimated function to calculate deviations (Hidiroglou et al. (1980)). For simple random sampling, this factor contributes to the degrees of freedom correction applied to the residual mean square for ordinary least squares in which p parameter are estimated. By default, the procedure will use this adjustment in the variance estimation. It is equivalent to specify the VADJUST=DF option in the MODEL statement. If you do not wish to use this multiplier in the variance estimation, you can specify the VADJUST=NONE option in the MODEL statement to suppress this factor.

Degrees of Freedom

PROC SURVEYREG produces tests for the significance of model effects, regression parameters, estimable functions specified in the ESTIMATE statement, and contrasts specified in the CONTRAST statement. It computes all these tests taking into account the sample design. The degrees of freedom for these tests differ from the degrees of freedom for the ANOVA table, which does not consider the sample design.

Denominator Degrees of Freedom

The denominator DF refers to the denominator degrees of freedom for F tests and to the degrees of freedom for t tests in the analysis. By default, the denominator DF equals the number of clusters minus the actual number of strata. If there are no clusters, the denominator DF equals the number of observations minus the actual number of strata. The actual number of strata equals

  • one, if there is no STRATA statement

  • the number of strata in the sample, if there is a STRATA statement but the procedure does not collapse any strata

  • the number of strata in the sample after collapsing, if there is a STRATA statement and the procedure collapses strata that have only one sampling unit

Alternatively, you can specify the denominator DF using the DF= option on page 4380 in the MODEL statement.

Numerator Degrees of Freedom

The numerator DF refers to the numerator degrees of freedom for the Wald F statistic associated with an effect or with a contrast. The procedure computes the Wald F statistic for an effect as a Type III test; that is, the test has the following properties:

  • The hypothesis for an effect does not involve parameters of other effects except for containing effects (which it must involve to be estimable).

  • The hypotheses to be tested are invariant to the ordering of effects in the model.

See the section 'Testing Effects' on page 4386 for more information. The numerator DF for the Wald F statistic for a contrast is the rank of the L matrix that defines the contrast.

Testing Effects

For each effect in the model, PROC SURVEYREG computes an L matrix such that every element of L ² is estimable; the L matrix has the maximum possible rank associated with the effect. To test the effect, the procedure uses the Wald F statistic for the hypothesis H : L ² = 0.TheWald F statistic equals

click to expand

with numerator degrees of freedom equal to rank( L ) and denominator degrees of freedom equal to the number of clusters minus the number of strata (unless you have specified the denominator degrees of freedom with the DF= option in the MODEL statement; see the section 'Denominator Degrees of Freedom' on page 4386). It is possible that the L matrix cannot be constructed for an effect, in which case that effect is not testable. For more information on how the matrix L is constructed, see the discussion in Chapter 11, 'The Four Types of Estimable Functions.'

Analysis of Variance (ANOVA)

PROC SURVEYREG produces an analysis of variance table for the model specified in the MODEL statement. This table is identical to the one produced by the GLM procedure for the model. PROC SURVEYREG computes ANOVA table entries using the sampling weights, but not the sample design information on stratification and clustering.

The degrees of freedom (DF) displayed in the ANOVA table are the same as those in the ANOVA table produced by PROC GLM. The Total DF is the total degrees of freedom used to obtain the regression coefficient estimates. The Total DF equals the total number of observations minus 1 if the model includes an intercept. If the model does not include an intercept, the Total DF equals the total number of observations. The Model DF equals the degrees of freedom for the effects in the MODEL statement, not including the intercept. The Error DF equals the total DF minus the model DF.

Multiple R-square

PROC SURVEYREG computes a multiple R-square for the weighted regression as

where SS error is the error sum of squares in the ANOVA table

and SS total is the total sum of squares

click to expand

where w is the sum of the sampling weights over all observations.

Adjusted R-square

If you specify the option ADJRSQ in the MODEL statement, PROC SURVEYREG computes an multiple R-square adjusted as the weighted regression as

click to expand

where R 2 is the multiple R-square.

Root Mean Square Errors

PROC SURVEYREG computes the square root of mean square errors as

click to expand

where w is the sum of the sampling weights over all observations.

Design Effect

If you specify the DEFF option in the MODEL statement, PROC SURVEYREG calculates the design effects for the regression coefficients. The design effect of an estimate is the ratio of the actual variance to the variance computed under the assumption of simple random sampling.

click to expand

Refer to Kish (1965, p. 258). PROC SURVEYREG computes the numerator as described in the section 'Variance Estimation' on page 4385. And the denominator is computed under the assumption that the sample design is simple random sampling, with no stratification and no clustering.

To compute the variance under the assumption of simple random sampling, PROC SURVEYREG calculates the sampling rate as follows . If you specify both sampling weights and sampling rates (or population totals) for the analysis, then the sampling rate under simple random sampling is calculated as

where n is the sample size and w (the sum of the weights over all observations) estimates the population size. If the sum of the weights is less than the sample size, f SRS is set to zero. If you specify sampling rates for the analysis but not sampling weights, then PROC SURVEYREG computes the sampling rate under simple random sampling as the average of the stratum sampling rates.

click to expand

If you do not specify sampling rates (or population totals) for the analysis, then the sampling rate under simple random sampling is assumed to be zero.

Stratum Collapse

If there is only one sampling unit in a stratum, then PROC SURVEYREG cannot estimate the variance for this stratum. To estimate stratum variances, by default the procedure collapses, or combines, those strata that contain only one sampling unit. If you specify the NOCOLLAPSE option in the STRATA statement, PROC SURVEYREG does not collapse strata and uses a variance estimate of 0 for any stratum that contains only one sampling unit.

If you do not specify the NOCOLLAPSE option, PROC SURVEYREG collapses strata according to the following rules. If there are multiple strata that each contain only one sampling unit, then the procedure collapses, or combines, all these strata into a new pooled stratum. If there is only one stratum with a single sampling unit, then PROC SURVEYREG collapses that stratum with the preceding stratum, where strata are ordered by the STRATA variable values. If the stratum with one sampling unit is the first stratum, then the procedure combines it with the following stratum.

If you specify stratum sampling rates using the RATE= SAS-data-set option, PROC SURVEYREG computes the sampling rate for the new pooled stratum as the weighted average of the sampling rates for the collapsed strata. See the section 'Computational Details' on page 4384 for details. If the specified sampling rate equals 0 for any of the collapsed strata, then the pooled stratum is assigned a sampling rate of 0. If you specify stratum totals using the TOTAL= SAS-data-set option, PROC SURVEYREG combines the totals for the collapsed strata to compute the sampling rate for the new pooled stratum.

Sampling Rate of the Pooled Stratum from Collapse

Assuming that PROC SURVEYREG collapses single-unit strata h 1 ,h 2 , h c into the pooled stratum, the procedure calculates the sampling rate for the pooled stratum as

click to expand

Contrasts

You can use the CONTRAST statement to perform custom hypothesis tests. If the hypothesis is testable in the univariate case, the Wald F statistic for H : L ² = 0 is computed as

click to expand

where L is the contrast vector or matrix you specify, ² is the vector of regression parameters, = ( X ² WX ) X ² WY , is the estimated covariance matrix of , rank( L ) is the rank of L , and L Full is a matrix such that

  • L Full has the same number of columns as L

  • L Full has full row rank

  • the rank of L Full equals the rank of the L matrix

  • all rows of L Full are estimable functions

  • the Wald F statistic computed using the L Full matrix is equivalent to the Wald F statistic computed using the L matrix with any row deleted that is a linear combination of previous rows

If L is a full-rank matrix, and all rows of L are estimable functions, then L Full is the same as L . It is possible that L Full matrix cannot be constructed for contrasts in a CONTRAST statement, in which case the contrasts are not testable.

Output

Displayed Output

The SURVEYREG procedure produces the following output.

Data Summary

By default, PROC SURVEYREG displays the following information in the 'Data Summary' table:

  • Number of Observations, which is the total number of observations used in the analysis, excluding observations with missing values

  • Sum of Weights, if you specify a WEIGHT statement

  • Mean of the dependent variable in the MODEL statement, or Weighted Mean if you specify a WEIGHT statement

  • Sum of the dependent variable in the MODEL statement, or Weighted Sum if you specify a WEIGHT statement

Design Summary

When you specify a CLUSTER statement or a STRATA statement, the procedure displays a 'Design Summary' table, which provides the following sample design information:

  • Number of Strata, if you specify a STRATA statement

  • Number of Strata Collapsed, if the procedure collapses strata

  • Number of Clusters, if you specify a CLUSTER statement

  • Overall Sampling Rate used to calculate the design effect, if you specify the DEFF option in the MODEL statement

Fit Statistics

By default, PROC SURVEYREG displays the following regression statistics in the 'Fit Statistics' table:

  • R-square for the regression

  • Root MSE, which is the square root of the mean square error

  • Denominator DF, which is the denominator degrees of freedom for the F tests and also the degrees of freedom for the t tests produced by the procedure

Stratum Information

When you specify the LIST option in the STRATA statement, PROC SURVEYREG displays a 'Stratum Information' table, which provides the following information for each stratum:

  • Stratum Index, which is a sequential stratum identification number

  • STRATA variable(s), which lists the levels of STRATA variables for the stratum

  • Population Total, if you specify the TOTAL= option

  • Sampling Rate, if you specify the TOTAL= option or the RATE= option. If you specify the TOTAL= option, the sampling rate is based on the number of nonmissing observations in the stratum.

  • N Obs, which is the number of observations

  • number of Clusters, if you specify a CLUSTER statement

  • Collapsed, which has the value ˜Yes' if the stratum is collapsed with another stratum before analysis

If PROC SURVEYREG collapses strata, the 'Stratum Information' table also displays stratum information for the new, collapsed stratum. The new stratum has a Stratum Index of 0 and is labeled ˜Pooled'.

Class Level Information

If you use a CLASS statement to name classification variables, PROC SURVEYREG displays a 'Class Level Information' table. This table contains the following information for each classification variable:

  • Class Variable, which lists each CLASS variable name

  • Levels, which is the number of values or levels of the classification variable

  • Values, which lists the values of the classification variable. The values are separated by a white space character; therefore, to avoid confusion, you should not include a white space character within a classification variable value.

X ² X Matrix

If you specify the XPX option in the MODEL statement, PROC SURVEYREG displays the X ² X matrix, or the X ² WX matrix when there is a WEIGHT variable. This option also displays the crossproducts vector X ² y or X ² Wy , where y is the response vector (dependent variable).

Inverse Matrix of X ² X

If you specify the INV option in the MODEL statement, PROC SURVEYREG displays the inverse or the generalized inverse of the X ² X matrix. When there is a WEIGHT variable, the procedure displays the inverse or the generalized inverse of the X ² WX matrix.

ANOVA for Dependent Variable

If you specify the ANOVA option in the model statement, PROC SURVEYREG displays an analysis of variance table for the dependent variable. This table is identical to the ANOVA table displayed by the GLM procedure.

Tests of Model Effects

By default, PROC SURVEYREG displays a 'Tests of Model Effects' table, which provides Wald's F test for each effect in the model. The table contains the following information for each effect:

  • Effect, which is the effect name

  • Num DF, which is the numerator degrees of freedom for Wald's F test

  • F Value, which is Wald's F statistic

  • Pr > F, which is the significance probability corresponding to the F Value

A footnote displays the denominator degrees of freedom, which is the same for all effects.

Estimated Regression Coefficients

PROC SURVEYREG displays the 'Estimated Regression Coefficients' table by default when there is no CLASS statement. Also, the procedure displays this table when you specify a CLASS statement and also specify the SOLUTIONS option in the MODEL statement. This table contains the following information for each regression parameter:

  • Parameter, which identifies the effect or regressor variable

  • Estimate, which is the estimate of the regression coefficient

  • Standard Error, which is the standard error of the estimate

  • t Value, which is the t statistic for testing H : Parameter = 0

  • Pr > t , which is the two-sided significance probability corresponding to the t Value

Covariance of Estimated Regression Coefficients

When you specify the COVB option in the MODEL statement, PROC SURVEYREG displays the 'Covariance of Estimated Regression Coefficients' matrix.

Coefficients of Contrast

When you specify the E option in a CONTRAST statement, PROC SURVEYREG displays a 'Coefficients of Contrast' table for the contrast. You can use this table to check the coefficients you specified in the CONTRAST statement. Also, this table gives a note for a nonestimable contrast.

Analysis of Contrasts

If you specify a CONTRAST statement, PROC SURVEYREG produces an 'Analysis of Contrasts' table, which displays Wald's F test for the contrast. If you use more than one CONTRAST statement, the procedure displays all results in the same table. The 'Analysis of Contrasts' table contains the following information for each contrast:

  • Contrast, which is the label of the contrast

  • Num DF, which is the numerator degrees of freedom for Wald's F test

  • F Value, which is Wald's F statistic for testing H : Contrast = 0

  • Pr > F,whichisthesignificance probability corresponding to the F Value

Coefficients of Estimate

When you specify the E option in an ESTIMATE statement, PROC SURVEYREG displays a 'Coefficients of Estimate' table for the linear function of the regression parameters in the ESTIMATE statement. You can use this table to check the coefficients you specified in the ESTIMATE statement. Also, this table gives a note for a nonestimable function.

Analysis of Estimable Functions

If you specify an ESTIMATE statement, PROC SURVEYREG checks the function for estimability. If the function is estimable, PROC SURVEYREG produces an 'Analysis of Estimable Functions' table, which displays the estimate and the corresponding t test. If you use more than one ESTIMATE statement, the procedure displays all results in the same table. The table contains the following information for each estimable function:

  • Parameter, which is the label of the function

  • Estimate, which is the estimate of the estimable liner function

  • Standard Error, which is the standard error of the estimate

  • t Value, which is the t statistic for testing H :Estimable Function = 0

  • Pr > t , which is the two-sided significance probability corresponding to the t Value

Output Data Sets

Output data sets from PROC SURVEYREG are produced using ODS (Output Delivery System). ODS encompasses more than just the production of output data sets. For example, you can use ODS to manipulate the format of your output, the headers and titles of the tables, the order of the columns in a table. For a more detailed description on using ODS, see Chapter 14, 'Using the Output Delivery System.'

ODS Table Names

PROC SURVEYREG assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, 'Using the Output Delivery System.'

Table 71.2: ODS Tables Produced in PROC SURVEYREG

ODS Table Name

Description

Statement

Option

ANOVA

ANOVA for dependent variable

MODEL

ANOVA

ClassVarInfo

Class level information

CLASS

default

ContrastCoef

Coefficients of contrast

CONTRAST

E

Contrasts

Analysis of contrasts

CONTRAST

default

CovB

Covariance of estimated regression coefficients

MODEL

COVB

DataSummary

Data summary

MODEL

default

DesignSummary

Design summary

STRATA CLUSTER

default

Effects

Tests of model effects

MODEL

 

EstimateCoef

Coefficients of estimate

ESTIMATE

E

Estimates

Analysis of estimable functions

ESTIMATE

default

FitStatistics

Fit Statistics

MODEL

default

InvXPX

Inverse matrix of X ² X

MODEL

INV

ParameterEstimates

Estimated regression coefficients

MODEL

default

StrataInfo

Stratum information

STRATA

LIST

XPX

X ² X matrix

MODEL

XPX

By referring to the names of such tables, you can use the ODS OUTPUT statement to place one or more of these tables in output data sets.

For example, the following statements create an output data set named MyStrata , which contains the 'StrataInfo' table, an output data set named MyParmEst , which contains the 'ParameterEstimates' table, and an output data set named Cov , which contains the 'CovB' table for the ice cream study discussed in the section 'Stratified Sampling' on page 4368:

  title1 'Ice Cream Spending Analysis';   title2 'Stratified Simple Random Sample Design';   proc surveyreg data=IceCream total=StudentTotals;   strata Grade /list;   class Kids;   model Spending = Income Kids / solution covb;   weight Weight;   ods output StrataInfo = MyStrata   ParameterEstimates = MyParmEst   CovB = Cov;   run;  

Note that the option CovB is specified in the MODEL statement in order to produce the covariance matrix table.




SAS.STAT 9.1 Users Guide (Vol. 6)
SAS.STAT 9.1 Users Guide (Vol. 6)
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 127

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net