If an observation has a missing value or a nonpositive value for the WEIGHT variable, then PROC SURVEYREG excludes that observation from the analysis. An observation is also excluded if it has a missing value for any STRATA variable, CLUSTER variable, dependent variable, or any variable used in the independent effects. The analysis includes all observations in the data set that have nonmissing values for all these design and analysis variables .
If you have missing values in your survey data for any reason (such as nonresponse), this can compromise the quality of your survey results. If the respondents are different from the nonrespondents with regard to a survey effect or outcome, then survey estimates will be biased and will not accurately represent the survey population. There are a variety of techniques in sample design and survey operations that can reduce nonresponse. Once data collection is complete, you can use imputation to replace missing values with acceptable values, and you can use sampling weight adjustments to compensate for nonresponse. You should complete this data preparation and adjustment before you analyze your data with PROC SURVEYREG. Refer to Cochran (1977) for more details.
If your analysis should include a finite population correction ( fpc ), you can input either the sampling rate or the population total using the RATE= option or the TOTAL= option. You cannot specify both of these options in the same PROC SURVEYREG statement. If you do not specify one of these options, the procedure does not use the fpc when computing variance estimates. For fairly small sampling fractions, it is appropriate to ignore this correction. Refer to Cochran (1977) and Kish (1965).
If your design has multiple stages of selection and you are specifying the RATE= option, you should input the first-stage sampling rate, which is the ratio of the number of PSUs in the sample to the total number of PSUs in the study population. If you are specifying the TOTAL= option for a multistage design, you should input the total number of PSUs in the study population.
For a nonstratified sample design, or for a stratified sample design with the same sampling rate or the same population total in all strata, you should use the RATE= value option or the TOTAL= value option. If your sample design is stratified with different sampling rates or population totals in the strata, then you can use the RATE= SAS-data-set option or the TOTAL= SAS-data-set option to name a SAS data set that contains the stratum sampling rates or totals. This data set is called a secondary data set , as opposed to the primary data set that you specify with the DATA= option.
The secondary data set must contain all the stratification variables listed in the STRATA statement and all the variables in the BY statement. If there are formats associated with the STRATA variables and the BY variables, then the formats must be consistent in the primary and the secondary data sets. If you specify the TOTAL= SAS-data-set option, the secondary data set must have a variable named _TOTAL_ that contains the stratum population totals. Or if you specify the RATE= SAS-data-set option, the secondary data set must have a variable named _RATE_ that contains the stratum sampling rates.
The secondary data set must contain all BY and STRATA groups that occur in the primary data set. If the secondary data set contains more than one observation for any one stratum, then the procedure uses the first value of _TOTAL_ or _RATE_ for that stratum and ignores the rest.
The value in the RATE= option, or the values of _RATE_ in the secondary data set, must be non-negative numbers . You can specify a sampling rate as a number between 0 and 1. Or you can specify a sampling rate in percentage form as a number between 1 and 100, and PROC SURVEYREG will convert that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%.
If you specify the TOTAL= value option, value must not be less than the sample size . If you provide stratum population totals in a secondary data set, these values must not be less than the corresponding stratum sample sizes.
When you have clusters, or primary sampling units (PSUs), in your sample design, the procedure estimates variance from the variation among PSUs. For more information, see the section 'Variance Estimation' on page 4385. You can use the CLUSTER statement to identify the first stage clusters in your design. PROC SURVEYREG assumes that each cluster represents a PSU in the sample and that each observation is an element of a PSU. If you do not specify a CLUSTER statement, the procedure treats each observation as a PSU.
For a stratified clustered sample design, observations are represented by an n — ( p +2) matrix
where
w denotes the sampling weight vector
y denotes the dependent variable
X denotes the design matrix. (When an effect contains only classification variables, the columns of X corresponding to this effect contain only 0s and 1s; no reparameterization is made.)
h = 1 , 2 , , H is the stratum number with a total of H strata
i = 1 , 2 , , n h is the cluster number within stratum h , with a total of n h clusters
j = 1 , 2 , , m hi is the unit number within cluster i of stratum h , with a total of m hi units
p is the total number of parameters (including an intercept if the INTERCEPT effect is included in the MODEL statement)
is the total number of observations in the sample
Also, f h denotes the sampling rate for stratum h . You can use the TOTAL= option or the RATE= option to input population totals or sampling rates. See the section 'Specification of Population Totals and Sampling Rates' on page 4382 for details. If you input stratum totals, PROC SURVEYREG computes f h as the ratio of the stratum sample size to the stratum total. If you input stratum sampling rates, PROC SURVEYREG uses these values directly for f h . If you do not specify the TOTAL= option or the RATE= option, then the procedure assumes that the stratum sampling rates f h are negligible, and a finite population correction is not used when computing variances.
PROC SURVEYREG solves the normal equations X ² WX ² = X ² Wy using a modified sweep routine that produces a generalized (g2) inverse ( X ² WX ) ˆ’ and a solution (Pringle and Raynor 1971)
where W is the diagonal matrix constructed from WEIGHT variable values.
For models with class variables, there are more design matrix columns than there are degrees of freedom (DF) for the effect. Thus, there are linear dependencies among the columns. In this case, the parameters are not estimable; there is an infinite number of least-squares solutions. PROC SURVEYREG uses a generalized (g2) inverse to obtain values for the estimates. The solution values are not displayed unless you specify the SOLUTION option in the MODEL statement. The solution has the characteristic that estimates are 0 whenever the design column for that parameter is a linear combination of previous columns. (Strictly termed, the solution values should not be called estimates.) With this full parameterization, hypothesis tests are constructed to test linear functions of the parameters that are estimable .
PROC SURVEYREG uses the Taylor series expansion theory to estimate the covariance-variance matrix of the estimated regression coefficients (Fuller 1975). Let
where the ( h, i, j )th element is r hij . Compute 1 — p row vectors
and calculate the p — p matrix
PROC SURVEYREG computes the covariance matrix of ² as
The factor ( n ˆ’ 1) / ( n ˆ’ p ) in the computation of the matrix should reduce the small sample bias associated with using the estimated function to calculate deviations (Hidiroglou et al. (1980)). For simple random sampling, this factor contributes to the degrees of freedom correction applied to the residual mean square for ordinary least squares in which p parameter are estimated. By default, the procedure will use this adjustment in the variance estimation. It is equivalent to specify the VADJUST=DF option in the MODEL statement. If you do not wish to use this multiplier in the variance estimation, you can specify the VADJUST=NONE option in the MODEL statement to suppress this factor.
PROC SURVEYREG produces tests for the significance of model effects, regression parameters, estimable functions specified in the ESTIMATE statement, and contrasts specified in the CONTRAST statement. It computes all these tests taking into account the sample design. The degrees of freedom for these tests differ from the degrees of freedom for the ANOVA table, which does not consider the sample design.
The denominator DF refers to the denominator degrees of freedom for F tests and to the degrees of freedom for t tests in the analysis. By default, the denominator DF equals the number of clusters minus the actual number of strata. If there are no clusters, the denominator DF equals the number of observations minus the actual number of strata. The actual number of strata equals
one, if there is no STRATA statement
the number of strata in the sample, if there is a STRATA statement but the procedure does not collapse any strata
the number of strata in the sample after collapsing, if there is a STRATA statement and the procedure collapses strata that have only one sampling unit
Alternatively, you can specify the denominator DF using the DF= option on page 4380 in the MODEL statement.
The numerator DF refers to the numerator degrees of freedom for the Wald F statistic associated with an effect or with a contrast. The procedure computes the Wald F statistic for an effect as a Type III test; that is, the test has the following properties:
The hypothesis for an effect does not involve parameters of other effects except for containing effects (which it must involve to be estimable).
The hypotheses to be tested are invariant to the ordering of effects in the model.
See the section 'Testing Effects' on page 4386 for more information. The numerator DF for the Wald F statistic for a contrast is the rank of the L matrix that defines the contrast.
For each effect in the model, PROC SURVEYREG computes an L matrix such that every element of L ² is estimable; the L matrix has the maximum possible rank associated with the effect. To test the effect, the procedure uses the Wald F statistic for the hypothesis H : L ² = 0.TheWald F statistic equals
with numerator degrees of freedom equal to rank( L ) and denominator degrees of freedom equal to the number of clusters minus the number of strata (unless you have specified the denominator degrees of freedom with the DF= option in the MODEL statement; see the section 'Denominator Degrees of Freedom' on page 4386). It is possible that the L matrix cannot be constructed for an effect, in which case that effect is not testable. For more information on how the matrix L is constructed, see the discussion in Chapter 11, 'The Four Types of Estimable Functions.'
PROC SURVEYREG produces an analysis of variance table for the model specified in the MODEL statement. This table is identical to the one produced by the GLM procedure for the model. PROC SURVEYREG computes ANOVA table entries using the sampling weights, but not the sample design information on stratification and clustering.
The degrees of freedom (DF) displayed in the ANOVA table are the same as those in the ANOVA table produced by PROC GLM. The Total DF is the total degrees of freedom used to obtain the regression coefficient estimates. The Total DF equals the total number of observations minus 1 if the model includes an intercept. If the model does not include an intercept, the Total DF equals the total number of observations. The Model DF equals the degrees of freedom for the effects in the MODEL statement, not including the intercept. The Error DF equals the total DF minus the model DF.
PROC SURVEYREG computes a multiple R-square for the weighted regression as
where SS error is the error sum of squares in the ANOVA table
and SS total is the total sum of squares
where w is the sum of the sampling weights over all observations.
If you specify the option ADJRSQ in the MODEL statement, PROC SURVEYREG computes an multiple R-square adjusted as the weighted regression as
where R 2 is the multiple R-square.
PROC SURVEYREG computes the square root of mean square errors as
where w is the sum of the sampling weights over all observations.
If you specify the DEFF option in the MODEL statement, PROC SURVEYREG calculates the design effects for the regression coefficients. The design effect of an estimate is the ratio of the actual variance to the variance computed under the assumption of simple random sampling.
Refer to Kish (1965, p. 258). PROC SURVEYREG computes the numerator as described in the section 'Variance Estimation' on page 4385. And the denominator is computed under the assumption that the sample design is simple random sampling, with no stratification and no clustering.
To compute the variance under the assumption of simple random sampling, PROC SURVEYREG calculates the sampling rate as follows . If you specify both sampling weights and sampling rates (or population totals) for the analysis, then the sampling rate under simple random sampling is calculated as
where n is the sample size and w (the sum of the weights over all observations) estimates the population size. If the sum of the weights is less than the sample size, f SRS is set to zero. If you specify sampling rates for the analysis but not sampling weights, then PROC SURVEYREG computes the sampling rate under simple random sampling as the average of the stratum sampling rates.
If you do not specify sampling rates (or population totals) for the analysis, then the sampling rate under simple random sampling is assumed to be zero.
If there is only one sampling unit in a stratum, then PROC SURVEYREG cannot estimate the variance for this stratum. To estimate stratum variances, by default the procedure collapses, or combines, those strata that contain only one sampling unit. If you specify the NOCOLLAPSE option in the STRATA statement, PROC SURVEYREG does not collapse strata and uses a variance estimate of 0 for any stratum that contains only one sampling unit.
If you do not specify the NOCOLLAPSE option, PROC SURVEYREG collapses strata according to the following rules. If there are multiple strata that each contain only one sampling unit, then the procedure collapses, or combines, all these strata into a new pooled stratum. If there is only one stratum with a single sampling unit, then PROC SURVEYREG collapses that stratum with the preceding stratum, where strata are ordered by the STRATA variable values. If the stratum with one sampling unit is the first stratum, then the procedure combines it with the following stratum.
If you specify stratum sampling rates using the RATE= SAS-data-set option, PROC SURVEYREG computes the sampling rate for the new pooled stratum as the weighted average of the sampling rates for the collapsed strata. See the section 'Computational Details' on page 4384 for details. If the specified sampling rate equals 0 for any of the collapsed strata, then the pooled stratum is assigned a sampling rate of 0. If you specify stratum totals using the TOTAL= SAS-data-set option, PROC SURVEYREG combines the totals for the collapsed strata to compute the sampling rate for the new pooled stratum.
Assuming that PROC SURVEYREG collapses single-unit strata h 1 ,h 2 , h c into the pooled stratum, the procedure calculates the sampling rate for the pooled stratum as
You can use the CONTRAST statement to perform custom hypothesis tests. If the hypothesis is testable in the univariate case, the Wald F statistic for H : L ² = 0 is computed as
where L is the contrast vector or matrix you specify, ² is the vector of regression parameters, = ( X ² WX ) X ² WY , is the estimated covariance matrix of , rank( L ) is the rank of L , and L Full is a matrix such that
L Full has the same number of columns as L
L Full has full row rank
the rank of L Full equals the rank of the L matrix
all rows of L Full are estimable functions
the Wald F statistic computed using the L Full matrix is equivalent to the Wald F statistic computed using the L matrix with any row deleted that is a linear combination of previous rows
If L is a full-rank matrix, and all rows of L are estimable functions, then L Full is the same as L . It is possible that L Full matrix cannot be constructed for contrasts in a CONTRAST statement, in which case the contrasts are not testable.
The SURVEYREG procedure produces the following output.
By default, PROC SURVEYREG displays the following information in the 'Data Summary' table:
Number of Observations, which is the total number of observations used in the analysis, excluding observations with missing values
Sum of Weights, if you specify a WEIGHT statement
Mean of the dependent variable in the MODEL statement, or Weighted Mean if you specify a WEIGHT statement
Sum of the dependent variable in the MODEL statement, or Weighted Sum if you specify a WEIGHT statement
When you specify a CLUSTER statement or a STRATA statement, the procedure displays a 'Design Summary' table, which provides the following sample design information:
Number of Strata, if you specify a STRATA statement
Number of Strata Collapsed, if the procedure collapses strata
Number of Clusters, if you specify a CLUSTER statement
Overall Sampling Rate used to calculate the design effect, if you specify the DEFF option in the MODEL statement
By default, PROC SURVEYREG displays the following regression statistics in the 'Fit Statistics' table:
R-square for the regression
Root MSE, which is the square root of the mean square error
Denominator DF, which is the denominator degrees of freedom for the F tests and also the degrees of freedom for the t tests produced by the procedure
When you specify the LIST option in the STRATA statement, PROC SURVEYREG displays a 'Stratum Information' table, which provides the following information for each stratum:
Stratum Index, which is a sequential stratum identification number
STRATA variable(s), which lists the levels of STRATA variables for the stratum
Population Total, if you specify the TOTAL= option
Sampling Rate, if you specify the TOTAL= option or the RATE= option. If you specify the TOTAL= option, the sampling rate is based on the number of nonmissing observations in the stratum.
N Obs, which is the number of observations
number of Clusters, if you specify a CLUSTER statement
Collapsed, which has the value ˜Yes' if the stratum is collapsed with another stratum before analysis
If PROC SURVEYREG collapses strata, the 'Stratum Information' table also displays stratum information for the new, collapsed stratum. The new stratum has a Stratum Index of 0 and is labeled ˜Pooled'.
If you use a CLASS statement to name classification variables, PROC SURVEYREG displays a 'Class Level Information' table. This table contains the following information for each classification variable:
Class Variable, which lists each CLASS variable name
Levels, which is the number of values or levels of the classification variable
Values, which lists the values of the classification variable. The values are separated by a white space character; therefore, to avoid confusion, you should not include a white space character within a classification variable value.
If you specify the XPX option in the MODEL statement, PROC SURVEYREG displays the X ² X matrix, or the X ² WX matrix when there is a WEIGHT variable. This option also displays the crossproducts vector X ² y or X ² Wy , where y is the response vector (dependent variable).
If you specify the INV option in the MODEL statement, PROC SURVEYREG displays the inverse or the generalized inverse of the X ² X matrix. When there is a WEIGHT variable, the procedure displays the inverse or the generalized inverse of the X ² WX matrix.
If you specify the ANOVA option in the model statement, PROC SURVEYREG displays an analysis of variance table for the dependent variable. This table is identical to the ANOVA table displayed by the GLM procedure.
By default, PROC SURVEYREG displays a 'Tests of Model Effects' table, which provides Wald's F test for each effect in the model. The table contains the following information for each effect:
Effect, which is the effect name
Num DF, which is the numerator degrees of freedom for Wald's F test
F Value, which is Wald's F statistic
Pr > F, which is the significance probability corresponding to the F Value
A footnote displays the denominator degrees of freedom, which is the same for all effects.
PROC SURVEYREG displays the 'Estimated Regression Coefficients' table by default when there is no CLASS statement. Also, the procedure displays this table when you specify a CLASS statement and also specify the SOLUTIONS option in the MODEL statement. This table contains the following information for each regression parameter:
Parameter, which identifies the effect or regressor variable
Estimate, which is the estimate of the regression coefficient
Standard Error, which is the standard error of the estimate
t Value, which is the t statistic for testing H : Parameter = 0
Pr > t , which is the two-sided significance probability corresponding to the t Value
When you specify the COVB option in the MODEL statement, PROC SURVEYREG displays the 'Covariance of Estimated Regression Coefficients' matrix.
When you specify the E option in a CONTRAST statement, PROC SURVEYREG displays a 'Coefficients of Contrast' table for the contrast. You can use this table to check the coefficients you specified in the CONTRAST statement. Also, this table gives a note for a nonestimable contrast.
If you specify a CONTRAST statement, PROC SURVEYREG produces an 'Analysis of Contrasts' table, which displays Wald's F test for the contrast. If you use more than one CONTRAST statement, the procedure displays all results in the same table. The 'Analysis of Contrasts' table contains the following information for each contrast:
Contrast, which is the label of the contrast
Num DF, which is the numerator degrees of freedom for Wald's F test
F Value, which is Wald's F statistic for testing H : Contrast = 0
Pr > F,whichisthesignificance probability corresponding to the F Value
When you specify the E option in an ESTIMATE statement, PROC SURVEYREG displays a 'Coefficients of Estimate' table for the linear function of the regression parameters in the ESTIMATE statement. You can use this table to check the coefficients you specified in the ESTIMATE statement. Also, this table gives a note for a nonestimable function.
If you specify an ESTIMATE statement, PROC SURVEYREG checks the function for estimability. If the function is estimable, PROC SURVEYREG produces an 'Analysis of Estimable Functions' table, which displays the estimate and the corresponding t test. If you use more than one ESTIMATE statement, the procedure displays all results in the same table. The table contains the following information for each estimable function:
Parameter, which is the label of the function
Estimate, which is the estimate of the estimable liner function
Standard Error, which is the standard error of the estimate
t Value, which is the t statistic for testing H :Estimable Function = 0
Pr > t , which is the two-sided significance probability corresponding to the t Value
Output data sets from PROC SURVEYREG are produced using ODS (Output Delivery System). ODS encompasses more than just the production of output data sets. For example, you can use ODS to manipulate the format of your output, the headers and titles of the tables, the order of the columns in a table. For a more detailed description on using ODS, see Chapter 14, 'Using the Output Delivery System.'
PROC SURVEYREG assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, 'Using the Output Delivery System.'
ODS Table Name | Description | Statement | Option |
---|---|---|---|
ANOVA | ANOVA for dependent variable | MODEL | ANOVA |
ClassVarInfo | Class level information | CLASS | default |
ContrastCoef | Coefficients of contrast | CONTRAST | E |
Contrasts | Analysis of contrasts | CONTRAST | default |
CovB | Covariance of estimated regression coefficients | MODEL | COVB |
DataSummary | Data summary | MODEL | default |
DesignSummary | Design summary | STRATA CLUSTER | default |
Effects | Tests of model effects | MODEL | |
EstimateCoef | Coefficients of estimate | ESTIMATE | E |
Estimates | Analysis of estimable functions | ESTIMATE | default |
FitStatistics | Fit Statistics | MODEL | default |
InvXPX | Inverse matrix of X ² X | MODEL | INV |
ParameterEstimates | Estimated regression coefficients | MODEL | default |
StrataInfo | Stratum information | STRATA | LIST |
XPX | X ² X matrix | MODEL | XPX |
By referring to the names of such tables, you can use the ODS OUTPUT statement to place one or more of these tables in output data sets.
For example, the following statements create an output data set named MyStrata , which contains the 'StrataInfo' table, an output data set named MyParmEst , which contains the 'ParameterEstimates' table, and an output data set named Cov , which contains the 'CovB' table for the ice cream study discussed in the section 'Stratified Sampling' on page 4368:
title1 'Ice Cream Spending Analysis'; title2 'Stratified Simple Random Sample Design'; proc surveyreg data=IceCream total=StudentTotals; strata Grade /list; class Kids; model Spending = Income Kids / solution covb; weight Weight; ods output StrataInfo = MyStrata ParameterEstimates = MyParmEst CovB = Cov; run;
Note that the option CovB is specified in the MODEL statement in order to produce the covariance matrix table.