This example investigates the relationship between the labor force participation rate (LFPR) of women in 1968 and 1972 in large cities in the United States. A simple random sample of 19 cities is drawn from a total of 200 cities. For each selected city, the LFPRs are recorded and saved in a SAS data set named Labor . The LFPR in 1972 is contained in the variable LFPR1972 , and the LFPR in 1968 is identified by the variable LFPR1968 :
data Labor; input City $ 1 16 LFPR1972 LFPR1968; datalines; New York .45 .42 Los Angeles .50 .50 Chicago .52 .52 Philadelphia .45 .45 Detroit .46 .43 San Francisco .55 .55 Boston .60 .45 Pittsburgh .49 .34 St. Louis .35 .45 Connecticut .55 .54 Washington D.C. .52 .42 Cincinnati .53 .51 Baltimore .57 .49 Newark .53 .54 Minn/St. Paul .59 .50 Buffalo .64 .58 Houston .50 .49 Patterson .57 .56 Dallas .64 .63 ;
Assume that the LFPRs in 1968 and 1972 have a linear relationship, as shown in the following model:
You can use PROC SURVEYREG to obtain the estimated regression coefficients and estimated standard errors of the regression coefficients. The following statements perform the regression analysis:
title 'Study of Labor Force Participation Rates of Women'; proc surveyreg data=Labor total=200; model LFPR1972 = LFPR1968; run;
Here, the TOTAL=200 option specifies the finite population total from which the simple random sample of 19 cities is drawn. You can specify the same information by using the sampling rate option RATE=0.095 (19/200=.095).
Output 71.1.1 summarizes the data information, the fit information.
Study of Labor Force Participation Rates of Women The SURVEYREG Procedure Regression Analysis for Dependent Variable LFPR1972 Data Summary Number of Observations 19 Mean of LFPR1972 0.52684 Sum of LFPR1972 10.01000 Fit Statistics R-square 0.3970 Root MSE 0.05657 Denominator DF 18
Output 71.1.2 presents the significance tests for the model effects and estimated regression coefficients. The F tests and t tests for the effects in the model are also presented in these tables.
Study of Labor Force Participation Rates of Women The SURVEYREG Procedure Regression Analysis for Dependent Variable LFPR1972 Tests of Model Effects Effect Num DF F Value Pr > F Model 1 13.84 0.0016 Intercept 1 4.63 0.0452 LFPR1968 1 13.84 0.0016 NOTE: The denominator degrees of freedom for the F tests is 18. Estimated Regression Coefficients Standard Parameter Estimate Error t Value Pr > t Intercept 0.20331056 0.09444296 2.15 0.0452 LFPR1968 0.65604048 0.17635810 3.72 0.0016 NOTE: The denominator degrees of freedom for the t tests is 18.
From the regression performed by PROC SURVEYREG, you obtain a positive estimated slope for the linear relationship between the LFPR in 1968 and the LFPR in 1972. The regression coefficients are all significant at the 5% level. Effects Intercept and LFPR1968 are significant in the model at the 5% level. In this example, the F test for the overall model without intercept is the same as the effect LFPR1968 .
This example illustrates the use of regression analysis in a simple random cluster sample design. The data are from S arndal, Swenson, and Wretman (1992, p. 652).
A total of 284 Swedish municipalities are grouped into 50 clusters of neighboring municipalities. Five clusters with a total of 32 municipalities are randomly selected. The results from the regression analysis in which clusters are used in the sample design are compared to the results of a regression analysis that ignores the clusters. The linear relationship between the population in 1975 and in 1985 is investigated.
The 32 selected municipalities in the sample are saved in the data set Municipalities :
data Municipalities; input Municipality Cluster Population85 Population75; datalines; 205 37 5 5 206 37 11 11 207 37 13 13 208 37 8 8 209 37 17 19 6 2 16 15 7 2 70 62 8 2 66 54 9 2 12 12 10 2 60 50 94 17 7 7 95 17 16 16 96 17 13 11 97 17 12 11 98 17 70 67 99 17 20 20 100 17 31 28 101 17 49 48 276 50 6 7 277 50 9 10 278 50 24 26 279 50 10 9 280 50 67 64 281 50 39 35 282 50 29 27 283 50 10 9 284 50 27 31 52 10 7 6 53 10 9 8 54 10 28 27 55 10 12 11 56 10 107 108 ;
The variable Municipality identifies the municipalities in the sample; the variable Cluster indicates the cluster to which a municipality belongs; and the variables Population85 and Population75 contain the municipality populations in 1985 and in 1975 (in thousands), respectively. A regression analysis is performed by PROC SURVEYREG with a CLUSTER statement:
title1 'Regression Analysis for Swedish Municipalities'; title2 'Cluster Simple Random Sampling'; proc surveyreg data=Municipalities total=50; cluster Cluster; model Population85=Population75; run;
The TOTAL=50 option specifies the total number of clusters in the sampling frame.
Output 71.2.1 displays the data summary, design summary, fit statistics, and regression coefficient estimates. Since the sample design includes clusters, the procedure displays the total number of clusters in the sample in the 'Design Summary' table. In the 'Estimated Regression Coefficients' table, the estimated slope for the linear relationship is 1.05, which is significant at the 5% level; but the intercept is not significant. This suggests that a regression line crossing the original can be established between populations in 1975 and in 1985.
Regression Analysis for Swedish Municipalities Cluster Simple Random Sampling The SURVEYREG Procedure Regression Analysis for Dependent Variable Population85 Data Summary Number of Observations 32 Mean of Population85 27.50000 Sum of Population85 880.00000 Design Summary Number of Clusters 5 Fit Statistics R-square 0.9860 Root MSE 3.0488 Denominator DF 4 Estimated Regression Coefficients Standard Parameter Estimate Error t Value Pr > t Intercept 0.0191292 0.89204053 0.02 0.9839 Population75 1.0546253 0.05167565 20.41 <.0001 NOTE: The denominator degrees of freedom for the t tests is 4.
The CLUSTER statement is necessary in PROC SURVEYREG in order to incorporate the sample design. If you do not specify a CLUSTER statement in the regression analysis, the standard deviation of the regression coefficients will be incorrectly estimated:
title1 'Regression Analysis for Swedish Municipalities'; title2 'Simple Random Sampling'; proc surveyreg data=Municipalities total=284; model Population85=Population75; run;
The analysis ignores the clusters in the sample, assuming that the sample design is a simple random sampling. Therefore, the TOTAL= option specifies the total number of municipalities, which is 284.
Output 71.2.2 displays the regression results ignoring the clusters. Compared to the results in Output 71.2.1 on page 4398, the regression coefficient estimates are the same. However, without using clusters, the regression coefficients have a smaller variance estimate in Output 71.2.2. Using clusters in the analysis, the estimated regression coeffiecient for effect Population75 is 1.05, with the estimated standard error 0.05, as displayed in Output 71.2.1; without using the clusters, the estimate is 1.05, but with the estimated standard error 0.04, as displayed in Output 71.2.2. To estimated the variance of the regression coefficients correctly, you should include the clustering information in the regression analysis.
Regression Analysis for Swedish Municipalities Simple Random Sampling The SURVEYREG Procedure Regression Analysis for Dependent Variable Population85 Data Summary Number of Observations 32 Mean of Population85 27.50000 Sum of Population85 880.00000 Fit Statistics R-square 0.9860 Root MSE 3.0488 Denominator DF 31 Estimated Regression Coefficients Standard Parameter Estimate Error t Value Pr > t Intercept 0.0191292 0.67417606 0.03 0.9775 Population75 1.0546253 0.03668414 28.75 <.0001 NOTE: The denominator degrees of freedom for the t tests is 31.
Using auxiliary information, you can construct the regression estimators to provide more accurate estimates of the population characteristics that are of interest. With ESTIMATE statements in PROC SURVEYREG, you can specify a regression estimator as a linear function of the regression parameters to estimate the population total. This example illustrates this application, using the data in the previous example.
In this sample, a linear model between the Swedish populations in 1975 and in 1985 is established:
Assuming that the total population in 1975 is known to be 8200 (in thousands), you can use the ESTIMATE statement to predict the 1985 total population using the following statements:
title1 'Regression Analysis for Swedish Municipalities'; title2 'Estimate Total Population'; proc surveyreg data=Municipalities total=50; cluster Cluster; model Population85=Population75; estimate '1985 population' Intercept 284 Population75 8200; run;
Since each observation in the sample is a municipality, and there is a total of 284 municipalities in Sweden, the coefficient for Intercept ( ± ) in the ESTIMATE statement is 284, and the coefficient for Population75 ( ² ) is the total population in 1975 (8.2 million).
Output 71.3.1 displays the regression results and the estimation of the total population. Using the linear model, you can predict the total population in 1985 to be 8.64 million, with a standard error of 0.26 million.
Regression Analysis for Swedish Municipalities Estimate Total Population The SURVEYREG Procedure Regression Analysis for Dependent Variable Population85 Analysis of Estimable Functions Standard Parameter Estimate Error t Value Pr > t 1985 population 8642.49485 258.558613 33.43 <.0001 NOTE: The denominator degrees of freedom for the t tests is 4.
This example illustrates using the SURVEYREG procedure to perform a regression in a stratified sample design. Consider a population of 235 farms producing corn in Nebraska and Iowa. You are interested in the relationship between corn yield ( CornYield ) and the total farm size ( FarmArea ).
Each state is divided into several regions , and each region is used as a stratum. Within each stratum, a simple random sample with replacement is drawn. A total of 19 farms is selected using a stratified simple random sample. The sample size and population size within each stratum are displayed in Table 71.3.
Number of Farms | ||||
---|---|---|---|---|
Stratum | State | Region | Population | Sample |
1 | Iowa | 1 | 100 | 3 |
2 | 2 | 50 | 5 | |
3 | 3 | 15 | 3 | |
4 | Nebraska | 1 | 30 | 6 |
5 | 2 | 40 | 2 | |
Total | 235 | 19 |
Three models for the data are considered :
Model I - Common intercept and slope:
Model II - Common intercept, different slope:
Model III - Different intercept and different slope:
Data from the stratified sample are saved in the SAS data set Farms . In the data set Farms , the variable Weight represents the sampling weight. In this example, the sampling weights are reciprocal of selection probabilities:
data Farms; input State $ Region FarmArea CornYield Weight; datalines; Iowa 1 100 54 33.333 Iowa 1 83 25 33.333 Iowa 1 25 10 33.333 Iowa 2 120 83 10.000 Iowa 2 50 35 10.000 Iowa 2 110 65 10.000 Iowa 2 60 35 10.000 Iowa 2 45 20 10.000 Iowa 3 23 5 5.000 Iowa 3 10 8 5.000 Iowa 3 350 125 5.000 Nebraska 1 130 20 5.000 Nebraska 1 245 25 5.000 Nebraska 1 150 33 5.000 Nebraska 1 263 50 5.000 Nebraska 1 320 47 5.000 Nebraska 1 204 25 5.000 Nebraska 2 80 11 20.000 Nebraska 2 48 8 20.000 ;
The information on population size in each stratum is saved in the SAS data set StratumTotals :
data StratumTotals; input State $ Region _TOTAL_; datalines; Iowa 1 100 Iowa 2 50 Iowa 3 15 Nebraska 1 30 Nebraska 2 40 ;
Using the sample data from the data set Farms and the control information data from the data set StratumTotals , you can fit Model I using PROC SURVEYREG with the following statements:
title1 'Analysis of Farm Area and Corn Yield'; title2 'Model I: Same Intercept and Slope'; proc surveyreg data=Farms total=StratumTotals; strata State Region / list; model CornYield = FarmArea; weight Weight; run;
Output 71.4.1 displays the data summary and stratification information fitting Model I. The sampling rates are automatically computed by the procedure based on the sample sizes and the population totals in strata.
Analysis of Farm Area and Corn Yield Model I: Same Intercept and Slope The SURVEYREG Procedure Regression Analysis for Dependent Variable CornYield Data Summary Number of Observations 19 Sum of Weights 234.99900 Weighted Mean of CornYield 31.56029 Weighted Sum of CornYield 7416.6 Design Summary Number of Strata 5 Fit Statistics R-square 0.3882 Root MSE 20.6422 Denominator DF 14 Stratum Information Stratum Population Sampling Index State Region N Obs Total Rate 1 Iowa 1 3 100 3.00% 2 2 5 50 10.0% 3 3 3 15 20.0% 4 Nebraska 1 6 30 20.0% 5 2 2 40 5.00%
Output 71.4.2 displays tests of model effects and the estimated regression coefficients.
Analysis of Farm Area and Corn Yield Model I: Same Intercept and Slope The SURVEYREG Procedure Regression Analysis for Dependent Variable CornYield Tests of Model Effects Effect Num DF F Value Pr > F Model 1 21.74 0.0004 Intercept 1 4.93 0.0433 FarmArea 1 21.74 0.0004 NOTE: The denominator degrees of freedom for the F tests is 14. Estimated Regression Coefficients Standard Parameter Estimate Error t Value Pr > t Intercept 11.8162978 5.31981027 2.22 0.0433 FarmArea 0.2126576 0.04560949 4.66 0.0004 NOTE: The denominator degrees of freedom for the t tests is 14.
Alternatively, you can assume that the linear relationship between corn yield ( CornYield ) and farm area ( FarmArea ) is different among the states (Model II). In order to analyze the data using this model, you create auxiliary variables FarmAreaNE and FarmAreaIA to represent farm area in different states:
The following statements create these variables in a new data set called FarmsByState and use PROC SURVEYREG to fit Model II:
title1 'Analysis of Farm Area and Corn Yield'; title2 'Model II: Same Intercept, Different Slopes'; data FarmsByState; set Farms; if State='Iowa' then do; FarmAreaIA=FarmArea ; FarmAreaNE=0; end; else do; FarmAreaIA=0 ; FarmAreaNE=FarmArea; end; run;
The following statements perform the regression using the new data set FarmsByState . The analysis uses the auxilary variables FarmAreaIA and FarmAreaNE as the regressors:
proc SURVEYREG data=FarmsByState total=StratumTotals; strata State Region; model CornYield = FarmAreaIA FarmAreaNE; weight Weight; run;
Output 71.4.3 displays the data summary, design information, fit statistics, and parameter estimates. The estimated slope parameters for each state are quite different from the estimated slope in Model I. The results from the regression show that Model II fits these data better than Model I.
Analysis of Farm Area and Corn Yield Model II: Same Intercept, Different Slopes The SURVEYREG Procedure Regression Analysis for Dependent Variable CornYield Data Summary Number of Observations 19 Sum of Weights 234.99900 Weighted Mean of CornYield 31.56029 Weighted Sum of CornYield 7416.6 Design Summary Number of Strata 5 Fit Statistics R-square 0.8158 Root MSE 11.6759 Denominator DF 14 Estimated Regression Coefficients Standard Parameter Estimate Error t Value Pr > t Intercept 4.04234816 3.80934848 1.06 0.3066 FarmAreaIA 0.41696069 0.05971129 6.98 <.0001 FarmAreaNE 0.12851012 0.02495495 5.15 0.0001 NOTE: The denominator degrees of freedom for the t tests is 14.
For Model III, different intercepts are used for the linear relationship in two states. The following statements illustrate the use of the NOINT option in the MODEL statement associated with the CLASS statement to fit Model III:
title2 'Model III: Different Intercepts and Slopes'; proc SURVEYREG data=FarmsByState total=StratumTotals; strata State Region; class State; model CornYield = State FarmAreaIA FarmAreaNE / noint covb solution; weight Weight; run;
The model statement includes the classification effect State as a regressor. Therefore, the parameter estimates for effect State will presents the intercepts in two states.
Output 71.4.4 displays the regression results for fitting Model III, including the data summary, parameter estimates, and covariance matrix of the regression coefficients. The estimated covariance matrix shows a lack of correlation between the regression coefficients from different states. This suggests that Model III might be the best choice for building a model for farm area and corn yield in these two states.
Analysis of Farm Area and Corn Yield Model III: Different Intercepts and Slopes The SURVEYREG Procedure Regression Analysis for Dependent Variable CornYield Data Summary Number of Observations 19 Sum of Weights 234.99900 Weighted Mean of CornYield 31.56029 Weighted Sum of CornYield 7416.6 Design Summary Number of Strata 5 Fit Statistics R-square 0.9300 Root MSE 11.9810 Denominator DF 14 Estimated Regression Coefficients Standard Parameter Estimate Error t Value Pr > t State Iowa 5.27797099 5.27170400 1.00 0.3337 State Nebraska 0.65275201 1.70031616 0.38 0.7068 FarmAreaIA 0.40680971 0.06458426 6.30 <.0001 FarmAreaNE 0.14630563 0.01997085 7.33 <.0001 NOTE: The denominator degrees of freedom for the t tests is 14. Covariance of Estimated Regression Coefficients State State Iowa Nebraska FarmAreaIA FarmAreaNE State Iowa 27.790863033 0 0.205517205 0 State Nebraska 0 2.8910750385 0 0.027354011 FarmAreaIA 0.205517205 0 0.0041711265 0 FarmAreaNE 0 0.027354011 0 0.0003988349
However, some statistics remain the same under different regression models, for example, Weighted Mean of CornYield. These estimators do not rely on the particular model you use.
This example uses the corn yield data from the previous example to illustrate how to construct a regression estimator for a stratified sample design.
Similar to Example 71.3 on page 4400, by incorporating auxilary information into a regression estimator, the procedure can produce more accurate estimates of the population characteristics that are of interest. In this example, the sample design is a stratified sample design. The auxilary information is the total farm areas in regions of each state, as displayed in Table 71.4. You want to estimate the total corn yield using this information under the three linear models given in Example 71.4.
Number of Farms in | |||||
---|---|---|---|---|---|
Stratum | State | Region | Population | Sample | Total Farm Area |
1 | Iowa | 1 | 100 | 3 | |
2 | 2 | 50 | 5 | 13,200 | |
3 | 3 | 15 | 3 | ||
4 | Nebraska | 1 | 30 | 6 | 8,750 |
5 | 2 | 40 | 2 | ||
Total | 235 | 19 | 21,950 |
The regression estimator to estimate the total corn yield under Model I can be obtained by using PROC SURVEYREG with an ESTIMATE statement:
title1 'Estimate Corn Yield from Farm Size'; title2 'Model I: Same Intercept and Slope'; proc surveyreg data=Farms total=StratumTotals; strata State Region / list; class State Region; model CornYield = FarmArea State*Region /solution; weight Weight; estimate 'Estimate of CornYield under Model I' INTERCEPT 235 FarmArea 21950 State*Region 100 50 15 30 40 /e; run;
To apply the contraint in each stratum that the weighted total number of farms equals to the total number of farms in the stratum, you can include the strata as an effect in the MODEL statement, effect State*Region . Thus, the CLASS statement must list the STRATA variables, State and Region , as classification variables. The following
ESTIMATE statement specifies the regression estimator, which is a linear function of the regression parameters:
estimate 'Estimate of CornYield under Model I' INTERCEPT 235 FarmArea 21950 State*Region 100 50 15 30 40 /e;
This linear function contains the total for each explanatory variable in the model. Because the sampling units are farms in this example, the coefficient for Intercept in the ESTIMATE statement is the total number of farms (235); the coefficient for FarmArea is the total farm area listed in Table 71.4 (21950); and the coefficients for effect State*Region are the total number of farms in each strata (as displayed in Table 71.4).
Output 71.5.1 displays the results of the ESTIMATE statement. The regression estimator for the total of CornYield in Iowa and Nebraska is 7464 under Model I, with a standard error of 927.
Estimate Corn Yield from Farm Size Model I: Same Intercept and Slope The SURVEYREG Procedure Regression Analysis for Dependent Variable CornYield Analysis of Estimable Functions Standard Parameter Estimate Error t Value Pr > t Estimate of CornYield under Model I 7463.52329 926.841541 8.05 <.0001 NOTE: The denominator degrees of freedom for the t tests is 14.
Under Model II, a regression estimator for totals can be obtained using the following statements:
title1 'Estimate Corn Yield from Farm Size'; title2 'Model II: Same Intercept, Different Slopes'; proc surveyreg data=FarmsByState total=StratumTotals; strata State Region; class State Region; model CornYield = FarmAreaIA FarmAreaNE state*region /solution; weight Weight; estimate 'Total of CornYield under Model II' INTERCEPT 235 FarmAreaIA 13200 FarmAreaNE 8750 State*Region 100 50 15 30 40 /e; run;
In this model, you also need to include strata as a fixed effect in the MODEL statement. Other regressors are the auxiliary variables FarmAreaIA and FarmAreaNE (defined in Example 71.4). In the following ESTIMATE statement, the coefficient for Intercept is still the total number of farms; and the coefficients for FarmAreaIA and FarmAreaNE are the total farm area in Iowa and Nebraska, respectively, as displayed in Table 71.4. The total number of farms in each strata are the coefficients for the strata effect:
estimate 'Total of CornYield under Model II' INTERCEPT 235 FarmAreaIA 13200 FarmAreaNE 8750 State*Region 100 50 15 30 40 /e;
Output 71.5.2 displays that the results of the regression estimator for the total of corn yield in two states under Model II is 7580 with a standard error of 859. The regression estimator under Model II has a slightly smaller standard error than under Model I.
Estimate Corn Yield from Farm Size Model II: Same Intercept, Different Slopes The SURVEYREG Procedure Regression Analysis for Dependent Variable CornYield Analysis of Estimable Functions Standard Parameter Estimate Error t Value Pr > t Total of CornYield under Model II 7580.48657 859.180439 8.82 <.0001 NOTE: The denominator degrees of freedom for the t tests is 14.
Finally, you can apply Model III to the data and estimate the total corn yield. Under Model III, you can also obtain the regression estimators for the total corn yield for each state. Three ESTIMATE statements are used in the following statements to create the three regression estimators:
title1 'Estimate Corn Yield from Farm Size'; title2 'Model III: Different Intercepts and Slopes'; proc SURVEYREG data=FarmsByState total=StratumTotals; strata State Region; class State Region; model CornYield = state FarmAreaIA FarmAreaNE State*Region /noint solution; weight Weight; estimate 'Total CornYield in Iowa under Model III' State 165 0 FarmAreaIA 13200 FarmAreaNE 0 State*region 100 50 15 0 0 /e; estimate 'Total CornYield in Nebraska under Model III' State 0 70 FarmAreaIA 0 FarmAreaNE 8750 State*Region 0 0 0 30 40 /e; estimate 'Total CornYield in both states under Model III' State 165 70 FarmAreaIA 13200 FarmAreaNE 8750 State*Region 100 50 15 30 40 /e; run;
The fixed effect State is added to the MODEL statement to obtain different intercepts in different states, using the NOINT option. Among the ESTIMATE statements, the coefficients for explanatory variables are different depending on which regression estimator is estimated. For example, in the ESTIMATE statement
estimate 'Total CornYield in Iowa under Model III' State 165 0 FarmAreaIA 13200 FarmAreaNE 0 State*region 100 50 15 0 0 /e;
the coefficients for the effect State are 165 an 0, respectively. This indicates that the total number of farms in Iowa is 165 and the total number of farms in Nebraska is 0, because the estimation is the total corn yield in Iowa only. Similarly, the total numbers of farms in three regions in Iowa are used for the coefficients of the strata effect State*Region , as displayed in Table 71.4.
Output 71.5.3 displays the results from the three regression estimators using Model III. Since the estimations are independent in each state, the total corn yield from both states is equal to the sum of the estimated total of corn yield in Iowa and Nebraska, 6246 + 1334 = 7580. This regression estimator is the same as the one under Model II. The variance of regression estimator of the total corn yield in both states is the sum of variances of regression estimators for total corn yield in each state. Therefore, it is not necessary to use Model III to obtain the regression estimator for the total corn yield unless you need to estimate the total corn yield for each individual state.
Estimate Corn Yield from Farm Size Model III: Different Intercepts and Slopes The SURVEYREG Procedure Regression Analysis for Dependent Variable CornYield Analysis of Estimable Functions Standard Parameter Estimate Error t Value Total CornYield in Iowa under Model III 6246.10697 851.272 7.34 Total CornYield in Nebraska under Model III 1334.37961 116.302948 11.47 Total CornYield in both states under Model III 7580.48657 859.180439 8.82 Analysis of Estimable Functions Parameter Pr > t Total CornYield in Iowa under Model III <.0001 Total CornYield in Nebraska under Model III <.0001 Total CornYield in both states under Model III <.0001 NOTE: The denominator degrees of freedom for the t tests is 14.
In a stratified sample, it is possible that some strata will have only one sampling unit. When this happens, PROC SURVEYREG collapses the strata that contain a single sampling unit into a pooled stratum. For more detailed information on stratum collapse, see the section 'Stratum Collapse' on page 4388.
Suppose that you have the following data:
data Sample; input Stratum X Y W; datalines; 10 0 0 5 10 1 1 5 11 1 1 10 11 1 2 10 12 3 3 16 33 4 4 45 14 6 7 50 12 3 4 16 ;
The variable Stratum is again the stratification variable, the variable X is the inde-pendent variable, and the variable Y is the dependent variable. You want to regress Y on X . In the data set Sample, both Stratum =33 and Stratum =14 contain one observation. By default, PROC SURVEYREG collapses these strata into one pooled stratum in the regression analysis.
To input the finite population correction information, you create the SAS data set StratumTotals :
data StratumTotals; input Stratum _TOTAL_; datalines; 10 10 11 20 12 32 33 40 33 45 14 50 15 . 66 70 ;
The variable Stratum is the stratification variable, and the variable _TOTAL_ contains the stratum totals. The data set StratumTotals contains more strata than the data set Sample . Also in the data set StratumTotals , more than one observation contains the stratum totals for Stratum =33:
33 40 33 45
PROC SURVEYREG allows this type of input. The procedure simply ignores strata that are not present in the data set Sample ; for the multiple entries of a stratum, the procedure uses the first observation. In this example, Stratum =33 has the stratum total _TOTAL_ =40.
The following SAS statements perform the regression analysis:
title1 'Stratified Sample with Single Sampling Unit in Strata'; title2 'With Stratum Collapse'; proc SURVEYREG data=Sample total=StratumTotals; strata Stratum/list; model Y=X; weight W; run;
Output 71.6.1 displays that there are a total of five strata in the input data set, and two strata are collapsed into a pooled stratum. The denominator degrees of freedom is 4, due to the collapse (see the section 'Denominator Degrees of Freedom' on page 4386).
Stratified Sample with Single Sampling Unit in Strata With Stratum Collapse The SURVEYREG Procedure Regression Analysis for Dependent Variable Y Data Summary Number of Observations 8 Sum of Weights 157.00000 Weighted Mean of Y 4.31210 Weighted Sum of Y 677.00000 Design Summary Number of Strata 5 Number of Strata Collapsed 2 Fit Statistics R-square 0.9564 Root MSE 0.5111 Denominator DF 4
Output 71.6.2 displays the stratification information, including stratum collapse. Under the column Collapsed, the fourth stratum ( Stratum =14) and the fifth ( Stratum =33) are marked as ˜Yes', which indicates that these two strata are collapsed into the pooled stratum (Stratum Index=0). The sampling rate for the pooled stratum is 2% (see the section 'Sampling Rate of the Pooled Stratum from Collapse' on page 4389).
Stratified Sample with Single Sampling Unit in Strata With Stratum Collapse The SURVEYREG Procedure Regression Analysis for Dependent Variable Y Stratum Information Stratum Population Sampling Index Collapsed Stratum N Obs Total Rate 1 10 2 10 20.0% 2 11 2 20 10.0% 3 12 2 32 6.25% 4 Yes 14 1 50 2.00% 5 Yes 33 1 40 2.50% 0 Pooled 2 90 2.22% NOTE: Strata with only one observation are collapsed into the stratum with Stratum Index "0".
Output 71.6.3 displays the parameter estimates and the tests of the significance of the model effects.
Stratified Sample with Single Sampling Unit in Strata With Stratum Collapse The SURVEYREG Procedure Regression Analysis for Dependent Variable Y Tests of Model Effects Effect Num DF F Value Pr > F Model 1 173.01 0.0002 Intercept 1 0.00 0.9961 X 1 173.01 0.0002 NOTE: The denominator degrees of freedom for the F tests is 4. Estimated Regression Coefficients Standard Parameter Estimate Error t Value Pr > t Intercept 0.00179469 0.34306373 0.01 0.9961 X 1.12598708 0.08560466 13.15 0.0002 NOTE: The denominator degrees of freedom for the t tests is 4.
Alternatively, if you prefer not to collapse strata with a single sampling unit, you can specify the NOCOLLAPSE option in the STRATA statement:
title1 'Stratified Sample with Single Sampling Unit in Strata'; title2 'Without Stratum Collapse'; proc SURVEYREG data=Sample total=StratumTotals; strata Stratum/list nocollapse; model Y = X; weight W; run;
Output 71.6.4 does not contain the stratum collapse information displayed in Output 71.6.1, and the denominator degrees of freedom is 3 instead of 4.
Stratified Sample with Single Sampling Unit in Strata Without Stratum Collapse The SURVEYREG Procedure Regression Analysis for Dependent Variable Y Data Summary Number of Observations 8 Sum of Weights 157.00000 Weighted Mean of Y 4.31210 Weighted Sum of Y 677.00000 Design Summary Number of Strata 5 Fit Statistics R-square 0.9564 Root MSE 0.5111 Denominator DF 3
In Output 71.6.5, although the fourth stratum and the fifth stratum contain only one observation, no stratum collapse occurs.
Stratified Sample with Single Sampling Unit in Strata Without Stratum Collapse The SURVEYREG Procedure Regression Analysis for Dependent Variable Y Stratum Information Stratum Population Sampling Index Stratum N Obs Total Rate 1 10 2 10 20.0% 2 11 2 20 10.0% 3 12 2 32 6.25% 4 14 1 50 2.00% 5 33 1 40 2.50%
Stratified Sample with Single Sampling Unit in Strata Without Stratum Collapse The SURVEYREG Procedure Regression Analysis for Dependent Variable Y Tests of Model Effects Effect Num DF F Value Pr > F Model 1 347.27 0.0003 Intercept 1 0.00 0.9962 X 1 347.27 0.0003 NOTE: The denominator degrees of freedom for the F tests is 3. Estimated Regression Coefficients Standard Parameter Estimate Error t Value Pr > t Intercept 0.00179469 0.34302581 0.01 0.9962 X 1.12598708 0.06042241 18.64 0.0003 NOTE: The denominator degrees of freedom for the t tests is 3.
As a result of not collapsing strata, the standard error estimates of the parameters are different from those in Output 71.6.3, as are the tests of the significance of model effects are.