Examples


Example 71.1. Simple Random Sampling

This example investigates the relationship between the labor force participation rate (LFPR) of women in 1968 and 1972 in large cities in the United States. A simple random sample of 19 cities is drawn from a total of 200 cities. For each selected city, the LFPRs are recorded and saved in a SAS data set named Labor . The LFPR in 1972 is contained in the variable LFPR1972 , and the LFPR in 1968 is identified by the variable LFPR1968 :

  data Labor;   input City $ 1   16 LFPR1972 LFPR1968;   datalines;   New York        .45     .42   Los Angeles     .50     .50   Chicago         .52     .52   Philadelphia    .45     .45   Detroit         .46     .43   San Francisco   .55     .55   Boston          .60     .45   Pittsburgh      .49     .34   St. Louis       .35     .45   Connecticut     .55     .54   Washington D.C. .52     .42   Cincinnati      .53     .51   Baltimore       .57     .49   Newark          .53     .54   Minn/St. Paul   .59     .50   Buffalo         .64     .58   Houston         .50     .49   Patterson       .57     .56   Dallas          .64     .63   ;  

Assume that the LFPRs in 1968 and 1972 have a linear relationship, as shown in the following model:

click to expand

You can use PROC SURVEYREG to obtain the estimated regression coefficients and estimated standard errors of the regression coefficients. The following statements perform the regression analysis:

  title 'Study of Labor Force Participation Rates of Women';   proc surveyreg data=Labor total=200;   model LFPR1972 = LFPR1968;   run;  

Here, the TOTAL=200 option specifies the finite population total from which the simple random sample of 19 cities is drawn. You can specify the same information by using the sampling rate option RATE=0.095 (19/200=.095).

Output 71.1.1 summarizes the data information, the fit information.

Output 71.1.1: Summary of Regression Using Simple Random Sampling
start example
  Study of Labor Force Participation Rates of Women   The SURVEYREG Procedure   Regression Analysis for Dependent Variable LFPR1972   Data Summary   Number of Observations            19   Mean of LFPR1972             0.52684   Sum of LFPR1972             10.01000   Fit Statistics   R-square            0.3970   Root MSE           0.05657   Denominator DF          18  
end example
 

Output 71.1.2 presents the significance tests for the model effects and estimated regression coefficients. The F tests and t tests for the effects in the model are also presented in these tables.

Output 71.1.2: Regression Coefficient Estimates
start example
  Study of Labor Force Participation Rates of Women   The SURVEYREG Procedure   Regression Analysis for Dependent Variable LFPR1972   Tests of Model Effects   Effect       Num DF    F Value    Pr > F   Model             1      13.84    0.0016   Intercept         1       4.63    0.0452   LFPR1968          1      13.84    0.0016   NOTE: The denominator degrees of freedom for the F tests is 18.   Estimated Regression Coefficients   Standard   Parameter      Estimate         Error    t Value    Pr > t   Intercept    0.20331056    0.09444296       2.15      0.0452   LFPR1968     0.65604048    0.17635810       3.72      0.0016   NOTE: The denominator degrees of freedom for the t tests is 18.  
end example
 

From the regression performed by PROC SURVEYREG, you obtain a positive estimated slope for the linear relationship between the LFPR in 1968 and the LFPR in 1972. The regression coefficients are all significant at the 5% level. Effects Intercept and LFPR1968 are significant in the model at the 5% level. In this example, the F test for the overall model without intercept is the same as the effect LFPR1968 .

Example 71.2. Simple Random Cluster Sampling

This example illustrates the use of regression analysis in a simple random cluster sample design. The data are from S arndal, Swenson, and Wretman (1992, p. 652).

A total of 284 Swedish municipalities are grouped into 50 clusters of neighboring municipalities. Five clusters with a total of 32 municipalities are randomly selected. The results from the regression analysis in which clusters are used in the sample design are compared to the results of a regression analysis that ignores the clusters. The linear relationship between the population in 1975 and in 1985 is investigated.

The 32 selected municipalities in the sample are saved in the data set Municipalities :

  data Municipalities;   input Municipality Cluster Population85 Population75;   datalines;   205   37    5    5   206   37   11   11   207   37   13   13   208   37    8    8   209   37   17   19   6    2   16   15   7    2   70   62   8    2   66   54   9    2   12   12   10    2   60   50   94   17    7    7   95   17   16   16   96   17   13   11   97   17   12   11   98   17   70   67   99   17   20   20   100   17   31   28   101   17   49   48   276   50    6    7   277   50    9   10   278   50   24   26   279   50   10    9   280   50   67   64   281   50   39   35   282   50   29   27   283   50   10    9   284   50   27   31   52   10    7    6   53   10    9    8   54   10   28   27   55   10   12   11   56   10  107  108   ;  

The variable Municipality identifies the municipalities in the sample; the variable Cluster indicates the cluster to which a municipality belongs; and the variables Population85 and Population75 contain the municipality populations in 1985 and in 1975 (in thousands), respectively. A regression analysis is performed by PROC SURVEYREG with a CLUSTER statement:

  title1 'Regression Analysis for Swedish Municipalities';   title2 'Cluster Simple Random Sampling';   proc surveyreg data=Municipalities total=50;   cluster Cluster;   model Population85=Population75;   run;  

The TOTAL=50 option specifies the total number of clusters in the sampling frame.

Output 71.2.1 displays the data summary, design summary, fit statistics, and regression coefficient estimates. Since the sample design includes clusters, the procedure displays the total number of clusters in the sample in the 'Design Summary' table. In the 'Estimated Regression Coefficients' table, the estimated slope for the linear relationship is 1.05, which is significant at the 5% level; but the intercept is not significant. This suggests that a regression line crossing the original can be established between populations in 1975 and in 1985.

Output 71.2.1: Regression Analysis for Simple Random Cluster Sampling
start example
  Regression Analysis for Swedish Municipalities   Cluster Simple Random Sampling   The SURVEYREG Procedure   Regression Analysis for Dependent Variable Population85   Data Summary   Number of Observations            32   Mean of Population85        27.50000   Sum of Population85        880.00000   Design Summary   Number of Clusters             5   Fit Statistics   R-square            0.9860   Root MSE            3.0488   Denominator DF           4   Estimated Regression Coefficients   Standard   Parameter         Estimate         Error    t Value    Pr > t   Intercept   0.0191292    0.89204053   0.02      0.9839   Population75     1.0546253    0.05167565      20.41      <.0001   NOTE: The denominator degrees of freedom for the t tests is 4.  
end example
 

The CLUSTER statement is necessary in PROC SURVEYREG in order to incorporate the sample design. If you do not specify a CLUSTER statement in the regression analysis, the standard deviation of the regression coefficients will be incorrectly estimated:

  title1 'Regression Analysis for Swedish Municipalities';   title2 'Simple Random Sampling';   proc surveyreg data=Municipalities total=284;   model Population85=Population75;   run;  

The analysis ignores the clusters in the sample, assuming that the sample design is a simple random sampling. Therefore, the TOTAL= option specifies the total number of municipalities, which is 284.

Output 71.2.2 displays the regression results ignoring the clusters. Compared to the results in Output 71.2.1 on page 4398, the regression coefficient estimates are the same. However, without using clusters, the regression coefficients have a smaller variance estimate in Output 71.2.2. Using clusters in the analysis, the estimated regression coeffiecient for effect Population75 is 1.05, with the estimated standard error 0.05, as displayed in Output 71.2.1; without using the clusters, the estimate is 1.05, but with the estimated standard error 0.04, as displayed in Output 71.2.2. To estimated the variance of the regression coefficients correctly, you should include the clustering information in the regression analysis.

Output 71.2.2: Regression Analysis for Simple Random Sampling
start example
  Regression Analysis for Swedish Municipalities   Simple Random Sampling   The SURVEYREG Procedure   Regression Analysis for Dependent Variable Population85   Data Summary   Number of Observations            32   Mean of Population85        27.50000   Sum of Population85        880.00000   Fit Statistics   R-square            0.9860   Root MSE            3.0488   Denominator DF          31   Estimated Regression Coefficients   Standard   Parameter         Estimate         Error    t Value    Pr > t   Intercept   0.0191292    0.67417606   0.03      0.9775   Population75     1.0546253    0.03668414      28.75      <.0001   NOTE: The denominator degrees of freedom for the t tests is 31.  
end example
 

Example 71.3. Regression Estimator for Simple Random Sample

Using auxiliary information, you can construct the regression estimators to provide more accurate estimates of the population characteristics that are of interest. With ESTIMATE statements in PROC SURVEYREG, you can specify a regression estimator as a linear function of the regression parameters to estimate the population total. This example illustrates this application, using the data in the previous example.

In this sample, a linear model between the Swedish populations in 1975 and in 1985 is established:

Assuming that the total population in 1975 is known to be 8200 (in thousands), you can use the ESTIMATE statement to predict the 1985 total population using the following statements:

  title1 'Regression Analysis for Swedish Municipalities';   title2 'Estimate Total Population';   proc surveyreg data=Municipalities total=50;   cluster Cluster;   model Population85=Population75;   estimate '1985 population' Intercept 284 Population75 8200;   run;  

Since each observation in the sample is a municipality, and there is a total of 284 municipalities in Sweden, the coefficient for Intercept ( ± ) in the ESTIMATE statement is 284, and the coefficient for Population75 ( ² ) is the total population in 1975 (8.2 million).

Output 71.3.1 displays the regression results and the estimation of the total population. Using the linear model, you can predict the total population in 1985 to be 8.64 million, with a standard error of 0.26 million.

Output 71.3.1: Use the Regression Estimator to Estimate the Population Total
start example
  Regression Analysis for Swedish Municipalities   Estimate Total Population   The SURVEYREG Procedure   Regression Analysis for Dependent Variable Population85   Analysis of Estimable Functions   Standard   Parameter            Estimate         Error    t Value    Pr > t   1985 population    8642.49485    258.558613      33.43      <.0001   NOTE: The denominator degrees of freedom for the t tests is 4.  
end example
 

Example 71.4. Stratified Sampling

This example illustrates using the SURVEYREG procedure to perform a regression in a stratified sample design. Consider a population of 235 farms producing corn in Nebraska and Iowa. You are interested in the relationship between corn yield ( CornYield ) and the total farm size ( FarmArea ).

Each state is divided into several regions , and each region is used as a stratum. Within each stratum, a simple random sample with replacement is drawn. A total of 19 farms is selected using a stratified simple random sample. The sample size and population size within each stratum are displayed in Table 71.3.

Table 71.3: Number of Farms in Each Stratum
   

Number of Farms

Stratum

State

Region

Population

Sample

1

Iowa

1

100

3

2

 

2

50

5

3

 

3

15

3

4

Nebraska

1

30

6

5

 

2

40

2

 

Total

 

235

19

Three models for the data are considered :

  • Model I - Common intercept and slope:

    click to expand
  • Model II - Common intercept, different slope:

    click to expand
  • Model III - Different intercept and different slope:

    click to expand

Data from the stratified sample are saved in the SAS data set Farms . In the data set Farms , the variable Weight represents the sampling weight. In this example, the sampling weights are reciprocal of selection probabilities:

  data Farms;   input State $ Region FarmArea CornYield Weight;   datalines;   Iowa     1 100  54 33.333   Iowa     1  83  25 33.333   Iowa     1  25  10 33.333   Iowa     2 120  83 10.000   Iowa     2  50  35 10.000   Iowa     2 110  65 10.000   Iowa     2  60  35 10.000   Iowa     2  45  20 10.000   Iowa     3  23   5  5.000   Iowa     3  10   8  5.000   Iowa     3 350 125  5.000   Nebraska 1 130  20  5.000   Nebraska 1 245  25  5.000   Nebraska 1 150  33  5.000   Nebraska 1 263  50  5.000   Nebraska 1 320  47  5.000   Nebraska 1 204  25  5.000   Nebraska 2  80  11 20.000   Nebraska 2  48   8 20.000   ;  

The information on population size in each stratum is saved in the SAS data set StratumTotals :

  data StratumTotals;   input State $ Region _TOTAL_;   datalines;   Iowa     1 100   Iowa     2  50   Iowa     3  15   Nebraska 1  30   Nebraska 2  40   ;  

Using the sample data from the data set Farms and the control information data from the data set StratumTotals , you can fit Model I using PROC SURVEYREG with the following statements:

  title1 'Analysis of Farm Area and Corn Yield';   title2 'Model I: Same Intercept and Slope';   proc surveyreg data=Farms total=StratumTotals;   strata State Region / list;   model CornYield = FarmArea;   weight Weight;   run;  

Output 71.4.1 displays the data summary and stratification information fitting Model I. The sampling rates are automatically computed by the procedure based on the sample sizes and the population totals in strata.

Output 71.4.1: Data Summary and Stratum Information Fitting Model I
start example
  Analysis of Farm Area and Corn Yield   Model I: Same Intercept and Slope   The SURVEYREG Procedure   Regression Analysis for Dependent Variable CornYield   Data Summary   Number of Observations                19   Sum of Weights                 234.99900   Weighted Mean of CornYield      31.56029   Weighted Sum of CornYield         7416.6   Design Summary   Number of Strata             5   Fit Statistics   R-square            0.3882   Root MSE           20.6422   Denominator DF          14   Stratum Information   Stratum                                        Population    Sampling   Index      State      Region       N Obs           Total        Rate   1       Iowa          1              3             100      3.00%   2                     2              5              50      10.0%   3                     3              3              15      20.0%   4       Nebraska      1              6              30      20.0%   5                     2              2              40      5.00%  
end example
 

Output 71.4.2 displays tests of model effects and the estimated regression coefficients.

Output 71.4.2: Estimated Regression Coefficients and the Estimated Covariance Matrix
start example
  Analysis of Farm Area and Corn Yield   Model I: Same Intercept and Slope   The SURVEYREG Procedure   Regression Analysis for Dependent Variable CornYield   Tests of Model Effects   Effect       Num DF    F Value    Pr > F   Model             1      21.74    0.0004   Intercept         1       4.93    0.0433   FarmArea          1      21.74    0.0004   NOTE: The denominator degrees of freedom for the F tests is 14.   Estimated Regression Coefficients   Standard   Parameter      Estimate         Error    t Value    Pr > t   Intercept    11.8162978    5.31981027       2.22      0.0433   FarmArea      0.2126576    0.04560949       4.66      0.0004   NOTE: The denominator degrees of freedom for the t tests is 14.  
end example
 

Alternatively, you can assume that the linear relationship between corn yield ( CornYield ) and farm area ( FarmArea ) is different among the states (Model II). In order to analyze the data using this model, you create auxiliary variables FarmAreaNE and FarmAreaIA to represent farm area in different states:

click to expand

The following statements create these variables in a new data set called FarmsByState and use PROC SURVEYREG to fit Model II:

  title1 'Analysis of Farm Area and Corn Yield';   title2 'Model II: Same Intercept, Different Slopes';   data FarmsByState; set Farms;   if State='Iowa' then do;   FarmAreaIA=FarmArea ; FarmAreaNE=0; end;   else do;   FarmAreaIA=0 ; FarmAreaNE=FarmArea; end;   run;  

The following statements perform the regression using the new data set FarmsByState . The analysis uses the auxilary variables FarmAreaIA and FarmAreaNE as the regressors:

  proc SURVEYREG data=FarmsByState total=StratumTotals;   strata State Region;   model CornYield = FarmAreaIA FarmAreaNE;   weight Weight;   run;  

Output 71.4.3 displays the data summary, design information, fit statistics, and parameter estimates. The estimated slope parameters for each state are quite different from the estimated slope in Model I. The results from the regression show that Model II fits these data better than Model I.

Output 71.4.3: Regression Results from Fitting Model II
start example
  Analysis of Farm Area and Corn Yield   Model II: Same Intercept, Different Slopes   The SURVEYREG Procedure   Regression Analysis for Dependent Variable CornYield   Data Summary   Number of Observations                19   Sum of Weights                 234.99900   Weighted Mean of CornYield      31.56029   Weighted Sum of CornYield         7416.6   Design Summary   Number of Strata             5   Fit Statistics   R-square            0.8158   Root MSE           11.6759   Denominator DF          14   Estimated Regression Coefficients   Standard   Parameter       Estimate         Error    t Value    Pr > t   Intercept     4.04234816    3.80934848       1.06      0.3066   FarmAreaIA    0.41696069    0.05971129       6.98      <.0001   FarmAreaNE    0.12851012    0.02495495       5.15      0.0001   NOTE: The denominator degrees of freedom for the t tests is 14.  
end example
 

For Model III, different intercepts are used for the linear relationship in two states. The following statements illustrate the use of the NOINT option in the MODEL statement associated with the CLASS statement to fit Model III:

  title2 'Model III: Different Intercepts and Slopes';   proc SURVEYREG data=FarmsByState total=StratumTotals;   strata State Region;   class State;   model CornYield = State FarmAreaIA FarmAreaNE / noint covb solution;   weight Weight;   run;  

The model statement includes the classification effect State as a regressor. Therefore, the parameter estimates for effect State will presents the intercepts in two states.

Output 71.4.4 displays the regression results for fitting Model III, including the data summary, parameter estimates, and covariance matrix of the regression coefficients. The estimated covariance matrix shows a lack of correlation between the regression coefficients from different states. This suggests that Model III might be the best choice for building a model for farm area and corn yield in these two states.

Output 71.4.4: Regression Results for Fitting Model III
start example
  Analysis of Farm Area and Corn Yield   Model III: Different Intercepts and Slopes   The SURVEYREG Procedure   Regression Analysis for Dependent Variable CornYield   Data Summary   Number of Observations                19   Sum of Weights                 234.99900   Weighted Mean of CornYield      31.56029   Weighted Sum of CornYield         7416.6   Design Summary   Number of Strata             5   Fit Statistics   R-square            0.9300   Root MSE           11.9810   Denominator DF          14   Estimated Regression Coefficients   Standard   Parameter            Estimate         Error    t Value    Pr > t   State Iowa         5.27797099    5.27170400       1.00      0.3337   State Nebraska     0.65275201    1.70031616       0.38      0.7068   FarmAreaIA         0.40680971    0.06458426       6.30      <.0001   FarmAreaNE         0.14630563    0.01997085       7.33      <.0001   NOTE: The denominator degrees of freedom for the t tests is 14.   Covariance of Estimated Regression Coefficients   State   State Iowa        Nebraska      FarmAreaIA      FarmAreaNE   State Iowa        27.790863033               0   0.205517205               0   State Nebraska               0    2.8910750385               0   0.027354011   FarmAreaIA   0.205517205               0    0.0041711265               0   FarmAreaNE                   0   0.027354011               0    0.0003988349  
end example
 

However, some statistics remain the same under different regression models, for example, Weighted Mean of CornYield. These estimators do not rely on the particular model you use.

Example 71.5. Regression Estimator for Stratified Sample

This example uses the corn yield data from the previous example to illustrate how to construct a regression estimator for a stratified sample design.

Similar to Example 71.3 on page 4400, by incorporating auxilary information into a regression estimator, the procedure can produce more accurate estimates of the population characteristics that are of interest. In this example, the sample design is a stratified sample design. The auxilary information is the total farm areas in regions of each state, as displayed in Table 71.4. You want to estimate the total corn yield using this information under the three linear models given in Example 71.4.

Table 71.4: Information for Each Stratum
   

Number of Farms in

 

Stratum

State

Region

Population

Sample

Total Farm Area

1

Iowa

1

100

3

 

2

 

2

50

5

13,200

3

 

3

15

3

 

4

Nebraska

1

30

6

8,750

5

 

2

40

2

 
 

Total

 

235

19

21,950

The regression estimator to estimate the total corn yield under Model I can be obtained by using PROC SURVEYREG with an ESTIMATE statement:

  title1 'Estimate Corn Yield from Farm Size';   title2 'Model I: Same Intercept and Slope';   proc surveyreg data=Farms total=StratumTotals;   strata State Region / list;   class State Region;   model CornYield = FarmArea State*Region /solution;   weight Weight;   estimate 'Estimate of CornYield under Model I'   INTERCEPT 235 FarmArea 21950   State*Region 100 50 15 30 40 /e;   run;  

To apply the contraint in each stratum that the weighted total number of farms equals to the total number of farms in the stratum, you can include the strata as an effect in the MODEL statement, effect State*Region . Thus, the CLASS statement must list the STRATA variables, State and Region , as classification variables. The following

ESTIMATE statement specifies the regression estimator, which is a linear function of the regression parameters:

  estimate 'Estimate of CornYield under Model I'   INTERCEPT 235 FarmArea 21950   State*Region 100 50 15 30 40 /e;  

This linear function contains the total for each explanatory variable in the model. Because the sampling units are farms in this example, the coefficient for Intercept in the ESTIMATE statement is the total number of farms (235); the coefficient for FarmArea is the total farm area listed in Table 71.4 (21950); and the coefficients for effect State*Region are the total number of farms in each strata (as displayed in Table 71.4).

Output 71.5.1 displays the results of the ESTIMATE statement. The regression estimator for the total of CornYield in Iowa and Nebraska is 7464 under Model I, with a standard error of 927.

Output 71.5.1: Regression Estimator for the Total of CornYield under Model I
start example
  Estimate Corn Yield from Farm Size   Model I: Same Intercept and Slope   The SURVEYREG Procedure   Regression Analysis for Dependent Variable CornYield   Analysis of Estimable Functions   Standard   Parameter                              Estimate       Error t Value   Pr > t   Estimate of CornYield under Model I 7463.52329 926.841541      8.05     <.0001   NOTE: The denominator degrees of freedom for the t tests is 14.  
end example
 

Under Model II, a regression estimator for totals can be obtained using the following statements:

  title1 'Estimate Corn Yield from Farm Size';   title2 'Model II: Same Intercept, Different Slopes';   proc surveyreg data=FarmsByState total=StratumTotals;   strata State Region;   class State Region;   model CornYield = FarmAreaIA FarmAreaNE   state*region /solution;   weight Weight;   estimate 'Total of CornYield under Model II'   INTERCEPT 235 FarmAreaIA 13200 FarmAreaNE 8750   State*Region 100 50 15 30 40 /e;   run;  

In this model, you also need to include strata as a fixed effect in the MODEL statement. Other regressors are the auxiliary variables FarmAreaIA and FarmAreaNE (defined in Example 71.4). In the following ESTIMATE statement, the coefficient for Intercept is still the total number of farms; and the coefficients for FarmAreaIA and FarmAreaNE are the total farm area in Iowa and Nebraska, respectively, as displayed in Table 71.4. The total number of farms in each strata are the coefficients for the strata effect:

  estimate 'Total of CornYield under Model II'   INTERCEPT 235 FarmAreaIA 13200 FarmAreaNE 8750   State*Region 100 50 15 30 40 /e;  

Output 71.5.2 displays that the results of the regression estimator for the total of corn yield in two states under Model II is 7580 with a standard error of 859. The regression estimator under Model II has a slightly smaller standard error than under Model I.

Output 71.5.2: Regression Estimator for the Total of CornYield under Model II
start example
  Estimate Corn Yield from Farm Size   Model II: Same Intercept, Different Slopes   The SURVEYREG Procedure   Regression Analysis for Dependent Variable CornYield   Analysis of Estimable Functions   Standard   Parameter                             Estimate        Error   t Value   Pr > t   Total of CornYield under Model II   7580.48657   859.180439      8.82     <.0001   NOTE: The denominator degrees of freedom for the t tests is 14.  
end example
 

Finally, you can apply Model III to the data and estimate the total corn yield. Under Model III, you can also obtain the regression estimators for the total corn yield for each state. Three ESTIMATE statements are used in the following statements to create the three regression estimators:

  title1 'Estimate Corn Yield from Farm Size';   title2 'Model III: Different Intercepts and Slopes';   proc SURVEYREG data=FarmsByState total=StratumTotals;   strata State Region;   class State Region;   model CornYield = state FarmAreaIA FarmAreaNE   State*Region /noint solution;   weight Weight;   estimate 'Total CornYield in Iowa under Model III'   State 165 0 FarmAreaIA 13200 FarmAreaNE 0   State*region 100 50 15 0 0 /e;   estimate 'Total CornYield in Nebraska under Model III'   State 0 70 FarmAreaIA 0 FarmAreaNE 8750   State*Region 0 0 0 30 40 /e;   estimate 'Total CornYield in both states under Model III'   State 165 70 FarmAreaIA 13200 FarmAreaNE 8750   State*Region 100 50 15 30 40 /e;   run;  

The fixed effect State is added to the MODEL statement to obtain different intercepts in different states, using the NOINT option. Among the ESTIMATE statements, the coefficients for explanatory variables are different depending on which regression estimator is estimated. For example, in the ESTIMATE statement

  estimate 'Total CornYield in Iowa under Model III'   State 165 0 FarmAreaIA 13200 FarmAreaNE 0   State*region 100 50 15  0  0 /e;  

the coefficients for the effect State are 165 an 0, respectively. This indicates that the total number of farms in Iowa is 165 and the total number of farms in Nebraska is 0, because the estimation is the total corn yield in Iowa only. Similarly, the total numbers of farms in three regions in Iowa are used for the coefficients of the strata effect State*Region , as displayed in Table 71.4.

Output 71.5.3 displays the results from the three regression estimators using Model III. Since the estimations are independent in each state, the total corn yield from both states is equal to the sum of the estimated total of corn yield in Iowa and Nebraska, 6246 + 1334 = 7580. This regression estimator is the same as the one under Model II. The variance of regression estimator of the total corn yield in both states is the sum of variances of regression estimators for total corn yield in each state. Therefore, it is not necessary to use Model III to obtain the regression estimator for the total corn yield unless you need to estimate the total corn yield for each individual state.

Output 71.5.3: Regression Estimator for the Total of CornYield under Model III
start example
  Estimate Corn Yield from Farm Size   Model III: Different Intercepts and Slopes   The SURVEYREG Procedure   Regression Analysis for Dependent Variable CornYield   Analysis of Estimable Functions   Standard   Parameter                                         Estimate       Error  t Value   Total CornYield in Iowa under Model III         6246.10697  851.272     7.34   Total CornYield in Nebraska under Model III     1334.37961  116.302948    11.47   Total CornYield in both states under Model III  7580.48657  859.180439     8.82   Analysis of Estimable Functions   Parameter                                       Pr > t   Total CornYield in Iowa under Model III           <.0001   Total CornYield in Nebraska under Model III       <.0001   Total CornYield in both states under Model III    <.0001   NOTE: The denominator degrees of freedom for the t tests is 14.  
end example
 

Example 71.6. Stratum Collapse

In a stratified sample, it is possible that some strata will have only one sampling unit. When this happens, PROC SURVEYREG collapses the strata that contain a single sampling unit into a pooled stratum. For more detailed information on stratum collapse, see the section 'Stratum Collapse' on page 4388.

Suppose that you have the following data:

  data Sample;   input Stratum X Y W;   datalines;   10 0 0 5   10 1 1 5   11 1 1 10   11 1 2 10   12 3 3 16   33 4 4 45   14 6 7 50   12 3 4 16   ;  

The variable Stratum is again the stratification variable, the variable X is the inde-pendent variable, and the variable Y is the dependent variable. You want to regress Y on X . In the data set Sample, both Stratum =33 and Stratum =14 contain one observation. By default, PROC SURVEYREG collapses these strata into one pooled stratum in the regression analysis.

To input the finite population correction information, you create the SAS data set StratumTotals :

  data StratumTotals;   input Stratum _TOTAL_;   datalines;   10 10   11 20   12 32   33 40   33 45   14 50   15  .   66 70   ;  

The variable Stratum is the stratification variable, and the variable _TOTAL_ contains the stratum totals. The data set StratumTotals contains more strata than the data set Sample . Also in the data set StratumTotals , more than one observation contains the stratum totals for Stratum =33:

  33 40   33 45  

PROC SURVEYREG allows this type of input. The procedure simply ignores strata that are not present in the data set Sample ; for the multiple entries of a stratum, the procedure uses the first observation. In this example, Stratum =33 has the stratum total _TOTAL_ =40.

The following SAS statements perform the regression analysis:

  title1 'Stratified Sample with Single Sampling Unit in Strata';   title2 'With Stratum Collapse';   proc SURVEYREG data=Sample total=StratumTotals;   strata Stratum/list;   model Y=X;   weight W;   run;  

Output 71.6.1 displays that there are a total of five strata in the input data set, and two strata are collapsed into a pooled stratum. The denominator degrees of freedom is 4, due to the collapse (see the section 'Denominator Degrees of Freedom' on page 4386).

Output 71.6.1: Summary of Data and Regression
start example
  Stratified Sample with Single Sampling Unit in Strata   With Stratum Collapse   The SURVEYREG Procedure   Regression Analysis for Dependent Variable Y   Data Summary   Number of Observations             8   Sum of Weights             157.00000   Weighted Mean of Y           4.31210   Weighted Sum of Y          677.00000   Design Summary   Number of Strata                       5   Number of Strata Collapsed             2   Fit Statistics   R-square            0.9564   Root MSE            0.5111   Denominator DF           4  
end example
 

Output 71.6.2 displays the stratification information, including stratum collapse. Under the column Collapsed, the fourth stratum ( Stratum =14) and the fifth ( Stratum =33) are marked as ˜Yes', which indicates that these two strata are collapsed into the pooled stratum (Stratum Index=0). The sampling rate for the pooled stratum is 2% (see the section 'Sampling Rate of the Pooled Stratum from Collapse' on page 4389).

Output 71.6.2: Stratification Information
start example
  Stratified Sample with Single Sampling Unit in Strata   With Stratum Collapse   The SURVEYREG Procedure   Regression Analysis for Dependent Variable Y   Stratum Information   Stratum                                          Population    Sampling   Index     Collapsed    Stratum       N Obs           Total        Rate   1                      10              2              10      20.0%   2                      11              2              20      10.0%   3                      12              2              32      6.25%   4        Yes           14              1              50      2.00%   5        Yes           33              1              40      2.50%   0        Pooled                        2              90      2.22%   NOTE: Strata with only one observation are collapsed into the stratum with   Stratum Index "0".  
end example
 

Output 71.6.3 displays the parameter estimates and the tests of the significance of the model effects.

Output 71.6.3: Parameter Estimates and Effect Tests
start example
  Stratified Sample with Single Sampling Unit in Strata   With Stratum Collapse   The SURVEYREG Procedure   Regression Analysis for Dependent Variable Y   Tests of Model Effects   Effect       Num DF    F Value    Pr > F   Model             1     173.01    0.0002   Intercept         1       0.00    0.9961   X                 1     173.01    0.0002   NOTE: The denominator degrees of freedom for the F tests is 4.   Estimated Regression Coefficients   Standard   Parameter      Estimate         Error    t Value    Pr > t   Intercept    0.00179469    0.34306373       0.01      0.9961   X            1.12598708    0.08560466      13.15      0.0002   NOTE: The denominator degrees of freedom for the t tests is 4.  
end example
 

Alternatively, if you prefer not to collapse strata with a single sampling unit, you can specify the NOCOLLAPSE option in the STRATA statement:

  title1 'Stratified Sample with Single Sampling Unit in Strata';   title2 'Without Stratum Collapse';   proc SURVEYREG data=Sample total=StratumTotals;   strata Stratum/list nocollapse;   model Y = X;   weight W;   run;  

Output 71.6.4 does not contain the stratum collapse information displayed in Output 71.6.1, and the denominator degrees of freedom is 3 instead of 4.

Output 71.6.4: Summary of Data and Regression
start example
  Stratified Sample with Single Sampling Unit in Strata   Without Stratum Collapse   The SURVEYREG Procedure   Regression Analysis for Dependent Variable Y   Data Summary   Number of Observations             8   Sum of Weights             157.00000   Weighted Mean of Y           4.31210   Weighted Sum of Y          677.00000   Design Summary   Number of Strata             5   Fit Statistics   R-square            0.9564   Root MSE            0.5111   Denominator DF           3  
end example
 

In Output 71.6.5, although the fourth stratum and the fifth stratum contain only one observation, no stratum collapse occurs.

Output 71.6.5: Stratification Information
start example
  Stratified Sample with Single Sampling Unit in Strata   Without Stratum Collapse   The SURVEYREG Procedure   Regression Analysis for Dependent Variable Y   Stratum Information   Stratum                             Population    Sampling   Index     Stratum       N Obs           Total        Rate   1         10              2              10      20.0%   2         11              2              20      10.0%   3         12              2              32      6.25%   4         14              1              50      2.00%   5         33              1              40      2.50%  
end example
 
Output 71.6.6: Parameter Estimates and Effect Tests
start example
  Stratified Sample with Single Sampling Unit in Strata   Without Stratum Collapse   The SURVEYREG Procedure   Regression Analysis for Dependent Variable Y   Tests of Model Effects   Effect       Num DF    F Value    Pr > F   Model             1     347.27    0.0003   Intercept         1       0.00    0.9962   X                 1     347.27    0.0003   NOTE: The denominator degrees of freedom for the F tests is 3.   Estimated Regression Coefficients   Standard   Parameter      Estimate         Error    t Value    Pr > t   Intercept    0.00179469    0.34302581       0.01      0.9962   X            1.12598708    0.06042241      18.64      0.0003   NOTE: The denominator degrees of freedom for the t tests is 3.  
end example
 

As a result of not collapsing strata, the standard error estimates of the parameters are different from those in Output 71.6.3, as are the tests of the significance of model effects are.




SAS.STAT 9.1 Users Guide (Vol. 6)
SAS.STAT 9.1 Users Guide (Vol. 6)
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 127

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net