The 'Getting Started' section on page 4315 contains examples of analyzing data from simple random sampling and stratified simple random sample designs. This section provides more examples that illustrate how to use PROC SURVEYMEANS.
Consider the example in the section 'Stratified Sampling' on page 4318. The study population is a junior high school with a total of 4,000 students in grades 7, 8, and 9. Researchers want to know how much these students spend weekly for ice cream, on the average, and what percentage of students spend at least $10 weekly for ice cream.
The example in the section 'Stratified Sampling' on page 4318 assumes that the sample of students was selected using a stratified simple random sample design. This example shows analysis based on a more complex sample design.
Suppose that every student belongs to a study group and that study groups are formed within each grade level. Each study group contains between two and four students. Table 70.4 shows the total number of study groups for each grade.
Grade | Number of Study Groups | Number of Students |
---|---|---|
7 | 608 | 1,824 |
8 | 252 | 1,025 |
9 | 403 | 1,151 |
Total | 617 | 4,000 |
It is quicker and more convenient to collect data from students in the same study group than to collect data from students individually. Therefore, this study uses a stratified clustered sample design. The primary sampling units, or clusters, are study groups. The list of all study groups in the school is stratified by grade level. From each grade level, a sample of study groups is randomly selected, and all students in each selected study group are interviewed. The sample consists of eight study groups from the 7th grade, three groups from the 8th grade, and five groups from the 9th grade.
The SAS data set named IceCreamStudy saves the responses of the selected students:
data IceCreamStudy; input Grade StudyGroup Spending @@; if (Spending < 10) then Group='less'; else Group='more'; datalines; 7 34 7 7 34 7 7 412 4 9 27 14 7 34 2 9 230 15 9 27 15 7 501 2 9 230 8 9 230 7 7 501 3 8 59 20 7 403 4 7 403 11 8 59 13 8 59 17 8 143 12 8 143 16 8 59 18 9 235 9 8 143 10 9 312 8 9 235 6 9 235 11 9 312 10 7 321 6 8 156 19 8 156 14 7 321 3 7 321 12 7 489 2 7 489 9 7 78 1 7 78 10 7 489 2 7 156 1 7 78 6 7 412 6 7 156 2 9 301 8 ;
In the data set IceCreamStudy , the variable Grade contain a student's grade. The variable StudyGroup identifies a student's study group. It is possible for students from different grades to have the same study group number because study groups are sequentially numbered within each grade. The variable Spending contains a student's response to how much he spends per week for ice cream, in dollars. The variable GROUP indicates whether a student spends at least $10 weekly for ice cream. It is not necessary to store the data in order of grade and study group.
The SAS data set StudyGroup is created to provide PROC SURVEYMEANS with the sample design information shown in Table 70.4:
data StudyGroups; input Grade _total_; datalines; 7 608 8 252 9 403 ;
The variable Grade identifies the strata, and the variable _TOTAL_ contains the total number of study groups in each stratum. As discussed in the section 'Specification of Population Totals and Sampling Rates' on page 4334, the population totals stored in the variable _TOTAL_ should be expressed in terms of the primary sampling units (PSUs), which are study groups in this example. Therefore, the variable _TOTAL_ contains the total number of study groups for each grade, rather than the total number of students.
In order to obtain unbiased estimates, you create sampling weights using the following SAS statements:
data IceCreamStudy; set IceCreamStudy; if Grade=7 then Prob=8/608; if Grade=8 then Prob=3/252; if Grade=9 then Prob=5/403; Weight=1/Prob;
The sampling weights are the reciprocals of the probabilities of selections. The variable Weight contains the sampling weights. Because the sampling design is clustered, and all students from each selected cluster are interviewed, the sampling weights equal the inverse of the cluster (or study group) selection probabilities.
The following SAS statements perform the analysis for this sample design:
title1 'Analysis of Ice Cream Spending'; title2 'Stratified Clustered Sample Design'; proc surveymeans data=IceCreamStudy total=StudyGroups; strata Grade / list; cluster StudyGroup; var Spending Group; weight Weight; run;
Output 70.1.1 provides information on the sample design and the input data set. There are 3 strata in the sample design, and the sample contains 16 clusters and 40 observations. The variable Group has two levels, ˜less' and ˜more'.
Analysis of Ice Cream Spending Stratified Clustered Sample Design The SURVEYMEANS Procedure Data Summary Number of Strata 3 Number of Clusters 16 Number of Observations 40 Sum of Weights 3162.6 Class Level Information Class Variable Levels Values Group 2 less more
Output 70.1.2 displays information for each stratum. Since the primary sampling units in this design are study groups, the population totals shown in Output 70.1.2 are the total numbers of study groups for each stratum or grade. This differs from Figure 70.3 on page 4320, which provides the population totals in terms of students since students were the primary sampling units for that design. Output 70.1.2 also displays the number of clusters for each stratum and analysis variable.
Analysis of Ice Cream Spending Stratified Clustered Sample Design The SURVEYMEANS Procedure Stratum Information Stratum Population Sampling Index Grade Total Rate N Obs Variable Level N ---------------------------------------------------------------------------- 1 7 608 1.32% 20 Spending 20 Group less 17 more 3 2 8 252 1.19% 9 Spending 9 Group less 0 more 9 3 9 403 1.24% 11 Spending 11 Group less 6 more 5 ---------------------------------------------------------------------------- Stratum Information Stratum Population Sampling Index Grade Total Rate N Obs Variable Level Clusters ---------------------------------------------------------------------------- 1 7 608 1.32% 20 Spending 8 Group less 8 more 3 2 8 252 1.19% 9 Spending 3 Group less 0 more 3 3 9 403 1.24% 11 Spending 5 Group less 4 more 4 ----------------------------------------------------------------------------
Output 70.1.3 displays the estimates of the average weekly ice cream expense and the percentage of students spending at least $10 weekly for ice cream.
Analysis of Ice Cream Spending Stratified Clustered Sample Design The SURVEYMEANS Procedure Statistics Std Error Lower 95% Variable Level N Mean of Mean CL for Mean --------------------------------------------------------------------------------- Spending 40 8.923860 0.650859 7.517764 Group less 23 0.561437 0.056368 0.439661 more 17 0.438563 0.056368 0.316787 --------------------------------------------------------------------------------- Statistics Upper 95% Variable Level CL for Mean --------------------------------- Spending 10.329957 Group less 0.683213 more 0.560339 ---------------------------------
Suppose that you are studying profiles of the 800 top-performing companies to provide information on their impact on the economy. You are also interested in the company profiles within each market type. A sample of 66 companies is selected with unequal probability across market types. However, market type is not included in the sample design. Thus, the number of companies within each market type is a random variable in your sample. To obtain statistics within each market type, you should use domain analysis. The data of the 66 companies are saved in the following data set:
data Company; length Type ; input Type$ Asset Sale Value Profit Employee Weight; datalines; Other 2764.0 1828.0 1850.3 144.0 18.7 9.6 Energy 13246.2 4633.5 4387.7 462.9 24.3 42.6 Finance 3597.7 377.8 93.0 14.0 1.1 12.2 Transportation 6646.1 6414.2 2377.5 348.2 47.1 21.8 HiTech 1068.4 1689.8 1430.2 72.9 4.6 4.3 Manufacturing 1125.0 1719.4 1057.5 98.1 20.4 4.5 Other 1459.0 1241.4 452.7 24.5 20.1 5.5 Finance 2672.3 262.5 296.2 23.1 2.2 9.3 Finance 311.0 566.2 932.0 52.8 2.7 1.9 Energy 1148.6 1014.6 485.1 60.6 4.0 4.5 Finance 5327.0 572.4 372.9 25.2 4.2 17.7 Energy 1602.7 678.4 653.0 75.6 2.8 6.0 Energy 5808.8 1288.4 2007.0 318.8 5.9 19.2 Medical 268.8 204.4 820.9 45.6 3.7 1.8 Transportation 5222.6 2627.8 1910.0 245.6 22.8 17.4 Other 872.7 1419.4 939.3 69.7 12.2 3.7 Retail 4461.7 8946.8 4662.7 289.0 132.1 15.0 HiTech 6719.2 6942.0 8240.2 381.3 85.8 22.1 Retail 833.4 1538.8 1090.3 64.9 15.4 3.5 Finance 415.9 167.3 1126.8 56.8 0.7 2.2 HiTech 442.4 1139.9 1039.9 57.6 22.7 2.3 Other 801.5 1157.0 664.2 56.9 15.5 3.4 Finance 4954.8 468.8 366.4 41.7 3.0 16.5 Finance 2661.9 257.9 181.1 21.2 2.1 9.3 Finance 5345.8 530.1 337.4 36.4 4.3 17.8 Energy 3334.3 1644.7 1407.8 157.6 6.4 11.4 Manufacturing 1826.6 2671.7 483.2 71.3 25.3 6.7 Retail 618.8 2354.7 767.7 58.6 19.0 2.9 Retail 1529.1 6534.0 826.3 58.3 65.8 5.7 Manufacturing 4458.4 4824.5 3132.1 28.9 67.0 15.0 HiTech 5831.7 6611.1 9464.7 459.6 86.7 19.3 Medical 6468.3 4199.2 3170.4 270.1 59.5 21.3 Energy 1720.7 473.1 811.1 86.6 1.6 6.3 Energy 1679.7 1379.9 721.1 91.8 4.5 6.2 Retail 4018.2 16823.4 2038.3 178.1 162.0 13.6 Other 227.1 575.8 1083.8 62.6 1.9 1.6 Finance 3872.8 362.0 209.3 27.6 2.4 13.1 Retail 3359.3 4844.7 2651.4 224.1 75.6 11.5 Energy 1295.6 356.9 180.8 162.3 0.6 5.0 Energy 1658.0 626.6 688.0 126.0 3.5 6.1 Finance 12156.7 1345.5 680.7 106.6 9.4 39.2 HiTech 3982.6 4196.0 3946.8 313.9 64.3 13.5 Finance 8760.7 886.4 1006.9 90.0 7.5 28.5 Manufacturing 2362.2 3153.3 1080.0 137.0 25.2 8.4 Transportation 2499.9 3419.0 992.6 47.2 25.3 8.8 Energy 1430.4 1610.0 664.3 77.7 3.5 5.4 Energy 13666.5 15465.4 2736.7 411.4 26.6 43.9 Manufacturing 4069.3 4174.7 2907.6 289.2 38.2 13.7 Energy 2924.7 711.9 1067.8 146.7 3.4 10.1 Transportation 1262.1 1716.0 364.3 71.2 14.5 4.9 Medical 684.4 672.9 287.4 61.8 6.0 3.1 Energy 3069.3 1719.0 1439.0 196.4 4.9 10.6 Medical 246.5 318.8 924.1 43.8 3.1 1.7 Finance 11562.2 1128.5 580.4 64.2 6.7 37.3 Finance 9316.0 1059.4 816.5 95.9 8.0 30.2 Retail 1094.3 3848.0 563.3 29.4 44.7 4.4 Retail 1102.1 4878.3 932.4 65.2 47.3 4.4 HiTech 466.4 675.8 845.7 64.5 5.2 2.4 Manufacturing 10839.4 5468.7 1895.4 232.8 47.8 35.0 Manufacturing 733.5 2135.3 96.6 10.9 2.7 3.2 Manufacturing 10354.2 14477.4 5607.2 321.9 188.5 33.5 Energy 1902.1 2697.9 329.3 34.2 2.2 6.9 Other 2245.2 2132.2 2230.4 198.9 8.0 8.0 Transportation 949.4 1248.3 298.9 35.4 10.4 3.9 Retail 2834.4 2884.6 458.2 41.2 49.8 9.8 Retail 2621.1 6173.8 1992.7 183.7 115.1 9.2 ;
For each company in your sample,
the variable Type identifies the type of market for the company.
the variable Asset contains the company's assets in millions of dollars.
the variable Sale contains sales in millions of dollars.
the variable Value contains the market value of the company in millions of dollars.
the variable Profit contains the profit in millions of dollars.
the variable Employee stores the number of employees in thousands.
the variable Weight contains the sampling weight.
The following SAS statements use PROC SURVEYMEANS to perform the domain analysis, estimating means and other statistics for the overall population and also for the subpopulations (or domain) defined by market type. The DOMAIN statement specifies Type as the domain variable:
title1 'Top Companies Profile Study'; proc surveymeans data=Company total=800 mean sum; var Asset Sale Value Profit Employee; weight Weight; domain Type; run;
Output 70.2.1 shows that there are 66 observations in the sample. The sum of the sampling weights equals 799.8, which is close to the total number of companies in the study population.
Top Companies Profile Study The SURVEYMEANS Procedure Data Summary Number of Observations 66 Sum of Weights 799.8 Statistics Std Error Variable Mean of Mean Sum Std Dev ------------------------------------------------------------------------ Asset 6523.488510 720.557075 5217486 1073829 Sale 4215.995799 839.132506 3371953 847885 Value 2145.935121 342.531720 1716319 359609 Profit 188.788210 25.057876 150993 30144 Employee 36.874869 7.787857 29493 7148.003298 ------------------------------------------------------------------------
The 'Statistics' table in Output 70.2.1 displays the estimates of the mean and total for all analysis variables for the entire 800 companies, while Output 70.2.2 shows the mean and total estimates for each company type.
Top Companies Profile Study The SURVEYMEANS Procedure Domain Analysis: Type Std Error Type Variable Mean of Mean Sum Std Dev -------------------------------------------------------------------------------- Energy Asset 7868.302932 1941.699163 1449341 785962 Sale 5419.679099 2416.214417 998305 673373 Value 2249.297177 520.295162 414321 213580 Profit 289.564658 52.512141 53338 25927 Employee 14.151194 3.974697 2606.650000 1481.777769 Finance Asset 7890.190264 1057.185336 1855773 704506 Sale 829.210502 115.762531 195030 74436 Value 565.068197 76.964547 132904 48156 Profit 63.716837 10.099341 14986 5801.108513 Employee 5.806293 0.811555 1365.640000 519.658410 HiTech Asset 5031.959781 732.436967 321542 183302 Sale 5464.292019 731.296997 349168 196013 Value 6707.828482 1194.160584 428630 249154 Profit 346.407042 42.299004 22135 12223 Employee 70.766980 8.683595 4522.010000 2524.778281 Manufacturing Asset 7403.004250 1454.921083 888361 492577 Sale 7207.638833 2112.444703 864917 501679 Value 2986.442750 799.121544 358373 196979 Profit 211.933583 39.993255 25432 13322 Employee 83.314333 31.089019 9997.720000 6294.309490 Medical Asset 5046.570609 1218.444638 140799 131942 Sale 3313.219713 758.216303 92439 85655 Value 2561.614695 530.802245 71469 64663 Profit 218.682796 44.051447 6101.250000 5509.560969 Employee 46.518996 11.135955 1297.880000 1213.651734 Other Asset 1850.250000 338.128984 58838 31375 Sale 1620.784906 168.686773 51541 24593 Value 1432.820755 297.869828 45564 24204 Profit 115.089937 27.970560 3659.860000 2018.201371 Employee 14.306604 2.313733 454.950000 216.327710 Retail Asset 2939.845750 393.692369 235188 94605 Sale 7395.453500 1746.187580 591636 263263 Value 2103.863125 529.756409 168309 78304 Profit 157.171875 31.734253 12574 5478.281027 Employee 93.624000 15.726743 7489.920000 3093.832061 Transportation Asset 4712.047359 888.954411 267644 163516 Sale 4030.233275 1015.555708 228917 142669 Value 1703.330282 313.841326 96749 58947 Profit 224.762324 56.168925 12767 8287.585418 Employee 30.946303 6.786270 1757.750000 1066.586615 --------------------------------------------------------------------------------
Suppose you are interested in the profit per employee and the sale per employee among the 800 top-performing companies in the data in the previous example. The following SAS statements illustrate how you can use PROC SURVEYMEANS to estimate these ratios:
title1 'Ratio Analysis in Top Companies Profile Study'; proc surveymeans data=Company total=800 ratio; var Profit Sale Employee; weight Weight; ratio Profit Sale / Employee; run;
The RATIO statement requests the ratio of the profit and the sale to the number of employees.
Output 70.3.1 shows the estimated ratios and their standard errors. Because the profit and the sale figures are in millions of dollars, and the employee numbers in thousands, the profit per employee is estimated as $5,120 with a standard error of $1,059, and the sale per employee is $114,333 with a standard error of $20,503.
Ratio Analysis in Top Companies Profile Study The SURVEYMEANS Procedure Ratio Analysis Numerator Denominator Ratio Std Err -------------------------------------------------- Sale Employee 114.332497 20.502742 Profit Employee 5.119698 1.058939 --------------------------------------------------
As described in the section 'Missing Values' on page 4333, the SURVEYMEANS procedure excludes an observation from the analysis if it has a missing value for the analysis variable or a nonpositive value for the WEIGHT variable.
However, if there is evidence indicating that the nonrespondents are different from the respondents for your study, you can use the DOMAIN statement to compute descriptive statistics among respondents from your survey data without imputation for nonrespondents. Note that although the variance estimation for respondents takes into account the assumption that the study population consists of distinct groups of respondents and nonrespondents, the degrees of freedom will not adjust for the non-respondents because they are deleted from the computation. As a result, there are fewer degrees of freedom and wider confidence limits in comparison to counting those nonrespondents for degrees of freedom. When the sample size and the number of respondents are large, the difference maybe ignored.
Consider the ice cream example in the section 'Stratified Sampling' on page 4318. Suppose that some of the students failed to provide the amounts spent on ice cream, as shown in the following data set IceCream :
data IceCream; input Grade Spending @@; datalines; 7 7 7 7 8 . 9 10 7 . 7 10 7 3 8 20 8 19 7 2 7 . 9 15 8 16 7 6 7 6 7 6 9 15 8 17 8 14 9 . 9 8 9 7 7 3 7 12 7 4 9 14 8 18 9 9 7 2 7 1 7 4 7 11 9 8 8 . 8 13 7 . 9 . 9 11 7 2 7 9 ; data StudentTotals; input Grade _total_; datalines; 7 1824 8 1025 9 1151
Considering the possibility that those students who didn't respond spend differently than those students who did respond, you can create an indicator variable to identify the respondents and non-respondents with the following SAS DATA step statements:
data IceCream; set IceCream; if Spending=. then Indicator='Nonrespondent'; else do; Indicator='Respondent'; if (Spending < 10) then Group='less'; else Group='more'; end; if Grade=7 then Prob=20/1824; if Grade=8 then Prob=9/1025; if Grade=9 then Prob=11/1151; Weight=1/Prob;
The variable Indicator identifies a student in the data set as either a respondent or a nonrespondent. The variable Group specifies whether a student spent more than $10 among the respondents.
The following SAS statements produce the desired analysis:
title1 'Analysis of Ice Cream Spending'; proc surveymeans data=IceCream total=StudentTotals mean sum; strata Grade / list; var Spending Group; weight Weight; domain Indicator; run;
Output 70.4.2 shows the mean and total estimates excluding those students who failed to provide the spending amount on ice cream.
Analysis of Ice Cream Spending The SURVEYMEANS Procedure Domain Analysis: Indicator Std Error Indicator Variable Level Mean of Mean Sum ---------------------------------------------------------------------------------- Nonrespondent Spending . . . Group less . . . more . . . Respondent Spending 9.770542 0.652347 32139 Group less 0.515404 0.067092 1695.345455 more 0.484596 0.067092 1594.004040 ---------------------------------------------------------------------------------- Domain Analysis: Indicator Indicator Variable Level Std Dev -------------------------------------------------- Nonrespondent Spending . Group less . more . Respondent Spending 3515.126876 Group less 220.690305 more 220.690305 --------------------------------------------------
Output 70.4.1 shows the mean and total estimates treating respondents as a domain in the student population. Compared to the estimates in Output 70.4.1, the point estimates are the same, but the variance estimations are slightly higher.
Analysis of Ice Cream Spending The SURVEYMEANS Procedure Data Summary Number of Strata 3 Number of Observations 40 Sum of Weights 4000 Statistics Std Error Variable Level Mean of Mean Sum Std Dev --------------------------------------------------------------------------------- Spending 9.770542 0.541381 32139 1780.792065 Group less 0.515404 0.067092 1695.345455 220.690305 more 0.484596 0.067092 1594.004040 220.690305 ---------------------------------------------------------------------------------