This section demonstrates how you can use the SURVEYMEANS procedure to produce descriptive statistics from sample survey data. For a complete description of PROC SURVEYMEANS, please refer to the 'Syntax' section on page 4322. The 'Examples' section on page 4350 provides more complicated examples to illustrate the applications of PROC SURVEYMEANS.
This example illustrates how you can use PROC SURVEYMEANS to estimate population means and proportions from sample survey data. The study population is a junior high school with a total of 4,000 students in grades 7, 8, and 9. Researchers want to know how much these students spend weekly for ice cream, on average, and what percentage of students spend at least $10 weekly for ice cream.
To answer these questions, 40 students were selected from the entire student population using simple random sampling (SRS). Selection by simple random sampling means that all students have an equal chance of being selected, and no student can be selected more than once. Each student selected for the sample was asked how much he spends for ice cream per week, on average. The SAS data set named IceCream saves the responses of the 40 students:
data IceCream; input Grade Spending @@; if (Spending < 10) then Group='less'; else Group='more'; datalines; 7 7 7 7 8 12 9 10 7 1 7 10 7 3 8 20 8 19 7 2 7 2 9 15 8 16 7 6 7 6 7 6 9 15 8 17 8 14 9 8 9 8 9 7 7 3 7 12 7 4 9 14 8 18 9 9 7 2 7 1 7 4 7 11 9 8 8 10 8 13 7 2 9 6 9 11 7 2 7 9 ;
The variable Grade contains a student's grade. The variable Spending contains a student's response on how much he spends per week for ice cream, in dollars. The variable Group is created to indicate whether a student spends at least $10 weekly for ice cream: Group ='more' if a student spends at least $10, or Group ='less' if a student spends less than $10.
You can use PROC SURVEYMEANS to produce estimates for the entire student population, based on this random sample of 40 students:
title1 'Analysis of Ice Cream Spending'; title2 'Simple Random Sample Design'; proc surveymeans data=IceCream total=4000; var Spending Group; run;
The PROC SURVEYMEANS statement invokes the procedure. The TOTAL=4000 option specifies the total number of students in the study population, or school. The procedure uses this total to adjust variance estimates for the effects of sampling from a finite population. The VAR statement names the variables to analyze, Spending and Group.
Figure 70.1 displays the results from this analysis. There are a total of 40 observations used in the analysis. The 'Class Level Information' table lists the two levels of the variable Group . This variable is a character variable, and so PROC SURVEYMEANS provides a categorical analysis for it, estimating the relative frequency or proportion for each level. If you want a categorical analysis for a numeric variable, you can name that variable in the CLASS statement.
Analysis of Ice Cream Spending Simple Random Sample Design The SURVEYMEANS Procedure Data Summary Number of Observations 40 Class Level Information Class Variable Levels Values Group 2 less more Statistics Std Error Lower 95% Variable Level N Mean of Mean CL for Mean --------------------------------------------------------------------------------- Spending 40 8.750000 0.845139 7.040545 Group less 23 0.575000 0.078761 0.415690 more 17 0.425000 0.078761 0.265690 --------------------------------------------------------------------------------- Statistics Upper 95% Variable Level CL for Mean --------------------------------- Spending 10.459455 Group less 0.734310 more 0.584310 --------------------------------
The 'Statistics' table displays the estimates for each analysis variable. By default, PROC SURVEYMEANS displays the number of observations, the estimate of the mean, its standard error, and 95% confidence limits for the mean. You can obtain other statistics by specifying the corresponding statistic-keywords in the PROC SURVEYMEANS statement.
The estimate of the average weekly ice cream expense is $8.75 for students in this school. The standard error of this estimate if $0.85, and the 95% confidence interval for weekly ice cream expense is from $7.04 to $10.46.
The analysis variable Group is a character variable, and so PROC SURVEYMEANS analyzes it as categorical, estimating the relative frequency or proportion for each level or category. These estimates are displayed in the Mean column of the 'Statistics' table. It is estimated that 57.5% of all students spend less than $10 weekly on ice cream, while 42.5% of the students spend at least $10 weekly. The standard error of each estimate is 7.9%.
Suppose that the sample of students described in the previous section was actually selected using stratified random sampling. In stratified sampling, the study population is divided into nonoverlapping strata, and samples are selected from each stratum independently.
The list of students in this junior high school was stratified by grade, yielding three strata: grades 7, 8, and 9. A simple random sample of students was selected from each grade. Table 70.1 shows the total number of students in each grade.
Grade | Number of Students |
---|---|
7 | 1,824 |
8 | 1,025 |
9 | 1,151 |
Total | 4,000 |
To analyze this stratified sample, you need to provide the population totals for each stratum to PROC SURVEYMEANS. The SAS data set named StudentTotals contains the information from Table 70.1:
data StudentTotals; input Grade _total_; datalines; 7 1824 8 1025 9 1151 ;
The variable Grade is the stratum identification variable, and the variable _TOTAL_ contains the total number of students for each stratum. PROC SURVEYMEANS requires you to use the variable name _TOTAL_ for the stratum population totals.
The procedure uses the stratum population totals to adjust variance estimates for the effects of sampling from a finite population. If you do not provide population totals or sampling rates, then the procedure assumes that the proportion of the population in the sample is very small, and the computation does not involve a finite population correction.
In a stratified sample design, when the sampling rates in the strata are unequal , you need to use sampling weights to reflect this information in order to produce an unbiased mean estimator . In this example, the appropriate sampling weights are reciprocals of the probabilities of selection. You can use the following data step to create the sampling weights:
data IceCream; set IceCream; if Grade=7 then Prob=20/1824; if Grade=8 then Prob=9/1025; if Grade=9 then Prob=11/1151; Weight=1/Prob;
If you use PROC SURVEYSELECT to select your sample, PROC SURVEYSELECT creates these sampling weights for you.
The following SAS statements perform the stratified analysis of the survey data:
title1 'Analysis of Ice Cream Spending'; title2 'Stratified Simple Random Sample Design'; proc surveymeans data=IceCream total=StudentTotals; stratum Grade / list; var Spending Group; weight Weight; run;
The PROC SURVEYMEANS statement invokes the procedure. The DATA= option names the SAS data set IceCream as the input data set to be analyzed . The TOTAL= option names the data set StudentTotals as the input data set containing the stratum population totals. Comparing this to the analysis in the 'Simple Random Sampling' section on page 4315, notice that the TOTAL= StudentTotals optionisusedhere instead of the TOTAL=4000 option. In this stratified sample design, the population totals are different for different strata, and so you need to provide them to PROC SURVEYMEANS in a SAS data set.
The STRATA statement identifies the stratification variable Grade . The LIST option in the STRATA statement requests that the procedure display stratum information. The WEIGHT statement tells the procedure that the variable Weight contains the sampling weights.
Figure 70.2 displays information on the input data set. There are three strata in the design, and 40 observations in the sample. The categorical variable Group has two levels, ˜less' and ˜more'.
Analysis of Ice Cream Spending Stratified Simple Random Sample Design The SURVEYMEANS Procedure Data Summary Number of Strata 3 Number of Observations 40 Sum of Weights 4000 Class Level Information Class Variable Levels Values Group 2 less more
Figure 70.3 displays information for each stratum. The table displays a Stratum Index and the values of the STRATA variable. The Stratum Index identifies each stratum by a sequentially assigned number. For each stratum, the table gives the population total (total number of students), the sampling rate, and the sample size . The stratum sampling rate is the ratio of the number of students in the sample to the number of students in the population for that stratum. The table also lists each analysis variable and the number of stratum observations for that variable. For categorical variables, the table lists each level and the number of sample observations in that level.
Analysis of Ice Cream Spending Stratified Simple Random Sample Design The SURVEYMEANS Procedure Stratum Information Stratum Population Sampling Index Grade Total Rate N Obs Variable Level N ---------------------------------------------------------------------------- 1 7 1824 1.10% 20 Spending 20 Group less 17 more 3 2 8 1025 0.88% 9 Spending 9 Group less 0 more 9 3 9 1151 0.96% 11 Spending 11 Group less 6 more 5 ----------------------------------------------------------------------------
Figure 70.4 shows that
the estimate of average weekly ice cream expense is $9.14 for students in this school, with a standard error of $0.53, and a 95% confidence interval from $8.06 to $10.22.
an estimate of 54.5% of all students spend less than $10 weekly on ice cream, and 45.5% spend more, with a standard error of 5.8%.
Analysis of Ice Cream Spending Stratified Simple Random Sample Design The SURVEYMEANS Procedure Statistics Std Error Lower 95% Variable Level N Mean of Mean CL for Mean --------------------------------------------------------------------------------- Spending 40 9.141298 0.531799 8.063771 Group less 23 0.544555 0.058424 0.426177 more 17 0.455445 0.058424 0.337068 --------------------------------------------------------------------------------- Statistics Upper 95% Variable Level CL for Mean --------------------------------- Spending 10.218825 Group less 0.662932 more 0.573823 ---------------------------------
PROC SURVEYMEANS uses the Output Delivery System (ODS) to create output data sets. This is a departure from older SAS procedures that provide OUTPUT statements for similar functionality. For more information on ODS, see Chapter 14, 'Using the Output Delivery System.'
For example, to save the 'Statistics' table shown in Figure 70.4 in the previous section in an output data set, you use the ODS OUTPUT statement as follows :
title1 'Analysis of Ice Cream Spending'; title2 'Stratified Simple Random Sample Design'; proc surveymeans data=IceCream total=StudentTotals; stratum Grade / list; var Spending Group; weight Weight; ods output Statistics=MyStat; run;
The statement
ods output Statistics=MyStat;
requests that the 'Statistics' table that appears in Figure 70.4 be placed in a SAS data set named MyStat .
The PRINT procedure displays observations of the data set MyStat :
proc print data=MyStat; run;
Figure 70.5 displays the data set MyStat .
Analysis of Ice Cream Spending Stratified Simple Random Sample Design L U o p w p V e e V a r r a r S C C r L t L L N e M d M M O a v e E e e B m e a r a a S e l N n r n n 1 Spending 40 9.141298 0.531799 8.063771 10.218825 2 Group less 23 0.544555 0.058424 0.426177 0.662932 3 Group more 17 0.455445 0.058424 0.337068 0.573823
The section 'ODS Table Names' on page 4349 gives the complete list of the tables produced by PROC SURVEYMEANS.