In this example, an Internet service provider conducts a customer satisfaction survey. The survey population consists of the company's current subscribers. The company plans to select a sample of customers from this population, interview the selected customers, and then make inferences about the entire survey population from the sample data.
The SAS data set Customers contains the sampling frame, which is the list of units in the survey population. The sample of customers will be selected from this sampling frame. The data set Customers is constructed from the company's customer database. It contains one observation for each customer, with a total of 13,471 observations. Figure 72.1 displays the first 10 observations of the data set Customers .
Internet Service Provider Customers (First 10 Observations) Obs CustomerID State Type Usage 1 416 87 4322 AL New 839 2 288 13 9763 GA Old 224 3 339 00 8654 GA Old 2451 4 118 98 0542 GA New 349 5 421 67 0342 FL New 562 6 623 18 9201 SC New 68 7 324 55 0324 FL Old 137 8 832 90 2397 AL Old 1563 9 586 45 0178 GA New 615 10 801 24 5317 SC New 728
In the SAS data set Customers , the variable CustomerID uniquely identifies each customer. The variable State contains the state of the customer's address. The company has customers in the following four states: Georgia (GA), Alabama (AL), Florida (FL), and South Carolina (SC). The variable Type equals ˜Old' if the customer has subscribed to the service for more than one year; otherwise , the variable Type equals ˜New'. The variable Usage contains the customer's average monthly service usage, in minutes.
The following sections illustrate the use of PROC SURVEYSELECT for probability sampling with three different designs for the customer satisfaction survey. All three designs are one stage, with customers as the sampling units. The first design is simple random sampling without stratification. In the second design, customers are stratified by state and type, and the sample is selected by simple random sampling within strata. In the third design, customers are sorted within strata by usage, and the sample is selected by systematic random sampling within strata.
The following PROC SURVEYSELECT statements select a probability sample of customers from the Customers data set using simple random sampling:
title 'Customer Satisfaction Survey'; proc surveyselect data=Customers method=srs n=100 out=SampleSRS; run;
The PROC SURVEYSELECT statement invokes the procedure. The DATA= option names the SAS data set Customers as the input data set from which to select the sample. The METHOD=SRS option specifies simple random sampling as the sample selection method. In simple random sampling, each unit has an equal probability of selection, and sampling is without replacement. Without-replacement sampling means that a unit cannot be selected more than once. The N=100 option specifies a sample size of 100 customers. The OUT= option stores the sample in the SAS data set named SampleSRS .
Figure 72.2 displays the output from PROC SURVEYSELECT, which summarizes the sample selection. A sample of 100 customers is selected from the data set Customers by simple random sampling. With simple random sampling and no stratification in the sample design, the selection probability is the same for all units in the sample. In this sample, the selection probability for each customer equals 0.007423, which is the sample size (100) divided by the population size (13,471). The sampling weight equals 134.71 for each customer in the sample, where the weight is the inverse of the selection probability. If you specify the STATS option, PROC SURVEYSELECT includes the selection probabilities and sampling weights in the output data set. (This information is always included in the output data set for more complex designs.)
Customer Satisfaction Survey The SURVEYSELECT Procedure Selection Method Simple Random Sampling Input Data Set CUSTOMERS Random Number Seed 39647 Sample Size 100 Selection Probability 0.007423 Sampling Weight 134.71 Output Data Set SAMPLESRS
The random number seed is 39647. PROC SURVEYSELECT uses this number as the initial seed for random number generation. Since the SEED= option is not specified in the PROC SURVEYSELECT statement, the seed value is obtained using the time of day from the computer's clock. You can specify SEED=39647 to reproduce this sample.
The sample of 100 customers is stored in the SAS data set SampleSRS .PROC SURVEYSELECT does not display this output data set. The following PROC PRINT statements display the first 20 observations of SampleSRS :
title1 'Customer Satisfaction Survey'; title2 'Sample of 100 Customers, Selected by SRS'; title3 '(First 20 Observations)'; proc print data=SampleSRS(obs=20); run;
Figure 72.3 displays the first 20 observations of the output data set SampleSRS , which contains the sample of customers. This data set includes all the variables from the DATA= input data set Customers . If you do not want to include all variables, you can use the ID statement to specify which variables to copy from the input data set to the output (sample) data set.
Customer Satisfaction Survey Sample of 100 Customers, Selected by SRS (First 20 Observations) Obs CustomerID State Type Usage 1 036 89 0212 FL New 74 2 045 53 3676 AL New 411 3 050 99 2380 GA Old 167 4 066 93 5368 AL Old 1232 5 082 99 9234 FL New 90 6 097 17 4766 FL Old 131 7 110 73 1051 FL Old 102 8 111 91 6424 GA New 247 9 127 39 4594 GA New 61 10 162 50 3866 FL New 100 11 162 56 1370 FL New 224 12 167 21 6808 SC New 60 13 168 02 5189 AL Old 7553 14 174 07 8711 FL New 284 15 187 03 7510 SC New 21 16 190 78 5019 GA New 185 17 200 75 0054 GA New 224 18 201 14 1003 GA Old 3437 19 207 15 7701 GA Old 24 20 211 14 1373 AL Old 88
In this section, stratification is added to the sample design for the customer satisfaction survey. The sampling frame, or list of all customers, is stratified by State and Type . This divides the sampling frame into nonoverlapping subgroups formed from the values of the State and Type variables. Samples are then selected independently within the strata.
PROC SURVEYSELECT requires that the input data set be sorted by the STRATA variables. The following PROC SORT statements sort the Customers data set by the stratification variables State and Type :
proc sort data=Customers; by State Type; run;
The following PROC FREQ statements display the crosstabulation of the Customers data set by State and Type :
proc freq data=Customers; tables State*Type; run;
Figure 72.4 presents the table of State by Type for the 13,471 customers. There are four states and two levels of Type , forming a total of eight strata.
The FREQ Procedure Table of State by Type State Type Frequency Percent Row Pct Col Pct New Old Total ---------+--------+--------+ AL 1238 706 1944 9.19 5.24 14.43 63.68 36.32 14.43 14.43 ---------+--------+--------+ FL 2170 1370 3540 16.11 10.17 26.28 61.30 38.70 25.29 28.01 ---------+--------+--------+ GA 3488 1940 5428 25.89 14.40 40.29 64.26 35.74 40.65 39.66 ---------+--------+--------+ SC 1684 875 2559 12.50 6.50 19.00 65.81 34.19 19.63 17.89 ---------+--------+--------+ Total 8580 4891 13471 63.69 36.31 100.00
The following PROC SURVEYSELECT statements select a probability sample of customers from the Customers data set according to the stratified sample design:
title1 'Customer Satisfaction Survey'; title2 'Stratified Sampling'; proc surveyselect data=Customers method=srs n=15 seed=1953 out=SampleStrata; strata State Type; run;
The STRATA statement names the stratification variables State and Type .Inthe PROC SURVEYSELECT statement, the METHOD=SRS option specifies simple random sampling. The N=15 option specifies a sample size of 15 customers for each stratum. If you want to specify different sample sizes for different strata, you can use the N= SAS-data-set option to name a secondary data set that contains the stratum sample sizes. The SEED=1953 option specifies '1953' as the initial seed for random number generation.
Figure 72.5 displays the output from PROC SURVEYSELECT, which summarizes the sample selection. A total of 120 customers are selected.
Customer Satisfaction Survey Stratified Sampling The SURVEYSELECT Procedure Selection Method Simple Random Sampling Strata Variables State Type Input Data Set CUSTOMERS Random Number Seed 1953 Stratum Sample Size 15 Number of Strata 8 Total Sample Size 120 Output Data Set SAMPLESTRATA
The following PROC PRINT statements display the first 30 observations of the output data set SampleStrata :
title1 'Customer Satisfaction Survey'; title2 'Sample Selected by Stratified Design'; title3 '(First 30 Observations)'; proc print data=SampleStrata(obs=30); run;
Figure 72.6 displays the first 30 observations of the output data set SampleStrata , which contains the sample of 120 customers, 15 customers from each of the eight strata. The variable SelectionProb contains the selection probability for each customer in the sample. Since customers are selected with equal probability within strata in this design, the selection probability equals the stratum sample size (15) divided by the stratum population size. The selection probabilities differ from stratum to stratum since the population sizes differ . The selection probability for each customer in the first stratum ( State = ˜AL' and Type = ˜New') is 0.012116, and the selection probability is 0.021246 for customers in the second stratum. The variable SamplingWeight contains the sampling weights, which are computed as inverse selection probabilities.
Customer Satisfaction Survey Sample Selected by Stratified Design (First 30 Observations) Selection Sampling Obs State Type CustomerID Usage Prob Weight 1 AL New 002 26 1498 1189 0.012116 82.5333 2 AL New 070 86 8494 106 0.012116 82.5333 3 AL New 121 28 6895 76 0.012116 82.5333 4 AL New 131 79 7630 265 0.012116 82.5333 5 AL New 211 88 4991 108 0.012116 82.5333 6 AL New 222 81 3742 83 0.012116 82.5333 7 AL New 238 46 3776 278 0.012116 82.5333 8 AL New 370 01 0671 123 0.012116 82.5333 9 AL New 407 07 5479 1580 0.012116 82.5333 10 AL New 550 90 3188 177 0.012116 82.5333 11 AL New 582 40 9610 46 0.012116 82.5333 12 AL New 672 59 9114 66 0.012116 82.5333 13 AL New 848 60 3119 28 0.012116 82.5333 14 AL New 886 83 4909 170 0.012116 82.5333 15 AL New 993 31 7677 64 0.012116 82.5333 16 AL Old 124 60 0495 80 0.021246 47.0667 17 AL Old 128 54 9590 56 0.021246 47.0667 18 AL Old 204 05 4017 17 0.021246 47.0667 19 AL Old 210 68 8704 4363 0.021246 47.0667 20 AL Old 239 75 4343 430 0.021246 47.0667 21 AL Old 317 70 6496 452 0.021246 47.0667 22 AL Old 365 37 1340 21 0.021246 47.0667 23 AL Old 399 78 7900 108 0.021246 47.0667 24 AL Old 404 90 6273 824 0.021246 47.0667 25 AL Old 421 04 8548 1332 0.021246 47.0667 26 AL Old 604 48 0587 16 0.021246 47.0667 27 AL Old 774 04 0162 318 0.021246 47.0667 28 AL Old 849 66 4156 79 0.021246 47.0667 29 AL Old 937 69 9106 182 0.021246 47.0667 30 AL Old 985 09 8691 24 0.021246 47.0667
The next sample design for the customer satisfaction survey uses stratification by State . The sampling frame is also sorted by Type and Usage before sample selection, to provide additional control over the distribution of the sample. Customers are then selected by systematic random sampling within strata. Selection by systematic sampling, together with control sorting, spreads the sample uniformly over the range of type and usage values within each stratum or state. The following PROC SURVEYSELECT statements select a probability sample of customers from the Customers data set using this design:
title1 'Customer Satisfaction Survey'; title2 'Stratified Sampling with Control Sorting'; proc surveyselect data=Customers method=sys rate=.02 seed=1234 out=SampleControl; strata State; control Type Usage; run;
The STRATA statement names the stratification variable State . The CONTROL statement names the control variables Type and Usage . In the PROC SURVEYSELECT statement, the METHOD=SYS option requests systematic random sampling. The RATE=.02 option specifies a sampling rate of 2% for each stratum. The SEED=1234 option specifies the initial seed for random number generation.
Figure 72.7 displays the output from PROC SURVEYSELECT, which summarizes the sample selection. A sample of 271 customers is selected, using systematic random sampling within strata determined by State . The sampling frame Customers is sorted by control variables Type and Usage within strata. The type of sorting is serpentine, which is used by default since SORT=NEST is not specified. See the section 'Sorting by CONTROL Variables' on page 4445 for a description of serpentine sorting. The sorted data set replaces the input data set. (To store the sorted input data in another data set, leaving the input data set unsorted, use the OUTSORT= option.) The output data set SampleControl contains the sample of customers.
Customer Satisfaction Survey Stratified Sampling with Control Sorting The SURVEYSELECT Procedure Selection Method Systematic Random Sampling Strata Variable State Control Variables Type Usage Control Sorting Serpentine Input Data Set CUSTOMERS Random Number Seed 1234 Stratum Sampling Rate 0.02 Number of Strata 4 Total Sample Size 271 Output Data Set SAMPLECONTROL