Getting Started | SAS.STAT 9.1 Users Guide (Vol. 6)

In this example, an Internet service provider conducts a customer satisfaction survey. The survey population consists of the company's current subscribers. The company plans to select a sample of customers from this population, interview the selected customers, and then make inferences about the entire survey population from the sample data.

The SAS data set Customers contains the sampling frame, which is the list of units in the survey population. The sample of customers will be selected from this sampling frame. The data set Customers is constructed from the company's customer database. It contains one observation for each customer, with a total of 13,471 observations. Figure 72.1 displays the first 10 observations of the data set Customers .

  Internet Service Provider Customers   (First 10 Observations)   Obs     CustomerID    State    Type     Usage   1    416   87   4322     AL      New        839   2    288   13   9763     GA      Old        224   3    339   00   8654     GA      Old       2451   4    118   98   0542     GA      New        349   5    421   67   0342     FL      New        562   6    623   18   9201     SC      New         68   7    324   55   0324     FL      Old        137   8    832   90   2397     AL      Old       1563   9    586   45   0178     GA      New        615   10    801   24   5317     SC      New        728

Figure 72.1: Customers Data Set (First 10 Observations)

In the SAS data set Customers , the variable CustomerID uniquely identifies each customer. The variable State contains the state of the customer's address. The company has customers in the following four states: Georgia (GA), Alabama (AL), Florida (FL), and South Carolina (SC). The variable Type equals ˜Old' if the customer has subscribed to the service for more than one year; otherwise , the variable Type equals ˜New'. The variable Usage contains the customer's average monthly service usage, in minutes.

The following sections illustrate the use of PROC SURVEYSELECT for probability sampling with three different designs for the customer satisfaction survey. All three designs are one stage, with customers as the sampling units. The first design is simple random sampling without stratification. In the second design, customers are stratified by state and type, and the sample is selected by simple random sampling within strata. In the third design, customers are sorted within strata by usage, and the sample is selected by systematic random sampling within strata.

Simple Random Sampling

The following PROC SURVEYSELECT statements select a probability sample of customers from the Customers data set using simple random sampling:

  title 'Customer Satisfaction Survey';   proc surveyselect data=Customers   method=srs n=100   out=SampleSRS;   run;

The PROC SURVEYSELECT statement invokes the procedure. The DATA= option names the SAS data set Customers as the input data set from which to select the sample. The METHOD=SRS option specifies simple random sampling as the sample selection method. In simple random sampling, each unit has an equal probability of selection, and sampling is without replacement. Without-replacement sampling means that a unit cannot be selected more than once. The N=100 option specifies a sample size of 100 customers. The OUT= option stores the sample in the SAS data set named SampleSRS .

Figure 72.2 displays the output from PROC SURVEYSELECT, which summarizes the sample selection. A sample of 100 customers is selected from the data set Customers by simple random sampling. With simple random sampling and no stratification in the sample design, the selection probability is the same for all units in the sample. In this sample, the selection probability for each customer equals 0.007423, which is the sample size (100) divided by the population size (13,471). The sampling weight equals 134.71 for each customer in the sample, where the weight is the inverse of the selection probability. If you specify the STATS option, PROC SURVEYSELECT includes the selection probabilities and sampling weights in the output data set. (This information is always included in the output data set for more complex designs.)

  Customer Satisfaction Survey   The SURVEYSELECT Procedure   Selection Method    Simple Random Sampling   Input Data Set            CUSTOMERS   Random Number Seed            39647   Sample Size                     100   Selection Probability      0.007423   Sampling Weight              134.71   Output Data Set           SAMPLESRS

Figure 72.2: Sample Selection Summary

The random number seed is 39647. PROC SURVEYSELECT uses this number as the initial seed for random number generation. Since the SEED= option is not specified in the PROC SURVEYSELECT statement, the seed value is obtained using the time of day from the computer's clock. You can specify SEED=39647 to reproduce this sample.

The sample of 100 customers is stored in the SAS data set SampleSRS .PROC SURVEYSELECT does not display this output data set. The following PROC PRINT statements display the first 20 observations of SampleSRS :

  title1 'Customer Satisfaction Survey';   title2 'Sample of 100 Customers, Selected by SRS';   title3 '(First 20 Observations)';   proc print data=SampleSRS(obs=20);   run;

Figure 72.3 displays the first 20 observations of the output data set SampleSRS , which contains the sample of customers. This data set includes all the variables from the DATA= input data set Customers . If you do not want to include all variables, you can use the ID statement to specify which variables to copy from the input data set to the output (sample) data set.

  Customer Satisfaction Survey   Sample of 100 Customers, Selected by SRS   (First 20 Observations)   Obs    CustomerID    State    Type     Usage   1    036   89   0212     FL      New         74   2    045   53   3676     AL      New        411   3    050   99   2380     GA      Old        167   4    066   93   5368     AL      Old       1232   5    082   99   9234     FL      New         90   6    097   17   4766     FL      Old        131   7    110   73   1051     FL      Old        102   8    111   91   6424     GA      New        247   9    127   39   4594     GA      New         61   10    162   50   3866     FL      New        100   11    162   56   1370     FL      New        224   12    167   21   6808     SC      New         60   13    168   02   5189     AL      Old       7553   14    174   07   8711     FL      New        284   15    187   03   7510     SC      New         21   16    190   78   5019     GA      New        185   17    200   75   0054     GA      New        224   18    201   14   1003     GA      Old       3437   19    207   15   7701     GA      Old         24   20    211   14   1373     AL      Old         88

Figure 72.3: Customer Sample (First 20 Observations)

Stratified Sampling

In this section, stratification is added to the sample design for the customer satisfaction survey. The sampling frame, or list of all customers, is stratified by State and Type . This divides the sampling frame into nonoverlapping subgroups formed from the values of the State and Type variables. Samples are then selected independently within the strata.

PROC SURVEYSELECT requires that the input data set be sorted by the STRATA variables. The following PROC SORT statements sort the Customers data set by the stratification variables State and Type :

  proc sort data=Customers;   by State Type;   run;

The following PROC FREQ statements display the crosstabulation of the Customers data set by State and Type :

  proc freq data=Customers;   tables State*Type;   run;

Figure 72.4 presents the table of State by Type for the 13,471 customers. There are four states and two levels of Type , forming a total of eight strata.

  The FREQ Procedure   Table of State by Type   State     Type   Frequency   Percent   Row Pct   Col Pct  New     Old       Total   ---------+--------+--------+   AL          1238     706    1944   9.19    5.24   14.43   63.68   36.32   14.43   14.43   ---------+--------+--------+   FL          2170    1370    3540   16.11   10.17   26.28   61.30   38.70   25.29   28.01   ---------+--------+--------+   GA          3488    1940    5428   25.89   14.40   40.29   64.26   35.74   40.65   39.66   ---------+--------+--------+   SC          1684     875    2559   12.50    6.50   19.00   65.81   34.19   19.63   17.89   ---------+--------+--------+   Total        8580     4891    13471   63.69    36.31   100.00

Figure 72.4: Stratification of Customers by State and Type

The following PROC SURVEYSELECT statements select a probability sample of customers from the Customers data set according to the stratified sample design:

  title1 'Customer Satisfaction Survey';   title2 'Stratified Sampling';   proc surveyselect data=Customers   method=srs n=15   seed=1953 out=SampleStrata;   strata State Type;   run;

The STRATA statement names the stratification variables State and Type .Inthe PROC SURVEYSELECT statement, the METHOD=SRS option specifies simple random sampling. The N=15 option specifies a sample size of 15 customers for each stratum. If you want to specify different sample sizes for different strata, you can use the N= SAS-data-set option to name a secondary data set that contains the stratum sample sizes. The SEED=1953 option specifies '1953' as the initial seed for random number generation.

Figure 72.5 displays the output from PROC SURVEYSELECT, which summarizes the sample selection. A total of 120 customers are selected.

  Customer Satisfaction Survey   Stratified Sampling   The SURVEYSELECT Procedure   Selection Method    Simple Random Sampling   Strata Variables    State   Type   Input Data Set            CUSTOMERS   Random Number Seed             1953   Stratum Sample Size              15   Number of Strata                  8   Total Sample Size               120   Output Data Set        SAMPLESTRATA

Figure 72.5: Sample Selection Summary

The following PROC PRINT statements display the first 30 observations of the output data set SampleStrata :

  title1 'Customer Satisfaction Survey';   title2 'Sample Selected by Stratified Design';   title3 '(First 30 Observations)';   proc print data=SampleStrata(obs=30);   run;

Figure 72.6 displays the first 30 observations of the output data set SampleStrata , which contains the sample of 120 customers, 15 customers from each of the eight strata. The variable SelectionProb contains the selection probability for each customer in the sample. Since customers are selected with equal probability within strata in this design, the selection probability equals the stratum sample size (15) divided by the stratum population size. The selection probabilities differ from stratum to stratum since the population sizes differ . The selection probability for each customer in the first stratum ( State = ˜AL' and Type = ˜New') is 0.012116, and the selection probability is 0.021246 for customers in the second stratum. The variable SamplingWeight contains the sampling weights, which are computed as inverse selection probabilities.

  Customer Satisfaction Survey   Sample Selected by Stratified Design   (First 30 Observations)   Selection    Sampling   Obs   State    Type     CustomerID     Usage       Prob       Weight   1     AL      New     002   26   1498      1189     0.012116     82.5333   2     AL      New     070   86   8494       106     0.012116     82.5333   3     AL      New     121   28   6895        76     0.012116     82.5333   4     AL      New     131   79   7630       265     0.012116     82.5333   5     AL      New     211   88   4991       108     0.012116     82.5333   6     AL      New     222   81   3742        83     0.012116     82.5333   7     AL      New     238   46   3776       278     0.012116     82.5333   8     AL      New     370   01   0671       123     0.012116     82.5333   9     AL      New     407   07   5479      1580     0.012116     82.5333   10     AL      New     550   90   3188       177     0.012116     82.5333   11     AL      New     582   40   9610        46     0.012116     82.5333   12     AL      New     672   59   9114        66     0.012116     82.5333   13     AL      New     848   60   3119        28     0.012116     82.5333   14     AL      New     886   83   4909       170     0.012116     82.5333   15     AL      New     993   31   7677        64     0.012116     82.5333   16     AL      Old     124   60   0495        80     0.021246     47.0667   17     AL      Old     128   54   9590        56     0.021246     47.0667   18     AL      Old     204   05   4017        17     0.021246     47.0667   19     AL      Old     210   68   8704      4363     0.021246     47.0667   20     AL      Old     239   75   4343       430     0.021246     47.0667   21     AL      Old     317   70   6496       452     0.021246     47.0667   22     AL      Old     365   37   1340        21     0.021246     47.0667   23     AL      Old     399   78   7900       108     0.021246     47.0667   24     AL      Old     404   90   6273       824     0.021246     47.0667   25     AL      Old     421   04   8548      1332     0.021246     47.0667   26     AL      Old     604   48   0587        16     0.021246     47.0667   27     AL      Old     774   04   0162       318     0.021246     47.0667   28     AL      Old     849   66   4156        79     0.021246     47.0667   29     AL      Old     937   69   9106       182     0.021246     47.0667   30     AL      Old     985   09   8691        24     0.021246     47.0667

Figure 72.6: Customer Sample (First 30 Observations)

Stratified Sampling with Control Sorting

The next sample design for the customer satisfaction survey uses stratification by State . The sampling frame is also sorted by Type and Usage before sample selection, to provide additional control over the distribution of the sample. Customers are then selected by systematic random sampling within strata. Selection by systematic sampling, together with control sorting, spreads the sample uniformly over the range of type and usage values within each stratum or state. The following PROC SURVEYSELECT statements select a probability sample of customers from the Customers data set using this design:

  title1 'Customer Satisfaction Survey';   title2 'Stratified Sampling with Control Sorting';   proc surveyselect data=Customers   method=sys rate=.02   seed=1234 out=SampleControl;   strata State;   control Type Usage;   run;

The STRATA statement names the stratification variable State . The CONTROL statement names the control variables Type and Usage . In the PROC SURVEYSELECT statement, the METHOD=SYS option requests systematic random sampling. The RATE=.02 option specifies a sampling rate of 2% for each stratum. The SEED=1234 option specifies the initial seed for random number generation.

Figure 72.7 displays the output from PROC SURVEYSELECT, which summarizes the sample selection. A sample of 271 customers is selected, using systematic random sampling within strata determined by State . The sampling frame Customers is sorted by control variables Type and Usage within strata. The type of sorting is serpentine, which is used by default since SORT=NEST is not specified. See the section 'Sorting by CONTROL Variables' on page 4445 for a description of serpentine sorting. The sorted data set replaces the input data set. (To store the sorted input data in another data set, leaving the input data set unsorted, use the OUTSORT= option.) The output data set SampleControl contains the sample of customers.

  Customer Satisfaction Survey   Stratified Sampling with Control Sorting   The SURVEYSELECT Procedure   Selection Method     Systematic Random Sampling   Strata Variable      State   Control Variables    Type   Usage   Control Sorting      Serpentine   Input Data Set               CUSTOMERS   Random Number Seed                1234   Stratum Sampling Rate             0.02   Number of Strata                     4   Total Sample Size                  271   Output Data Set          SAMPLECONTROL

Figure 72.7: Sample Selection Summary