Examples | SAS.STAT 9.1 Users Guide (Vol. 6)

Example 72.1. Replicated Sampling

This example uses the Customers data set from the section 'Getting Started' on page 4422. The data set Customers contains an Internet service provider's current subscribers, and the service provider wants to select a sample from this population for a customer satisfaction survey.

This example illustrates replicated sampling, which selects multiple samples from the survey population according to the same design. You can use replicated sampling to provide a simple method of variance estimation, or to evaluate variable nonsampling errors such as interviewer differences. Refer to Lohr (1999), Kish (1965, 1987), and Kalton (1983) for information on replicated sampling.

This design includes four replicates, each with a sample size of 50 customers. The sampling frame is stratified by State and sorted by Type and Usage within strata. Customers are selected by sequential random sampling with equal probability within strata. The following PROC SURVEYSELECT statements select a probability sample of customers from the Customers data set using this design.

  title1 'Customer Satisfaction Survey';   title2 'Replicated Sampling';   proc surveyselect data=Customers   method=seq n=(8 12 20 10)   rep=4   seed=40070 out=SampleRep;   strata State;   control Type Usage;   run;

The STRATA statement names the stratification variable State . The CONTROL statement names the control variables Type and Usage . In the PROC SURVEYSELECT statement, the METHOD=SEQ option requests sequential random sampling. The REP=4 option specifies four replicates of this sample. The N=(8 12 20 10) option specifies the stratum sample sizes for each replicate. The N= option lists the stratum sample sizes in the same order as the strata appear in the Customers data set, which has been sorted by State . The sample size of eight customers corresponds to the first stratum, State = ˜AL'.Thesamplesize12 corresponds to the next stratum, State = ˜FL', and so on. The SEED=40070 option specifies '40070' as the initial seed for random number generation.

Output 72.1.1 displays the output from PROC SURVEYSELECT, which summarizes the sample selection. A total of 200 customers is selected in four replicates. PROC SURVEYSELECT selects each replicate using sequential random sampling within strata determined by State . The sampling frame Customers is sorted by control variables Type and Usage within strata, according to hierarchic serpentine sorting. The output data set SampleRep contains the sample.

Output 72.1.1: Sample Selection Summary

  Customer Satisfaction Survey   Replicated Sampling   The SURVEYSELECT Procedure   Selection Method     Sequential Random Sampling   With Equal Probability   Strata Variable      State   Control Variables    Type   Usage   Control Sorting      Serpentine   Input Data Set           CUSTOMERS   Random Number Seed           40070   Number of Strata                 4   Number of Replicates             4   Total Sample Size              200   Output Data Set          SAMPLEREP

The following PROC PRINT statements display the selected customers for the first stratum, State = ˜AL', from the output data set SampleRep .

  title1 'Customer Satisfaction Survey';   title2 'Sample Selected by Replicated Design';   title3 '(First Stratum)';   proc print data=SampleRep;   where State = 'AL';   run;

Output 72.1.2 displays the 32 sample customers of the first stratum ( State = ˜AL') from the output data set SampleRep , which includes the entire sample of 200 customers. The variable SelectionProb contains the selection probability, and SamplingWeight contains the sampling weight. Since customers are selected with equal probability within strata in this design, all customers in the same stratum have the same selection probability. These selection probabilities and sampling weights apply to a single replicate, and the variable Replicate contains the sample replicate number.

Output 72.1.2: Customer Sample (First Stratum)

  Customer Satisfaction Survey   Sample Selected by Replicated Design   (First Stratum)   Selection   Sampling   Obs  State   Replicate    CustomerID   Type    Usage         Prob    Weight   1    AL         1       882   37   7496   New       572   .004115226      243   2    AL         1       581   32   5534   New       863   .004115226      243   3    AL         1       980   29   2898   Old       571   .004115226      243   4    AL         1       172   56   4743   Old       128   .004115226      243   5    AL         1       998   55   5227   Old        35   .004115226      243   6    AL         1       625   44   3396   New        60   .004115226      243   7    AL         1       627   48   2509   New       114   .004115226      243   8    AL         1       257   66   6558   New       172   .004115226      243   9    AL         2       622   83   1680   New        22   .004115226      243   10    AL         2       343   57   1186   New        53   .004115226      243   11    AL         2       976   05   3796   New       110   .004115226      243   12    AL         2       859   74   0652   New       303   .004115226      243   13    AL         2       476   48   1066   New       839   .004115226      243   14    AL         2       109   27   8914   Old      2102   .004115226      243   15    AL         2       743   25   0298   Old       376   .004115226      243   16    AL         2       722   08   2215   Old       105   .004115226      243   17    AL         3       668   57   7696   New       200   .004115226      243   18    AL         3       300   72   0129   New       471   .004115226      243   19    AL         3       073   60   0765   New       656   .004115226      243   20    AL         3       526   87   0258   Old       672   .004115226      243   21    AL         3       726   61   0387   Old       150   .004115226      243   22    AL         3       632   29   9020   Old        51   .004115226      243   23    AL         3       417   17   8378   New        56   .004115226      243   24    AL         3       091   26   2366   New        93   .004115226      243   25    AL         4       336   04   1288   New       419   .004115226      243   26    AL         4       827   04   7407   New       650   .004115226      243   27    AL         4       317   70   6496   Old       452   .004115226      243   28    AL         4       002   38   4582   Old       206   .004115226      243   29    AL         4       181   83   3990   Old        33   .004115226      243   30    AL         4       675   34   7393   New        47   .004115226      243   31    AL         4       228   07   6671   New        65   .004115226      243   32    AL         4       298   46   2434   New       161   .004115226      243

Example 72.2. PPS Selection of Two Units Per Stratum

A state health agency plans to conduct a state-wide survey of a variety of different hospital services. The agency plans to select a probability sample of individual discharge records within hospitals using a two-stage sample design. First stage units are hospitals, and second stage units are patient discharges during the study time period. Hospitals are stratified first according to geographic region and then by rural/urban type and size of hospital. Two hospitals are selected from each stratum with probability proportional to size. This example describes hospital selection for this survey using PROC SURVEYSELECT.

The data set HospitalFrame contains all hospitals in the first geographical region of this state.

  data HospitalFrame;   input Hospital$ Type$ SizeMeasure @@;   if (SizeMeasure < 20) then Size='Small ';   else if (SizeMeasure < 50) then Size='Medium';   else Size='Large ';   datalines;   034 Rural  0.870   107 Rural  1.316   079 Rural  2.127   223 Rural  3.960   236 Rural  5.279   165 Rural  5.893   086 Rural  0.501   141 Rural 11.528   042 Urban  3.104   124 Urban  4.033   006 Urban  4.249   261 Urban  4.376   195 Urban  5.024   190 Urban 10.373   038 Urban 17.125   083 Urban 40.382   259 Urban 44.942   129 Urban 46.702   133 Urban 46.992   218 Urban 48.231   026 Urban 61.460   058 Urban 65.931   119 Urban 66.352   ;

In the SAS data set HospitalFrame , the variable Hospital identifies the hospital. The variable Type equals ˜Urban' if the hospital is located in an urbanized area, and ˜Rural' otherwise . The variable SizeMeasure contains the hospital's size measure, which is constructed from past data on service utilization for the hospital together with the desired sampling rates for each service. This size measure reflects the amount of relevant survey information expected from the hospital. Refer to Drummond et al. (1982) for details on this type of size measure. The variable Size equals ˜Small', ˜Medium', or ˜Large', depending on the value of the hospital's size measure.

The following PROC PRINT statements display the data set Hospital Frame .

  title1 'Hospital Utilization Survey';   title2 'Sampling Frame, Region 1';   proc print data=HospitalFrame;   run;

Output 72.2.1: Sampling Frame

  Hospital Utilization Survey   Sampling Frame, Region 1   Size   Obs   Hospital    Type     Measure     Size   1      034       Rural      0.870    Small   2      107       Rural      1.316    Small   3      079       Rural      2.127    Small   4      223       Rural      3.960    Small   5      236       Rural      5.279    Small   6      165       Rural      5.893    Small   7      086       Rural      0.501    Small   8      141       Rural     11.528    Small   9      042       Urban      3.104    Small   10      124       Urban      4.033    Small   11      006       Urban      4.249    Small   12      261       Urban      4.376    Small   13      195       Urban      5.024    Small   14      190       Urban     10.373    Small   15      038       Urban     17.125    Small   16      083       Urban     40.382    Medium   17      259       Urban     44.942    Medium   18      129       Urban     46.702    Medium   19      133       Urban     46.992    Medium   20      218       Urban     48.231    Medium   21      026       Urban     61.460    Large   22      058       Urban     65.931    Large   23      119       Urban     66.352    Large

The following PROC SURVEYSELECT statements select a probability sample of hospitals from the HospitalFrame data set, using a stratified design with PPS selection of two units from each stratum.

  title1 'Hospital Utilization Survey';   proc surveyselect data=HospitalFrame   method=pps_brewer   seed=48702 out=SampleHospitals;   size SizeMeasure;   strata Type Size notsorted;   run;

The STRATA statement names the stratification variables Type and Size .The NOTSORTED option specifies that observations with the same STRATA variable values are grouped together but are not necessarily sorted in alphabetical or increasing numerical order. In the HospitalFrame data set, Size = ˜Small' precedes Size = ˜Medium'.

In the PROC SURVEYSELECT statement, the METHOD=PPS_BREWER option requests sample selection by Brewer's method, which selects two units per stratum with probability proportional to size. The SEED=48702 option specifies 48702 as the initial seed for random number generation. The SIZE statement specifies the size measure variable. It is not necessary to specify the sample size with the N= option, since Brewer's method always selects two units from each stratum.

Output 72.2.2 displays the output from PROC SURVEYSELECT. A total of 8 hospitals were selected from the 4 strata. The data set SampleHospitals contains the selected hospitals.

Output 72.2.2: Sample Selection Summary

  Hospital Utilization Survey   The SURVEYSELECT Procedure   Selection Method    Brewer's PPS Method   Size Measure        SizeMeasure   Strata Variables    Type   Size   Input Data Set           HOSPITALFRAME   Random Number Seed               48702   Stratum Sample Size                  2   Number of Strata                     4   Total Sample Size                    8   Output Data Set        SAMPLEHOSPITALS

The following PROC PRINT statements display the sample hospitals.

  title1 'Hospital Utilization Survey';   title2 'Sample Selected by Stratified PPS Design';   proc print data=SampleHospitals;   run;

Output 72.2.3: Sample Hospitals

  Hospital Utilization Survey   Sample Selected by Stratified PPS Design   Jt   Size    Selection   Sampling   Selection   Obs   Type     Size    Hospital   Measure      Prob      Weight       Prob   1    Rural   Small      079        2.127    0.13516     7.39868    0.01851   2    Rural   Small      236        5.279    0.33545     2.98106    0.01851   3    Urban   Small      006        4.249    0.17600     5.68181    0.01454   4    Urban   Small      195        5.024    0.20810     4.80533    0.01454   5    Urban   Medium     133       46.992    0.41357     2.41795    0.11305   6    Urban   Medium     218       48.231    0.42448     2.35584    0.11305   7    Urban   Large      026       61.460    0.63445     1.57617    0.31505   8    Urban   Large      058       65.931    0.68060     1.46929    0.31505

The variable SelectionProb contains the selection probability for each hospital in the sample. The variable JtSelectionProb contains the joint probability of selection for the two sample hospitals in the same stratum. The variable SamplingWeight contains the sampling weight component for this first stage of the design. The final-stage weight components , which correspond to patient record selection within hospitals, can be multiplied by the hospital weight components to obtain the overall sampling weights.

Example 72.3. PPS (Dollar-Unit) Sampling

A small company wants to audit employee travel expenses in an effort to improve the expense reporting procedure and possibly reduce expenses. The company does not have resources to examine all expense reports and wants to use statistical sampling to objectively select expense reports for audit.

The data set TravelExpense contains the dollar amount of all employee travel expense transactions during the past month.

  data TravelExpense;   input ID$ Amount @@;   if (Amount < 500) then Level='1_Low ';   else if (Amount > 1500) then Level='3_High';   else Level='2_Avg ';   datalines;   110  237.18   002  567.89   234  118.50   743   74.38   411 1287.23   782  258.10   216  325.36   174  218.38   568 1670.80   302  134.71   285 2020.70   314   47.80   139 1183.45   775  330.54   425  780.10   506  895.80   239  620.10   011  420.18   672  979.66   142  810.25   738  670.85   192  314.58   243   87.50   263 1893.40   496  753.30   332  540.65   486 2580.35   614  230.56   654  185.60   308  688.43   784  505.14   017  205.48   162  650.42   289 1348.34   691   30.50   545 2214.80   517  940.35   382  217.85   024  142.90   478  806.90   107  560.72   ;

In the SAS data set TravelExpense , the variable ID identifies the travel expense report. The variable Amount contains the dollar amount of the reported expense. The variable Level equals ˜1_Low', ˜2_Avg', or ˜3_High', depending on the value of Amount .

In the sample design for this audit, expense reports are stratified by Level . This ensures that each of these expense levels is included in the sample and also permits a disproportionate allocation of the sample, selecting proportionately more of the expense reports from the higher levels. Within strata, the sample of expense reports is selected with probability proportional to the amount of the expense, thus giving a greater chance of selection to larger expenses. In auditing terms, this is known as monetary -unit sampling. Refer to Wilburn (1984).

PROC SURVEYSELECT requires that the input data set be sorted by the STRATA variables. The following PROC SORT statements sort the TravelExpense data set by the stratification variable Level .

  proc sort data=TravelExpense;   by Level;   run;

The following PROC PRINT statements display the sampling frame data set TravelExpense , which contains 41 observations.

  title1 'Travel Expense Audit';   proc print data=TravelExpense;   run;

Output 72.3.1: Sampling Frame

  Travel Expense Audit   Obs    ID      Amount    Level   1    110     237.18    1_Low   2    002     567.89    2_Avg   3    234     118.50    1_Low   4    743      74.38    1_Low   5    411    1287.23    2_Avg   6    782     258.10    1_Low   7    216     325.36    1_Low   8    174     218.38    1_Low   9    568    1670.80    3_High   10    302     134.71    1_Low   11    285    2020.70    3_High   12    314      47.80    1_Low   13    139    1183.45    2_Avg   14    775     330.54    1_Low   15    425     780.10    2_Avg   16    506     895.80    2_Avg   17    239     620.10    2_Avg   18    011     420.18    1_Low   19    672     979.66    2_Avg   20    142     810.25    2_Avg   21    738     670.85    2_Avg   22    192     314.58    1_Low   23    243      87.50    1_Low   24    263    1893.40    3_High   25    496     753.30    2_Avg   26    332     540.65    2_Avg   27    486    2580.35    3_High   28    614     230.56    1_Low   29    654     185.60    1_Low   30    308     688.43    2_Avg   31    784     505.14    2_Avg   32    017     205.48    1_Low   33    162     650.42    2_Avg   34    289    1348.34    2_Avg   35    691      30.50    1_Low   36    545    2214.80    3_High   37    517     940.35    2_Avg   38    382     217.85    1_Low   39    024     142.90    1_Low   40    478     806.90    2_Avg   41    107     560.72    2_Avg

The following PROC SURVEYSELECT statements select a probability sample of expense reports from the TravelExpense data set using the stratified design with PPS selection within strata.

  title1 'Travel Expense Audit';   proc surveyselect data=TravelExpense   method=pps n=(6 10 4)   seed=47279 out=AuditSample;   size Amount;   strata Level;   run;

The STRATA statement names the stratification variable Level . The SIZE statement specifies the size measure variable Amount . In the PROC SURVEYSELECT statement, the METHOD=PPS option requests sample selection with probability proportional to size and without replacement. The N=(6 10 4) option specifies the stratum sample sizes, listing the sample sizes in the same order that the strata appear in the TravelExpense data set. The sample size of 6 corresponds to the first stratum, Level = ˜1_Low', the sample size of 10 corresponds to the second stratum, Level = ˜2_Avg', and 4 corresponds to the last stratum, Level = ˜3_High'. The SEED=47279 option specifies '47279' as the initial seed for random number generation.

Output 72.3.2 displays the output from PROC SURVEYSELECT. A total of 20 expense reports is selected for audit. The data set AuditSample contains the sample of travel expense reports.

Output 72.3.2: Sample Selection Summary

  Travel Expense Audit   The SURVEYSELECT Procedure   Selection Method    PPS, Without Replacement   Size Measure        Amount   Strata Variable     Level   Input Data Set        TRAVELEXPENSE   Random Number Seed            47279   Number of Strata                  3   Total Sample Size                20   Output Data Set         AUDITSAMPLE

The following PROC PRINT statements display the audit sample, which is shown in Output 72.3.3.

Output 72.3.3: Audit Sample

  Travel Expense Audit   Sample Selected by Stratified PPS Design   Selection    Sampling   Obs    Level     ID      Amount       Prob       Weight   1    1_Low     654     185.60     0.31105      3.21489   2    1_Low     017     205.48     0.34437      2.90385   3    1_Low     382     217.85     0.36510      2.73896   4    1_Low     614     230.56     0.38640      2.58797   5    1_Low     782     258.10     0.43256      2.31183   6    1_Low     775     330.54     0.55396      1.80518   7    2_Avg     784     505.14     0.34623      2.88823   8    2_Avg     332     540.65     0.37057      2.69853   9    2_Avg     002     567.89     0.38924      2.56909   10    2_Avg     239     620.10     0.42503      2.35278   11    2_Avg     738     670.85     0.45981      2.17479   12    2_Avg     496     753.30     0.51633      1.93676   13    2_Avg     425     780.10     0.53470      1.87022   14    2_Avg     478     806.90     0.55307      1.80810   15    2_Avg     672     979.66     0.67148      1.48925   16    2_Avg     139    1183.45     0.81116      1.23280   17    3_High    568    1670.80     0.64385      1.55316   18    3_High    263    1893.40     0.72963      1.37056   19    3_High    285    2020.70     0.77869      1.28421   20    3_High    486    2580.35     0.99435      1.00568

  title1 'Travel Expense Audit';   title2 'Sample Selected by Stratified PPS Design';   proc print data=AuditSample;   run;