This example uses the Customers data set from the section 'Getting Started' on page 4422. The data set Customers contains an Internet service provider's current subscribers, and the service provider wants to select a sample from this population for a customer satisfaction survey.
This example illustrates replicated sampling, which selects multiple samples from the survey population according to the same design. You can use replicated sampling to provide a simple method of variance estimation, or to evaluate variable nonsampling errors such as interviewer differences. Refer to Lohr (1999), Kish (1965, 1987), and Kalton (1983) for information on replicated sampling.
This design includes four replicates, each with a sample size of 50 customers. The sampling frame is stratified by State and sorted by Type and Usage within strata. Customers are selected by sequential random sampling with equal probability within strata. The following PROC SURVEYSELECT statements select a probability sample of customers from the Customers data set using this design.
title1 'Customer Satisfaction Survey'; title2 'Replicated Sampling'; proc surveyselect data=Customers method=seq n=(8 12 20 10) rep=4 seed=40070 out=SampleRep; strata State; control Type Usage; run;
The STRATA statement names the stratification variable State . The CONTROL statement names the control variables Type and Usage . In the PROC SURVEYSELECT statement, the METHOD=SEQ option requests sequential random sampling. The REP=4 option specifies four replicates of this sample. The N=(8 12 20 10) option specifies the stratum sample sizes for each replicate. The N= option lists the stratum sample sizes in the same order as the strata appear in the Customers data set, which has been sorted by State . The sample size of eight customers corresponds to the first stratum, State = ˜AL'.Thesamplesize12 corresponds to the next stratum, State = ˜FL', and so on. The SEED=40070 option specifies '40070' as the initial seed for random number generation.
Output 72.1.1 displays the output from PROC SURVEYSELECT, which summarizes the sample selection. A total of 200 customers is selected in four replicates. PROC SURVEYSELECT selects each replicate using sequential random sampling within strata determined by State . The sampling frame Customers is sorted by control variables Type and Usage within strata, according to hierarchic serpentine sorting. The output data set SampleRep contains the sample.
Customer Satisfaction Survey Replicated Sampling The SURVEYSELECT Procedure Selection Method Sequential Random Sampling With Equal Probability Strata Variable State Control Variables Type Usage Control Sorting Serpentine Input Data Set CUSTOMERS Random Number Seed 40070 Number of Strata 4 Number of Replicates 4 Total Sample Size 200 Output Data Set SAMPLEREP
The following PROC PRINT statements display the selected customers for the first stratum, State = ˜AL', from the output data set SampleRep .
title1 'Customer Satisfaction Survey'; title2 'Sample Selected by Replicated Design'; title3 '(First Stratum)'; proc print data=SampleRep; where State = 'AL'; run;
Output 72.1.2 displays the 32 sample customers of the first stratum ( State = ˜AL') from the output data set SampleRep , which includes the entire sample of 200 customers. The variable SelectionProb contains the selection probability, and SamplingWeight contains the sampling weight. Since customers are selected with equal probability within strata in this design, all customers in the same stratum have the same selection probability. These selection probabilities and sampling weights apply to a single replicate, and the variable Replicate contains the sample replicate number.
Customer Satisfaction Survey Sample Selected by Replicated Design (First Stratum) Selection Sampling Obs State Replicate CustomerID Type Usage Prob Weight 1 AL 1 882 37 7496 New 572 .004115226 243 2 AL 1 581 32 5534 New 863 .004115226 243 3 AL 1 980 29 2898 Old 571 .004115226 243 4 AL 1 172 56 4743 Old 128 .004115226 243 5 AL 1 998 55 5227 Old 35 .004115226 243 6 AL 1 625 44 3396 New 60 .004115226 243 7 AL 1 627 48 2509 New 114 .004115226 243 8 AL 1 257 66 6558 New 172 .004115226 243 9 AL 2 622 83 1680 New 22 .004115226 243 10 AL 2 343 57 1186 New 53 .004115226 243 11 AL 2 976 05 3796 New 110 .004115226 243 12 AL 2 859 74 0652 New 303 .004115226 243 13 AL 2 476 48 1066 New 839 .004115226 243 14 AL 2 109 27 8914 Old 2102 .004115226 243 15 AL 2 743 25 0298 Old 376 .004115226 243 16 AL 2 722 08 2215 Old 105 .004115226 243 17 AL 3 668 57 7696 New 200 .004115226 243 18 AL 3 300 72 0129 New 471 .004115226 243 19 AL 3 073 60 0765 New 656 .004115226 243 20 AL 3 526 87 0258 Old 672 .004115226 243 21 AL 3 726 61 0387 Old 150 .004115226 243 22 AL 3 632 29 9020 Old 51 .004115226 243 23 AL 3 417 17 8378 New 56 .004115226 243 24 AL 3 091 26 2366 New 93 .004115226 243 25 AL 4 336 04 1288 New 419 .004115226 243 26 AL 4 827 04 7407 New 650 .004115226 243 27 AL 4 317 70 6496 Old 452 .004115226 243 28 AL 4 002 38 4582 Old 206 .004115226 243 29 AL 4 181 83 3990 Old 33 .004115226 243 30 AL 4 675 34 7393 New 47 .004115226 243 31 AL 4 228 07 6671 New 65 .004115226 243 32 AL 4 298 46 2434 New 161 .004115226 243
A state health agency plans to conduct a state-wide survey of a variety of different hospital services. The agency plans to select a probability sample of individual discharge records within hospitals using a two-stage sample design. First stage units are hospitals, and second stage units are patient discharges during the study time period. Hospitals are stratified first according to geographic region and then by rural/urban type and size of hospital. Two hospitals are selected from each stratum with probability proportional to size. This example describes hospital selection for this survey using PROC SURVEYSELECT.
The data set HospitalFrame contains all hospitals in the first geographical region of this state.
data HospitalFrame; input Hospital$ Type$ SizeMeasure @@; if (SizeMeasure < 20) then Size='Small '; else if (SizeMeasure < 50) then Size='Medium'; else Size='Large '; datalines; 034 Rural 0.870 107 Rural 1.316 079 Rural 2.127 223 Rural 3.960 236 Rural 5.279 165 Rural 5.893 086 Rural 0.501 141 Rural 11.528 042 Urban 3.104 124 Urban 4.033 006 Urban 4.249 261 Urban 4.376 195 Urban 5.024 190 Urban 10.373 038 Urban 17.125 083 Urban 40.382 259 Urban 44.942 129 Urban 46.702 133 Urban 46.992 218 Urban 48.231 026 Urban 61.460 058 Urban 65.931 119 Urban 66.352 ;
In the SAS data set HospitalFrame , the variable Hospital identifies the hospital. The variable Type equals ˜Urban' if the hospital is located in an urbanized area, and ˜Rural' otherwise . The variable SizeMeasure contains the hospital's size measure, which is constructed from past data on service utilization for the hospital together with the desired sampling rates for each service. This size measure reflects the amount of relevant survey information expected from the hospital. Refer to Drummond et al. (1982) for details on this type of size measure. The variable Size equals ˜Small', ˜Medium', or ˜Large', depending on the value of the hospital's size measure.
The following PROC PRINT statements display the data set Hospital Frame .
title1 'Hospital Utilization Survey'; title2 'Sampling Frame, Region 1'; proc print data=HospitalFrame; run;
Hospital Utilization Survey Sampling Frame, Region 1 Size Obs Hospital Type Measure Size 1 034 Rural 0.870 Small 2 107 Rural 1.316 Small 3 079 Rural 2.127 Small 4 223 Rural 3.960 Small 5 236 Rural 5.279 Small 6 165 Rural 5.893 Small 7 086 Rural 0.501 Small 8 141 Rural 11.528 Small 9 042 Urban 3.104 Small 10 124 Urban 4.033 Small 11 006 Urban 4.249 Small 12 261 Urban 4.376 Small 13 195 Urban 5.024 Small 14 190 Urban 10.373 Small 15 038 Urban 17.125 Small 16 083 Urban 40.382 Medium 17 259 Urban 44.942 Medium 18 129 Urban 46.702 Medium 19 133 Urban 46.992 Medium 20 218 Urban 48.231 Medium 21 026 Urban 61.460 Large 22 058 Urban 65.931 Large 23 119 Urban 66.352 Large
The following PROC SURVEYSELECT statements select a probability sample of hospitals from the HospitalFrame data set, using a stratified design with PPS selection of two units from each stratum.
title1 'Hospital Utilization Survey'; proc surveyselect data=HospitalFrame method=pps_brewer seed=48702 out=SampleHospitals; size SizeMeasure; strata Type Size notsorted; run;
The STRATA statement names the stratification variables Type and Size .The NOTSORTED option specifies that observations with the same STRATA variable values are grouped together but are not necessarily sorted in alphabetical or increasing numerical order. In the HospitalFrame data set, Size = ˜Small' precedes Size = ˜Medium'.
In the PROC SURVEYSELECT statement, the METHOD=PPS_BREWER option requests sample selection by Brewer's method, which selects two units per stratum with probability proportional to size. The SEED=48702 option specifies 48702 as the initial seed for random number generation. The SIZE statement specifies the size measure variable. It is not necessary to specify the sample size with the N= option, since Brewer's method always selects two units from each stratum.
Output 72.2.2 displays the output from PROC SURVEYSELECT. A total of 8 hospitals were selected from the 4 strata. The data set SampleHospitals contains the selected hospitals.
Hospital Utilization Survey The SURVEYSELECT Procedure Selection Method Brewer's PPS Method Size Measure SizeMeasure Strata Variables Type Size Input Data Set HOSPITALFRAME Random Number Seed 48702 Stratum Sample Size 2 Number of Strata 4 Total Sample Size 8 Output Data Set SAMPLEHOSPITALS
The following PROC PRINT statements display the sample hospitals.
title1 'Hospital Utilization Survey'; title2 'Sample Selected by Stratified PPS Design'; proc print data=SampleHospitals; run;
Hospital Utilization Survey Sample Selected by Stratified PPS Design Jt Size Selection Sampling Selection Obs Type Size Hospital Measure Prob Weight Prob 1 Rural Small 079 2.127 0.13516 7.39868 0.01851 2 Rural Small 236 5.279 0.33545 2.98106 0.01851 3 Urban Small 006 4.249 0.17600 5.68181 0.01454 4 Urban Small 195 5.024 0.20810 4.80533 0.01454 5 Urban Medium 133 46.992 0.41357 2.41795 0.11305 6 Urban Medium 218 48.231 0.42448 2.35584 0.11305 7 Urban Large 026 61.460 0.63445 1.57617 0.31505 8 Urban Large 058 65.931 0.68060 1.46929 0.31505
The variable SelectionProb contains the selection probability for each hospital in the sample. The variable JtSelectionProb contains the joint probability of selection for the two sample hospitals in the same stratum. The variable SamplingWeight contains the sampling weight component for this first stage of the design. The final-stage weight components , which correspond to patient record selection within hospitals, can be multiplied by the hospital weight components to obtain the overall sampling weights.
A small company wants to audit employee travel expenses in an effort to improve the expense reporting procedure and possibly reduce expenses. The company does not have resources to examine all expense reports and wants to use statistical sampling to objectively select expense reports for audit.
The data set TravelExpense contains the dollar amount of all employee travel expense transactions during the past month.
data TravelExpense; input ID$ Amount @@; if (Amount < 500) then Level='1_Low '; else if (Amount > 1500) then Level='3_High'; else Level='2_Avg '; datalines; 110 237.18 002 567.89 234 118.50 743 74.38 411 1287.23 782 258.10 216 325.36 174 218.38 568 1670.80 302 134.71 285 2020.70 314 47.80 139 1183.45 775 330.54 425 780.10 506 895.80 239 620.10 011 420.18 672 979.66 142 810.25 738 670.85 192 314.58 243 87.50 263 1893.40 496 753.30 332 540.65 486 2580.35 614 230.56 654 185.60 308 688.43 784 505.14 017 205.48 162 650.42 289 1348.34 691 30.50 545 2214.80 517 940.35 382 217.85 024 142.90 478 806.90 107 560.72 ;
In the SAS data set TravelExpense , the variable ID identifies the travel expense report. The variable Amount contains the dollar amount of the reported expense. The variable Level equals ˜1_Low', ˜2_Avg', or ˜3_High', depending on the value of Amount .
In the sample design for this audit, expense reports are stratified by Level . This ensures that each of these expense levels is included in the sample and also permits a disproportionate allocation of the sample, selecting proportionately more of the expense reports from the higher levels. Within strata, the sample of expense reports is selected with probability proportional to the amount of the expense, thus giving a greater chance of selection to larger expenses. In auditing terms, this is known as monetary -unit sampling. Refer to Wilburn (1984).
PROC SURVEYSELECT requires that the input data set be sorted by the STRATA variables. The following PROC SORT statements sort the TravelExpense data set by the stratification variable Level .
proc sort data=TravelExpense; by Level; run;
The following PROC PRINT statements display the sampling frame data set TravelExpense , which contains 41 observations.
title1 'Travel Expense Audit'; proc print data=TravelExpense; run;
Travel Expense Audit Obs ID Amount Level 1 110 237.18 1_Low 2 002 567.89 2_Avg 3 234 118.50 1_Low 4 743 74.38 1_Low 5 411 1287.23 2_Avg 6 782 258.10 1_Low 7 216 325.36 1_Low 8 174 218.38 1_Low 9 568 1670.80 3_High 10 302 134.71 1_Low 11 285 2020.70 3_High 12 314 47.80 1_Low 13 139 1183.45 2_Avg 14 775 330.54 1_Low 15 425 780.10 2_Avg 16 506 895.80 2_Avg 17 239 620.10 2_Avg 18 011 420.18 1_Low 19 672 979.66 2_Avg 20 142 810.25 2_Avg 21 738 670.85 2_Avg 22 192 314.58 1_Low 23 243 87.50 1_Low 24 263 1893.40 3_High 25 496 753.30 2_Avg 26 332 540.65 2_Avg 27 486 2580.35 3_High 28 614 230.56 1_Low 29 654 185.60 1_Low 30 308 688.43 2_Avg 31 784 505.14 2_Avg 32 017 205.48 1_Low 33 162 650.42 2_Avg 34 289 1348.34 2_Avg 35 691 30.50 1_Low 36 545 2214.80 3_High 37 517 940.35 2_Avg 38 382 217.85 1_Low 39 024 142.90 1_Low 40 478 806.90 2_Avg 41 107 560.72 2_Avg
The following PROC SURVEYSELECT statements select a probability sample of expense reports from the TravelExpense data set using the stratified design with PPS selection within strata.
title1 'Travel Expense Audit'; proc surveyselect data=TravelExpense method=pps n=(6 10 4) seed=47279 out=AuditSample; size Amount; strata Level; run;
The STRATA statement names the stratification variable Level . The SIZE statement specifies the size measure variable Amount . In the PROC SURVEYSELECT statement, the METHOD=PPS option requests sample selection with probability proportional to size and without replacement. The N=(6 10 4) option specifies the stratum sample sizes, listing the sample sizes in the same order that the strata appear in the TravelExpense data set. The sample size of 6 corresponds to the first stratum, Level = ˜1_Low', the sample size of 10 corresponds to the second stratum, Level = ˜2_Avg', and 4 corresponds to the last stratum, Level = ˜3_High'. The SEED=47279 option specifies '47279' as the initial seed for random number generation.
Output 72.3.2 displays the output from PROC SURVEYSELECT. A total of 20 expense reports is selected for audit. The data set AuditSample contains the sample of travel expense reports.
Travel Expense Audit The SURVEYSELECT Procedure Selection Method PPS, Without Replacement Size Measure Amount Strata Variable Level Input Data Set TRAVELEXPENSE Random Number Seed 47279 Number of Strata 3 Total Sample Size 20 Output Data Set AUDITSAMPLE
The following PROC PRINT statements display the audit sample, which is shown in Output 72.3.3.
Travel Expense Audit Sample Selected by Stratified PPS Design Selection Sampling Obs Level ID Amount Prob Weight 1 1_Low 654 185.60 0.31105 3.21489 2 1_Low 017 205.48 0.34437 2.90385 3 1_Low 382 217.85 0.36510 2.73896 4 1_Low 614 230.56 0.38640 2.58797 5 1_Low 782 258.10 0.43256 2.31183 6 1_Low 775 330.54 0.55396 1.80518 7 2_Avg 784 505.14 0.34623 2.88823 8 2_Avg 332 540.65 0.37057 2.69853 9 2_Avg 002 567.89 0.38924 2.56909 10 2_Avg 239 620.10 0.42503 2.35278 11 2_Avg 738 670.85 0.45981 2.17479 12 2_Avg 496 753.30 0.51633 1.93676 13 2_Avg 425 780.10 0.53470 1.87022 14 2_Avg 478 806.90 0.55307 1.80810 15 2_Avg 672 979.66 0.67148 1.48925 16 2_Avg 139 1183.45 0.81116 1.23280 17 3_High 568 1670.80 0.64385 1.55316 18 3_High 263 1893.40 0.72963 1.37056 19 3_High 285 2020.70 0.77869 1.28421 20 3_High 486 2580.35 0.99435 1.00568
title1 'Travel Expense Audit'; title2 'Sample Selected by Stratified PPS Design'; proc print data=AuditSample; run;